We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Reseach Article

Web Crawler: A Review

by Md. Abu Kausar, V. S. Dhaka, Sanjeev Kumar Singh
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 63 - Number 2
Year of Publication: 2013
Authors: Md. Abu Kausar, V. S. Dhaka, Sanjeev Kumar Singh
10.5120/10440-5125

Md. Abu Kausar, V. S. Dhaka, Sanjeev Kumar Singh . Web Crawler: A Review. International Journal of Computer Applications. 63, 2 ( February 2013), 31-36. DOI=10.5120/10440-5125

@article{ 10.5120/10440-5125,
author = { Md. Abu Kausar, V. S. Dhaka, Sanjeev Kumar Singh },
title = { Web Crawler: A Review },
journal = { International Journal of Computer Applications },
issue_date = { February 2013 },
volume = { 63 },
number = { 2 },
month = { February },
year = { 2013 },
issn = { 0975-8887 },
pages = { 31-36 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume63/number2/10440-5125/ },
doi = { 10.5120/10440-5125 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:13:08.264448+05:30
%A Md. Abu Kausar
%A V. S. Dhaka
%A Sanjeev Kumar Singh
%T Web Crawler: A Review
%J International Journal of Computer Applications
%@ 0975-8887
%V 63
%N 2
%P 31-36
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Information Retrieval deals with searching and retrieving information within the documents and it also searches the online databases and internet. Web crawler is defined as a program or software which traverses the Web and downloads web documents in a methodical, automated manner. Based on the type of knowledge, web crawler is usually divided in three types of crawling techniques: General Purpose Crawling, Focused crawling and Distributed Crawling. In this paper, the applicability of Web Crawler in the field of web search and a review on Web Crawler to different problem domains in web search is discussed.

References
  1. Berners-Lee, Tim, "The World Wide Web: Past, Present and Future", MIT USA, Aug 1996, available at: http://www. w3. org/People/Berners-Lee/1996/ppf. html.
  2. Berners-Lee, Tim, and Cailliau, CN, R. , "Worldwide Web: Proposal for a Hypertext Project" CERN October 1990, available at: http://www. w3. org/Proposal. html.
  3. "Internet World Stats. Worldwide internet users", available at: http://www. internetworldstats. com (accessed on May 5, 2012).
  4. Maurice de Kunder, "Size of the World Wide Web", Available at: http://www. worldwidewebsize. com (accessed on May 5, 2012).
  5. P. J. Deutsch. Original Archie Announcement, 1990. URL http://groups. google. com/group/comp. archives/msg/a77343f9175b24c3?output=gplain.
  6. A. Emtage and P. Deutsch. Archie: An Electronic Directory Service for the Internet. In roceedings of the Winter 1992 USENIX Conference, pp. 93–110, San Francisco, California, USA, 1991.
  7. G. S. Machovec. Veronica: A Gopher Navigational Tool on the Internet. Information Intelligence, Online Libraries, and Microcomputers, 11(10): pp. 1–4, Oct. 1 1993. ISSN 0737-7770.
  8. R. Jones. Jughead: Jonzy's Universal Gopher Hierarchy Excavation And Display. unpublished, Apr. 1993.
  9. J. Harris. Mining the Internet: Networked Information Location Tools: Gophers, Veronica, Archie, and Jughead. Computing Teacher, 21(1):pp. 16–19, Aug. 1 1993. ISSN 0278-9175.
  10. H. Hahn and R. Stout. The Gopher, Veronica, and Jughead. In The Internet Complete Reference, pp. 429–457. Osborne McGraw-Hill, 1994.
  11. T. Berners-Lee, R. Cailliau, J. Groff, and B. Pollermann. World-Wide Web: The Information Universe. Electronic Networking: Research, Applications and Policy, 1(2): pp. 74–82, 1992. URL http://citeseer. ist. psu. edu/berners-lee92worldwide. html.
  12. T. Berners-Lee. W3C, Mar. 2008. URL http://www. w3. org/.
  13. M. K. Gray. World Wide Web Wanderer, 1996b. URL http://www. mit. edu/people/mkgray/net/.
  14. W. Sonnenreich and T. Macinta. Web Developer. com, Guide to Search Engines. John Wiley & Sons, New York, New York, USA, 1998.
  15. M. Koster. ALIWEB - Archie-Like Indexing in the WEB. Computer Networks and ISDN Systems, 27(2): pp. 175–182, 1994a. ISSN 0169-7552. doi: http://dx. doi. org/10. 1016/0169-7552(94)90131-7.
  16. M. Koster. A Standard for Robot Exclusion, 1994b. URL http://www. robotstxt. org/wc/norobots. html. http://www. robotstxt. org/wc/exclusion. html.
  17. B. Pinkerton. Finding What People Want: Experiences with the WebCrawler. In Proceedings of the Second International World Wide Web Conference, Chicago, Illinois, USA, Oct. 1994.
  18. Infoseek, Mar. 2008. URL www. infoseek. co. jp
  19. Lycos, Mar. 2008. URL http://www. lycos. com
  20. Altavista, Mar. 2008. URL www. altavista. com
  21. Excite, Mar. 2008. URL www. excite. com
  22. Dogpile, Mar. 2008. URL www. dogpile. com
  23. Inktomi, Mar. 2008. URL www. inktomi. com
  24. Ask. com, Mar. 2008. URL http://ask. com/.
  25. Northern Light, Mar. 2008. URL http://www. northernlight. com
  26. D. Sullivan. Search Engine Watch: Where are they now? Search Engines we've Known & Loved, Mar. 4 2003b. URL http://searchenginewatch. com/sereport/article. php/2175241.
  27. Google. Google's New GoogleScout Feature Expands Scope of Search on the Internet, Sept. 1999. URL http://www. google. com/press/pressrel/pressrelease4. html.
  28. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998. URL http://citeseer. ist. psu. edu/page98pagerank. html
  29. S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. In P. H. Enslow Jr. and A. Ellis, editors, WWW7: Proceedings of the Seventh International Conference on World Wide Web, pp. 107–117, Brisbane, Australia, Apr. 14–18 1998. Elsevier Science Publishers B. V. , Amsterdam, The Netherlands. doi: http://dx. doi. org/10. 1016/S0169-7552(98)00110-X.
  30. Junghoo Cho and Hector Garcia-Molina "Parallel Crawlers". Proceedings of the 11th international conference on World Wide Web WWW '02", May 7–11, 2002, Honolulu, Hawaii, USA. ACM 1-58113-449-5/02/0005.
  31. Rajashree Shettar, Dr. Shobha G, "Web Crawler On Client Machine", Proceedings of the International MultiConference of Engineers and Computer Scientists 2008 Vol II IMECS 2008, 19-21 March, 2008, Hong Kong
  32. Eytan Adar, Jaime Teevan, Susan T. Dumais and Jonathan L. Elsas "The Web Changes Everything: Understanding the Dynamics of Web Content", ACM 2009.
  33. A. K. Sharma, J. P. Gupta and D. P. Agarwal "PARCAHYD: An Architecture of a Parallel Crawler based on Augmented Hypertext Documents", International Journal of Advancements in Technology, pp. 270-283, October 2010.
  34. Ashutosh Dixit and Dr. A. K. Sharma, "A Mathematical Model for Crawler Revisit Frequency", IEEE 2nd International Advance Computing Conference, pp. 316-319, 2010.
  35. Shruti Sharma, A. K. Sharma and J. P. Gupta "A Novel Architecture of a Parallel Web Crawler", International Journal of Computer Applications (0975 – 8887) Volume 14– No. 4, pp. 38-42, January 2011
  36. Alex Goh Kwang Leng, Ravi Kumar P, Ashutosh Kumar Singh and Rajendra Kumar Dash "PyBot: An Algorithm for Web Crawling", IEEE 2011
  37. Song Zheng, "Genetic and Ant Algorithms Based Focused Crawler Design", Second International Conference on Innovations in Bio-inspired Computing and Applications pp. 374-378, 2011
  38. Lili Yana, Zhanji Guia, Wencai Dub and Qingju Guoa "An Improved PageRank Method based on Genetic Algorithm for Web Search", Procedia Engineering, pp. 2983-2987, Elsevier 2011
  39. Andoena Balla, Athena Stassopoulou and Marios D. Dikaiakos (2011), "Real-time Web Crawler Detection", 18th International Conference on Telecommunications, pp. 428-432, 2011
  40. Bahador Saket and Farnaz Behrang "A New Crawling Method Based on AntNet Genetic and Routing Algorithms", International Symposium on Computing, Communication, and Control, pp. 350-355, IACSIT Press, Singapore, 2011
  41. Anbukodi. S and Muthu Manickam. K "Reducing Web Crawler Overhead using Mobile Crawler", PROCEEDINGS OF ICETECT, pp. 926-932, 2011
  42. K. S. Kim, K. Y. Kim, K. H. Lee, T. K. Kim, and W. S. Cho "Design and Implementation of Web Crawler Based on Dynamic Web Collection Cycle", pp. 562-566, IEEE 2012
  43. MetaCrawler Search Engine, available at: http://www. metacrawler. com.
  44. Cho, J. and H. Garcia-Molina. The evolution of the Web and implications for an incremental crawler. VLDB '00, 200-209, 2000.
  45. Douglis, F. , A. Feldmann, B. Krishnamurthy, and J. Mogul. Rate of change and other metrics: A live study of the World Wide Web. USENIX Symposium on Internet Technologies and Systems, 1997.
  46. Fetterly, D. , M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of Web pages. WWW '03, 669-678, 2003.
  47. Kim, J. K. , and S. H. Lee. An empirical study of the change of Web pages. APWeb '05, 632-642, 2005.
  48. Koehler, W. Web page change and persistence: A four-year longitudinal study. JASIST, 53(2), 162-171, 2002.
  49. Kwon, S. H. , S. H. Lee, and S. J. Kim. Effective criteria for Web page changes. In Proceedings of APWeb '06, 837-842, 2006.
  50. Ntoulas, A. , Cho, J. , and Olston, C. What's new on the Web? The evolution of the Web from a search engine perspective. WWW '04 , 1-12, 2004.
  51. Olston, C. and Pandey, S. Recrawl scheduling based on information longevity. WWW '08, 437-446, 2008.
  52. Pitkow, J. and Pirolli, P. Life, death, and lawfulness on the electronic frontier. CHI '97, 383-390, 1997.
  53. Selberg, E. and Etzioni, O. On the instability of Web search engines. In Proceedings of RIAO '00, 2000.
  54. Teevan, J. , E. Adar, R. Jones, and M. A. Potts. Information reretrieval: repeat queries in Yahoo's logs. SIGIR '07, 151-158, 2007.
Index Terms

Computer Science
Information Sciences

Keywords

WWW Web Crawler Crawling techniques Web Crawler Survey Search engine Parallel Crawler