CFP last date
20 January 2025
Reseach Article

Improvised Architecture for Distributed Web Crawling

by Tilak Patidar, Aditya Ambasth
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 151 - Number 9
Year of Publication: 2016
Authors: Tilak Patidar, Aditya Ambasth
10.5120/ijca2016911857

Tilak Patidar, Aditya Ambasth . Improvised Architecture for Distributed Web Crawling. International Journal of Computer Applications. 151, 9 ( Oct 2016), 14-20. DOI=10.5120/ijca2016911857

@article{ 10.5120/ijca2016911857,
author = { Tilak Patidar, Aditya Ambasth },
title = { Improvised Architecture for Distributed Web Crawling },
journal = { International Journal of Computer Applications },
issue_date = { Oct 2016 },
volume = { 151 },
number = { 9 },
month = { Oct },
year = { 2016 },
issn = { 0975-8887 },
pages = { 14-20 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume151/number9/26260-2016911857/ },
doi = { 10.5120/ijca2016911857 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:56:38.354287+05:30
%A Tilak Patidar
%A Aditya Ambasth
%T Improvised Architecture for Distributed Web Crawling
%J International Journal of Computer Applications
%@ 0975-8887
%V 151
%N 9
%P 14-20
%D 2016
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Web crawlers are program, designed to fetch web pages for information retrieval system. Crawlers facilitate this process by following hyperlinks in web pages to automatically download new or update existing web pages in the repository. A web crawler interacts with millions of hosts, fetches millions of page per second and updates these pages into a database, creating a need for maintaining I/O performance, network resources within OS limit, which are essential in order to achieve high performance at a reasonable cost. This paper aims to showcase efficient techniques to develop a scalable web crawling system, addressing challenges which deals with issues related to the structure of the web, distributed computing, job scheduling, spider traps, canonicalizing URLs and inconsistent data formats on the web. A brief discussion on new web crawler architecture is done in this paper.

References
  1. Shkapenyuk, V. and Suel, T. (2002). Design and implementation of a high performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357-368, San Jose, California. IEEE CS Press.
  2. J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In 7th Int.World Wide Web Conference, May 1998.
  3. M. Najork and J. Wiener. Breadth-first search crawling yields high-quality pages. In 10th Int. World Wide Web Conference, 2001
  4. Web Crawling, By Christopher Olston and Marc Najork Foundations and Trends R in Information Retrieval Vol. 4, No. 3 (2010) 175–246 c 2010 C. Olston and M. Najork DOI: 10.1561/1500000017.
  5. Common Crawl, “Common Crawl’s Move to Nutch,” http://commoncrawl.org/2014/02/common-crawl-move-to-nutch/
  6. Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors.
  7. J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 117–128, May 2000.
  8. .J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proc. of 26th Int. Conf. on Very Large Data Bases, pages 117–128, September 2000
  9. George Adam, Christos Bouras, Professor Vassilis Poulopoulos, Utilizing RSS feeds for crawling the Web Conference: Fourth International Conference on Internet and Web Applications and Services, ICIW 2009, 24-28 May 2009, Venice/Mestre, Italy.
  10. Chakrabarti, Soumen, Martin Van den Berg, and Byron Dom. "Focused crawling: a new approach to topic-specific Web resource discovery."Computer Networks 31.11 (1999): 1623-1640.
  11. Broder, A. and Mitzenmacher, M., 2004. Network applications of bloom filters: A survey. Internet mathematics, 1(4), pp.485-509.
  12. High Scalability, “10 Things You Should Know About Running MongoDB At Scale” http://highscalability.com/blog/2014/3/5/10-things-you-should-know-about-running-mongodb-at-scale.html
  13. MongoDB,“GridFS - MongoDB Manual 3.2” https://docs.mongodb.com/manual/core/gridfs/
  14. Compose, “Better Bulking for MongoDB 2.6 & Beyond –Compose an IBM company”. https://www.compose.com/articles/better-bulking-for-mongodb-2-6-and-beyond/
  15. Castillo, Carlos, and Ricardo Baeza-Yates. Practical Issues of Crawling Large Web Collections. Technical report, 2005.
Index Terms

Computer Science
Information Sciences

Keywords

Web Crawler Distributed Computing Bloom Filter Batch Crawling Selection Policy Politeness Policy.