Improvised Architecture for Distributed Web Crawling

Tilak Patidar; Aditya Ambasth

Call for Paper

September Edition

IJCA solicits high quality original research papers for the upcoming September edition of the journal. The last date of research paper submission is 20 August 2025

Submit your paper

Know more

The week's pick

Assessing LLMs as Cognitive Interpreters of Student Prompts: A Typological Framework

Tadeu da Ponte Matevz Vremec Matej Mertik

Random Articles

Reseach Article

Improvised Architecture for Distributed Web Crawling

by Tilak Patidar, Aditya Ambasth

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 151 - Number 9

Year of Publication: 2016

Authors: Tilak Patidar, Aditya Ambasth

10.5120/ijca2016911857

Tilak Patidar, Aditya Ambasth . Improvised Architecture for Distributed Web Crawling. International Journal of Computer Applications. 151, 9 ( Oct 2016), 14-20. DOI=10.5120/ijca2016911857

@article{ 10.5120/ijca2016911857,

author = { Tilak Patidar, Aditya Ambasth },

title = { Improvised Architecture for Distributed Web Crawling },

journal = { International Journal of Computer Applications },

issue_date = { Oct 2016 },

volume = { 151 },

number = { 9 },

month = { Oct },

year = { 2016 },

issn = { 0975-8887 },

pages = { 14-20 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume151/number9/26260-2016911857/ },

doi = { 10.5120/ijca2016911857 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:56:38.354287+05:30

%A Tilak Patidar

%A Aditya Ambasth

%T Improvised Architecture for Distributed Web Crawling

%J International Journal of Computer Applications

%@ 0975-8887

%V 151

%N 9

%P 14-20

%D 2016

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Web crawlers are program, designed to fetch web pages for information retrieval system. Crawlers facilitate this process by following hyperlinks in web pages to automatically download new or update existing web pages in the repository. A web crawler interacts with millions of hosts, fetches millions of page per second and updates these pages into a database, creating a need for maintaining I/O performance, network resources within OS limit, which are essential in order to achieve high performance at a reasonable cost. This paper aims to showcase efficient techniques to develop a scalable web crawling system, addressing challenges which deals with issues related to the structure of the web, distributed computing, job scheduling, spider traps, canonicalizing URLs and inconsistent data formats on the web. A brief discussion on new web crawler architecture is done in this paper.

References

Shkapenyuk, V. and Suel, T. (2002). Design and implementation of a high performance distributed web crawler. In Proceedings of the 18th International Conference on Data Engineering (ICDE), pages 357-368, San Jose, California. IEEE CS Press.
J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In 7th Int.World Wide Web Conference, May 1998.
M. Najork and J. Wiener. Breadth-first search crawling yields high-quality pages. In 10th Int. World Wide Web Conference, 2001
Web Crawling, By Christopher Olston and Marc Najork Foundations and Trends R in Information Retrieval Vol. 4, No. 3 (2010) 175–246 c 2010 C. Olston and M. Najork DOI: 10.1561/1500000017.
Common Crawl, “Common Crawl’s Move to Nutch,” http://commoncrawl.org/2014/02/common-crawl-move-to-nutch/
Burton H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors.
J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proc. of the ACM SIGMOD Int. Conf. on Management of Data, pages 117–128, May 2000.
.J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proc. of 26th Int. Conf. on Very Large Data Bases, pages 117–128, September 2000
George Adam, Christos Bouras, Professor Vassilis Poulopoulos, Utilizing RSS feeds for crawling the Web Conference: Fourth International Conference on Internet and Web Applications and Services, ICIW 2009, 24-28 May 2009, Venice/Mestre, Italy.
Chakrabarti, Soumen, Martin Van den Berg, and Byron Dom. "Focused crawling: a new approach to topic-specific Web resource discovery."Computer Networks 31.11 (1999): 1623-1640.
Broder, A. and Mitzenmacher, M., 2004. Network applications of bloom filters: A survey. Internet mathematics, 1(4), pp.485-509.
High Scalability, “10 Things You Should Know About Running MongoDB At Scale” http://highscalability.com/blog/2014/3/5/10-things-you-should-know-about-running-mongodb-at-scale.html
MongoDB,“GridFS - MongoDB Manual 3.2” https://docs.mongodb.com/manual/core/gridfs/
Compose, “Better Bulking for MongoDB 2.6 & Beyond –Compose an IBM company”. https://www.compose.com/articles/better-bulking-for-mongodb-2-6-and-beyond/
Castillo, Carlos, and Ricardo Baeza-Yates. Practical Issues of Crawling Large Web Collections. Technical report, 2005.

Index Terms

Computer Science

Information Sciences

Keywords

Web Crawler Distributed Computing Bloom Filter Batch Crawling Selection Policy Politeness Policy.