International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 151 - Number 9 |
Year of Publication: 2016 |
Authors: Tilak Patidar, Aditya Ambasth |
10.5120/ijca2016911857 |
Tilak Patidar, Aditya Ambasth . Improvised Architecture for Distributed Web Crawling. International Journal of Computer Applications. 151, 9 ( Oct 2016), 14-20. DOI=10.5120/ijca2016911857
Web crawlers are program, designed to fetch web pages for information retrieval system. Crawlers facilitate this process by following hyperlinks in web pages to automatically download new or update existing web pages in the repository. A web crawler interacts with millions of hosts, fetches millions of page per second and updates these pages into a database, creating a need for maintaining I/O performance, network resources within OS limit, which are essential in order to achieve high performance at a reasonable cost. This paper aims to showcase efficient techniques to develop a scalable web crawling system, addressing challenges which deals with issues related to the structure of the web, distributed computing, job scheduling, spider traps, canonicalizing URLs and inconsistent data formats on the web. A brief discussion on new web crawler architecture is done in this paper.