Crawling the Web Surface Databases

Vidushi Singhal; Sachin Sharma

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

Crawling the Web Surface Databases

by Vidushi Singhal, Sachin Sharma

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 52 - Number 19

Year of Publication: 2012

Authors: Vidushi Singhal, Sachin Sharma

10.5120/8309-1827

Vidushi Singhal, Sachin Sharma . Crawling the Web Surface Databases. International Journal of Computer Applications. 52, 19 ( August 2012), 15-22. DOI=10.5120/8309-1827

@article{ 10.5120/8309-1827,

author = { Vidushi Singhal, Sachin Sharma },

title = { Crawling the Web Surface Databases },

journal = { International Journal of Computer Applications },

issue_date = { August 2012 },

volume = { 52 },

number = { 19 },

month = { August },

year = { 2012 },

issn = { 0975-8887 },

pages = { 15-22 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume52/number19/8309-1827/ },

doi = { 10.5120/8309-1827 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:52:40.178508+05:30

%A Vidushi Singhal

%A Sachin Sharma

%T Crawling the Web Surface Databases

%J International Journal of Computer Applications

%@ 0975-8887

%V 52

%N 19

%P 15-22

%D 2012

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The World Wide Web is growing at a rapid rate. A web crawler is a computer program which independently browses the World Wide Web. The size of web as on February 2007 was 29 billion pages. One of the most important uses of web page is in indexing purpose and keeping web pages up to date which can be used by search engine to serve the end user queries. Web is dynamic in nature; hence we need to update the web pages constantly. In this paper, we put forward a technique to update a page stored in web repository. This paper put forward an efficient method to refresh a page. We are proposing two methods for refreshing the page by comparing the page structure. First method compares the page structure with the help of tags used in it. And second method creates a document tree compare structures of pages.

References

David Eichmann, "The RBSE Spider – Balancing effective search against web load", Repository Based Software Engineering Program , Research Institute for Computing and Information Systems, University of Houston – Clear Lake.
Sergey Brin and Lawrence Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine", In Proceedings of the Seventh World-Wide Web Conference, 1998.
Anshika pal, Deepak Singh tomar, S. C srivastava, "effective focused crawling based on content and link structure analysis", international journal of computer science and information security, vol 2, no. 2, June 2009
jody Johnson, kostas Tsioutsiouliklis, C. L Giles, "Evolving strategies for focused web crawling", Proceedings of twentieth international conference of machine learning, Washington DC, 2003.
Junghoo Cho & Hector Garcia-Molina, "Parallel Crawlers". Proceedings of the 11th international conference on World Wide Web WWW '02, Honolulu, Hawaii, USA. ACM Press. Page(s): 124 – 135.
F. Ahmadi Abkenari, Ali Selamat, "A clickstream based focused trend parallel web crawler", vol 9, no 5, November 2010.
Dilip Kumar Sharma, A. K. Sharma," A Novel Architecture for Deep Web Crawler", International Journal of Information Technology and Web Engineering, vol 6, issue 1, 25-48, January-March 2011
Nidhi Tyagi, Deepti Gupta, "A novel architecture for domain specific parallel crawler", Indian journal of computer science and engineering, vol 1, no 1, 44 – 53.
E. Co. man, Jr. , Z. Liu, and R. R. Weber, "Optimal robot scheduling for web search engines". Proceedings of the 11th international conference on World Wide Web WWW '02 Honolulu, Hawaii, USA. ACM Press. Page(s): 136 – 147.
"Synchronizing a database to improve freshness, submitted for publication". Proceedings of the 2000 ACM SIGMOD international conference on Management of data. Volume 29 Issue 2. Page(s): 117 – 128.
M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori, "Focused crawling using context graphs", In Proceedings of the Twenty-sixth International Conference on Very Large Databases, 2000.
S. Chakrabarti, M. van den Berg, and B. Dom, "Focused crawling: A new approach to topic-specific web resource discovery", In The 8th International World Wide Web Conference, 1999.
Junghoo Cho, Hector Garcia-Molina, and Lawrence, "Efficient crawling through URL ordering Page", In Proceedings of the 7th World-Wide Web Conference, 1998, page(s):161-171.
Divakar Yadav, A. K Sharma, J. P. Gupta, " Parallel crawler architecture and web page change detection", WSEAS transaction on computers, issue 7, volume 7, july 2008
Bergman, Michael K, "White paper: the deep web : surfacing hidden value", Vol 7, Issue 1, August 2001
Junghoo Cho , Hector Garcia-molina ," Effective page refresh policies for web crawlers",Vol 28, Issue 4, December 2003, Pages 390 – 426
Vipul Sharma, Mukesh Kumar, Renu Vig, A Hybrid Revisit Policy For Web Search, Vol 3, No 1, Feb 2012, Page(s): 36 - 47

Index Terms

Computer Science

Information Sciences

Keywords

Web Crawler WWW Spidering Search Engine Surface Web Deep Web Document Tree Structure