We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Call for Paper
December Edition
IJCA solicits high quality original research papers for the upcoming December edition of the journal. The last date of research paper submission is 20 November 2024

Submit your paper
Know more
Reseach Article

Crawling the Web Surface Databases

by Vidushi Singhal, Sachin Sharma
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 52 - Number 19
Year of Publication: 2012
Authors: Vidushi Singhal, Sachin Sharma
10.5120/8309-1827

Vidushi Singhal, Sachin Sharma . Crawling the Web Surface Databases. International Journal of Computer Applications. 52, 19 ( August 2012), 15-22. DOI=10.5120/8309-1827

@article{ 10.5120/8309-1827,
author = { Vidushi Singhal, Sachin Sharma },
title = { Crawling the Web Surface Databases },
journal = { International Journal of Computer Applications },
issue_date = { August 2012 },
volume = { 52 },
number = { 19 },
month = { August },
year = { 2012 },
issn = { 0975-8887 },
pages = { 15-22 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume52/number19/8309-1827/ },
doi = { 10.5120/8309-1827 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:52:40.178508+05:30
%A Vidushi Singhal
%A Sachin Sharma
%T Crawling the Web Surface Databases
%J International Journal of Computer Applications
%@ 0975-8887
%V 52
%N 19
%P 15-22
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The World Wide Web is growing at a rapid rate. A web crawler is a computer program which independently browses the World Wide Web. The size of web as on February 2007 was 29 billion pages. One of the most important uses of web page is in indexing purpose and keeping web pages up to date which can be used by search engine to serve the end user queries. Web is dynamic in nature; hence we need to update the web pages constantly. In this paper, we put forward a technique to update a page stored in web repository. This paper put forward an efficient method to refresh a page. We are proposing two methods for refreshing the page by comparing the page structure. First method compares the page structure with the help of tags used in it. And second method creates a document tree compare structures of pages.

References
  1. David Eichmann, "The RBSE Spider – Balancing effective search against web load", Repository Based Software Engineering Program , Research Institute for Computing and Information Systems, University of Houston – Clear Lake.
  2. Sergey Brin and Lawrence Page, "The Anatomy of a Large-Scale Hypertextual Web Search Engine", In Proceedings of the Seventh World-Wide Web Conference, 1998.
  3. Anshika pal, Deepak Singh tomar, S. C srivastava, "effective focused crawling based on content and link structure analysis", international journal of computer science and information security, vol 2, no. 2, June 2009
  4. jody Johnson, kostas Tsioutsiouliklis, C. L Giles, "Evolving strategies for focused web crawling", Proceedings of twentieth international conference of machine learning, Washington DC, 2003.
  5. Junghoo Cho & Hector Garcia-Molina, "Parallel Crawlers". Proceedings of the 11th international conference on World Wide Web WWW '02, Honolulu, Hawaii, USA. ACM Press. Page(s): 124 – 135.
  6. F. Ahmadi Abkenari, Ali Selamat, "A clickstream based focused trend parallel web crawler", vol 9, no 5, November 2010.
  7. Dilip Kumar Sharma, A. K. Sharma," A Novel Architecture for Deep Web Crawler", International Journal of Information Technology and Web Engineering, vol 6, issue 1, 25-48, January-March 2011
  8. Nidhi Tyagi, Deepti Gupta, "A novel architecture for domain specific parallel crawler", Indian journal of computer science and engineering, vol 1, no 1, 44 – 53.
  9. E. Co. man, Jr. , Z. Liu, and R. R. Weber, "Optimal robot scheduling for web search engines". Proceedings of the 11th international conference on World Wide Web WWW '02 Honolulu, Hawaii, USA. ACM Press. Page(s): 136 – 147.
  10. "Synchronizing a database to improve freshness, submitted for publication". Proceedings of the 2000 ACM SIGMOD international conference on Management of data. Volume 29 Issue 2. Page(s): 117 – 128.
  11. M. Diligenti, F. M. Coetzee, S. Lawrence, C. L. Giles, and M. Gori, "Focused crawling using context graphs", In Proceedings of the Twenty-sixth International Conference on Very Large Databases, 2000.
  12. S. Chakrabarti, M. van den Berg, and B. Dom, "Focused crawling: A new approach to topic-specific web resource discovery", In The 8th International World Wide Web Conference, 1999.
  13. Junghoo Cho, Hector Garcia-Molina, and Lawrence, "Efficient crawling through URL ordering Page", In Proceedings of the 7th World-Wide Web Conference, 1998, page(s):161-171.
  14. Divakar Yadav, A. K Sharma, J. P. Gupta, " Parallel crawler architecture and web page change detection", WSEAS transaction on computers, issue 7, volume 7, july 2008
  15. Bergman, Michael K, "White paper: the deep web : surfacing hidden value", Vol 7, Issue 1, August 2001
  16. Junghoo Cho , Hector Garcia-molina ," Effective page refresh policies for web crawlers",Vol 28, Issue 4, December 2003, Pages 390 – 426
  17. Vipul Sharma, Mukesh Kumar, Renu Vig, A Hybrid Revisit Policy For Web Search, Vol 3, No 1, Feb 2012, Page(s): 36 - 47
Index Terms

Computer Science
Information Sciences

Keywords

Web Crawler WWW Spidering Search Engine Surface Web Deep Web Document Tree Structure