We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Call for Paper
December Edition
IJCA solicits high quality original research papers for the upcoming December edition of the journal. The last date of research paper submission is 20 November 2024

Submit your paper
Know more
Reseach Article

Available Challenges and Guidelines in the Field of Deep Web and Intensive Crawling

by Yasin Ezatdoost, Ali Tourani, Amir Seyed Danesh
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 77 - Number 1
Year of Publication: 2013
Authors: Yasin Ezatdoost, Ali Tourani, Amir Seyed Danesh
10.5120/13355-0948

Yasin Ezatdoost, Ali Tourani, Amir Seyed Danesh . Available Challenges and Guidelines in the Field of Deep Web and Intensive Crawling. International Journal of Computer Applications. 77, 1 ( September 2013), 1-5. DOI=10.5120/13355-0948

@article{ 10.5120/13355-0948,
author = { Yasin Ezatdoost, Ali Tourani, Amir Seyed Danesh },
title = { Available Challenges and Guidelines in the Field of Deep Web and Intensive Crawling },
journal = { International Journal of Computer Applications },
issue_date = { September 2013 },
volume = { 77 },
number = { 1 },
month = { September },
year = { 2013 },
issn = { 0975-8887 },
pages = { 1-5 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume77/number1/13355-0948/ },
doi = { 10.5120/13355-0948 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:49:05.686948+05:30
%A Yasin Ezatdoost
%A Ali Tourani
%A Amir Seyed Danesh
%T Available Challenges and Guidelines in the Field of Deep Web and Intensive Crawling
%J International Journal of Computer Applications
%@ 0975-8887
%V 77
%N 1
%P 1-5
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Today, there is a great deal of information available in Web world and the only way to access them is through search relationships. Web crawler is an automated script that independently browses the web. Web crawler starts its task with a "seed URL" and then traces links available in each page. This encountered many available crawlers with essential difficulties. Identification of search intermediate and selection of a proper inquiry, on one hand, and retrieving documentaries returned by the web as the result, on the other hand, are issues that intensify challenges available for web crawlers. The aim of the present paper is to investigate available challenges and guidelines in the field of deep web and intensive crawling.

References
  1. See http://java. sun. com/products/servlet/ 2006 Java Servlet TM Technology
  2. Gravano L. , Iperirotis P. G, Sahami M. 2003 QProber: A system for automatic classification Web databases. In Proceedings of the ACM Trans. Information System pp. 1-14
  3. Change K. C. C. , He B. , Li C. , Patel M. , Zhang Z. 2004 Structured databases on the web: Observations and implications. SIGMOD Record
  4. Chakrabarti S. , Berg M. V. D. , Dom B. 1999 Focused Crawling: a New Approach to Topic-Specific Web Resource Discovery. In 31th Computer Networks Conference, pp. 1623-1640
  5. Chakrabarti S. , Berg M. V. D. , Dom B. 1997 Distributed Hypertext Resource Discovery through Example". In 25th International Conference on Very Large Data Base, USA
  6. Cho J. , Garcia-Molina H. 2000 the Evolution of the Web and Implications for an Incremental Crawler. In 26th International Conference on Very Large Data Bases, USA, pp. 200-209
  7. Cho J. , Garcia-Molina H. 2000 Synchronizing a Database to Improve Freshness. In ACM SIGMOD International Conference on Management of Data, USA, pp. 117-128
  8. Cho J. , Garcia-Molina H. and Page L. 1998 Efficient Crawling through URL Ordering In 7th In World Wide Web Conference, Australia. pp. 161-172
  9. Diligenti M. , Coetzee F. , Lawrence S. 2000 Focused Crawling Using Context Graphs. In 26th International Conference on Very Large Databases (VLDB), Cairo, Egypt, pp. 527-534
  10. Alvarez M. , Pan A. , Raposo J. and Vina A. 2006 Crawling the client-side hidden web
  11. Doorenbos R. B. , Etzioni O. , Weld D. S. 1997 A scalable comparison-shopping agent for the World-Wide Web. In First International Conference on Autonomouse Agent, pp. 39-48
  12. Lage J. P. , da Silva A. , Golgher P. B. , Laender A. H. 2004 Automatic generation of agent for collecting hidden web pages for data extraction. Data Knowledge Eng. pp. 177-196
  13. Zhang Z. , He B. , Chang K. 2004 Understanding Web query interfaces: best- effort parsing with hidden syntax. In Proceeding of the 2004 ACM SIGMOD international Conference on Management of Data, Paris, France
  14. Article on New York Times 2006 Old Search Engine, the Library Tries to Fit Into a Google World. See http://www. nytimes. com/2004/06/21/technology/21LIBR. html
  15. Najork M. , Wiener J. 2011 Breadth-First Search Crawling Yields High-Quality Pages. In 10th Conference on Word Wide Web, Hong-Kong. pp. 114- 118
  16. Broder A. , Carnel D. 2005 Sampling search-engine results. In 14th international Conference on world Wide Web, Chiba, Japan
  17. Qin J. , Chen H. 2005 Using Genetic Algorithm in Building Domain-Specific Collections: An Experiment in the Nanotechnology Domain. In 38th Annual Hawaii International Conference on System Sciences, USA
  18. Rennie J. , McCallum A. 1999 Using Reinforcement Learning to Spider the Web Efficiently. In 16th International Conference on Machine Learning, USA, pp. 335-343
  19. Rungsawang A. , Angkawattanawit N. 2005 Learnable Topic-Specific WebCrawler. Journal of Network and Computer Applications, UK, pp. 97-114
  20. Koster M. 1993 Guidelines for robot writers, http://www. robotstxt. org/guidelines. html,
  21. Shkapenyuk V. , Suel T. 2001 Design and Implementation of a High-Performance Distributed Web Crawler. In 18th International Conference on Data Engineering, USA, pp. 357- 368
  22. Younes H. , Chabane D. 2004 High Performance Crawling System. In 6th ACM SIGMM International Workshop on Multimedia Information Retrieval, New York, USA, pp. 299-306
  23. Gulli A. , Signorini A. 2005 The Index able Web is More than 11. 5 billion pages. In 14th International World Wide Web Conference, Chiba, Japan
  24. Gravano L. , Ipeirotis P. G. , Sahami M. 2002 Query- vs. Crawling-based Classification of Searchable Web Databases. IEEE Data Engineering Bulletin
  25. Gravano L. , Garcia-Molina H. , Tomasic A. 1999 GIOSS: Text source discovery over the Internet. ACM TODS
  26. Ipeirotis P. G. , Gravano L. , Sahami M. 2001 Probe, count, and classify: categorizing hidden web databases. In Proceeding of 2001 ACM SIGMOD, international Conference on Management of Data, Santa Barbara, California, U. S.
  27. Ipeirotis P. G. , Gravano L. 2002 Distributed Search over the Hidden web: Hierarchical Database Sampling and Selection. In 28th VLDB Conference, Hong Kong, China
  28. Barbosa L. , Freire J. 2004 Siphoning Hidden-Web Data through Keyword-Base Interfaces. In SBBD
  29. Castillo C. 2004 Effective Web Crawling. In ACM SIGIR. Vo. 39, Issue 1
  30. Kumar Sharma D. 2011 A Novel Architecture for Deep Web Crawler. International Journal of Information Technology and Web Engineering
Index Terms

Computer Science
Information Sciences

Keywords

Intensive crawler search engine genetic algorithm deep web