CFP last date
20 January 2025
Reseach Article

Learning Capable Focused Crawler for Information Technology Domain

by Mukesh Kumar, Renu Vig
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 43 - Number 23
Year of Publication: 2012
Authors: Mukesh Kumar, Renu Vig
10.5120/6416-7849

Mukesh Kumar, Renu Vig . Learning Capable Focused Crawler for Information Technology Domain. International Journal of Computer Applications. 43, 23 ( April 2012), 1-4. DOI=10.5120/6416-7849

@article{ 10.5120/6416-7849,
author = { Mukesh Kumar, Renu Vig },
title = { Learning Capable Focused Crawler for Information Technology Domain },
journal = { International Journal of Computer Applications },
issue_date = { April 2012 },
volume = { 43 },
number = { 23 },
month = { April },
year = { 2012 },
issn = { 0975-8887 },
pages = { 1-4 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume43/number23/6416-7849/ },
doi = { 10.5120/6416-7849 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:34:03.676677+05:30
%A Mukesh Kumar
%A Renu Vig
%T Learning Capable Focused Crawler for Information Technology Domain
%J International Journal of Computer Applications
%@ 0975-8887
%V 43
%N 23
%P 1-4
%D 2012
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The Web provides us with a huge and endless resource for information. But, the rapidly growing size of the Web poses great challenge for general purpose crawlers and search engines. It is impossible for any search engine to index the whole Web. Focused crawler collects domain relevant pages from the Web by avoiding the irrelevant portion of the Web. Focused crawler can help the search engine to index all documents present on the Web related to a specific domain which in turn provides the search engine's users complete and up-to-date contents. In this paper we present a focused crawler capable of learning from the previous crawl results to collect the relevant documents. Crawling results for three consecutive learning phases are shown. Results indicate significant improvement in terms of relevancy to the focused domain

References
  1. Brin, S. and Page, L. (1998), 'The anatomy of a large scale hypertextual web search engine',Computer Networks and ISDN Systems, 30, pp. 107-117.
  2. C. Aggarwal, F. Al-Garawi and P. Yu. ( 2001), 'Intelligent Crawling on the World Wide Web with Arbitrary Predicates' ,Proceedings of the 10th international conference on World Wide Web, Hong Kong, pp. 96-105.
  3. D. Bergmark, Carl Lagoze and Alex Sbityakov(2002),'Focused Crawls, Tunneling, and Digital Libraries',Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, Rome, Italy, pp. 91-106.
  4. Ehrig, M. & Maedche, A. (2003),' Ontology-Focused Crawling of Web Documents', Proceedings of the Symposium on Applied Computing 2003 (SAC 2003). Melbourne, FL, USA, S. Pp. 1174-1178.
  5. J. Cho and Hector Garcia-Molina(2002), 'Parallel Crawlers', Proceedings of the World Wide Web conference (WWW), Honolulu, Hawaii.
  6. J. Cho and H. Garcia-Molina(2000),'The evolution of the web and implications for an incremental crawler', Proceeding of 26th International Conference on Very Large Database, Cairo, Egypt, , pp. 200-209.
  7. J. Cho, H. Garcia-Molina, L. Page (1998),'Efficient Crawling Through URL Ordering': Proceedings of the Seventh International World Wide Web Conference, Brisbane, Australia, pp. 379-388.
  8. L. Page, S. Brin, R. Motwani, T. Winograd (1998), 'The PageRank Citation Ranking: Bringing Order to the Web', Technical report, Stanford Digital Library Technologies Project, pp. 1-17.
  9. Martin Ester, Matthias Groß, Hans-Peter Kriegel(2001), 'Focused Web Crawling: A Generic Framework for Specifying the User Interest and for Adaptive Crawling Strategies',Proceedings of the 27th International Conference on Very Large Database,VLDB2001,Roma,Italy,pp. 633-637.
  10. P. Boldi, B. Codenotti, M. Santini, and S. Vigna (2004),'Ubicrawler: a scalable fully distributed web crawler', Software Practice & Experience, 34(8), pp. 711–726.
  11. P. M. E. De Bra and R. D. J. Post (1994),' Information retrieval in the World-Wide Web: Making client-based searching feasible', Computer Networks and ISDN Systems. vol. 27, no. 2, pp. 183-192.
  12. S. Chakrabarti, M. van den Berg, B. Domc(1999), 'Focused crawling: a new approach to topic-specific Web resource discovery', Proceedings of the 8th international World Wild Web Conference, Toronto, Canada, pp. 1623-1640.
  13. Ari Pirkola (2007),' Focused Crawling: A Means to Acquire Biological Data from the Web', VLDB '07, Vienna, Austria.
  14. http://en. wikipedia. org/wiki/Stemming, (visited on 10-02-2012).
  15. A. Rungsawang, N. Angkawattanawit (2005),' Learnable topic-specific web crawler',Journal of Networks and Computer Applications',pp. 97-114.
Index Terms

Computer Science
Information Sciences

Keywords

Web Internet Retrieval Focused Web Crawler Search Engine Etc