Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification

Lizashree Mishra; Amritesh Kumar; Debashis Hati

Call for Paper

April Edition

IJCA solicits high quality original research papers for the upcoming April edition of the journal. The last date of research paper submission is 20 March 2026

Submit your paper

Know more

The week's pick

Explainable Hybrid Deep Learning for Automated Diagnosis of Canine Mammary Tumors

Elham Shawky Salama Heba Askr Ashraf Darwish Aboul Ella Hassanien

Random Articles

Reseach Article

Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification

by Lizashree Mishra, Amritesh Kumar, Debashis Hati

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 3 - Number 9

Year of Publication: 2010

Authors: Lizashree Mishra, Amritesh Kumar, Debashis Hati

10.5120/767-1074

Lizashree Mishra, Amritesh Kumar, Debashis Hati . Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification. International Journal of Computer Applications. 3, 9 ( July 2010), 23-30. DOI=10.5120/767-1074

@article{ 10.5120/767-1074,

author = { Lizashree Mishra, Amritesh Kumar, Debashis Hati },

title = { Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification },

journal = { International Journal of Computer Applications },

issue_date = { July 2010 },

volume = { 3 },

number = { 9 },

month = { July },

year = { 2010 },

issn = { 0975-8887 },

pages = { 23-30 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume3/number9/767-1074/ },

doi = { 10.5120/767-1074 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T19:51:29.189427+05:30

%A Lizashree Mishra

%A Amritesh Kumar

%A Debashis Hati

%T Unvisited URL Relevancy Calculation in Focused Crawling Based on NaÔve Bayesian Classification

%J International Journal of Computer Applications

%@ 0975-8887

%V 3

%N 9

%P 23-30

%D 2010

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Vertical search engines use focused crawler as their key component and develop some specific algorithms to select web pages relevant to some pre-defined set of topics. Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. The focused crawler of a special-purpose search engine aims to selectively seek out pages that are relevant to a pre-defined set of topics, rather than to exploit all regions of the Web. Maintaining currency of search engine indices by exhaustive crawling is rapidly becoming impossible due to the increasing size of the web. Focused crawler aims to search only the subset of the web related to a specific topic, and offer a potential solution to the problem. A focused crawler is an agent that targets a particular topic and visits and gathers only a relevant, narrow web segment while trying not to waste resources on irrelevant material. As the crawler is only a computer program, it cannot determine how relevant a web page is. The major problem is how to retrieve the maximal set of relevant and quality page. In our proposed approach, we classify the unvisited URL based on visited URLs attribute score, i.e., unvisited URLs are relevant to topics or not, and then decide based on seed page attribute score. Based on score, we put ‚ÄúYes‚Äù or ‚ÄúNo‚Äù values in the table. URLs attributes are: it‚Äôs Anchor text relevancy, its description in Google search engine and calculates the similarity score of description with topic keywords, cohesive text similarity with topic keywords and Relevancy score of its parent pages. Relevancy score is calculated based on vector space model. Classification is done by Na√Øve Bayesian classification methods.

References

Index Terms

Computer Science

Information Sciences

Keywords

Crawler Focused crawler Vector space model Na√Øve Bayesian classification methods