International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 61 - Number 12 |
Year of Publication: 2013 |
Authors: Shikha Shukla, Nitin, Sitendra Tamrakar |
10.5120/9981-4811 |
Shikha Shukla, Nitin, Sitendra Tamrakar . Web Information Extraction: Tag Density and Keyword Approach. International Journal of Computer Applications. 61, 12 ( January 2013), 28-30. DOI=10.5120/9981-4811
Web page consists of lots of noise in the form of advertisements, irrelevant information, copyrights information and menus. To extract the information from web we use the two concepts, text density and title of the page. Generally the main content of the page is denser than the other and noises has lesser text information. The title is the most important information on the page that tells us about what is this page for. So we simply extract all the information that is denser than particular threshold or at least contain one of the keywords that is made from the title of the page. By using this approach the more false negatives can be avoided. This approach gives very satisfactory results.