CFP last date
20 December 2024
Reseach Article

Web Information Extraction: Tag Density and Keyword Approach

by Shikha Shukla, Nitin, Sitendra Tamrakar
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 61 - Number 12
Year of Publication: 2013
Authors: Shikha Shukla, Nitin, Sitendra Tamrakar
10.5120/9981-4811

Shikha Shukla, Nitin, Sitendra Tamrakar . Web Information Extraction: Tag Density and Keyword Approach. International Journal of Computer Applications. 61, 12 ( January 2013), 28-30. DOI=10.5120/9981-4811

@article{ 10.5120/9981-4811,
author = { Shikha Shukla, Nitin, Sitendra Tamrakar },
title = { Web Information Extraction: Tag Density and Keyword Approach },
journal = { International Journal of Computer Applications },
issue_date = { January 2013 },
volume = { 61 },
number = { 12 },
month = { January },
year = { 2013 },
issn = { 0975-8887 },
pages = { 28-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume61/number12/9981-4811/ },
doi = { 10.5120/9981-4811 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:08:55.919474+05:30
%A Shikha Shukla
%A Nitin
%A Sitendra Tamrakar
%T Web Information Extraction: Tag Density and Keyword Approach
%J International Journal of Computer Applications
%@ 0975-8887
%V 61
%N 12
%P 28-30
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Web page consists of lots of noise in the form of advertisements, irrelevant information, copyrights information and menus. To extract the information from web we use the two concepts, text density and title of the page. Generally the main content of the page is denser than the other and noises has lesser text information. The title is the most important information on the page that tells us about what is this page for. So we simply extract all the information that is denser than particular threshold or at least contain one of the keywords that is made from the title of the page. By using this approach the more false negatives can be avoided. This approach gives very satisfactory results.

References
  1. Shin, Kwangcheol, and Geun Sik Jo. "Catch Crawler: Automatic Web Information Extractor Using Style Sheet. " Semantic Computing and Applications, 2008. IWSCA'08. IEEE International Workshop on. IEEE, 2008.
  2. Sun, Fei, Dandan Song, and Lejian Liao. "Dom based content extraction via text density. " SIGIR. Vol. 11. 2011.
  3. Asfia, Mohsen, Mir Mohsen Pedram, and Amir Masoud Rahmani. "Main Content Extraction from Detailed Web Pages. " International Journal of Computer Applications IJCA 4. 11 (2010): 18-21.
  4. Downey, Doug, et al. "Learning text patterns for web information extraction and assessment. " AAAI-04 workshop on adaptive text extraction and mining. 2004.
  5. Yi, Lan, and Bing Liu. "Web page cleaning for web mining through feature weighting. " International joint conference on artificial intelligence. Vol. 18. LAWRENCE ERLBAUM ASSOCIATES LTD, 2003.
Index Terms

Computer Science
Information Sciences

Keywords

Crawler Web mining information extraction