International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 91 - Number 3 |
Year of Publication: 2014 |
Authors: K. Nethra, J. Anitha |
10.5120/15861-4785 |
K. Nethra, J. Anitha . Web Content Extraction by Integrating Textual and Visual Importance of Web Pages. International Journal of Computer Applications. 91, 3 ( April 2014), 20-24. DOI=10.5120/15861-4785
A Web page has huge information and the information in the Web pages is useful in real world applications. The additional contents in the Web page like links, footers, headers and advertisements may cause the content extraction to be complicated. Irrelevant content in the Web page is treated as noisy content. A method is necessary to extract the informative content and discard the noisy content from Web pages. An integration of textual and visual importance is used to extract the informative content from Web pages. Initially a Web page is converted in to DOM (Document Object Model) tree. For each node in the DOM tree, textual and visual importance is calculated. Textual importance and visual importance is combined to form hybrid density. Density sum is calculated and used in content extraction algorithm to extract the informative content from Web pages. Performance of Web content extraction is obtained by calculating precision, recall, f-measure and accuracy.