International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 4 - Number 11 |
Year of Publication: 2010 |
Authors: Amir Masoud Rahmani, Mir Mohsen Pedram, Mohsen Asfia |
10.5120/869-1219 |
Amir Masoud Rahmani, Mir Mohsen Pedram, Mohsen Asfia . Main Content Extraction from Detailed Web Pages. International Journal of Computer Applications. 4, 11 ( August 2010), 18-21. DOI=10.5120/869-1219
As we know internet detailed web pages contains information which are not considered as primary content such as advertisements, headers, footers, navigation links and copyright information. Also information on web pages such as comments and reviews are not preferred by search engines to index as informative content, thereby having an algorithm to extracts only main content could help better quality on web page indexing. Almost all algorithms have been proposed are tag dependent means they could only look for primary content among specific tags such as < TABLE > or < DIV >. The algorithm in this paper simulates a web page user visit and how the user finds the main content block position in the page. The proposed method is tag independent and has two phases to accomplish the extraction job. First it transforms input DOM tree obtained from input HTML detailed web page into a block tree based on their visual representation and DOM structure in a way that on every node it will have specification vector, then it traverses the obtained small block tree to find main block having dominant computed value in comparison with other block nodes based on its specification vector values. The introduced method doesn’t have any learning phases and could find informative content on any random input detailed web page. This method has been tested in large variety of websites and as we will show, it gains better precision and recall based on other compared method K-FE.