International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 15 - Number 5 |
Year of Publication: 2011 |
Authors: Neha Gupta, Dr. Saba Hilal |
10.5120/1945-2601 |
Neha Gupta, Dr. Saba Hilal . A Heuristic Approach for Web Content Extraction. International Journal of Computer Applications. 15, 5 ( February 2011), 20-24. DOI=10.5120/1945-2601
Today internet has made the life of human dependent on it. Almost everything and anything can be searched on net. Web pages usually contain huge amount of information that may not interest the user, as it may not be the part of the main content of the web page. To extract the main content of the web page, data mining techniques need to be implemented. A lot of research has already been done in this field. Current automatic techniques are unsatisfactory as their outputs are not appropriate for the query of the user. In this paper, we are presenting an automatic approach to extract the main content of the web page using tag tree & heuristics to filter the clutter and display the main content. Experimental results have shown that the technique presented in this paper is able to outperform existing techniques dramatically.