Emerging Technology Trends on Advanced Engineering Research - 2012 |
Foundation of Computer Science USA |
ICETT - Number 3 |
January 2012 |
Authors: Teena Merin Thomas, V. Vidhya |
c3c7ae6f-5083-49a5-bdca-8c8b909fd090 |
Teena Merin Thomas, V. Vidhya . A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages. Emerging Technology Trends on Advanced Engineering Research - 2012. ICETT, 3 (January 2012), 24-28.
World Wide Web is a vast and rapidly growing source of information. Web Pages contain a combination of unique data and template material, which is present across multiple pages to achieve high productivity of publishing. The template detection becomes a more attractive technique in the web pages, since the unknown template degrade the performance of web applications due to the irrelevant terms in the templates. The web pages is clustered using Agglomerative Clustering Algorithm based on the similarity of templates in the web pages. The unknown number of web pages and the partitioning of web pages is dealt with the help of Rissanen's Minimum Description Length Principle. Wrappers are generated for clustered heterogeneous web pages and the data encoded in the web pages are automatically extracted. Hence, the proposed approach for automatic data extraction let the web page users to access the data in a quick and easiest manner with better effectiveness and scalability.