We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages

Published on January 2012 by Teena Merin Thomas, V. Vidhya
Emerging Technology Trends on Advanced Engineering Research - 2012
Foundation of Computer Science USA
ICETT - Number 3
January 2012
Authors: Teena Merin Thomas, V. Vidhya
c3c7ae6f-5083-49a5-bdca-8c8b909fd090

Teena Merin Thomas, V. Vidhya . A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages. Emerging Technology Trends on Advanced Engineering Research - 2012. ICETT, 3 (January 2012), 24-28.

@article{
author = { Teena Merin Thomas, V. Vidhya },
title = { A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages },
journal = { Emerging Technology Trends on Advanced Engineering Research - 2012 },
issue_date = { January 2012 },
volume = { ICETT },
number = { 3 },
month = { January },
year = { 2012 },
issn = 0975-8887,
pages = { 24-28 },
numpages = 5,
url = { /proceedings/icett/number3/9845-1025/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 Emerging Technology Trends on Advanced Engineering Research - 2012
%A Teena Merin Thomas
%A V. Vidhya
%T A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages
%J Emerging Technology Trends on Advanced Engineering Research - 2012
%@ 0975-8887
%V ICETT
%N 3
%P 24-28
%D 2012
%I International Journal of Computer Applications
Abstract

World Wide Web is a vast and rapidly growing source of information. Web Pages contain a combination of unique data and template material, which is present across multiple pages to achieve high productivity of publishing. The template detection becomes a more attractive technique in the web pages, since the unknown template degrade the performance of web applications due to the irrelevant terms in the templates. The web pages is clustered using Agglomerative Clustering Algorithm based on the similarity of templates in the web pages. The unknown number of web pages and the partitioning of web pages is dealt with the help of Rissanen's Minimum Description Length Principle. Wrappers are generated for clustered heterogeneous web pages and the data encoded in the web pages are automatically extracted. Hence, the proposed approach for automatic data extraction let the web page users to access the data in a quick and easiest manner with better effectiveness and scalability.

References
  1. Abdur Chowdury, Ling Ma and Nazli Goharian, 2008 Automatic Data Extraction from Template Generated Web Pages, Journal of Software, vol. 19, pp. 209-223.
  2. Arasu A. and Gracia Molina H. 2003, Extracting Structured Data from Web Pages, in Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, USA, pp. 337-348.
  3. Cavalcanti J. , da Silva A. , de Moura E. , Freire J. , Pinto N. and Vieira K. 2006, A Fast and Robust Method for Web Page Template Detection and Removal, in Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Virginia, USA, pp. 258-267.
  4. Gibson D. , Punera K. , amd Tomkins A. 2005, The Volume and Evolution of Web Page Templates, in Proceedings of the 14th International Conference on World Wide Web, Chiba, Japan, pp. 830-838.
  5. Golgher P. , Laender A. , Reis D. and Silva A. 2004, Automatic Web News Extraction from Tree Edit Distance, in Proceedings of the 13th International Conference on World Wide Web, New York, USA, pp. 502-511.
  6. Kim C. and Shim K. 2010, TEXT: Automaticc Template Extraction from Heterogeneous Web Pages, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 4, pp 612-626.
  7. Liu B. , Grossman R. and Zhai Y. 2003, Mining Data Records in Web Pages, in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA, pp. 601-606.
  8. Merialdo P. , Missier P. and Crescenzi V. 2005, Clustering Web Pages based on their structure, IEEE Transactions on Data and Knowledge Engineering, vol. 54, no. 3, pp. 279-299.
  9. J. Rissanen 1978, Modeling by Shortest Data Description, Automatica vol. 14, pp 465-471.
  10. Song R. , Wen J. , Wu D. and Zheng S. 2007, Joint Optimization of Wrapper Generation and Template Detection, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, USA, pp. 894-902.
  11. Hongkun Zhao and Weiyi Meng 2005, Fully Automatic Wrapper Generation for Search Engines, in Proceedings of the 14th International Conference on World Wide Web, New York, USA, pp. 66-75.
  12. HTMLParser: http://htmlparser. sourceforge. net.
Index Terms

Computer Science
Information Sciences

Keywords

Wrap_match Mdl Clustering Essential Paths