A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages

Call for Paper

April Edition

IJCA solicits high quality original research papers for the upcoming April edition of the journal. The last date of research paper submission is 20 March 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages

Published on January 2012 by Teena Merin Thomas, V. Vidhya

Emerging Technology Trends on Advanced Engineering Research - 2012

Foundation of Computer Science USA

ICETT - Number 3

January 2012

Authors: Teena Merin Thomas, V. Vidhya

Teena Merin Thomas, V. Vidhya . A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages. Emerging Technology Trends on Advanced Engineering Research - 2012. ICETT, 3 (January 2012), 24-28.

@article{

author = { Teena Merin Thomas, V. Vidhya },

title = { A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages },

journal = { Emerging Technology Trends on Advanced Engineering Research - 2012 },

issue_date = { January 2012 },

volume = { ICETT },

number = { 3 },

month = { January },

year = { 2012 },

issn = 0975-8887,

pages = { 24-28 },

numpages = 5,

url = { /proceedings/icett/number3/9845-1025/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 Emerging Technology Trends on Advanced Engineering Research - 2012

%A Teena Merin Thomas

%A V. Vidhya

%T A Novel Approach for Automatic Data Extraction from Heterogeneous Web Pages

%J Emerging Technology Trends on Advanced Engineering Research - 2012

%@ 0975-8887

%V ICETT

%N 3

%P 24-28

%D 2012

%I International Journal of Computer Applications

Abstract

World Wide Web is a vast and rapidly growing source of information. Web Pages contain a combination of unique data and template material, which is present across multiple pages to achieve high productivity of publishing. The template detection becomes a more attractive technique in the web pages, since the unknown template degrade the performance of web applications due to the irrelevant terms in the templates. The web pages is clustered using Agglomerative Clustering Algorithm based on the similarity of templates in the web pages. The unknown number of web pages and the partitioning of web pages is dealt with the help of Rissanen's Minimum Description Length Principle. Wrappers are generated for clustered heterogeneous web pages and the data encoded in the web pages are automatically extracted. Hence, the proposed approach for automatic data extraction let the web page users to access the data in a quick and easiest manner with better effectiveness and scalability.

References

Abdur Chowdury, Ling Ma and Nazli Goharian, 2008 Automatic Data Extraction from Template Generated Web Pages, Journal of Software, vol. 19, pp. 209-223.
Arasu A. and Gracia Molina H. 2003, Extracting Structured Data from Web Pages, in Proceedings of the ACM SIGMOD International Conference on Management of Data, San Diego, USA, pp. 337-348.
Cavalcanti J. , da Silva A. , de Moura E. , Freire J. , Pinto N. and Vieira K. 2006, A Fast and Robust Method for Web Page Template Detection and Removal, in Proceedings of the 15th ACM International Conference on Information and Knowledge Management, Virginia, USA, pp. 258-267.
Gibson D. , Punera K. , amd Tomkins A. 2005, The Volume and Evolution of Web Page Templates, in Proceedings of the 14th International Conference on World Wide Web, Chiba, Japan, pp. 830-838.
Golgher P. , Laender A. , Reis D. and Silva A. 2004, Automatic Web News Extraction from Tree Edit Distance, in Proceedings of the 13th International Conference on World Wide Web, New York, USA, pp. 502-511.
Kim C. and Shim K. 2010, TEXT: Automaticc Template Extraction from Heterogeneous Web Pages, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 4, pp 612-626.
Liu B. , Grossman R. and Zhai Y. 2003, Mining Data Records in Web Pages, in Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, USA, pp. 601-606.
Merialdo P. , Missier P. and Crescenzi V. 2005, Clustering Web Pages based on their structure, IEEE Transactions on Data and Knowledge Engineering, vol. 54, no. 3, pp. 279-299.
J. Rissanen 1978, Modeling by Shortest Data Description, Automatica vol. 14, pp 465-471.
Song R. , Wen J. , Wu D. and Zheng S. 2007, Joint Optimization of Wrapper Generation and Template Detection, in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, California, USA, pp. 894-902.
Hongkun Zhao and Weiyi Meng 2005, Fully Automatic Wrapper Generation for Search Engines, in Proceedings of the 14th International Conference on World Wide Web, New York, USA, pp. 66-75.
HTMLParser: http://htmlparser. sourceforge. net.

Index Terms

Computer Science

Information Sciences

Keywords

Wrap_match Mdl Clustering Essential Paths