CFP last date
20 December 2024
Call for Paper
January Edition
IJCA solicits high quality original research papers for the upcoming January edition of the journal. The last date of research paper submission is 20 December 2024

Submit your paper
Know more
Reseach Article

Data Record Extraction using Tag Tree Comparison

by Aleem Ansari, Hemlata Vasishtha
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 117 - Number 11
Year of Publication: 2015
Authors: Aleem Ansari, Hemlata Vasishtha
10.5120/20599-3170

Aleem Ansari, Hemlata Vasishtha . Data Record Extraction using Tag Tree Comparison. International Journal of Computer Applications. 117, 11 ( May 2015), 20-24. DOI=10.5120/20599-3170

@article{ 10.5120/20599-3170,
author = { Aleem Ansari, Hemlata Vasishtha },
title = { Data Record Extraction using Tag Tree Comparison },
journal = { International Journal of Computer Applications },
issue_date = { May 2015 },
volume = { 117 },
number = { 11 },
month = { May },
year = { 2015 },
issn = { 0975-8887 },
pages = { 20-24 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume117/number11/20599-3170/ },
doi = { 10.5120/20599-3170 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:59:07.748549+05:30
%A Aleem Ansari
%A Hemlata Vasishtha
%T Data Record Extraction using Tag Tree Comparison
%J International Journal of Computer Applications
%@ 0975-8887
%V 117
%N 11
%P 20-24
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper presents a robust unsupervised approach for extraction of data records from dynamic web pages using tag tree comparison. Extracting data records from the web pages involves following sequences. We first download the related web pages of interest on our system. Next we construct DOM trees for those pages using a parser. We then compare two or more web pages to eliminate the noisy unwanted data such as header, menu bar, navigation bar, advertisements, etc and find the region of interest called Data region or Object region. We then traverse subtrees of data region to detect individual data record and pull them in the XML file. The main contribution of this paper is in developing a fully unsupervised approach for extracting structured as well as semi-structured data records from the web pages. Our proposed system can extract data records from many commercial websites more precisely. Hence it can serve as a source for integrating information from various web sources which can be used for providing value added services such as comparative shopping, market intelligence, meta-querying and search.

References
  1. Breuel, Thomas. 2003. Information extraction from HTML documents by structural matching. U. S. Patent Application 10/248,681.
  2. Álvarez, M. , Pan, A. , Raposo, J. , Bellas, F. , & Cacheda, F. 2010. Finding and extracting data records from web pages. Journal of Signal Processing Systems, 59(1), 123-137.
  3. Ye, S. , & Chua, T. S. 2006. Learning object models from semistructured web documents. Knowledge and Data Engineering, IEEE Transactions on, 18(3), 334-349.
  4. Hsu, C. N. , & Dung, M. T. 1998. Generating finite-state transducers for semi-structured data extraction from the web. Information systems, 23(8), 521-538.
  5. Kushmerick, N. 2000. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1), 15-68.
  6. Muslea, I. , Minton, S. , & Knoblock, C. 1999. A hierarchical approach to wrapper induction. In Proceedings of the third annual conference on Autonomous Agents (pp. 190-197). ACM.
  7. Pinto, D. , McCallum, A. , Wei, X. , & Croft, W. B. 2003. Table extraction using conditional random fields. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval (pp. 235-242). ACM.
  8. Embley, D. W. , Jiang, Y. , & Ng, Y. K. 1999. Record-boundary discovery in Web documents. In ACM SIGMOD Record (Vol. 28, No. 2, pp. 467-478). ACM.
  9. Buttler, D. , Liu, L. , & Pu, C. 2001. A fully automated object extraction system for the World Wide Web. In Distributed Computing Systems, 2001. 21st International Conference on. (pp. 361-370). IEEE.
  10. Liu, B. , Grossman, R. , & Zhai, Y. 2003. Mining data records in Web pages. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 601-606). ACM.
  11. Zhai, Y. , & Liu, B. 2006. Structured data extraction from the web based on partial tree alignment. Knowledge and Data Engineering, IEEE Transactions on, 18(12), 1614-1628.
  12. Dong, Y. , & Li, Q. 2009. A Robust Approach of Automatic Web Data Record Extraction. Journal of Computer Information Systems, 5(6), 1757-1766.
  13. Sharma, A. K. 2011. Hidden Web Data Extraction Using Dynamic Rule Generation. International Journal on Computer Science & Engineering, 3(8).
  14. Hong-ping, C. , Wei, F. , Zhou, Y. , Lin, Z. , & Zhi-Ming, C. 2009. Automatic Data Records Extraction from List Page in Deep Web Sources. In Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on (Vol. 1, pp. 370-373). IEEE.
  15. Marini, J. 2002. Document Object Model. McGraw-Hill, Inc.
Index Terms

Computer Science
Information Sciences

Keywords

Data Record Detection Information Extraction Automatic Extraction Web Mining Semi-Structured data Wrapper Generation.