We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 November 2024
Reseach Article

Structure based Data Extraction from Hidden Web Sources: A Review

by Anuradha, A.K.Sharma
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 25 - Number 3
Year of Publication: 2011
Authors: Anuradha, A.K.Sharma
10.5120/3010-4060

Anuradha, A.K.Sharma . Structure based Data Extraction from Hidden Web Sources: A Review. International Journal of Computer Applications. 25, 3 ( July 2011), 32-37. DOI=10.5120/3010-4060

@article{ 10.5120/3010-4060,
author = { Anuradha, A.K.Sharma },
title = { Structure based Data Extraction from Hidden Web Sources: A Review },
journal = { International Journal of Computer Applications },
issue_date = { July 2011 },
volume = { 25 },
number = { 3 },
month = { July },
year = { 2011 },
issn = { 0975-8887 },
pages = { 32-37 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume25/number3/3010-4060/ },
doi = { 10.5120/3010-4060 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:10:49.594388+05:30
%A Anuradha
%A A.K.Sharma
%T Structure based Data Extraction from Hidden Web Sources: A Review
%J International Journal of Computer Applications
%@ 0975-8887
%V 25
%N 3
%P 32-37
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In order to extract data from the web pages of Hidden web sources, many semi-automatic and automatic techniques are proposed based on structure and tags of HTML documents. These techniques include machine learning and schema- matching approaches to solve the problem of data extraction. This paper discusses the research that has been done in the area of data extraction from Hidden Web sources. The goal of this paper is to discuss the advantages and disadvantages of currently existing techniques.

References
  1. Chen Hong-ping; Fang Wei; Yang Zhou; Zhuo Lin; Cui Zhi-Ming; Automatic Data Records Extraction from List Page in Deep Web Sources; 978-0-7695-3699- 6/09 © 2009 IEEE pages 370-373.
  2. Bing Liu, Robert Grossman, and Yanhong Zhai. Mining data records in web pages. In KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601–606, New York, NY, USA, 2003.ACM Press.
  3. YalinWang and Jianying Hu. A machine learning based approach for table detection on the web. In WWW ’02: Proceedings of the 11th international conference on World Wide Web, pages 242–250, New York, NY, USA, 2002. ACM Press.
  4. Cai, D., Yu, S., Wen, J.-R., and Ma, W.-Y. 2003. VIPS: a Vision-based Page Segmentation Algorithm. Tech. Rep. MSR-TR-2003-79, Microsoft Technical Report.
  5. Simon, K., Lausen, G., and Boley, H. 2006. From HTML documents to web tables and rules. In ICEC, M. S. Fox and B. Spencer, Eds. ACM International Conference Proceeding Series, vol. 156. ACM, 125–131.
  6. Chang, K. C.-C., He, B., Li, C., Patel, M., and Zhang, Z. 2004. Structured databases on the web: observations and implications. SIGMOD Rec. 33, 3, 61–70.
  7. Freitag, D. 1998. Information Extraction from HTML: Application of a General MachineLearning Approach. In AAAI/IAAI. 517–523.
  8. B. Liu and Y. Zhai. NET: System for extracting Web data from °at and nested data records. In Proceedings of the Conference on Web Information Systems Engineering, pages 487-495, 2005.
  9. S. Raghavan and H. Garcia-Molina. Crawling the Hidden Web. In Proceedings of VLDB, pages 129–138, 2001.
  10. S. Lawrence and C. L. Giles. Searching the World Wide Web. Science, 280(5360):98–100, 1998.
Index Terms

Computer Science
Information Sciences

Keywords

Surface Web Hidden Web Information Extraction