CFP last date
20 January 2025
Reseach Article

Vide: A Vision-based Approach for Deep Web Data Extraction

Published on August 2011 by Snehal M. Shewale, Trupti S. Patil
journal_cover_thumbnail
National Technical Symposium on Advancements in Computing Technologies
Foundation of Computer Science USA
NTSACT - Number 5
August 2011
Authors: Snehal M. Shewale, Trupti S. Patil
e71292de-9766-4fb8-a401-327fe2597c31

Snehal M. Shewale, Trupti S. Patil . Vide: A Vision-based Approach for Deep Web Data Extraction. National Technical Symposium on Advancements in Computing Technologies. NTSACT, 5 (August 2011), 34-40.

@article{
author = { Snehal M. Shewale, Trupti S. Patil },
title = { Vide: A Vision-based Approach for Deep Web Data Extraction },
journal = { National Technical Symposium on Advancements in Computing Technologies },
issue_date = { August 2011 },
volume = { NTSACT },
number = { 5 },
month = { August },
year = { 2011 },
issn = 0975-8887,
pages = { 34-40 },
numpages = 7,
url = { /proceedings/ntsact/number5/3213-ntst034/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 National Technical Symposium on Advancements in Computing Technologies
%A Snehal M. Shewale
%A Trupti S. Patil
%T Vide: A Vision-based Approach for Deep Web Data Extraction
%J National Technical Symposium on Advancements in Computing Technologies
%@ 0975-8887
%V NTSACT
%N 5
%P 34-40
%D 2011
%I International Journal of Computer Applications
Abstract

The data available on the web is so voluminous and Heterogeneous. Deep Web, contains magnitudes more and valuable information than the surface Web. Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages. A large number of techniques have been proposed to address this problem, but all of them are Web-pageprogramming- language-dependent. In this paper we reviewed a novel vision-based approach that is Web-pageprogramming- language-independent. ViDE utilizes the visual features on the deep Web pages to implement deep Web data extraction, including data record extraction and data item extraction. Our experiments on a large set of Web databases show that the proposed vision-based approach is highly effective for deep Web data extraction.

References
  1. G.O. Arocena and A.O. Mendelzon, “WebOQL: Restructuring Documents, Databases, and Webs,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 24–33, 1998.
  2. D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. Int’l Conf. Distributed Computing Systems (ICDCS), pp. 361-370, 2001.
  3. D. Cai, X. He, J.-R. wen, and W.-Y. Ma, “Block-Level Link Analysis,” Proc. SIGIR, pp. 440–447, 2004.
  4. D. Cai, S. Yu, J. Wen, and W. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” Proc. Asia Pacific Web Conf. (APWeb), pp. 406–417, 2003.
  5. C.-H. Chang, M. Kayed, M.R. Girgis, and K.F. Shaalan, “A Survey of Web Information Extraction Systems,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 10, pp. 1411–1428, Oct. 2006.
  6. C.-H. Chang, C.-N. Hsu, and S.-C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern Discovery,” Decision Support Systems, vol. 35, no. 1, pp. 129– 147, 2003.
  7. V. Crescenzi and G. Mecca, “Grammars Have Exceptions,”Information Systems, vol. 23, no. 8, pp. 539-565, 1998.
  8. V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 109–118, 2001.
  9. D.W. Embley, Y.S. Jiang, and Y.-K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. ACM SIGMOD, pp. 467– 478, 1999.
  10. W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krpl, and B. Pollak, “Towards Domain Independent Information Extraction from Web Tables,” Proc. Int’l World Wide Web Conf. (WWW), pp. 71–80, 2007.
  11. J. Hammer, J. McHugh, and H. Garcia-Molina, “Semistructured Data: The TSIMMIS Experience,” Proc. East-European Workshop Advances in Databases and Information Systems (ADBIS), pp. 1–8, 1997.
  12. C.-N. Hsu and M.-T. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, vol. 23, no. 8, pp. 521–538, 1998.
  13. Testbed for Information Extraction from Deep Web at: http://daisen.cc.kyushu-u.ac.jp/TBDW/, 2009.
  14. A vocabulary and associated APIs for HTML and XHTML at: http://www.w3.org/html/wg/html5/, 09.
  15. N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, vol. 118, nos. 1/2, pp. 15–68, 2000.
  16. A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record, vol. 31, no. 2, pp. 84–93, 2002.
  17. B. Liu, R.L. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 601–606, 2003.
Index Terms

Computer Science
Information Sciences

Keywords

Deep Web Data mining Data Extraction Visual Features