Vide: A Vision-based Approach for Deep Web Data Extraction

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

Vide: A Vision-based Approach for Deep Web Data Extraction

Published on August 2011 by Snehal M. Shewale, Trupti S. Patil

National Technical Symposium on Advancements in Computing Technologies

Foundation of Computer Science USA

NTSACT - Number 5

August 2011

Authors: Snehal M. Shewale, Trupti S. Patil

Snehal M. Shewale, Trupti S. Patil . Vide: A Vision-based Approach for Deep Web Data Extraction. National Technical Symposium on Advancements in Computing Technologies. NTSACT, 5 (August 2011), 34-40.

@article{

author = { Snehal M. Shewale, Trupti S. Patil },

title = { Vide: A Vision-based Approach for Deep Web Data Extraction },

journal = { National Technical Symposium on Advancements in Computing Technologies },

issue_date = { August 2011 },

volume = { NTSACT },

number = { 5 },

month = { August },

year = { 2011 },

issn = 0975-8887,

pages = { 34-40 },

numpages = 7,

url = { /proceedings/ntsact/number5/3213-ntst034/ },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Proceeding Article

%1 National Technical Symposium on Advancements in Computing Technologies

%A Snehal M. Shewale

%A Trupti S. Patil

%T Vide: A Vision-based Approach for Deep Web Data Extraction

%J National Technical Symposium on Advancements in Computing Technologies

%@ 0975-8887

%V NTSACT

%N 5

%P 34-40

%D 2011

%I International Journal of Computer Applications

Abstract

The data available on the web is so voluminous and Heterogeneous. Deep Web, contains magnitudes more and valuable information than the surface Web. Deep Web contents are accessed by queries submitted to Web databases and the returned data records are enwrapped in dynamically generated Web pages. A large number of techniques have been proposed to address this problem, but all of them are Web-pageprogramming- language-dependent. In this paper we reviewed a novel vision-based approach that is Web-pageprogramming- language-independent. ViDE utilizes the visual features on the deep Web pages to implement deep Web data extraction, including data record extraction and data item extraction. Our experiments on a large set of Web databases show that the proposed vision-based approach is highly effective for deep Web data extraction.

References

G.O. Arocena and A.O. Mendelzon, “WebOQL: Restructuring Documents, Databases, and Webs,” Proc. Int’l Conf. Data Eng. (ICDE), pp. 24–33, 1998.
D. Buttler, L. Liu, and C. Pu, “A Fully Automated Object Extraction System for the World Wide Web,” Proc. Int’l Conf. Distributed Computing Systems (ICDCS), pp. 361-370, 2001.
D. Cai, X. He, J.-R. wen, and W.-Y. Ma, “Block-Level Link Analysis,” Proc. SIGIR, pp. 440–447, 2004.
D. Cai, S. Yu, J. Wen, and W. Ma, “Extracting Content Structure for Web Pages Based on Visual Representation,” Proc. Asia Pacific Web Conf. (APWeb), pp. 406–417, 2003.
C.-H. Chang, M. Kayed, M.R. Girgis, and K.F. Shaalan, “A Survey of Web Information Extraction Systems,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 10, pp. 1411–1428, Oct. 2006.
C.-H. Chang, C.-N. Hsu, and S.-C. Lui, “Automatic Information Extraction from Semi-Structured Web Pages by Pattern Discovery,” Decision Support Systems, vol. 35, no. 1, pp. 129– 147, 2003.
V. Crescenzi and G. Mecca, “Grammars Have Exceptions,”Information Systems, vol. 23, no. 8, pp. 539-565, 1998.
V. Crescenzi, G. Mecca, and P. Merialdo, “RoadRunner: Towards Automatic Data Extraction from Large Web Sites,” Proc. Int’l Conf. Very Large Data Bases (VLDB), pp. 109–118, 2001.
D.W. Embley, Y.S. Jiang, and Y.-K. Ng, “Record-Boundary Discovery in Web Documents,” Proc. ACM SIGMOD, pp. 467– 478, 1999.
W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krpl, and B. Pollak, “Towards Domain Independent Information Extraction from Web Tables,” Proc. Int’l World Wide Web Conf. (WWW), pp. 71–80, 2007.
J. Hammer, J. McHugh, and H. Garcia-Molina, “Semistructured Data: The TSIMMIS Experience,” Proc. East-European Workshop Advances in Databases and Information Systems (ADBIS), pp. 1–8, 1997.
C.-N. Hsu and M.-T. Dung, “Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web,” Information Systems, vol. 23, no. 8, pp. 521–538, 1998.
Testbed for Information Extraction from Deep Web at: http://daisen.cc.kyushu-u.ac.jp/TBDW/, 2009.
A vocabulary and associated APIs for HTML and XHTML at: http://www.w3.org/html/wg/html5/, 09.
N. Kushmerick, “Wrapper Induction: Efficiency and Expressiveness,” Artificial Intelligence, vol. 118, nos. 1/2, pp. 15–68, 2000.
A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira, “A Brief Survey of Web Data Extraction Tools,” SIGMOD Record, vol. 31, no. 2, pp. 84–93, 2002.
B. Liu, R.L. Grossman, and Y. Zhai, “Mining Data Records in Web Pages,” Proc. Int’l Conf. Knowledge Discovery and Data Mining (KDD), pp. 601–606, 2003.

Index Terms

Computer Science

Information Sciences

Keywords

Deep Web Data mining Data Extraction Visual Features