CFP last date
20 January 2025
Reseach Article

Web Mining: A Brief Survey on Data Extraction Techniques

Published on May 2012 by Vijay Bagdi, Sulabha Patil
National Conference on Recent Trends in Computing
Foundation of Computer Science USA
NCRTC - Number 5
May 2012
Authors: Vijay Bagdi, Sulabha Patil
c0f55b62-a0df-4649-82e3-e3cb5259af83

Vijay Bagdi, Sulabha Patil . Web Mining: A Brief Survey on Data Extraction Techniques. National Conference on Recent Trends in Computing. NCRTC, 5 (May 2012), 21-24.

@article{
author = { Vijay Bagdi, Sulabha Patil },
title = { Web Mining: A Brief Survey on Data Extraction Techniques },
journal = { National Conference on Recent Trends in Computing },
issue_date = { May 2012 },
volume = { NCRTC },
number = { 5 },
month = { May },
year = { 2012 },
issn = 0975-8887,
pages = { 21-24 },
numpages = 4,
url = { /proceedings/ncrtc/number5/6548-1038/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 National Conference on Recent Trends in Computing
%A Vijay Bagdi
%A Sulabha Patil
%T Web Mining: A Brief Survey on Data Extraction Techniques
%J National Conference on Recent Trends in Computing
%@ 0975-8887
%V NCRTC
%N 5
%P 21-24
%D 2012
%I International Journal of Computer Applications
Abstract

In last few years, we faced problem regarding extraction of data from web pages. In this paper we proposed to address problem of web data extraction techniques related to areas such as natural language processing, language and grammar, machine learning, information retrieval and Ontologies. As consequence they represent very distinct feature and capabilities which make direct comparison difficult to be done.

References
  1. Abascal R. And Sanchez, J. A. Xtract : structure Extraction from Botanical Textual Description. In proceeding of String Processing and Information Retrieval Symposium and International Workshop on Groupware SPIRE/CRIWG (Camcum, Mexico, 1999).
  2. Abiteboul . S. Querying Semi-structured data. In Database Theory- ICDTS'97 – 6thInternational Conference , Delphi, Greece, January 8-10, 1997. F. N. Afrati and P. Kolaitis. Eds. Vol. 1186 , Lecture Notes in Computer Science.
  3. Adelberg B. Nodose . A tool for Semi-Automatically Extracting structured and Semi-Structured Data from Text Document, SIGMOD Record 27, 2(1998).
  4. Arocena . G. O. And Mendelzon A. O. WebOQL : Restructuring Documents, Databases and Webs. In Proceeding of 14th IEEE International conference on Data Engineering (Orlando, Florida, 1998)
  5. Baumgartner R, Flesca S. And Gottlob G. Visual Web information Extraction with Lixto. In proceeding of 26th International Conference on Very Large Database Systems (Rome, Itly, 2001).
  6. Bray. T. , Pauli. J. And Mcqueen M. S Extensible Markup Language (Xml) 1. 0
  7. Brin S. , Motwani R. , Page L. And Winograd T. What Can You Do With Web In Your Pocket? Data Engineering Bulletin 21,2 (1998)
  8. Buneman P. Semi-structured data. In proceedings of the sixteenth ACM-SIGACT-SIGMOD-SIGART . Symposium on Principles of Database Systems (Tueson, Arizona, 1997)
  9. Califg M. E. And Mooney R. J. Relational Learning of Pattern-Match Rules for Information Extraction. In proceeding of Sixteenth National Conference on Artificial Intelligence and Eleventh Conference on Innovative application of Artificial Intelligence (Orlando, Florida, 1999).
  10. Crescenzi V. And Mecca G. Grammars have Exceptions, Information Systems 23,8.
  11. Crescenzi V, Mecca G. , Nad Merialdo P. Roadrunner : Towards automatic data extraction from web sites. In proceeding of 26th International Conference on Very large Database Systems (Rome, Itly, 2001)
  12. Embley D. W. , Campbeli D. M. , Jiang Y. S. , Liddle S. W. , Kaing Y. , Quass D. And Smith R. -D. Conceptual-Model-Based Data Extraction from Multi-Record web pages. Data and Knowledge Engineering 31, 3 (1999).
  13. Embley D. W. , Jiang Y. S. And Ng. Y. K. Record Boundary Discovery in web documents. In proceedings ACM SIGMOD International Conference on Management of Data (Pholadephia, Penssylvania, USA, 1999).
  14. Florescu D. , Levy. A. Y. And Medelzon, A. O. Database Techniques for the World-Wide-Web: a Survey SIGMOD record 27, 3 (1998).
  15. Freitag D. Machine Learning for Information Extraction in Informal Domains. Machine Learning 39, 2/3 (2000)
  16. Golgher P. B. , Di Silva A. S. , Laender A. H. F. And Ribiero Neta. B. A. Bootstrapping for example based Data extraction. In proceedings of tenth ACM International Conference on Information and Knowledge management (Atlanta, Georgia, 2001)
  17. Hammer J. , Garcia-Molina, H. Nestorov, S. Yerneni, R. , Breunig M. And Vassalos V. Template-base wrappers in TSIMMIS system. SIGMOD Record 26, 2(1997)
  18. Hammer J. , Mchugh J. And Garcai-Molina H. , Semi-Structured Data: The Tismmis Experience. In proceedings of the first East-Europian Symposium on Advances in Database and Information Systems. (ADBIS'97) (St. Peterburg, Rusia, 1997)
  19. Hsu. C. S. And Dung M-T. Generating Finite-State Transducers for Semi-Structured Data Extraction from web. Information Systems 23, 8 (1998)
  20. Huck, G. , Fankhauser, P. , Aberer K. , Nad Newhold E. J. Jedi: Extracting and Synthesizing information from the web. In proceedings of third IFCIS International Conference on Cooperative Information Systems (new York City, New York, 1998)
  21. Ion Muslea, Rise :Repositionary of Online Information Sources used in Information extraction task.
  22. Kushmerrick N. , Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence Journal 118, 12 (2000).
  23. Laender A. H. F. , Ribeiro Neto, B. A. And Da Silva, A. S. Dbye Data Extraction by Example. Data and Knowledge Engineering (2001)
  24. Laender A. H. F. , Ribeiro Neto, B. A. , Da Silva A. S. , And Silva, E. S. Representing Web Data as Complex Objects. In Electronic Commerce and Web Technologies .
  25. Liu, L. , Pu. , C And Han, W. Xwrap. An Xml Enabled Wrapper Construction System for web Information Sources. In proceeding of 16th IEEE International Conference on Data Engineering (SanDiego, California, 2000)
  26. Ludascher B. , Himmeroder R, Lausen G. May, W. , And Schlepphorst, C. Managing semistructured data with florid : A deductive object-oriented perspective. Information Systems 23, 8 (1998)
  27. Mecca G. , Atzeni P. , Masci A. , Merialdo P. , And Sindoni G. The ARANUES web base management system. SIGMOD Record 27,2 (1998)
  28. Muslea I. , Extraction Patterns for Information Extraction tasks: A survey in proceedings of the AAAI-99 Workshop on Machine Learning for Information Extraction (Orlando, Florida, 1999)
  29. Muslea I. , Minton S. , And Knoblock C. A. Hierarchical Wrapper Induction for semistructured Information Sources. Autonomous Agents and Multiagents.
  30. Papakonstantinou Y. , Garcia-Molina, H. , And Widom J. Object Exchange Across Heterogeneous Information Sources. In proceeding of IEEE 11th International Conference on Data Engineering. (Taipai, Taiwan, 1995)
  31. Ribiero-Neto, B. A. Laender, A. H. F. And Da Silva. A. S. Extracting Semistructured Data Through Examples. in proceedings of Eight ACM International Conference on Information and Knowledge management. (Kansas City, Missouri, 1999)
  32. Sahuguet, A. , And Azavant, F. Building Intelligent web applications using lightweight wrappers. Data and Knowledge Engineering 36, 3 (2001)
  33. Soderlan S. Learning Information Extraction Rules for semi-structured and Free Text. Machine Learning 34, 3 (1999)
  34. Teixeira, J. S. A Comparative study of Approaches for semostructured Data Extraction. Master's Thesis. Department of Computer Science, Federal University of Minas Gerais, Brazil, 2001
  35. World Wide Web Consortium. W3C. THE DOCUMENT OBJECT MODEL.
  36. M. Richardson And P. Domingos. Markov Logic Networks. Machine Learning, 62.
  37. Satyajeet Nimgaonkar and SuryaprakhDuppala, "A survey on web content mining and extraction of Structured and Semi structured data"
  38. H. Kautz, B. Selman, and Y. Jiang. A general stochastic approach to solving problems with hard and soft constraints. In The satiability problem: theory and applications. AMS, 1997.
Index Terms

Computer Science
Information Sciences

Keywords

Ontology Web Data Extraction Web Crawler Road Runner