We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

Web Data Extraction

Published on April 2012 by P. A. Chaudhari, R. L. Paikrao
Emerging Trends in Computer Science and Information Technology (ETCSIT2012)
Foundation of Computer Science USA
ETCSIT - Number 4
April 2012
Authors: P. A. Chaudhari, R. L. Paikrao
765e135c-73f7-401c-81f2-bfa218c04313

P. A. Chaudhari, R. L. Paikrao . Web Data Extraction. Emerging Trends in Computer Science and Information Technology (ETCSIT2012). ETCSIT, 4 (April 2012), 13-17.

@article{
author = { P. A. Chaudhari, R. L. Paikrao },
title = { Web Data Extraction },
journal = { Emerging Trends in Computer Science and Information Technology (ETCSIT2012) },
issue_date = { April 2012 },
volume = { ETCSIT },
number = { 4 },
month = { April },
year = { 2012 },
issn = 0975-8887,
pages = { 13-17 },
numpages = 5,
url = { /proceedings/etcsit/number4/5984-1027/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 Emerging Trends in Computer Science and Information Technology (ETCSIT2012)
%A P. A. Chaudhari
%A R. L. Paikrao
%T Web Data Extraction
%J Emerging Trends in Computer Science and Information Technology (ETCSIT2012)
%@ 0975-8887
%V ETCSIT
%N 4
%P 13-17
%D 2012
%I International Journal of Computer Applications
Abstract

Web is a huge reservoir of information. Data available is extremely diversified and abundant. To search for specific information, the user has to go through many pages of the Internet, filter the data and download related documents and files. This task of searching and downloading is time consuming. Web pages are in unstructured HTML format. There is a necessity to convert unstructured HTML format into a new structured format such as XML or XHTML. We propose an approach for implementing web data extraction and developing a Mashup from HTML web pages. The various stages of building a Mashup are Data Retrieval, Data Source Modeling, Data Cleaning/Filtering, Data Integration and Data Visualization. The data modeling stage renders Document Object Model (DOM) tree with the help of HTML Parser. Algorithms and rules are used to specifically analyze the HTML tags and extract the data. Furthermore, our application enables the user to perform his task without the need to write a script or program or even without any knowledge of computer programming. This approach will manage multiple servers and assure that our website will always have latest data. The Mashup created will help in the decision making process, which is the prima facie requirement for success in corporate world.

References
  1. Jer Lang Hong, Fariza Fauzi, "Tree Wrap-data Extraction Using Tree Matching Algorithm", February 2010
  2. Robert Baumgartner , Wolfgang Gatterbauer, "Web Data Extraction", 2010
  3. Journal of Computer Science 7 (2): 129-142, 2011 ISSN 1549- 3636 © 2011 Science Publications " Proposing the new Algorithm and Technique Development for Integrating Web Table Extraction and Building a Mashup"
  4. Rudy AG. Gultom, Riri Fitri Sari, "Implementing Web Data Extraction and Making Mashup with Xtractorz", 978-1-4244-4791-6/10/$25. 00_c 2010 IEEE.
  5. Majlesi Journal of Electrical Engineering Vol. 4, No. 2, June 2010- 43,"Tree Wrap-data Extraction Using Tree Matching Algorithm"
  6. D. Chamberlin and al. (Eds. ), "XQuery: A query language for XML",http://www. w3. org, 2001
  7. Hsiao-Tzu Lu,Wuu Yang, "A Simple Tree Pattern Matching Algorithm",Vol 64,1999
Index Terms

Computer Science
Information Sciences

Keywords

Web Data Extraction making Mashup mashup Stages html Xml dom Tree