Emerging Trends in Computer Science and Information Technology (ETCSIT2012) |
Foundation of Computer Science USA |
ETCSIT - Number 4 |
April 2012 |
Authors: P. A. Chaudhari, R. L. Paikrao |
765e135c-73f7-401c-81f2-bfa218c04313 |
P. A. Chaudhari, R. L. Paikrao . Web Data Extraction. Emerging Trends in Computer Science and Information Technology (ETCSIT2012). ETCSIT, 4 (April 2012), 13-17.
Web is a huge reservoir of information. Data available is extremely diversified and abundant. To search for specific information, the user has to go through many pages of the Internet, filter the data and download related documents and files. This task of searching and downloading is time consuming. Web pages are in unstructured HTML format. There is a necessity to convert unstructured HTML format into a new structured format such as XML or XHTML. We propose an approach for implementing web data extraction and developing a Mashup from HTML web pages. The various stages of building a Mashup are Data Retrieval, Data Source Modeling, Data Cleaning/Filtering, Data Integration and Data Visualization. The data modeling stage renders Document Object Model (DOM) tree with the help of HTML Parser. Algorithms and rules are used to specifically analyze the HTML tags and extract the data. Furthermore, our application enables the user to perform his task without the need to write a script or program or even without any knowledge of computer programming. This approach will manage multiple servers and assure that our website will always have latest data. The Mashup created will help in the decision making process, which is the prima facie requirement for success in corporate world.