CFP last date
20 December 2024
Reseach Article

Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents

Published on December 2014 by Jyoti Mhaske
Innovations and Trends in Computer and Communication Engineering
Foundation of Computer Science USA
ITCCE - Number 2
December 2014
Authors: Jyoti Mhaske
f4bf8943-ee50-4acd-bb25-62f38f556769

Jyoti Mhaske . Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents. Innovations and Trends in Computer and Communication Engineering. ITCCE, 2 (December 2014), 16-18.

@article{
author = { Jyoti Mhaske },
title = { Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents },
journal = { Innovations and Trends in Computer and Communication Engineering },
issue_date = { December 2014 },
volume = { ITCCE },
number = { 2 },
month = { December },
year = { 2014 },
issn = 0975-8887,
pages = { 16-18 },
numpages = 3,
url = { /proceedings/itcce/number2/19048-2013/ },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Proceeding Article
%1 Innovations and Trends in Computer and Communication Engineering
%A Jyoti Mhaske
%T Comparative Cost Analysis Of Template Extraction from Heterogeneous Web Documents
%J Innovations and Trends in Computer and Communication Engineering
%@ 0975-8887
%V ITCCE
%N 2
%P 16-18
%D 2014
%I International Journal of Computer Applications
Abstract

Extracting structured information from unstructured and semi-structured machine-readable documents automatically it plays vital role in now a days. So most websites are using common templates with contents to populate the information to achieve good publishing productivity. Where Internet is the major resource for extracting the information. In recent days Template detection technique received lot of concentration to improve in different aspects like performance of search engine , clustering and classification of web documents , as templates degrade the performance and accuracy of web application for a machines because of irrelevant template terms. So Novel algorithms is useful for extracting templates from a large number of web documents which are generated from heterogeneous templates. Using the similarity of underlying template structures in the document cluster the web documents so that template for each cluster is extracted simultaneously.

References
  1. Chulyun Kim and Kyuseok Shim, Member, IEEE,"TEXT: Automatic Tem- plate Extraction from Heterogeneous Web Pages, IEEE Transactions on knoeldge and data engineering, VOL. 23, NO. 4,APRIL 2011.
  2. Document Object Model (dom) Level 1 Speci?cation Version 1. 0, http://www. w3. org/TR/REC-DOM-Level-1, 2010.
  3. Xpath Speci?cation, http://www. w3. org/TR/xpath, 2010.
  4. D. Chakrabarti, R. Kumar, and K. Punera, Page-Level Template Detection via Isotonic Smoothing, Proc. 16th Intl Conf. World Wide Web (WWW),2007.
  5. M. D. Plumbley, Clustering of Sparse Binary Data Using a Minimum Description Length Approach, http://www. elec. qmul. ac. uk/stanfo/markp/, 2002.
  6. Chang and S. Lui. IEPAD: Information extraction based on pattern discovery. In Proc. of 2001 Intl. World Wide Web Conf. , pages 681–688, 2001.
  7. M. N. Garofalakis, A. Gionis, R. Rastogi, S. Seshdri, and K. Shim, "Xtract: A System for Extracting Document Type Descrip- tors from Xml Documents," Proc. ACM SIGMOD, 2000.
  8. K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. M. B. Cavalcanti, and J. Freire, "A Fast and Robust Method for Web Page Template Detection and Removal," Proc. 15th ACM Int'l Conf. Information and Knowledge Management , 2006. 9] T. M. Cover and J. A. Thomas, Elements of Information Theory. Wiley Interscience, 1991.
  9. F. Pan, X. Zhang, and W. Wang, "Crd: Fast Co-Clustering on Large Data Sets Utilizing Sampling-Based Matrix Decomposi- tion," Proc. ACM SIGMOD, 2008.
  10. J. Rissanen, "Modeling by Shortest Data Description," Automatica, vol. 14, pp. 465-471, 1978.
  11. H. Zhao, W. Meng, and C. Yu, "Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages," Proc. 32nd Int'l Conf. Very Large Data Bases (VLDB), 2006.
Index Terms

Computer Science
Information Sciences

Keywords

Web Template Extraction Clustering Documents Minimum Description Length Principle.