Extraction of Template using Clustering from Heterogeneous Web Documents

Rashmi D Thakare; Manisha R Patil

Call for Paper

May Edition

IJCA solicits high quality original research papers for the upcoming May edition of the journal. The last date of research paper submission is 20 April 2026

Submit your paper

Know more

The week's pick

A Unified NIST SP 800-90B Validation Framework for CMOS True Random Number Generators and Quantum Random Number Generators

Che-Ping Lin

Random Articles

Reseach Article

Extraction of Template using Clustering from Heterogeneous Web Documents

by Rashmi D Thakare, Manisha R Patil

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 119 - Number 11

Year of Publication: 2015

Authors: Rashmi D Thakare, Manisha R Patil

10.5120/21112-3906

Rashmi D Thakare, Manisha R Patil . Extraction of Template using Clustering from Heterogeneous Web Documents. International Journal of Computer Applications. 119, 11 ( June 2015), 23-31. DOI=10.5120/21112-3906

@article{ 10.5120/21112-3906,

author = { Rashmi D Thakare, Manisha R Patil },

title = { Extraction of Template using Clustering from Heterogeneous Web Documents },

journal = { International Journal of Computer Applications },

issue_date = { June 2015 },

volume = { 119 },

number = { 11 },

month = { June },

year = { 2015 },

issn = { 0975-8887 },

pages = { 23-31 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume119/number11/21112-3906/ },

doi = { 10.5120/21112-3906 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T23:03:46.527635+05:30

%A Rashmi D Thakare

%A Manisha R Patil

%T Extraction of Template using Clustering from Heterogeneous Web Documents

%J International Journal of Computer Applications

%@ 0975-8887

%V 119

%N 11

%P 23-31

%D 2015

%I Foundation of Computer Science (FCS), NY, USA

Abstract

In general, a common template or layout is used to generate set of pages in websites. For example, Google Book lays out the details like author name, book names, reviews or comments, etc. in the similar way in all of its book pages. The database provides different values to generate the pages. The problem during automatic database value extraction from different web pages is studied which is done without any human data input. A template is well defined which would propose the framework to be used to describe how the values are inserted into the pages. An extraction algorithm is at core to extract values from web pages. This algorithm is trained to generate the template referring defined set of words having common occurrence. As a result, extracted values are semantically similar in most of the cases. Ours focus on extracting templates from heterogeneous web pages. But due to large variety of web documents in websites, there is a need to manage unknown number of templates. This is achieved by clustering web documents. The various methods for clustering, which are compared i) TEXT Minimum Description Length (TEXTMDL), ii) MinHash using Jaccard Coefficient, iii) MinHash using Dice Coefficient methods are used for clustering web pages.

References

Arasu and H. Garcia-Molina, Extracting Structured Data from Web Pages. Proc. ACM SIGMOD, 2003.
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher, Min-Wise Independent Permutations J. Computer and System Sciences, vol. 60, no. 3, pp. 630-659, 2000.
D. Chakrabarti, R. Kumar, and K. Punera, Page-Level Template Detection via Isotonic Smoothing. Proc. 16th Int?l Conf. World Wide Web (WWW), 2007.
Z. Chen, F. Korn, N. Koudas, and S. Muithukrishnan, Selectivity Estimation for Boolean Queries Proc. ACM SIGMOD-SIGACTSIGART Symp. Principles of Database Systems (PODS), 2000.
V. Crescenzi, G. Mecca, and P. Merialdo, Roadrunner: Towards Automatic Data Extraction from Large Web Sites Proc. 27th Int?l Conf. Very Large Data Bases (VLDB), 2001.
V. Crescenzi, P. Merialdo, and P. Missier, Clustering Web Pages Based on Their Structure. Data and Knowledge Eng. , vol. 54, pp. 279- 299, 2005.
I. S. Dhillon, S. Mallela, AND D. S. Modha, InformationTheoretic CO-Clustering. PROC. ACM SIGKDD, 2003
D. Gibson, K. Punera, AND A. Tomkins, The Volume And Evolution Of Web Page Templates PROC. 14TH INT?L CONF. WORLD WIDE WEB (WWW), 2005.
B. Long, Z. Zhang, AND P. S. Yu, Co-Clustering By Block Value Decomposition PROC. ACM SIGKDD, 2005
F. Pan, X. Zhang, AND W. Wang, CRD: Fast CO-Clustering On Large Data Sets Utilizing Sampling-Based Matrix Decomposition. PROC. ACM SIGMOD, 2008
Kim And Shim, Text: Automatic Template Extraction From Heterogeneous Web Pages. 'IEEE Transactions On Knowledge And Data Engineering, VOL. 23, NO. 4, APRIL 2011
Hanady Abdul Salam, David B. Skillicorn, Classification Using Streaming Random Forests. IEEE Computer Society, IEEE Transactions On Knowledge And Data Engineering, VOL. 23, NO. 1, JANUARY 2011.

Index Terms

Computer Science

Information Sciences

Keywords

Webpage sectioning webpage segmentation template detection Information extraction Clustering Web data modelling Web data mining. Template Extraction Data mining Information search and retrieval