International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 119 - Number 11 |
Year of Publication: 2015 |
Authors: Rashmi D Thakare, Manisha R Patil |
10.5120/21112-3906 |
Rashmi D Thakare, Manisha R Patil . Extraction of Template using Clustering from Heterogeneous Web Documents. International Journal of Computer Applications. 119, 11 ( June 2015), 23-31. DOI=10.5120/21112-3906
In general, a common template or layout is used to generate set of pages in websites. For example, Google Book lays out the details like author name, book names, reviews or comments, etc. in the similar way in all of its book pages. The database provides different values to generate the pages. The problem during automatic database value extraction from different web pages is studied which is done without any human data input. A template is well defined which would propose the framework to be used to describe how the values are inserted into the pages. An extraction algorithm is at core to extract values from web pages. This algorithm is trained to generate the template referring defined set of words having common occurrence. As a result, extracted values are semantically similar in most of the cases. Ours focus on extracting templates from heterogeneous web pages. But due to large variety of web documents in websites, there is a need to manage unknown number of templates. This is achieved by clustering web documents. The various methods for clustering, which are compared i) TEXT Minimum Description Length (TEXTMDL), ii) MinHash using Jaccard Coefficient, iii) MinHash using Dice Coefficient methods are used for clustering web pages.