International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 19 - Number 7 |
Year of Publication: 2011 |
Authors: Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan |
10.5120/2374-3128 |
Midhun Mathew, Shine N Das, T R Lakshmi Narayanan, Pramod K Vijayaraghavan . A Novel Approach for Near-Duplicate Detection of Web Pages using TDW Matrix. International Journal of Computer Applications. 19, 7 ( April 2011), 16-21. DOI=10.5120/2374-3128
The voluminous amount of web documents has weakened the performance and reliability of web search engines. The subsistence of near-duplicate data is an issue that accompanies the growing need to incorporate heterogeneous data. Web content mining face huge problems due to the existence of duplicate and near-duplicate web pages. These pages either increase the index storage space or increase the serving costs thereby irritating the users. Near-duplicate detection has been recognized as an important one in the field of plagiarism detection, spam detection and in focused web crawling scenarios. Here we propose a novel idea for finding near-duplicates of an input web-page, from a huge repository. We proposes a TDW matrix based algorithm with three phases, rendering, filtering and verification, which receives an input web-page and a threshold in its first phase , prefix filtering and positional filtering to reduce the size of records in the second phase and returns an optimal set of near-duplicate web pages in the verification phase after calculating its similarity. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.