International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 61 - Number 12 |
Year of Publication: 2013 |
Authors: M. Padmanaban, T. Bhuvaneswari |
10.5120/9977-4100 |
M. Padmanaban, T. Bhuvaneswari . A Technique for Data Deduplication using Q-Gram Concept with Support Vector Machine. International Journal of Computer Applications. 61, 12 ( January 2013), 1-9. DOI=10.5120/9977-4100
Several systems that rely on consistent data to offer high quality services, such as digital libraries and e-commerce brokers, may be affected by the existence of duplicates, quasi-replicas, or near-duplicate entries in their repositories. Because of that, there have been significant investments from private and government organizations in developing methods for removing replicas from its data repositories. In this paper, we have proposed accordingly. In the previous work, duplicate record detection was done using three different similarity measures and neural network. In the previous work, we have generated feature vector based on similarity measures and then, neural network was used to find the duplicate records. In this paper, we have developed Q-gram concept with support vector machine for deduplication process. The similarity function, which we are used Dice coefficient,Damerau–Levenshtein distance,Tversky index for similarity measurement. Finally, support vector machine is used for testing whether data record is duplicate or not. A set of data generated from some similarity measures are used as the input to the proposed system. There are two processes which characterize the proposed deduplication technique, the training phase and the testing phase the experimental results showed that the proposed deduplication technique has higher accuracy than the existing method. The accuracy obtained for the proposed deduplication 88%.