International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 82 - Number 12 |
Year of Publication: 2013 |
Authors: M. Padmanaban, R. Radha |
10.5120/14166-9829 |
M. Padmanaban, R. Radha . PSO Algorithm to Select Subsets of Q-Gram Features for Record Duplicate Detection. International Journal of Computer Applications. 82, 12 ( November 2013), 7-14. DOI=10.5120/14166-9829
Though data quality issues arise with ever-zooming quantity of data, it is a welcome sign that of late, significant improvement has been made in data engineering. Consequently, there have been significant investments from private and government organizations in developing methods for removing replicas from the data repositories. This phenomenon has caused a significant interest among researchers in developing efficient and effective duplicate detection strategy using modern and emerging techniques. In this paper, we have proposed accordingly. In the previous work duplicate record detection was done using Q-gram concept and the fuzzy classifier. Here, different set of features from the data is found out using the Q-gram concept that leads to computational complex environment. In order to reduce the computational task, a set of important Q-gram-based feature subsets is selected. With this intention, the overall steps of the proposed technique are carried out using three different steps, such as, 1) feature computation, 2) feature selection, and 3) detection. Initially, the features are computed using Q-gram concept and then, the subset of optimal feature sets is identified using particle swarm algorithm (PSO) which is one of the most effective optimization algorithms. Once we select the optimal features sets, the Naïve Bayes Classifier is utilized to detect the duplication records. There are two processes which characterize the proposed Duplicate Record Detection technique such as the training phase and the testing phase. The experimental results showed that the proposed Duplicate Record Detection technique has higher accuracy than that of the existing method. The accuracy obtained for the proposed Duplicate Record Detection is found to be 89%.