International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 127 - Number 6 |
Year of Publication: 2015 |
Authors: Arfa Skandar, Mariam Rehman, Maria Anjum |
10.5120/ijca2015906401 |
Arfa Skandar, Mariam Rehman, Maria Anjum . An Efficient Duplication Record Detection Algorithm for Data Cleansing. International Journal of Computer Applications. 127, 6 ( October 2015), 28-37. DOI=10.5120/ijca2015906401
The purpose of this research was to review, analyze and compare algorithms lying under the empirical technique in order to suggest the most effective algorithm in terms of efficiency and accuracy. The research process was initiated by collecting the relevant research papers with the query of “duplication record detection” from IEEE database. After that, papers were categorized on the basis of different techniques proposed in the literature. In this research, the focus was made on empirical technique. The papers lying under this technique were further analyzed in order to come up with the algorithms. Finally, the comparison was performed in order to come up with the best algorithm i.e. DCS++. The selected algorithm was critically analyzed in order to improve its working. On the basis of limitations of selected algorithm, variation in algorithm was proposed and validated by developed prototype. After implementation of existing DCS++ and its proposed variation, it was found that the proposed variation in DCS++ producing better results in term of efficiency and accuracy. The algorithms lying under the empirical technique of duplicate records deduction were focused. The research material was gathered from the single digital library i.e. IEEE. A restaurant dataset was selected and the results were evaluated on the specified dataset which can be considered as a limitation of the research. The existing algorithm i.e. DCS++ and proposed variation in DCS++ were implemented in C#. As a result, it was concluded that proposed algorithm is performing outstanding than the existing algorithm.