CFP last date
20 December 2024
Reseach Article

Automatic Threshold Selection using PSO for GA based Duplicate Record Detection

by K. Deepa, R. Rangarajan, M. Senthamil Selvi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 62 - Number 4
Year of Publication: 2013
Authors: K. Deepa, R. Rangarajan, M. Senthamil Selvi
10.5120/10068-4674

K. Deepa, R. Rangarajan, M. Senthamil Selvi . Automatic Threshold Selection using PSO for GA based Duplicate Record Detection. International Journal of Computer Applications. 62, 4 ( January 2013), 22-27. DOI=10.5120/10068-4674

@article{ 10.5120/10068-4674,
author = { K. Deepa, R. Rangarajan, M. Senthamil Selvi },
title = { Automatic Threshold Selection using PSO for GA based Duplicate Record Detection },
journal = { International Journal of Computer Applications },
issue_date = { January 2013 },
volume = { 62 },
number = { 4 },
month = { January },
year = { 2013 },
issn = { 0975-8887 },
pages = { 22-27 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume62/number4/10068-4674/ },
doi = { 10.5120/10068-4674 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:10:47.111439+05:30
%A K. Deepa
%A R. Rangarajan
%A M. Senthamil Selvi
%T Automatic Threshold Selection using PSO for GA based Duplicate Record Detection
%J International Journal of Computer Applications
%@ 0975-8887
%V 62
%N 4
%P 22-27
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Normally setting the threshold is an important issue in applications where the similarity functions are used and it relies more on human intervention. The proposed work addressed two issues : first to find the optimal equation using Genetic Algorithm (GA) and next it adopts an intelligence algorithm, Particle Swarm Optimization (PSO) to get the optimal threshold to detect the duplicate records more accurately and also it reduces human intervention. Restaurant and CORA data repository are used to analyze the proposed algorithm and the performance of the proposed algorithm is compared against marlin method and the genetic programming with the help of evaluation metrics.

References
  1. Moises G. de Carvalho, Alberto H. F. Laender, Marcos Andre Goncalves, Altigran S. da Silva, "A Genetic Programming Approach to Record Deduplication", IEEE Transaction on Knowledge and Data Engineering,pp 399-412, 2011.
  2. Mikhail Bilenko and Raymond J. Mooney, "Adaptive Duplicate Detection Using Learnable String Similarity Measures", Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39-48, 2003.
  3. Juliana B. dos Santos, Carlos A. Heuser, Viviane P. Moreira, Leandro K. Wives," Automatic threshold estimation for data matching applications", Information Sciences, Volume 181, Issue 13, Pages 2685-2699,2011.
  4. Zhiwei Ye , Hongwei Chen, Wei Liu and Jinping Zhang, "Automatic Threshold Selection Based on Particle Swarm Optimization Algorithm", ICICTA '08 Proceedings of the 2008 International Conference on Intelligent Computation Technology and Automation - Volume 01,2008, pp36-39
  5. Robert Isele and Christian Bizer, "Learning Linkage Rules using Genetic Programming", 6th International Workshop on Ontology Matching,Bonn,Germany,pp 13-24,2011
  6. Junio de Freitas, Gisele L. Pappa, Altigran S. da Silva, Marcos A. Gonc¸alves, Edleno Moura, Adriano Veloso, Alberto H. F. Laender, Mois´es G. de Carvalho, "Active Learning Genetic Programming for Record Deduplication" Evolutionary Computation (CEC), 2010 IEEE Congress, pp. 1-8, July 2010.
  7. Ye Qingwei, WuDongxing, Zhou Yu, Wang Xiaodong, " The duplicated of partial content detection based on PSO ", IEEE FifthInternational Conference on Bio-Inspired Computing: Theories and Applications, pp: 350 - 353, 2010.
  8. Gabriel S. Gonçalves, Moisés G. de Carvalho, Alberto H. F. Laender, Marcos A. Gonçalves, "Automatic Selection of Training Examples for a Record Deduplication Method Based on Genetic Programming" Journal of Information and Data Management, Vol. 1, No. 2, pp 213–228, 2010.
  9. Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti and Rajeev Motwani, "Robust and Efficient Fuzzy Match for Online Data Cleaning", In Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, USA, pp. 313-324, 2003.
  10. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis and Vassilios S. Verykios, "Duplicate Record Detection: A Survey", IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 1, pp. 1-16, January 2007.
  11. Surajit Chaudhuri, Anish Das Sarma, Venkatesh Ganti and Raghav Kaushik, "Leveraging Aggregate Constraints for Deduplication", In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 437-448, New York, USA, 2007.
  12. Erhard Rahm, Hong Hai Do, "Data Cleaning: Problems and Current Approaches", IEEE Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 4, December 2000.
  13. Sunita Sarawagi, Anuradha Bhamidipaty, Alok Kirpal, Chandra Mouli, "ALIAS: An Active Learning led Interactive Deduplication System", In Proceedings of the 28th International Conference on Very Large Databases, pp: 1103-1106, 2002.
  14. Arthur D. Chapman, "Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data", version 1. 0. Report for the Global Biodiversity Information Facility, Copenhagen, 2005.
  15. Bitton D and DeWitt, "Duplicate Record Elimination in Large Data Files", ACM Transactions on Database Systems, vol. 8, No. 2, pp. 255-265, June 1983.
  16. A. E. Monge and C. P. Elkan, "The field matching problem: Algorithms and applications", SIGMOD workshop on research issues on knowledge discovery and data mining, pp. 267-270, 1996.
  17. site: Riddle dataset, http://www. cs. utexas. edu/users/ml/riddle/data. html
  18. M. Bilenko, R. Mooney, W. Cohen, P. Ravikumar, and S. Fienberg,"Adaptive name matching in information integration," IEEE Intelligent Systems, vol. 18, no. 5, pp. 16–23,2003.
Index Terms

Computer Science
Information Sciences

Keywords

GA PSO Similarity metrics Threshold