CFP last date
20 December 2024
Reseach Article

An Optimized Approach of Modified BAT Algorithm to Record Deduplication

by Faritha Banu. A, Chandrasekar. C
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 62 - Number 1
Year of Publication: 2013
Authors: Faritha Banu. A, Chandrasekar. C
10.5120/10043-4627

Faritha Banu. A, Chandrasekar. C . An Optimized Approach of Modified BAT Algorithm to Record Deduplication. International Journal of Computer Applications. 62, 1 ( January 2013), 10-15. DOI=10.5120/10043-4627

@article{ 10.5120/10043-4627,
author = { Faritha Banu. A, Chandrasekar. C },
title = { An Optimized Approach of Modified BAT Algorithm to Record Deduplication },
journal = { International Journal of Computer Applications },
issue_date = { January 2013 },
volume = { 62 },
number = { 1 },
month = { January },
year = { 2013 },
issn = { 0975-8887 },
pages = { 10-15 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume62/number1/10043-4627/ },
doi = { 10.5120/10043-4627 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:10:29.738809+05:30
%A Faritha Banu. A
%A Chandrasekar. C
%T An Optimized Approach of Modified BAT Algorithm to Record Deduplication
%J International Journal of Computer Applications
%@ 0975-8887
%V 62
%N 1
%P 10-15
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The task of recognizing, in a data warehouse, records that pass on to the identical real world entity despite misspelling words, kinds, special writing styles or even unusual schema versions or data types is called as the record deduplication. In existing research they offered a genetic programming (GP) approach to record deduplication. Their approach combines several different parts of substantiation extracted from the data content to generate a deduplication purpose that is capable to recognize whether two or more entries in a depository are duplications or not. Because record deduplication is a time intense task even for undersized repositories, their aspire is to promote a method that discovers a proper arrangement of the best pieces of confirmation, consequently compliant a deduplication function that maximizes performance using a small representative portion of the corresponding data for preparation purposes also the optimization of process is less. Our research deals these issues with a novel technique called modified bat algorithm for record duplication. The incentive behind is to generate a flexible and effective method that employs Data Mining algorithms. The structure distributes many similarities with evolutionary computation techniques such as Genetic programming approach. This scheme is initialized with an inhabitant of random solutions and explores for optima by updating bat inventions. Nevertheless, disparate GP, modified bat has no development operators such as crossover and mutation. We also compare the proposed algorithm with other existing algorithms, including GP from the experimental results.

References
  1. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, "Duplicate Record Detection: A Survey", IEEE transactions on knowledge and data engineering, vol. 19, no. 1, January 2007.
  2. Gengxin Miao1 Junichi Tatemura2 Wang-Pin Hsiung2 Arsany Sawires2 Louise E. Moser11 ECE Dept. , University of California, Santa Barbara, Santa Barbara, CA, 93106 2 NEC Laboratories America, 10080 N. Wolfe Rd SW3-350, Cupertino, CA, 95014, "Extracting Data Records from the Web Using Tag Path Clustering".
  3. Imran R. Mansuriimran@it. iitb. ac. in IIT Bombay, Sunita Sarawagi sunita@it. iitb. ac. in IIT Bombay, "Integrating unstructured data into relational databases".
  4. Jaehong Min, Daeyoung Yoon, and Youjip Won,"Efficient Deduplication Techniques for Modern Backup Operation", IEEE transactions on computers, vol. 60, no. 6, June 2011.
  5. Moises G. de Carvalho, Alberto H. F. Laender, Marcos Andre Goncalves, Altigran S. da Silva,"A Genetic Programming Approach to Record Deduplication", IEEE Transaction on Knowledge and Data Engineering, pp 399-412, 2011.
  6. N. Koudas, S. Sarawagi, and D. Srivastava, "Record linkage: similarity measures and algorithms," in Proceedings of the2006 ACM SIGMOD International Conference on Management of Data, pp. 802–803, 2006.
  7. Dutch T. Meyer and William J. Bolosky, "A Study of Practical Deduplication", Computer and Information Science,pp: 1-13,2011.
  8. C. Dubnicki, L. Gryz, L. Heldt, M. Kaczmarczyk, W. Kilian, P. Strzelczak, J. Szczepkowski, C. Ungureanu, and M. Welnicki. Hydrastor: a scalable secondary storage. In Proc. 7th USENIX Conference on File and Storage Technologies, 2009.
  9. C. Ungureanu, B. Atkin, A. Aranya, S. Gokhale, S. Rago, G. Cakowski, C. Dubnicki, and A. Bohra. Hydrafs: A high-throughputfile system for the Hydrastor content-addressable storage system. In Proc. 8th USENIX Conference on File and StorageTechnologies, 2010.
  10. D. Bhagwat, K. Eshghi, D. D. Long, and M. Lillibridge, "Extreme Binning: Scalable, Parallel Deduplication for Chunkbased File Backup," HP Laboratories, Tech. Rep. HPL-2009- 10R2, Sep. 2009.
  11. A New Metaheuristic Bat-Inspired Algorithm, Xin-She Yang, Department of Engineering, University of Cambridge.
  12. "Bats behaviour", www. Swam inteLligence. org.
Index Terms

Computer Science
Information Sciences

Keywords

Genetic Programming Deduplication Function Modified Bat Algorithm Data Mining algorithms