CFP last date
20 February 2025
Reseach Article

Handling Duplicate Data in Data Warehouse for Data Mining

by Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 15 - Number 4
Year of Publication: 2011
Authors: Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta
10.5120/1939-2590

Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta . Handling Duplicate Data in Data Warehouse for Data Mining. International Journal of Computer Applications. 15, 4 ( February 2011), 7-15. DOI=10.5120/1939-2590

@article{ 10.5120/1939-2590,
author = { Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta },
title = { Handling Duplicate Data in Data Warehouse for Data Mining },
journal = { International Journal of Computer Applications },
issue_date = { February 2011 },
volume = { 15 },
number = { 4 },
month = { February },
year = { 2011 },
issn = { 0975-8887 },
pages = { 7-15 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume15/number4/1939-2590/ },
doi = { 10.5120/1939-2590 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:03:38.729491+05:30
%A Dr. J. Jebamalar Tamilselvi
%A C. Brilly Gifta
%T Handling Duplicate Data in Data Warehouse for Data Mining
%J International Journal of Computer Applications
%@ 0975-8887
%V 15
%N 4
%P 7-15
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.

References
  1. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios (January 2007), Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, Volume 19, NO. 1.
  2. Bilenko, M., Mooney, R.J (August 2003), Adaptive Duplicate Detection Using Learnable String Similarity Measures, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03), Washington, DC.
  3. Dorian Pyle (1999), Data Preparation for Data Mining, Published by Morgan Kaufmann, ISBN 1558605290, 9781558605299, 540 pages.
  4. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios ( 2007), Duplicate Record Detection: A Survey, IEEE TKDE, 19(1):1-16.
  5. Feekin A. and Z. Chen (2000), Duplicate detection using K-way sorting method, Proc. ACM SAC Conference, pages 323-327.
  6. Hui Xiong and Gaurav Pandey and Michael Steinbach and Vipin Kumar ( 2006), Enhancing Data Analysis with Noise Removal, IEEE Transactions on Knowledge and Data Engineering, IEEE Computer Society, volume 18, page no 304-319.
  7. Hans-peter Keriegel, Karsten M. Borgwardt, Peer Kroger, Alexey Pryakhin, Matthias Schubert, Arthur Zimek (2007), Future trends in data mining, Data Mining and Knowledge Discovery, Volume 15 , Issue 1, Pages: 87 – 97, ISSN:1384-5810.
  8. Jiawei Han, Micheline Kamber (March 2006), Data Mining: Concepts and Techniques, Publisher: Elsevier Science & Technology Books, ISBN-13: 9781558609013.
  9. Judice L.Y.Koh, Mong Li Lee, Asif M. Khan, Paul T.J. Tan and Vladimir (September 24,2004), Duplicate Detection in Biological Data using Association Rule Mining, 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, Pisa, Italy.
  10. Lup Low W.; Li Lee M.; Wang Ling T.( December 2001), A knowledge-based approach for duplicate elimination in data cleaning, Information Systems, Volume 26, Issue 8, pp. 585-606(22), Publisher: ELSEVIER, ISSN:0306-4379.
  11. Partrick Lehti(2006), Unsupervised Duplicate Detection Using Sample Non-duplicates, Lecture Notes in Computer Science, NUMB 4244, pages 136-164.
  12. Robert Leland (August 2007), Duplicate Detection with PMC – A Parallel Approach to Pattern Matching Department of Computer and Information Science, Norwegian University of Science and Technology, Ph.D. Thesis.
  13. Shen H, Zhang Y.( Nov. 2008), Improved approximate detection of duplicates for data streams over sliding windows, Journal of Computer Science and Technology, Volume 23(6), pp. 973-987.
Index Terms

Computer Science
Information Sciences

Keywords

Data Cleaning Duplicate Data Data Warehouse Data Mining