Handling Duplicate Data in Data Warehouse for Data Mining

Dr. J. Jebamalar Tamilselvi; C. Brilly Gifta

Call for Paper

March Edition

IJCA solicits high quality original research papers for the upcoming March edition of the journal. The last date of research paper submission is 20 February 2026

Submit your paper

Know more

The week's pick

A Knowledge-Graph–Driven Multimodal Large Model for Semantic Understanding and Controllable Generation of Intangible Cultural Heritage

Jundi Yang Heng Yao

Random Articles

Reseach Article

Handling Duplicate Data in Data Warehouse for Data Mining

by Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 15 - Number 4

Year of Publication: 2011

Authors: Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta

10.5120/1939-2590

Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta . Handling Duplicate Data in Data Warehouse for Data Mining. International Journal of Computer Applications. 15, 4 ( February 2011), 7-15. DOI=10.5120/1939-2590

@article{ 10.5120/1939-2590,

author = { Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta },

title = { Handling Duplicate Data in Data Warehouse for Data Mining },

journal = { International Journal of Computer Applications },

issue_date = { February 2011 },

volume = { 15 },

number = { 4 },

month = { February },

year = { 2011 },

issn = { 0975-8887 },

pages = { 7-15 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume15/number4/1939-2590/ },

doi = { 10.5120/1939-2590 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2024-02-06T20:03:38.729491+05:30

%A Dr. J. Jebamalar Tamilselvi

%A C. Brilly Gifta

%T Handling Duplicate Data in Data Warehouse for Data Mining

%J International Journal of Computer Applications

%@ 0975-8887

%V 15

%N 4

%P 7-15

%D 2011

%I Foundation of Computer Science (FCS), NY, USA

Abstract

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.

References

Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios (January 2007), Duplicate Record Detection: A Survey, IEEE Transactions on Knowledge and Data Engineering, Volume 19, NO. 1.
Bilenko, M., Mooney, R.J (August 2003), Adaptive Duplicate Detection Using Learnable String Similarity Measures, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’03), Washington, DC.
Dorian Pyle (1999), Data Preparation for Data Mining, Published by Morgan Kaufmann, ISBN 1558605290, 9781558605299, 540 pages.
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios ( 2007), Duplicate Record Detection: A Survey, IEEE TKDE, 19(1):1-16.
Feekin A. and Z. Chen (2000), Duplicate detection using K-way sorting method, Proc. ACM SAC Conference, pages 323-327.
Hui Xiong and Gaurav Pandey and Michael Steinbach and Vipin Kumar ( 2006), Enhancing Data Analysis with Noise Removal, IEEE Transactions on Knowledge and Data Engineering, IEEE Computer Society, volume 18, page no 304-319.
Hans-peter Keriegel, Karsten M. Borgwardt, Peer Kroger, Alexey Pryakhin, Matthias Schubert, Arthur Zimek (2007), Future trends in data mining, Data Mining and Knowledge Discovery, Volume 15 , Issue 1, Pages: 87 – 97, ISSN:1384-5810.
Jiawei Han, Micheline Kamber (March 2006), Data Mining: Concepts and Techniques, Publisher: Elsevier Science & Technology Books, ISBN-13: 9781558609013.
Judice L.Y.Koh, Mong Li Lee, Asif M. Khan, Paul T.J. Tan and Vladimir (September 24,2004), Duplicate Detection in Biological Data using Association Rule Mining, 2nd European Workshop on Data Mining and Text Mining for Bioinformatics, Pisa, Italy.
Lup Low W.; Li Lee M.; Wang Ling T.( December 2001), A knowledge-based approach for duplicate elimination in data cleaning, Information Systems, Volume 26, Issue 8, pp. 585-606(22), Publisher: ELSEVIER, ISSN:0306-4379.
Partrick Lehti(2006), Unsupervised Duplicate Detection Using Sample Non-duplicates, Lecture Notes in Computer Science, NUMB 4244, pages 136-164.
Robert Leland (August 2007), Duplicate Detection with PMC – A Parallel Approach to Pattern Matching Department of Computer and Information Science, Norwegian University of Science and Technology, Ph.D. Thesis.
Shen H, Zhang Y.( Nov. 2008), Improved approximate detection of duplicates for data streams over sliding windows, Journal of Computer Science and Technology, Volume 23(6), pp. 973-987.

Index Terms

Computer Science

Information Sciences

Keywords

Data Cleaning Duplicate Data Data Warehouse Data Mining