International Journal of Computer Applications |
Foundation of Computer Science (FCS), NY, USA |
Volume 15 - Number 4 |
Year of Publication: 2011 |
Authors: Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta |
10.5120/1939-2590 |
Dr. J. Jebamalar Tamilselvi, C. Brilly Gifta . Handling Duplicate Data in Data Warehouse for Data Mining. International Journal of Computer Applications. 15, 4 ( February 2011), 7-15. DOI=10.5120/1939-2590
The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations of the same logical value. Also, it is important to detect and clean equivalence errors because an equivalence error may result in several duplicate tuples. Recent research efforts have focused on the issue of duplicate elimination in data warehouses. This entails trying to match inexact duplicate records, which are records that refer to the same real-world entity while not being syntactically equivalent. This paper mainly focuses on efficient detection and elimination of duplicate data. The main objective of this research work is to detect exact and inexact duplicates by using duplicate detection and elimination rules. This approach is used to improve the efficiency of the data.