CFP last date
20 January 2025
Reseach Article

Article:Novel Framework and Model for Data Warehouse Cleansing

by Daya Gupta, Payal Pahwa, Rajiv Arora
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 32 - Number 8
Year of Publication: 2011
Authors: Daya Gupta, Payal Pahwa, Rajiv Arora
10.5120/3922-5533

Daya Gupta, Payal Pahwa, Rajiv Arora . Article:Novel Framework and Model for Data Warehouse Cleansing. International Journal of Computer Applications. 32, 8 ( October 2011), 6-13. DOI=10.5120/3922-5533

@article{ 10.5120/3922-5533,
author = { Daya Gupta, Payal Pahwa, Rajiv Arora },
title = { Article:Novel Framework and Model for Data Warehouse Cleansing },
journal = { International Journal of Computer Applications },
issue_date = { October 2011 },
volume = { 32 },
number = { 8 },
month = { October },
year = { 2011 },
issn = { 0975-8887 },
pages = { 6-13 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume32/number8/3922-5533/ },
doi = { 10.5120/3922-5533 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T20:18:37.756697+05:30
%A Daya Gupta
%A Payal Pahwa
%A Rajiv Arora
%T Article:Novel Framework and Model for Data Warehouse Cleansing
%J International Journal of Computer Applications
%@ 0975-8887
%V 32
%N 8
%P 6-13
%D 2011
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Data cleansing is a process that deals with identification of corrupt and duplicate data inherent in the data sets of a data warehouse to enhance the quality of data. This paper aims to facilitate the data cleaning process by addressing the problem of duplicate records detection pertaining to the ‘name’ attributes of the data sets. It provides a sequence of algorithms through a novel framework for identifying duplicity in the ‘name’ attribute of the data sets of an already existing data warehouse. The key features of the research includes its proposal of a novel framework through a well defined sequence of algorithms and refining the application of alliance rules [1] by incorporating the use of previously existing and well defined similarity computation measures. The results depicted show the feasibility and validity of the suggested method.

References
  1. Rajiv Arora, Payal Pahwa, Shubha Bansal,” Alliance Rules for Data Warehouse Cleansing”, International Conference on Signal Processing Systems IEEE Explore no. D01 10.1109/ICSPS, 133, pages 743-747, 2009.
  2. P.Ponniah, “Data Warehousing Fundamentals- A comprehensive guide for IT professionals”, Ist ed., second reprint, ISBN-81-265-0919-8, Glorious Printers: New Delhi, India, 2007.
  3. A.Marcus, J.I.Maletic,”Utilizing Association Rules For the Identification of Errors in Data”, TR-CS-00-04, University of Memphis, 2004.
  4. A.Marcus, J.I.Maletic,” Data Cleansing: Beyond Integrity Analysis” Proceedings of the Conference onInformation Quality (IQ2000). Boston: Massachusetts Institute of Technology, pp. 200-209, 2000.
  5. T. Redman, "The Impact of Poor Data Quality on the Typical Enterprise", Communications of the ACM, Vol. 41. 8, February 1998.
  6. A.Marcus, J.I.Maletic, “Automated Identification of Errors in Data Sets”, TR-CS-00-02, University of Memphis, 2002.
  7. A.Marcus, J.I.Maletic and Lin, K.-I.,” Association Rules for Error Identification in Data Sets”, Proceedings of the 10th ACM Conference on Information and Knowledge Management (ACM CIKM 2001). Atlanta, GA, pp. 589-591, 2001.
  8. Peter Christen,” A Comparison of Personal Name Matching: Techniques and Practical Issues” Joint Computer Science Technical Report Series, TR-CS-06-02, September, 2006.
  9. Gérard Bouchard and Christian Pouyez, Name Variations and Computerised Record Linkage, Historical Methods, Vol. 13, No. 2, Springer 1980, pp119-125.
  10. Timothy E. Ohanekwu, C.I. Ezeife,” A Token-Based Data Cleaning Technique for Data Warehouse Systems”, Ontario, Canada N9B, 3P4.
  11. Surajit Chaudhary, Kris Ganjam, Venkatesh Ganti, Rajeev Motwani, ” Robust and efficient fuzzy match for online data cleaning”,ACM SIGMOD,2003
  12. Amit Rudra, Emilie Yeo, “Key Issues in Achieving Data Quality and Consistency in Data Warehousing among Large Organisations in Australia,” Proceedings of the 32nd Hawaii International Conference on System Sciences – 1999.
  13. E. Rahm, H. H. Do: “Data Cleaning: Problems and Current Approaches”, IEEE Techn. Bull. Data Eng., Dec. 2000.
  14. Heiko Müller, Johann-Christoph Freytag, Berlin,” Problems, Methods, and Challenges in Comprehensive Data Cleansing”, 10099 Berlin, Germany.
  15. A. D.Chapman, “Principles and Methods of Data Cleaning – Primary Species and Species-Occurrence Data, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen, 2005.
  16. Rohit Ananthakrishna (Cornell University) Surajit Chaudhuri Venkatesh Ganti (Microsoft Research),” Eliminating Fuzzy Duplicates in Data Warehouses”.
  17. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, Vassilios S. Verykios, “Duplicate Record Detection: A Survey”, IEEE transactions on knowledge and data engineering, vol. 19, no. 1, January 2007.
  18. Oktie Hassanzadeh, Mohammad Sadoghi, Ren´ee J. Miller, “Accuracy of Approximate String Joins Using Grams”, University of Toronto 10 King’s College Rd.,Toronto, ON M5S3G4, Canada.
  19. Jakub Piskorski_, Marcin Sydow, “Usability of String Distance Metrics for Name Matching Tasks in Polish”.
Index Terms

Computer Science
Information Sciences

Keywords

Data warehouse data cleansing fuzzy logic data mining