CFP last date
20 December 2024
Reseach Article

Hybrid Approaches for Data Cleaning in Data Warehouse

by Prerana S. Kulkarni, J. W. Bakal
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 88 - Number 18
Year of Publication: 2014
Authors: Prerana S. Kulkarni, J. W. Bakal
10.5120/15450-3813

Prerana S. Kulkarni, J. W. Bakal . Hybrid Approaches for Data Cleaning in Data Warehouse. International Journal of Computer Applications. 88, 18 ( February 2014), 7-10. DOI=10.5120/15450-3813

@article{ 10.5120/15450-3813,
author = { Prerana S. Kulkarni, J. W. Bakal },
title = { Hybrid Approaches for Data Cleaning in Data Warehouse },
journal = { International Journal of Computer Applications },
issue_date = { February 2014 },
volume = { 88 },
number = { 18 },
month = { February },
year = { 2014 },
issn = { 0975-8887 },
pages = { 7-10 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume88/number18/15450-3813/ },
doi = { 10.5120/15450-3813 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T22:07:56.271454+05:30
%A Prerana S. Kulkarni
%A J. W. Bakal
%T Hybrid Approaches for Data Cleaning in Data Warehouse
%J International Journal of Computer Applications
%@ 0975-8887
%V 88
%N 18
%P 7-10
%D 2014
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The quality of data can only be improved by cleaning data prior to loading into the data warehouse as correctness of data is essential for well-informed and reliable decision making. Data warehouse is the only viable solution that can bring that dream into a reality. The quality of the data can only be produced by cleaning data prior to loading into data warehouse. Data Cleaning is a very important process of the data warehouse. It is not a very easy process as many different types of unclean data can be present. So correctness of data is essential for well-informed and reliable decision making. Also, whether a data is clean or dirty is highly dependent on the nature and source of the raw data. Many attempts have been made till now to clean the data using different types of algorithms. In this paper an attempt has been made to provide a hybrid approach for cleaning data which combines modified versions of PNRS, Transitive closure algorithms and Semantic Data Matching algorithm can be applied to the data to get better results in data corrections.

References
  1. E Rahm, Hong Hai Do, "Data Cleaning Problems and Current Approaches" IEEE Bulletin of the Technical Committee on Data Engineering, 2000, 24, 4.
  2. C. Varol, C. Bayrak, R. Wagner and D. Goff, "Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data", Data Engineering - International Series in Operations Research & Management Science, vol. 132, pp. 91 -101, 2010.
  3. W. N. Li, R. Bheemavaram, X. Zhang, "Transitive Closure of Data Records: Application and Computation", Data Engineering – International Series in Operations Research & Management Science, Springer US, vol. 132, pp. 39-75, 2010.
  4. M. A. Hernandez and S. J. Stolfo, "Real world Data is Dirty: Data Cleansing and The Merge/Purge Problem", Data Mining and Knowledge Discovery, Springer Netherlands, vol. 2, no. 1, pp. 9-37, 1998.
  5. K. Kukich, "Techniques for Automatically Correcting Words in Text", ACM Computing Surveys, vol. 24, no. 4, pp. 377-439, 1992.
  6. Deaton, Thao Doan, T. Schweiger, "SemanticData Matching Principles and Performance", DataEngineering - International Series in Operations Research & Management Science, Springer US, vol. 132, pp. 77-90, 2010.
  7. T. Redman,"The impact of poor data quality of typical enterprise", Communications of ACM, vol. 41, no. 3, pp. 79-82, 1998
  8. Anders Haug, FrederikZachariassen, Dennis van Liempd; "The costs of poor data quality" JIEM, 2011 – 4(2): 168-193 – Online ISSN: 2013-0953.
  9. Arindam Paul, VaruniGanesan, JagatSeshChalla, Yashvardhan Sharma, "HADCLEAN: A Hybrid Approach to Data Cleaning in Data Warehouses", Information Retrieval & Knowledge Management (CAMP), International Conference on 2012, Page(s): 136-142.
  10. C. M. Strohmaier, C. Ringlstetter, K. U. Schulz and S. Mihov, "Lexical Post correctionof OCR-Results: The Web as a Dynamic Secondary Dictionary", Seventh International Conference on Document Analysis and Recognition (ICDAR'03),VOL. 2, pp. 1133,2003
  11. S. M. Beitzel, E. C. Jensen and D. A. Grossman, "Retrieving OCR text: A Survey of Current Approaches", Symposium on Document Image Understanding Technologies (SDUIT), Greenbelt, MD, 2003.
  12. P. Jokinen, J. Tarhio and E. Ukkonen, "A Comparison of Approximate String Matching Algorithms", Journal of Software Practice and Experience, vol. 1,no. 1,pp. 1-4,1988. Matching
Index Terms

Computer Science
Information Sciences

Keywords

PNRS Transitive closure Semantic Data matching