CFP last date
20 January 2025
Call for Paper
February Edition
IJCA solicits high quality original research papers for the upcoming February edition of the journal. The last date of research paper submission is 20 January 2025

Submit your paper
Know more
Reseach Article

A Comparison Study of Data Scrubbing Algorithms and Frameworks in Data Warehousing

by Hamed Ibrahim Housien, Zhang Zuping, Zainab Qays Abdulhadi
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 68 - Number 25
Year of Publication: 2013
Authors: Hamed Ibrahim Housien, Zhang Zuping, Zainab Qays Abdulhadi
10.5120/11752-7406

Hamed Ibrahim Housien, Zhang Zuping, Zainab Qays Abdulhadi . A Comparison Study of Data Scrubbing Algorithms and Frameworks in Data Warehousing. International Journal of Computer Applications. 68, 25 ( April 2013), 26-32. DOI=10.5120/11752-7406

@article{ 10.5120/11752-7406,
author = { Hamed Ibrahim Housien, Zhang Zuping, Zainab Qays Abdulhadi },
title = { A Comparison Study of Data Scrubbing Algorithms and Frameworks in Data Warehousing },
journal = { International Journal of Computer Applications },
issue_date = { April 2013 },
volume = { 68 },
number = { 25 },
month = { April },
year = { 2013 },
issn = { 0975-8887 },
pages = { 26-32 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume68/number25/11752-7406/ },
doi = { 10.5120/11752-7406 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:29:02.410209+05:30
%A Hamed Ibrahim Housien
%A Zhang Zuping
%A Zainab Qays Abdulhadi
%T A Comparison Study of Data Scrubbing Algorithms and Frameworks in Data Warehousing
%J International Journal of Computer Applications
%@ 0975-8887
%V 68
%N 25
%P 26-32
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

In these days, many organizations tend to use a Data Warehouse to meet the requirements to develop decision-making processes and achieve their goals better and satisfy their customers. It enables Executives to access the information they need in a timely manner for making the right decision for any work. Decision Support System (DSS) is one of the means that applied in data mining . Its robust and better decision depends on an important and conclusive factor called Data Quality (DQ), to obtain a high data quality using Data Scrubbing (DS) which is one of data Extraction Transformation and Loading (ETL) tools. Data Scrubbing is very important and necessary in the Data Warehouse (DW). There are growing relationships to get high DQ and effective DS. The use of DS algorithms is a solution to the constraints that limit the DQ which leads to weak decisions and the burden of the high financial costs. These constraints are: dirty data, noise data, missing values, inconsistency, uncertain data, ambiguous, conflicting, duplicated records and similar columns. The Sources and causes of these constraints are many, including: input error, merge data from different sources, difference in representing the same information, etc. In addition there are more than 35 sources and causes of the poor-quality data that arise at the stage of the ETL process. This paper present comparison and analysis for DS algorithms and the pros and cons of each algorithm, accuracy and time complexity. Additionally, it present a comparative and analysis of the Data Scrubbing Frameworks and determine the best framework.

References
  1. Efraim Turban, Ramesh Sharda and Dursun Delen, "Decision Support and Business Intelligence Systems", 9th edition, 2011.
  2. S. Sumathi and S. N Sivanandam, "Data Marts and Data Warehouse: Information Architecture for the Millennium", Studies in Computational Intelligence (SCI),Springer-Verlag Berlin Heidelberg 2006.
  3. Munawar, Naomie Salim and Roliana Ibrahim, "Towards Data Quality into the Data Warehouse Development", IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing, 2011.
  4. R. R Nemoni and R Konda, "A Framework for Data Quality in Datawarehouse", In J. Yang et. Al (Eds): UNISCON 2009, Springer-Verlag Berlin Heidelberg, 2009.
  5. Arindam Paul, Varuni Ganesan, Jagat Sesh Challa and Yashvardhan Sharma, "HADCLEAN: A Hybrid Approach to Data Cleaning in Data Warehouses", IEEE, 2012.
  6. Wayne W. Eckerson, "Data Quality and the Botton Line, Achieving Business Success through a Commitment to High Quality Data", The Data Warehousing Institute,Available at: www. dw-institute. com , Accessed on Jan 2013.
  7. Negin Daneshpour and Ahmad Abdollahzadeh, "Data Engineering Approach to Efficient Data Warehouse: life cycle development revisited", IEEE, 2011.
  8. Hasimah Hj Mohamed, Tee Leong Kheng, Chee Collin, and Ong Siong Lee, " E-Clean: A Data Cleaing Framework for Patient Data", First International Conference on Informatics and Computational Intelligence, IEEE, 2011.
  9. Xuhui Chen and Xinghua Zhang, "Extract-Transform-Load of Data Cleaning Method in Electric Company", International Conference on Artificial Intelligence and Computational Intelligence", IEEE, 2010.
  10. Vaishali Rajeev Patel and Rupa G. Mehta, "Performance Analysis of MK-Means Clustering Algorithm with Normalization Approach", World Congress on Inforation and Communication Technologies, IEEE, 2011.
  11. Mauricom A. Hernandez and Salvatore J. Stolfo, "The Merge / Purge Problem for Large Databases", Department of Computer Science, Columbia University, New York, 1995.
  12. Zhong Jia Qing, Zhang Yi Fang and Lu Zhi Gang, "Research of Data Cleaning Algorithm in Data Warehouse", China Academic Journal Electronic Publishing House, 2009.
  13. Luyi Mo, Reynold Cheng, Xiang Li, David Cheung and Xuan Yang, "Cleaning Uncertain Data for Top-k Queries, Department of Computer Science University of Hong Kong, Hong Kong, 2012.
  14. Mortadha M. Hamed and Alaa Abdulkhar Jihad, "An Enhanced Technique to Clean Data in the Data Warehouse", Development in E-systems Engineering, IEEE, 2011.
  15. Kazi shah Nawaz Ripon, Ashiqur Rahman and Atiqur Rahaman, "A Domain-Independent Data Cleaning Algorithm for Detecting Similar-Duplicates", Journal of Computers, Vol. 5, No. 12, Decem, 2012.
  16. Israr Ahmed and Abdul Aziz, "Dynamic Approach for Data Scrubbing Process", International Journal on Computer Science and Engineerign, Vol. 02, No. 02, 2010.
  17. Shaofeng Liu, Alex H. B Duffy, Robert Ian Whitifield, Iain M. Boyle, "Integration of decision support systems to improve decision support performance", Springer-Verlag London Limited, February 2009.
  18. Ranjit Singh, and Kawaljeet Singh, "A Decriptive Classification of Causes of Data Quality Problems in Data Warehousing", IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 3, No2, May 2010.
  19. Carlo Vercellis, "Business Intelligence:Data Mining and Optimizationfor Decision Making", John Wiley & Sons, Ltd. , 2009.
  20. "Figure 1: Basic ETL Functionality", Available at: http://gerardnico. com/wiki/dit/etl_become_di , Accessed on Jan 2013.
  21. "Figure: Overview of Data Warehousing Infrastructure", Available at http://174. 37. 163. 146-static. reverse. softlayer. com/data-warehousing/data-warehousing-overview. asp , Accessed on Jan 2013.
  22. Sean Kandel, Andreas Paepcke, Joseph Jellerstein and Jeffery Heer, "Wrangler: Interactive Visual Specification of Data Transformation Scripts", ACM Human Factors in Computing Systems (CHI), May 2011.
  23. WinPure Ltd, "Merging Duplicate Records –The Easy Way",Available at: http://www. winpure. com/blog/merging-duplicate-records/ , Accessed on Jan 2013.
Index Terms

Computer Science
Information Sciences

Keywords

Data scrubbing Data warehousing Data Quality Extract-Transform-Load (ETL)