We apologize for a recent technical issue with our email system, which temporarily affected account activations. Accounts have now been activated. Authors may proceed with paper submissions. PhDFocusTM
CFP last date
20 December 2024
Reseach Article

Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster

by Sudhakar Singh, Rakhi Garg, P.K. Mishra
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 128 - Number 9
Year of Publication: 2015
Authors: Sudhakar Singh, Rakhi Garg, P.K. Mishra
10.5120/ijca2015906632

Sudhakar Singh, Rakhi Garg, P.K. Mishra . Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster. International Journal of Computer Applications. 128, 9 ( October 2015), 45-51. DOI=10.5120/ijca2015906632

@article{ 10.5120/ijca2015906632,
author = { Sudhakar Singh, Rakhi Garg, P.K. Mishra },
title = { Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster },
journal = { International Journal of Computer Applications },
issue_date = { October 2015 },
volume = { 128 },
number = { 9 },
month = { October },
year = { 2015 },
issn = { 0975-8887 },
pages = { 45-51 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume128/number9/22905-2015906632/ },
doi = { 10.5120/ijca2015906632 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:21:14.643626+05:30
%A Sudhakar Singh
%A Rakhi Garg
%A P.K. Mishra
%T Performance Analysis of Apriori Algorithm with Different Data Structures on Hadoop Cluster
%J International Journal of Computer Applications
%@ 0975-8887
%V 128
%N 9
%P 45-51
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Mining frequent itemsets from massive datasets is always being a most important problem of data mining. Apriori is the most popular and simplest algorithm for frequent itemset mining. To enhance the efficiency and scalability of Apriori, a number of algorithms have been proposed addressing the design of efficient data structures, minimizing database scan and parallel and distributed processing. MapReduce is the emerging parallel and distributed technology to process big datasets on Hadoop Cluster. To mine big datasets it is essential to re-design the data mining algorithm on this new paradigm. In this paper, we implement three variations of Apriori algorithm using data structures hash tree, trie and hash table trie i.e. trie with hash technique on MapReduce paradigm. We emphasize and investigate the significance of these three data structures for Apriori algorithm on Hadoop cluster, which has not been given attention yet. Experiments are carried out on both real life and synthetic datasets which shows that hash table trie data structures performs far better than trie and hash tree in terms of execution time. Moreover the performance in case of hash tree becomes worst.

References
  1. Agrawal, R., Imielinski, T. and Swami, A. 1993. Mining Association Rules between Sets of Items in Large Databases. In ACM SIGMOD Conf. Management of Data, Washington, D.C., 207–216.
  2. Agrawal, R. and Srikant, R. 1994. Fast Algorithms for Mining Association Rules. In Proceedings of the Twentieth International Conference on Very Large Databases, Santiago, Chile, 487–499.
  3. Ward, J. S. and Barker, A. Undefined By Data: A Survey of Big Data Definitions. http://arxiv.org/abs/1309.5821v1. Retrieved Sept. 2015.
  4. Apache Hadoop. http://hadoop.apache.org
  5. Bodon, F. and Rónyai, L. “Trie: an alternative data structure for data mining algorithms”, Mathematical and Computer Modelling, 2003, 38(7), 739-751.
  6. Bodon, F. 2010. A fast apriori implementation. In Proceedings of IEEE ICDM workshop on frequent itemset mining implementations (FIMI’03), Vol. 90.
  7. Han, J. and Kamber, M. 2006. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers.
  8. HDFS Architecture Guide. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
  9. Yahoo! Hadoop Tutorial. http://developer.yahoo.com/hadoop/tutorial/index.html
  10. Ghemawat, S., Gobioff, H. and Leung, S. “The Google File System”, ACM SIGOPS Operating Systems Review, 2003, 37(5), 29–43.
  11. Dean, J. and Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters”, ACM Commun., 2008, vol. 51, 107–113.
  12. Li, J., Roy, P., Khan, S. U., Wang, L. and Bai, Y. “Data Mining Using Clouds: An Experimental Implementation of Apriori over MapReduce”, http://sameekhan.org/pub/L−K−2012−SCALCOM.pdf, Retrieved March 2014.
  13. Lin, X. 2014. MR-Apriori: Association Rules Algorithm Based on MapReduce. In Proceedings of IEEE International Conference on Software Engineering and Service Science (ICSESS).
  14. Yang, X. Y., Liu, Z. and Fu, Y. 2010. MapReduce as a Programming Model for Association Rules Algorithm on Hadoop. In Proceedings of 3rd International Conference on Information Sciences and Interaction Sciences (ICIS), 99(102), 23–25.
  15. Li, N., Zeng, L., He, Q. and Shi, Z. 2012. Parallel Implementation of Apriori Algorithm based on MapReduce. In Proceedings of 13th ACIS IEEE International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel & Distributed Computing, 236–241.
  16. Oruganti, S., Ding, Q. and Tabrizi, N. 2013. Exploring HADOOP as a Platform for Distributed Association Rule Mining. In FUTURE COMPUTING 2013 the Fifth International Conference on Future Computational Technologies and Applications, 62–67.
  17. Lin, M-Y., Lee, P-Y. and Hsueh, S-C. 2012. Apriori-based Frequent Itemset Mining Algorithms on MapReduce. In Proceedings of 6th International Conference on Ubiquitous Information Management and Communication (ICUIMC ’12), ACM, New York, Article 76.
  18. Kovacs, F. and Illes, J. 2013. Frequent Itemset Mining on Hadoop. In Proceedings of IEEE 9th International Conference on Computational Cybernetics (ICCC), Hungry, 241–245.
  19. Li, L. and Zhang, M. 2011. The Strategy of Mining Association Rule Based on Cloud Computing. In Proceedings of IEEE International Conference on Business Computing and Global Informatization (BCGIN), 29–31.
  20. Yu, H., Wen, J., Wang, H. and Jun, L. An improved Apriori Algorithm Based on the Boolean Matrix and Hadoop", Procedia Engineering 15 (2011), Elsevier, 1827-1831.
  21. Riondato, M., DeBrabant, J. A., Fonseca, R. and Upfal, E. 2012. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. In Proceedings of the 21st ACM international conference on Information and Knowledge Management, 85-94.
  22. Singh, S., Garg, R. and Mishra, P. K. 2014. Review of Apriori Based Algorithms on MapReduce Framework. In Proceedings of International Conference on Communication and Computing (ICC - 2014), Elsevier Science and Technology Publications, 593–604.
  23. Borgelt, C. 2003. Efficient implementations of apriori and éclat. In Proceedings of IEEE ICDM workshop on frequent itemset mining implementations (FIMI’03).
  24. Bodon, F. 2004. Surprising Results of Trie-based FIM Algorithms. FIMI 2004.
  25. Bodon, F. 2005. A trie-based APRIORI implementation for mining frequent item sequences. In Proceedings 1st international workshop on open source data mining: frequent pattern mining implementations, ACM.
  26. Bodon, F. and Schmidt-Thieme, L. 2005. The relation of closed itemset mining, complete pruning strategies and item ordering in apriori-based fim algorithms. In Knowledge Discovery in Databases: PKDD, Springer Berlin Heidelberg, 437-444.
  27. SPMF Datasets. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
  28. Frequent Itemset Mining Dataset Repository. http://fimi.ua.ac.be/data/
Index Terms

Computer Science
Information Sciences

Keywords

Frequent Itemset Mining Apriori Hash Tree Trie Hadoop MapReduce