CFP last date
20 September 2024
Reseach Article

Robust Frequent Patterns Mining Algorithms on Parallel Systems

by Sameh Abdulah, Walid Atwa
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 36
Year of Publication: 2024
Authors: Sameh Abdulah, Walid Atwa
10.5120/ijca2024923878

Sameh Abdulah, Walid Atwa . Robust Frequent Patterns Mining Algorithms on Parallel Systems. International Journal of Computer Applications. 186, 36 ( Aug 2024), 24-30. DOI=10.5120/ijca2024923878

@article{ 10.5120/ijca2024923878,
author = { Sameh Abdulah, Walid Atwa },
title = { Robust Frequent Patterns Mining Algorithms on Parallel Systems },
journal = { International Journal of Computer Applications },
issue_date = { Aug 2024 },
volume = { 186 },
number = { 36 },
month = { Aug },
year = { 2024 },
issn = { 0975-8887 },
pages = { 24-30 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number36/robust-frequent-patterns-mining-algorithms-on-parallel-systems/ },
doi = { 10.5120/ijca2024923878 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-08-28T02:16:49.594313+05:30
%A Sameh Abdulah
%A Walid Atwa
%T Robust Frequent Patterns Mining Algorithms on Parallel Systems
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 36
%P 24-30
%D 2024
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Data Mining (DM) algorithms have become increasingly prevalent in analyzing vast amounts of data generated in scientific fields like instrument and simulation data and in areas such as social networks and financial transactions. The availability of High-Performance Computing (HPC) systems has made parallel implementations of these algorithms commonplace. However, these systems, which were designed with data movement constraints in mind, often experience faults in computing devices, resulting in permanent process or node failures. This paper presents fault-tolerant parallel algorithms that enable checkpointing and recovery in memory for frequent pattern mining algorithms. Long-running data-intensive applications typically utilize the Message Passing Interface (MPI). Therefore, we tackle the challenge of fault tolerance in MPI-based applications by leveraging internal algorithm features and using MPI one-sided communication technology. Although this paper focuses on the FP-Growth frequent mining algorithm, we anticipate that the proposed approaches will serve as a foundation for designing fault-tolerant DM algorithms in general, given the effectiveness of the proposed implementations. Our evaluation demonstrates that MPI one-sided communication can act as a robust support system for efficient memory-based fault tolerance in parallel algorithms, even when compared to existing parallel programming models like Hadoop and Spark. To evaluate our fault-tolerant (FT) algorithms, we conduct tests on a large-scale InfiniBand cluster using several extensive datasets, employing up to 2K cores. Our evaluation reveals excellent efficiency in checkpointing and recovery compared to the disk-based approach. Furthermore, we observe an average speed-up of 20X for the FP-Growth algorithm compared to Spark. This establishes that a well-designed algorithm can easily surpass a solution based on a general fault-tolerant programming model.

References
  1. Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, et al. Parallel programming with migratable objects: Charm++ in practice. In SC’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 647– 658. IEEE, 2014.
  2. R Agrawal and R Srikant. Quest synthetic data generator. ibm almaden research center, san jose, california, 2009.
  3. Rakesh Agrawal, Tomasz Imieli´nski, and Arun Swami. Mining association rules between sets of items in large databases. In Acm sigmod record, volume 22, pages 207–216. ACM, 1993.
  4. Albert Alexandrov, Mihai F Ionescu, Klaus E Schauser, and Chris Scheiman. Loggp: Incorporating long messages into the logp model—one step closer towards a realistic model for parallel computation. In Proceedings of the seventh annual ACM symposium on Parallel algorithms and architectures, pages 95–105, 1995.
  5. Chong Bao and Shancong Zhang. Algorithm-based fault tolerance for discrete wavelet transform implemented on gpus. Journal of systems architecture, 108:101823, 2020.
  6. Christian Borgelt. An implementation of the fp-growth algorithm. In Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, pages 1–5. ACM, 2005.
  7. Claus Braun, Sebastian Halder, and Hans Joachim Wunderlich. A-abft: Autonomous algorithm-based fault tolerance for matrix multiplications on graphics processing units. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on, pages 443–454. IEEE, 2014.
  8. Jiajun Cao, Kapil Arya, Rohan Garg, Shawn Matott, Dhabaleswar K Panda, Hari Subramoni, J´erˆome Vienne, and Gene Cooperman. System-level scalable checkpoint-restart for petascale computing. In 2016 IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS), pages 932–941. IEEE, 2016.
  9. Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph Von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. Acm Sigplan Notices, 40(10):519–538, 2005.
  10. Ifeanyi P Egwutuoha, David Levy, Bran Selic, and Shiping Chen. A survey of fault tolerance mechanisms and checkpoint/ restart implementations for high performance computing systems. The Journal of Supercomputing, 65(3):1302– 1326, 2013.
  11. Karam Gouda and Mohammed J Zaki. Genmax: An efficient algorithm for mining maximal frequent itemsets. Data Mining and Knowledge Discovery, 11(3):223–242, 2005.
  12. Upama Kabir and Dhrubajyoti Goswami. An abft scheme based on communication characteristics. In Cluster Computing (CLUSTER), 2016 IEEE International Conference on, pages 515–523. IEEE, 2016.
  13. Laxmikant V Kale and Sanjeev Krishnan. Charm++ a portable concurrent object oriented system based on c++. In Proceedings of the eighth annual conference on Objectoriented programming systems, languages, and applications, pages 91–108, 1993.
  14. Khushboo Kalia and Neeraj Gupta. Analysis of hadoop mapreduce scheduling in heterogeneous environment. Ain Shams Engineering Journal, 12(1):1101–1110, 2021.
  15. Jia-Ling Koh and Pei-Wy Yo. An efficient approach for mining fault-tolerant frequent patterns based on bit vector representations. In DASFAA, volume 5, pages 568–575. Springer, 2005.
  16. Jay Lewis, Ryan G Benton, David Bourrie, and Jennifer Lavergne. Enhancing itemset tree rules and performance. In 2019 IEEE International Conference on Big Data (Big Data), pages 1143–1150. IEEE, 2019.
  17. Wei-Tee Lin and Chih-Ping Chu. Determining the appropriate number of nodes for fast mining of frequent patterns in distributed computing environments. International Journal of Parallel, Emergent and Distributed Systems, (aheadof- print):1–13, 2014.
  18. Adam Moody, Greg Bronevetsky, Kathryn Mohror, and Bronis R De Supinski. Design, modeling, and evaluation of a scalable multi-level checkpointing system. In High Performance Computing, Networking, Storage and Analysis (SC), 2010 International Conference for, pages 1–11. IEEE, 2010.
  19. Faisal Shahzad, Markus Wittmann, Moritz Kreutzer, Thomas Zeiser, Georg Hager, and GerhardWellein. A survey of checkpoint/ restart techniques on distributed memory systems. Parallel Processing Letters, 23(04):1340011, 2013.
  20. Mai Shawkat, Mahmoud Badawi, Sally El-ghamrawy, Reham Arnous, and Ali El-desoky. An optimized fp-growth algorithm for discovery of association rules. The Journal of Supercomputing, pages 1–28, 2022.
  21. Kamalpreet Singh and Ravinder Kaur. Hadoop: addressing challenges of big data. In 2014 IEEE International Advance Computing Conference (IACC), pages 686–689. IEEE, 2014.
  22. Panruo Wu, Qiang Guan, Nathan DeBardeleben, Sean Blanchard, Dingwen Tao, Xin Liang, Jieyang Chen, and Zizhong Chen. Towards practical algorithm based fault tolerance in dense linear algebra. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, pages 31–42, 2016.
  23. Cheng Yang, Usama Fayyad, and Paul S Bradley. Efficient discovery of error-tolerant frequent itemsets in high dimensions. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 194–203. ACM, 2001.
  24. Matei Zaharia and et al. Resilient distributed datasets: A faulttolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, NSDI’12, Berkeley, CA, USA, 2012. USENIX Association.
  25. Mohammed Javeed Zaki. Scalable algorithms for association mining. IEEE Transactions on Knowledge and Data Engineering, 12(3):372–390, 2000.
  26. Kai Zhao, Sheng Di, Sihuan Li, Xin Liang, Yujia Zhai, Jieyang Chen, Kaiming Ouyang, Franck Cappello, and Zizhong Chen. Ft-cnn: Algorithm-based fault tolerance for convolutional neural networks. IEEE Transactions on Parallel and Distributed Systems, 32(7):1677–1689, 2020.
Index Terms

Computer Science
Information Sciences
Message Passing Interface (MPI)
One-side communication
fault tolerance

Keywords

MPI one-sided communication MPI-based applications fault tolerance algorithms FP-Growth Frequent mining algorithm