CFP last date
20 December 2024
Reseach Article

Parallel and Distributed Code Clone Detection using Sequential Pattern Mining

by Ali El-matarawy, Mohammad El-ramly, Reem Bahgat
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 62 - Number 10
Year of Publication: 2013
Authors: Ali El-matarawy, Mohammad El-ramly, Reem Bahgat
10.5120/10118-4792

Ali El-matarawy, Mohammad El-ramly, Reem Bahgat . Parallel and Distributed Code Clone Detection using Sequential Pattern Mining. International Journal of Computer Applications. 62, 10 ( January 2013), 25-31. DOI=10.5120/10118-4792

@article{ 10.5120/10118-4792,
author = { Ali El-matarawy, Mohammad El-ramly, Reem Bahgat },
title = { Parallel and Distributed Code Clone Detection using Sequential Pattern Mining },
journal = { International Journal of Computer Applications },
issue_date = { January 2013 },
volume = { 62 },
number = { 10 },
month = { January },
year = { 2013 },
issn = { 0975-8887 },
pages = { 25-31 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume62/number10/10118-4792/ },
doi = { 10.5120/10118-4792 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T21:11:27.053408+05:30
%A Ali El-matarawy
%A Mohammad El-ramly
%A Reem Bahgat
%T Parallel and Distributed Code Clone Detection using Sequential Pattern Mining
%J International Journal of Computer Applications
%@ 0975-8887
%V 62
%N 10
%P 25-31
%D 2013
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This research presents a parallel and distributed data mining approach to code clone detection. It aims to prove the value and importance of deploying parallel and distributed computing for real-time large scale code clone detection. It is implemented this approach in a family of clone detectors, called PD EgyCD (Parallel and Distributed Egypt Clone Detector). In this approach, This research builds on an earlier work of the authors for code clone and plagiarism detection using sequential pattern mining by adding parallelism and distribution to our earlier tool EgyCD. Our approach uses data mining through a tailored Apriori-based algorithm for code clone detection. And it uses parallelization and distribution to achieve excellent performance to scale up to clone detection on very large systems. This approach has been implemented as a database application which leverages the capabilities of modern database tools. Two versions have been developed of this distributed technique. The first one uses client-server technique in which all clients and the server deal with only one database. The second one uses agents where each client acts as a separate agent and has its own database and after working on a sub-problem, it submits its partial solution to the server to finally get the complete solution (set of code clones). Experiments show that agents technique is faster than client-server one. Distribution enhances performance very much. Speed improvement is a function of the number of clients/agents used. Our conclusion is that data mining, combined with parallel and distributed computing, can efficiently be deployed for code clone detection of very large systems.

References
  1. C. K. Roy, J. R. Cordy, R. Koschke, Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Comparison and Evaluation of Code Clone Detection Techniques, Science of Computer Programming, 74, 470-495, 2009.
  2. B. Baker, On Finding Duplication and Near-Duplication in Large Software Systems, in: Proceedings of the 2nd Working Conference on Reverse Engineering, WCRE 1995, pp. 86-95, 1995.
  3. C. K. Roy and J. R. Cordy, An Empirical Study of Function Clones in Open Source Software Systems. In Proceedings of the 15th Working Conference on Reverse Engineering, WCRE 2008, pp. 81-90, 2008.
  4. E. Juergens, F. Deissenboeck, B. Hummel and S. Wagner. Do Code Clones Matter? In Proceedings of the 31st International Conference on Software Engineering (ICSE'09), pp. 485–495, Vancouver, Canada, May 2009.
  5. J. H. Johnson. Identifying Redundancy in Source Code Using Fingerprints. In Proceeding of the 1993 Conference of the Centre for Advanced Studies Conference (CASCON' 93), pp. 171–183, Toronto, Canada, October 1993.
  6. B. Baker. On Finding Duplication and Near-Duplication in Large Software Systems. In Proceedings of the Second Working Conference on Reverse Engineering(WCRE'95), pp. 86–95, Toronto, Ontario, Canada, July 1995.
  7. A. Chou, J. Yang, B. Chelf, S. Hallem and D. R. Engler. An Empirical Study of Operating System Errors. In Proceedings of the 18th ACM symposium on Operating systems principles (SOSP'01), pp. 73–88, Banff, Alberta, Canada, October 2001.
  8. Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code. IEEE Transactions on Software Engineering, 32(3):176–192, 2006.
  9. M. Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley, 2000.
  10. S. Bellon, R. Koschke, G. Antoniol, J. Krinke and E. Merlo, Comparison and Evaluation of Clone Detection Tools, Transactions on Software Engineering, 33(9):577-591, 2007.
  11. Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. Transactions on Software Engineering, Vol. 28(7): 654- 670, July 2002.
  12. Chanchal Kumar Roy and James R. Cordy, A Survey on Software Clone Detection, Technical Report No. 2007-541, School of Computing, Queen's University at Kingston, Ontario, Canada, September 26, 2007.
  13. Raghavan Komondoor and Susan Horwitz. Using Slicing to Identify Duplication in Source Code. In Proceedings of the 8th International Symposium on Static Analysis (SAS'01), Vol. LNCS 2126, pp. 40-56, Paris, France, July 2001.
  14. Agrawal, R. and Srikant, R. 1995. Mining sequential patterns. In Eleventh International, Conference on Data Engineering, P. S. Yu and A. S. P. Chen, Eds. IEEE Computer Society, Press, Taipei, Taiwan, 3-14.
  15. Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu. GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. In the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06), pp. 872-881, Philadelphia, USA, August 2006.
  16. B. Hummel, E. Juergens, L. Heinemann, M. Conradt, Index-based Code Clone Detection: Incremental, Distributed, Scalable. Int. Conf. Software Maintenance (ICSM), 2010.
  17. A. Matarawy, M. El-Ramly and R. Bahgat. Code Clone Detection Using Data Mining, Conference of Institute of Statistical Studies and Research (ISSR), Cairo University. (to appear in Dec. 2012).
  18. S. Livieri, Y. Higo, M. Matushita, K. Inoue, Very-Large Scale Code Clone Analysis and Visualization of Open Source Programs Using Distributed CCFinder: D-CCFinder, Graduate School of Information Science and Technology, Osaka University1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan, 2007
  19. Vera Wahler, Dietmar Seipel, J¨urgen Wolff v. Gudenberg, and Gregor Fischer. Clone Detection in Source Code by Frequent Itemset Techniques, Source Code Analysis and Manipulation, 2004. Fourth IEEE International Workshop on16-16 Sept. 2004.
  20. M. -S. Chen, J. Han, and P. S. Yu. Data mining: an overview from a database perspective. IEEE Trans. On Knowledge And Data Engineering 8, 866-883,1996.
  21. Q. Zhao, S. S. Bhowmick, Sequential pattern mining: a survey, Technical Report Center for Advanced Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore, 2003.
  22. Jiawei Han, Micheline Kamber: Data Mining – Concepts and Techniques, Kaufmann, 2001.
Index Terms

Computer Science
Information Sciences

Keywords

Code clones textual approach lexical approach syntactic approach clone types parallel code clone detector distributed code clone detector clone relation terminologies data mining apriori property sequential pattern mining