CFP last date
20 December 2024
Reseach Article

Code Clone Detection using Sequential Pattern Mining

by Ali El-Matarawy, Mohammad El-Ramly, Reem Bahgat
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 127 - Number 2
Year of Publication: 2015
Authors: Ali El-Matarawy, Mohammad El-Ramly, Reem Bahgat
10.5120/ijca2015906324

Ali El-Matarawy, Mohammad El-Ramly, Reem Bahgat . Code Clone Detection using Sequential Pattern Mining. International Journal of Computer Applications. 127, 2 ( October 2015), 10-18. DOI=10.5120/ijca2015906324

@article{ 10.5120/ijca2015906324,
author = { Ali El-Matarawy, Mohammad El-Ramly, Reem Bahgat },
title = { Code Clone Detection using Sequential Pattern Mining },
journal = { International Journal of Computer Applications },
issue_date = { October 2015 },
volume = { 127 },
number = { 2 },
month = { October },
year = { 2015 },
issn = { 0975-8887 },
pages = { 10-18 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume127/number2/22700-2015906324/ },
doi = { 10.5120/ijca2015906324 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-02-06T23:18:48.938669+05:30
%A Ali El-Matarawy
%A Mohammad El-Ramly
%A Reem Bahgat
%T Code Clone Detection using Sequential Pattern Mining
%J International Journal of Computer Applications
%@ 0975-8887
%V 127
%N 2
%P 10-18
%D 2015
%I Foundation of Computer Science (FCS), NY, USA
Abstract

This paper presents a new technique for clone detection using sequential pattern mining titled EgyCD. Over the last decade many techniques and tools for software clone detection have been proposed such as textual approaches, lexical approaches, syntactic approaches, semantic approaches …, etc. In this paper, we explore the potential of data mining techniques in clone detection. In particular, we developed a clone detection technique based on sequential pattern mining (SPM). The source code is treated as a sequence of transactions processed by the SPM algorithm to find frequent itemsets. We run three experiments to discover code clones of Type I, Type II and Type III and for plagiarism detection. We compared the results with other established code clone detectors. Our technique discovers all code clones in the source code and hence it is slower than the compared code clone detectors since they discover few code clones compared with EgyCD.

References
  1. C. K. Roy, J. R. Cordy, R. Koschke, Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach. Comparison and Evaluation of Code Clone Detection Techniques, Science of Computer Programming, 74, 470-495, (2009).
  2. B. Baker, On Finding Duplication and Near-Duplication in Large Software Systems, in: Proceedings of the 2nd Working Conference on Reverse Engineering, WCRE 1995, pp. 86-95 (1995).
  3. C. K. Roy and J. R. Cordy, An Empirical Study of Function Clones in Open Source Software Systems, in: Proceedings of the 15th Working Conference on Reverse Engineering, WCRE 2008, pp. 81-90 (2008).
  4. E. Juergens, F. Deissenboeck, B. Hummel and S. Wagner. Do Code Clones Matter? In Proceedings of the 31st International Conference on Software Engineering (ICSE’09), pp. 485–495, Vancouver, Canada, May 2009.
  5. J. H. Johnson. Identifying Redundancy in Source Code Using Fingerprints. In Proceeding of the 1993 Conference of the Centre for Advanced Studies Conference (CASCON’ 93), pp. 171–183, Toronto, Canada, October 1993.
  6. B. Baker. On Finding Duplication and Near-Duplication in Large Software Systems. In Proceedings of the Second Working Conference on Reverse Engineering(WCRE’95), pp. 86–95, Toronto, Ontario, Canada, July 1995.
  7. A. Chou, J. Yang, B. Chelf, S. Hallem and D. R. Engler. An Empirical Study of Operating System Errors. In Proceedings of the 18th ACM symposium on Operating systems principles (SOSP’01), pp. 73–88, Banff, Alberta, Canada, October 2001.
  8. Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: Finding Copy-Paste and Related Bugs in Large-Scale Software Code. IEEE Transactions on Software Engineering, 32(3):176–192, 2006.
  9. M. Fowler. Refactoring: Improving the Design of Existing Code. Addison-Wesley, 2000.
  10. S. Bellon, R. Koschke, G. Antoniol, J. Krinke and E. Merlo, Comparison and Evaluation of Clone Detection Tools, Transactions on Software Engineering, 33(9):577-591 (2007).
  11. Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue. CCFinder: A Multilinguistic Token-Based Code Clone Detection System for Large Scale Source Code. Transactions on Software Engineering, Vol. 28(7): 654- 670, July 2002.
  12. Chanchal Kumar Roy and James R. Cordy, A Survey on Software Clone Detection, September 26, 2007, Technical Report No. 2007-541, School of Computing, Queen’s University at Kingston, Ontario, Canada
  13. Raghavan Komondoor and Susan Horwitz. Using Slicing to Identify Duplication in Source Code. In Proceedings of the 8th International Symposium on Static Analysis (SAS’01), Vol. LNCS 2126, pp. 40-56, Paris, France, July 2001.
  14. Jens Krinke. Identifying Similar Code with Program Dependence Graphs. In Proceed- ings of the 8th Working Conference on Reverse Engineering (WCRE’01), pp. 301-309, Stuttgart, Germany, October 2001.
  15. Chao Liu, Chen Chen, Jiawei Han and Philip S. Yu. GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis. In the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), pp. 872-881, Philadelphia, USA, August 2006.
  16. Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3):319349,1987.
  17. A. Leitlao, Detection of Redundant Code Using R2D2, Software Quality Journal, 12(4):361-382 (2004).
  18. Stefan Bellon, Daniel Simon: Vergleich von Klon- erkennungstechniken, 5th Workshop on Software Reengineering, 2003.
  19. Filip van Rysselberghe, Serge Demeyer: Evaluating Clone Detection Techniques, Proceedings of the Inter- national Workshop on Evolution of Large Scale Indus- trial Applications ELISA 2003.
  20. M.-S. Chen, J. Han, and P. S. Yu. Data mining: an overview from a database perspective. IEEE Trans. On Knowledge And Data Engineering 8, 866-883 (1996).
  21. Q. Zhao, S.S. Bhowmick, Sequential pattern mining: a survey, Technical Report Center for Advanced Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore, (2003).
  22. C. Liu, C. Chen, J. Han and P. Yu, GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 872-881 (2006).
  23. M. Gabel, L. Jiang and Z. Su, Scalable Detection of Semantic Clones, in: Proceedings of the 30th International Conference on Software Engineering, ICSE 2008, pp. 321-330 (2008).
  24. R. Komondoor and S. Horwitz, Using Slicing to Identify Duplication in Source Code, in: Proceedings of the 8th International Symposium on Static Analysis, SAS 2001, pp. 40-56 (2001).
  25. What is Plagiarism? Available online: http://www.plagiarism.org/plagiarism-101/what-is-plagiarism.
  26. Hamid Abdul Basit, Member, IEEE, and Stan Jarzabek, A Data Mining Approach for Detecting Higher-level Clones in Software.
  27. Vera Wahler, Dietmar Seipel, J¨urgen Wolff v. Gudenberg, and Gregor Fischer, Clone Detection in Source Code by Frequent Itemset Techniques, University of W¨urzburg, Institute for Computer Science Am Hubland, D . 97074 W¨urzburg, Germany.
  28. C. K. Roy and J. R. Cordy. “NiCad: Accurate Detection of Near-Miss Intentional Clones Using Flexible Pretty-Printing and Code Normalization", Proc. ICPC, 2008, pp. 172-181.
  29. M. S. Uddin, C. K. Roy K. A. Schneider, A. Hindle, "On the Effectiveness of Simhash for Detecting Near-Miss Clones in Large Scale Software Systems", 10.1109/WCRE.2011.12 P: 13 - 22
Index Terms

Computer Science
Information Sciences

Keywords

Sequential Pattern Mining Clone Detection Data Mining