PDF (9.9 MB)
Collect
Submit Manuscript
Show Outline
Figures (11)

Show 2 more figures Hide 2 figures
Tables (6)
Table 1
Table 2
Table 3
Table 4
Table 5
Show 1 more tables Hide 1 tables
Open Access

CCDive: A Deep Dive into Code Clone Detection Using Local Sequence Alignment

School of Software, Tsinghua University, Beijing 100084, China
Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
Show Author Information

Abstract

The rapid evolution of software development has accentuated the deficiencies of prevailing code clone detection techniques. As modern applications become more complex, traditional cloning tools often struggle to detect general and large-gap clones that undergo regular modification. Such challenges pose threats to software integrity, emphasizing the critical need for improved code cloning techniques. Observing the prevailing gap, we propose an innovative code clone dive (CCDive) code cloning technique, which is designed to detect an extensive range of clones, from direct clones to the often challenging large-gap clones, thoroughly covering different categories, such as very strongly Type-III, strongly Type-III, and moderate Type-III clones. In CCDive, the fusion of a level-by-level abstraction and an innovative similarity matching algorithm ensures the recognition of clones even when nearly half the original code in the chunk has been modified. Furthermore, by integrating the Smith–Waterman local sequence alignment, the capability of CCDive to spot exact code transformation locations can be enhanced. In a comprehensive evaluation, CCDive was compared with well-known code cloning techniques. The efficacy of CCDive was measured using precision, recall, F1-score, accuracy, and efficiency. CCDive consistently surpassed other techniques in the precision, recall, F1-score, and accuracy metrics for both file-based and function-based clone detection. The robust performance of CCDive emphasizes its effectiveness, reliability, accuracy, and efficiency, making it well-suited for practical applications in the real world.

References

[1]

Q. U. Ain, W. H. Butt, M. W. Anwar, F. Azam, and B. Maqbool, A systematic review on code clone detection, IEEE Access, vol. 7, pp. 86121–86144, 2019.

[2]

H. Zhang and K. Sakurai, A survey of software clone detection from security perspective, IEEE Access, vol. 9, pp. 48157–48173, 2021.

[3]

S. Kim and H. Lee, Software systems at risk: An empirical study of cloned vulnerabilities in practice, Computers & Security, vol. 77, pp. 720–736, 2018.

[4]
Y. Golubev, V. Poletansky, N. Povarov, and T. Bryksin, Multi-threshold token-based code clone detection, in Proc. IEEE Int. Conf. Softw. Anal., Evol. Reengineering (SANER ), pp. 496–500, 2021.
[5]
H. A. Basit and S. Jarzabek, Efficient token based clone detection with flexible tokenization, in Proc. 6 th Joint Meeting of the European Software Engineering Conf. and the ACM SIGSOFT Symp. Foundations of Software Engineering, Dubrovnik, Croatia, 2007, pp. 513–516.
[6]
Y. Semura, N. Yoshida, E. Choi, and K. Inoue, CCFinderSW: Clone detection tool with flexible multilingual tokenization, in Proc. 2017 24 th Asia–Pacific Software Engineering Conf. (APSEC), Nanjing, China, 2017, pp. 654–659.
[7]
C. K. Roy and J. R. Cordy, NICAD: Accurate detection of near-miss intentional clones using flexible pretty-printing and code normalization, in Proc. 2008 16 th IEEE Int. Conf. Program Comprehension, Amsterdam, The Netherlands, 2008, pp. 172–181.
[8]
C. Ragkhitwetsagul and J. Krinke, Using compilation/decompilation to enhance clone detection, in Proc. 2017 IEEE 11 th Int. Workshop on Software Clones (IWSC ), Klagenfurt, Austria, 2017, pp. 1–7.
[9]
Y. Glani, L. Ping, and S. A. Shah, AASH: A lightweight and efficient static IoT malware detection technique at source code level, in Proc. 2022 3 rd Asia Conf. Computers and Communications (ACCC ), Shanghai, China, 2022, pp. 19–23.
[10]
Y. L. Hung and S. Takada, CPPCD: A token-based approach to detecting potential clones, in Proc. 2020 IEEE 14 th Int. Workshop on Software Clones (IWSC ), London, Canada, 2020, pp. 26–32.
[11]
Z. Li, S. Lu, S. Myagmar, and Y. Zhou, CP-Miner: Finding copy-paste and related bugs in large-scale software code, IEEE Trans. Softw. Eng., vol. 32, no. 3, pp. 176–192, 2006.
[12]
Y. Glani, L. Ping, K. Lin, and S. A. Shah, AyatDroid: A lightweight code cloning technique using different static features, in Proc. 2023 IEEE 3 rd Int. Conf. Software Engineering and Artificial Intelligence (SEAI ), Xiamen, China, 2023, pp. 17–21.
[13]
J. Jang, A. Agrawal, and D. Brumley, ReDeBug: Finding unpatched code clones in entire OS distributions, in Proc. 2012 IEEE Symp. Security and Privacy, San Francisco, CA, USA, 2012, pp. 48–62.
[14]
S. Kim, S. Woo, H. Lee, and H. Oh, VUDDY: A scalable approach for vulnerable code clone discovery, in Proc. 2017 IEEE Symp. Security and Privacy (SP ), San Jose, CA, USA, 2017, pp. 595–614.
[15]
Y. Gao, Z. Wang, S. Liu, L. Yang, W. Sang, and Y. Cai, TECCD: A tree embedding approach for code clone detection, in Proc. 2019 IEEE Int. Conf. Software Maintenance and Evolution (ICSME ), Cleveland, OH, USA, 2019, pp. 145–156.
[16]
P. Bulychev and M. Minea, Duplicate code detection using anti-unification, https://cyberleninka.ru/article/n/duplicate-code-detection-using-anti-unification, 2008.
[17]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu, DECKARD: Scalable and accurate tree-based detection of code clones, in Proc. 29 th Int. Conf. Software Engineering (ICSE'07 ), Minneapolis, MN, USA, 2007, pp. 96–105.
[18]
J. Li and M. D. Ernst, CBCD: Cloned buggy code detector, in Proc. 2012 34 th Int. Conf. Software Engineering (ICSE ), Zurich, Switzerland, 2012, pp. 310–320.
[19]
F. Yamaguchi, N. Golde, D. Arp, and K. Rieck, Modeling and discovering vulnerabilities with code property graphs, in Proc. 2014 IEEE Symp. Security and Privacy, Berkeley, CA, USA, 2014, pp. 590–604.
[20]
Y. Xiao, B. Chen, C. Yu, Z. Xu, Z. Yuan, F. Li, B. Liu, Y. Liu, W. Huo, W. Zou, et al., MVP: Detecting vulnerabilities using patch-enhanced vulnerability signatures, in Proc. 29 th USENIX Security Symp., 2020, pp. 1165–1182.
[21]
Y. Yang, Z. Ren, X. Chen, and H. Jiang, Structural function based code clone detection using a new hybrid technique, in Proc. 2018 IEEE 42 nd Annu. Computer Software and Applications Conf. (COMPSAC ), Tokyo, Japan, 2018, pp. 286–291.
[22]

E. Kodhai and S. Kanmani, Method-level code clone detection through LWH (Light Weight Hybrid) approach, J. Softw. Eng. Res. Dev., vol. 2, no. 1, pp. 12–29, 2014.

[23]
M. R. H. Misu, A. Satter, and K. Sakib, An exploratory study on interface similarities in code clones, in Proc. 2017 24 th Asia-Pacific Software Engineering Conf. Workshops, Nanjing, China, 2017, pp. 126–133.
[24]

N. Saini, S. Singh, and N. Suman, Code clones: Detection and management, Procedia Comput. Sci., vol. 132, pp. 718–727, 2018.

[25]
Y. Higo, Y. Ueda, T. Kamiya, S. Kusumoto, and K. Inoue, On software maintenance process improvement based on code clone analysis, in Proc. 4 th Int. Conf. Product Focused Software Process Improvement, Rovaniemi, Finland, 2002, pp. 185–197.
[26]
D. Li, M. Piao, H. S. Shon, K. H. Ryu, and I. Paik, One pass preprocessing for token-based source code clone detection, in Proc. 2014 IEEE 6 th Int. Conf. Awareness Science and Technology (iCAST ), Paris, France, 2014, pp. 1–6.
[27]
N. H. Pham, T. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, Detection of recurring software vulnerabilities, in Proc. 25 th IEEE/ACM Int. Conf. Automated Software Engineering, Antwerp, Belgium, 2010, pp. 447–456.
[28]
J. Svajlenko and C. K. Roy, BigCloneEval: A clone detection tool evaluation framework with BigCloneBench, in Proc. 2016 IEEE Int. Conf. Software Maintenance and Evolution (ICSME ), Raleigh, NC, USA, 2016, pp. 596–600.
[29]

J. Akram, M. Mumtaz, G. Jabeen, and P. Luo, DroidMD: An efficient and scalable Android malware detection approach at source code level, International Journal of Information and Computer Security, vol. 15, no. 2–3, pp. 299–321, 2021.

[30]
J. Akram, M. Mumtaz, and P. Luo, IBFET: Index-based features extraction technique for scalable code clone detection at file level granularity, Softw. : Pract. Exp., vol. 50, no. 1, pp. 22–46, 2020.
[31]
X. Song, A. Yu, H. Yu, S. Liu, X. Bai, L. Cai, and D. Meng, Program slice based vulnerable code clone detection, in Proc. 2020 IEEE 19 th Int. Conf. Trust, Security and Privacy in Computing and Communications (TrustCom ), Guangzhou, China, 2020, pp. 293–300.
[32]
The BigCloneBench Dataset, https://github.com/clonebench/BigCloneBench?tab=readme-ov-file.
[33]

J. Svajlenko, I. Keivanloo, and C. K. Roy, Big data clone detection using classical detectors: An exploratory study, J. Softw.: Evol. Process, vol. 27, no. 6, pp. 430–464, 2015.

[34]
A. Schäfer, W. Amme, and T. S. Heinze, Stubber: Compiling source code into bytecode without dependencies for Java code clone detection, in Proc. 2021 IEEE 15 th Int. Workshop on Software Clones (IWSC ), Luxembourg City, Luxembourg, 2021, pp. 29–35.
[35]
A. A. Elkhail, J. Svacina, and T. Cerny, Intelligent token-based code clone detection system for large scale source code, in Proc. Conf. Research in Adaptive and Convergent Systems, Chongqing, China, 2019, pp. 256–260.
[36]
E. Juergens, F. Deissenboeck, and B. Hummel, CloneDetective: A workbench for clone detection research, in Proc. 2009 IEEE 31 st Int. Conf. Software Engineering, Vancouver, Canada, 2009, pp. 603–606.
[37]
Y. Giani, L. Ping, and S. A. Shah, AYAT: A lightweight and efficient code clone detection technique, in Proc. 2022 3 rd Asia Conf. Computers and Communications (ACCC ), Shanghai, China, 2022, pp. 47–52.
[38]
Research group on Java development, University of Edinburgh, https://groups.inf.ed.ac.uk/cup/javaGithub/.
[39]
OneDrive repository, https://www.kaggle.com/datasets/zavadskyy/lots-of-code?select=java.txt.
Tsinghua Science and Technology
Pages 1435-1456
Cite this article:
Glani Y, Ping L, Shah SA, et al. CCDive: A Deep Dive into Code Clone Detection Using Local Sequence Alignment. Tsinghua Science and Technology, 2025, 30(4): 1435-1456. https://doi.org/10.26599/TST.2024.9010075
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return