During software development, developers tend to tangle multiple concerns into a single commit, resulting in many composite commits. This paper studies the problem of detecting and untangling composite commits, so as to improve the maintainability and understandability of software. Our approach is built upon the observation that both the textual content of code statements and the dependencies between code statements are helpful in comprehending the code commit. Based on this observation, we first construct an attributed graph for each commit, where code statements and various code dependencies are modeled as nodes and edges, respectively, and the textual bodies of code statements are maintained as node attributes. Based on the attributed graph, we propose graph-based learning algorithms that first detect whether the given commit is a composite commit, and then untangle the composite commit into atomic ones. We evaluate our approach on nine C# projects, and the results demonstrate the effectiveness and efficiency of our approach.
Herzig K, Just S, Zeller A. The impact of tangled code changes on defect prediction models. Empirical Software Engineering, 2016, 21(2): 303–336. DOI: 10.1007/s10664-015-9376-6.
Herbold S, Trautsch A, Ledel B et al. A fine-grained data set and analysis of tangling in bug fixing commits. Empirical Software Engineering, 2022, 27(6): Article No. 125. DOI: 10.1007/s10664-021-10083-5.
Frey B J, Dueck D. Clustering by passing messages between data points. Science, 2007, 315(5814): 972–976. DOI: 10.1126/science.1136800.
Shen B, Zhang W, Zhao H, Liang G, Jin Z, Wang Q. IntelliMerge: A refactoring-aware software merging technique. Proceedings of the ACM on Programming Languages, 2019, 3(OOPSLA): Article No. 170. DOI: 10.1145/3360596.
Kavi K M, Buckles B P, Bhat U N. A formal definition of data flow graph models. IEEE Transactions on Computers, 1986, 35(11): 940–948. DOI: 10.1109/tc.1986.1676696.
Ferrante J, Ottenstein K J, Warren J D. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS), 1987, 9(3): 319–349. DOI: 10.1145/24039.24041.
Flake G W, Lawrence S, Giles C L, Coetzee F M. Self-organization and identification of web communities. Computer, 2002, 35(3): 66–70. DOI: 10.1109/2.989932.
Shervashidze N, Schweitzer P, van Leeuwen E J, Mehlhorn K, Borgwardt K M. Weisfeiler-Lehman graph kernels. The Journal of Machine Learning Research, 2011, 12: 2539–2561. DOI: 10.5555/1953048.2078187.
Liaw A, Wiener M. Classification and regression by randomforest. R News, 2002, 2(3): 18–22.
Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 2011, 12: 2825–2830.
Kim S, Whitehead E J, Zhang Y. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering, 2008, 34(2): 181–196. DOI: 10.1109/tse.2007.70773.
Zhou Y, Siow J K, Wang C, Liu S, Liu Y. SPI: Automated identification of security patches via commits. ACM Transactions on Software Engineering and Methodology (TOSEM), 2022, 31(1): Article No. 13. DOI: 10.1145/3468854.
Cui D, Fan L, Chen S, Cai Y, Zheng Q, Liu Y, Liu T. Towards characterizing bug fixes through dependency-level changes in apache java open source projects. Science China Information Sciences, 2022, 65(7): 172101. DOI: 10.1007/s11432-020-3317-2.
Guo B, Kwon Y W, Song M. Decomposing composite changes for code review and regression test selection in evolving software. Journal of Computer Science and Technology, 2019, 34(2): 416–436. DOI: 10.1007/s11390-019-1917-9.