Article Link
Collect
Submit Manuscript
Show Outline
Outline
Abstract
Keywords
Electronic Supplementary Material
References
Show full outline
Hide outline
Regular Paper

Detecting and Untangling Composite Commits via Attributed Graph Modeling

State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China
Show Author Information

Abstract

During software development, developers tend to tangle multiple concerns into a single commit, resulting in many composite commits. This paper studies the problem of detecting and untangling composite commits, so as to improve the maintainability and understandability of software. Our approach is built upon the observation that both the textual content of code statements and the dependencies between code statements are helpful in comprehending the code commit. Based on this observation, we first construct an attributed graph for each commit, where code statements and various code dependencies are modeled as nodes and edges, respectively, and the textual bodies of code statements are maintained as node attributes. Based on the attributed graph, we propose graph-based learning algorithms that first detect whether the given commit is a composite commit, and then untangle the composite commit into atomic ones. We evaluate our approach on nine C# projects, and the results demonstrate the effectiveness and efficiency of our approach.

Electronic Supplementary Material

Download File(s)
JCST-2211-12943-Highlights.pdf (286 KB)

References

[1]
Tao Y, Dang Y, Xie T, Zhang D, Kim S. How do software engineers understand code changes?: An exploratory study in industry. In Proc. the 20th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, Nov. 2012, Article No. 51. DOI: 10.1145/2393596.2393656.
[2]

Herzig K, Just S, Zeller A. The impact of tangled code changes on defect prediction models. Empirical Software Engineering, 2016, 21(2): 303–336. DOI: 10.1007/s10664-015-9376-6.

[3]

Herbold S, Trautsch A, Ledel B et al. A fine-grained data set and analysis of tangling in bug fixing commits. Empirical Software Engineering, 2022, 27(6): Article No. 125. DOI: 10.1007/s10664-021-10083-5.

[4]
Herzig K, Zeller A. The impact of tangled code changes. In Proc. the 10th Working Conference on Mining Software Repositories (MSR), May 2013, pp.121–130. DOI: 10.1109/msr.2013.6624018.
[5]
Nguyen H A, Nguyen A T, Nguyen T N. Filtering noise in mixed-purpose fixing commits to improve defect prediction and localization. In Proc. the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), Nov. 2013, pp.138–147. DOI: 10.1109/issre.2013.6698913.
[6]
Barnett M, Bird C, Brunet J, Lahiri S K. Helping developers help themselves: Automatic decomposition of code review changesets. In Proc. the 37th IEEE/ACM IEEE International Conference on Software Engineering, May 2015, pp.134–144. DOI: 10.1109/icse.2015.35.
[7]
Tao Y, Kim S. Partitioning composite code changes to facilitate code review. In Proc. the 12th IEEE/ACM Working Conference on Mining Software Repositories, May 2015, pp.180–190. DOI: 10.1109/msr.2015.24.
[8]
Dias M, Bacchelli A, Gousios G, Cassou D, Ducasse S. Untangling fine-grained code changes. In Proc. the 22nd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER), Mar. 2015, pp.341–350. DOI: 10.1109/saner.2015.7081844.
[9]
Kirinuki H, Higo Y, Hotta K, Kusumoto S. Splitting commits via past code changes. In Proc. the 23rd Asia-Pacific Software Engineering Conference (APSEC), Dec. 2016, pp.129–136. DOI: 10.1109/apsec.2016.028.
[10]
Muylaert W, De Roover C. Untangling composite commits using program slicing. In Proc. the 18th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM), Sept. 2018, pp.193–202. DOI: 10.1109/SCAM.2018.00030.
[11]
Wang M, Lin Z, Zou Y, Xie B. CoRA: Decomposing and describing tangled code changes for reviewer. In Proc. the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), Nov. 2019, pp.1050–1061. DOI: 10.1109/ase.2019.00101.
[12]
Yamashita S, Hayashi S, Saeki M. ChangeBeadsThreader: An interactive environment for tailoring automatically untangled changes. In Proc. the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Feb. 2020, pp.657–661. DOI: 10.1109/saner48275.2020.9054861.
[13]
Pârtachi P P, Dash S K, Allamanis M, Barr E T. Flexeme: Untangling commits using lexical flows. In Proc. the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Nov. 2020, pp.63–74. DOI: 10.1145/3368089.3409693.
[14]
Shen B, Zhang W, Kästner C, Zhao H, Wei Z, Liang G, Jin Z. SmartCommit: A graph-based interactive assistant for activity-oriented commits. In Proc. the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Aug. 2021, pp.379–390. DOI: 10.1145/3468264.3468551.
[15]
Hindle A, Barr E T, Su Z, Gabel M, Devanbu P. On the naturalness of software. Communications of the ACM, 2016, 59(5):122–131. DOI: 10.1145/2902362.
[16]
Zhang X, Liu H, Li Q, Wu X M. Attributed graph clustering via adaptive graph convolution. In Proc. the 28th International Joint Conference on Artificial Intelligence, Aug. 2019, pp.4327–4333. DOI: 10.24963/ijcai.2019/601.
[17]

Frey B J, Dueck D. Clustering by passing messages between data points. Science, 2007, 315(5814): 972–976. DOI: 10.1126/science.1136800.

[18]
Chen S Y, Xu S B, Yao Y, Xu F. Untangling composite commits by attributed graph clustering. In Proc. the 13th Asia-Pacific Symposium on Internetware, Jun. 2022, pp.117–126. DOI: 10.1145/3545258.3545267.
[19]
Nguyen A T, Nguyen T N. Graph-based statistical language model for code. In Proc. the 37th IEEE/ACM IEEE International Conference on Software Engineering, May 2015, pp.858–868. DOI: 10.1109/icse.2015.336.
[20]
Allamanis M, Brockschmidt M, Khademi M. Learning to represent programs with graphs. In Proc. the 6th International Conference on Learning Representations, Apr. 30–May 3, 2018.
[21]
Nguyen H A, Nguyen T N, Dig D, Nguyen S, Tran H, Hilton M. Graph-based mining of in-the-wild, fine-grained, semantic code change patterns. In Proc. the 41st IEEE/ACM International Conference on Software Engineering (ICSE), May 2019, pp.819–830. DOI: 10.1109/icse.2019.00089.
[22]

Shen B, Zhang W, Zhao H, Liang G, Jin Z, Wang Q. IntelliMerge: A refactoring-aware software merging technique. Proceedings of the ACM on Programming Languages, 2019, 3(OOPSLA): Article No. 170. DOI: 10.1145/3360596.

[23]
Zhou Y, Liu S, Siow J, Du X, Liu Y. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 915.
[24]
Zhang K, Wang W, Zhang H, Li G, Jin Z. Learning to represent programs with heterogeneous graphs. In Proc. the 30th IEEE/ACM International Conference on Program Comprehension, May 2022, pp.378–389. DOI: 10.1145/3524610.3527905.
[25]
Kumar K S, Malathi D. A novel method to find time complexity of an algorithm by using control flow graph. In Proc. the 2017 International Conference on Technical Advancements in Computers and Communications (ICTACC), Apr. 2017, pp.66–68. DOI: 10.1109/ictacc.2017.26.
[26]

Kavi K M, Buckles B P, Bhat U N. A formal definition of data flow graph models. IEEE Transactions on Computers, 1986, 35(11): 940–948. DOI: 10.1109/tc.1986.1676696.

[27]

Ferrante J, Ottenstein K J, Warren J D. The program dependence graph and its use in optimization. ACM Transactions on Programming Languages and Systems (TOPLAS), 1987, 9(3): 319–349. DOI: 10.1145/24039.24041.

[28]
Yamaguchi F, Golde N, Arp D, Rieck K. Modeling and discovering vulnerabilities with code property graphs. In Proc. the 2014 IEEE Symposium on Security and Privacy, May 2014, pp.590–604. DOI: 10.1109/sp.2014.44.
[29]
Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.
[30]
Li Y, Tarlow D, Brockschmidt M, Zemel R S. Gated graph sequence neural networks. In Proc. the 4th International Conference on Learning Representations, May 2016.
[31]
Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? In Proc. the 7th International Conference on Learning Representations, May 2019.
[32]
Brandes U, Gaertler M, Wagner D. Experiments on graph clustering algorithms. In Proc. the 11th Annual European Symposium on Algorithms, Sept. 2003, pp.568–579. DOI: 10.1007/978-3-540-39658-1_52.
[33]
Carrasco J J, Fain D C, Lang K J, Zhukov L. Clustering of bipartite advertiser-keyword graph. In Proc. the 3rd IEEE International Conference on Data Mining, Workshop on Clustering Large Data Sets, Nov. 2003.
[34]

Flake G W, Lawrence S, Giles C L, Coetzee F M. Self-organization and identification of web communities. Computer, 2002, 35(3): 66–70. DOI: 10.1109/2.989932.

[35]
Gkantsidis C, Mihail M, Zegura E. Spectral analysis of internet topologies. In Proc. the 22nd Annual Joint Conference of the IEEE Computer and Communications Societies (IEEE Cat. No. 03CH37428), Mar. 2003, pp.364–374. DOI: 10.1109/infcom.2003.1208688.
[36]
Mihail M, Gkantsidis C, Saberi A. On the semantics of internet topologies. Technical report, Georgia Institute of Technology, 2002. https://repository.gatech.edu/entities/publication/88498271-8732-47f2-898e-f701ca23fcba/full, Nov. 2022.
[37]
Tian F, Gao B, Cui Q, Chen E, Liu T Y. Learning deep representations for graph clustering. In Proc. the 38th AAAI Conference on Artificial Intelligence, Jul. 2014, pp.1293–1299. DOI: 10.1609/aaai.v28i1.8916.
[38]
Wang C, Pan S, Hu R, Long G, Jiang J, Zhang C. Attributed graph clustering: A deep attentional embedding approach. In Proc. the 28th International Joint Conference on Artificial Intelligence, Aug. 2019, pp.3670–3676. DOI: 10.24963/ijcai.2019/509.
[39]

Shervashidze N, Schweitzer P, van Leeuwen E J, Mehlhorn K, Borgwardt K M. Weisfeiler-Lehman graph kernels. The Journal of Machine Learning Research, 2011, 12: 2539–2561. DOI: 10.5555/1953048.2078187.

[40]
Dash S K, Allamanis M, Barr E T. RefiNym: Using names to refine types. In Proc. the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Nov. 2018, pp.107–117. DOI: 10.1145/3236024.3236042.
[41]
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp.4171–4186. DOI: 10.18653/v1/N19-1423.
[42]
Feng Z, Guo D, Tang D, Duan N, Feng X, Gong M, Shou L, Qin B, Liu T, Jiang D, Zhou M. CodeBERT: A pre-trained model for programming and natural languages. In Proc. the Findings of the Association for Computational Linguistics: EMNLP 2020, Nov. 2020, pp.1536–1547. DOI: 10.18653/v1/2020.findings-emnlp.139.
[43]

Liaw A, Wiener M. Classification and regression by randomforest. R News, 2002, 2(3): 18–22.

[44]

Pedregosa F, Varoquaux G, Gramfort A et al. Scikit-learn: Machine learning in python. The Journal of Machine Learning Research, 2011, 12: 2825–2830.

[45]
Kingma D P, Ba L J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations (ICLR), May 2015, Article No. 13.
[46]

Kim S, Whitehead E J, Zhang Y. Classifying software changes: Clean or buggy? IEEE Transactions on Software Engineering, 2008, 34(2): 181–196. DOI: 10.1109/tse.2007.70773.

[47]
Hoang T, Dam H K, Kamei Y, Lo D, Ubayashi N. DeepJIT: An end-to-end deep learning framework for just-in-time defect prediction. In Proc. the 16th IEEE/ACM International Conference on Mining Software Repositories (MSR), 2019, pp.34–45. DOI: 10.1109/msr.2019.00016.
[48]
Zeng Z, Zhang Y, Zhang H, Zhang L. Deep just-in-time defect prediction: How far are we? In Proc. the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis, Jul. 2021, pp.427–438. DOI: 10.1145/3460319.3464819.
[49]
Hindle A, German D M, Holt R. What do large commits tell us?: A taxonomical study of large commits. In Proc. the 2008 International Working Conference on Mining Software Repositories, May 2008, pp.99–108. DOI: 10.1145/1370750.1370773.
[50]
Hindle A, German D M, Godfrey M W, Holt R C. Automatic classication of large changes into maintenance categories. In Proc. the 17th IEEE International Conference on Program Comprehension, May 2009, pp.30–39. DOI: 10.1109/icpc.2009.5090025.
[51]

Zhou Y, Siow J K, Wang C, Liu S, Liu Y. SPI: Automated identification of security patches via commits. ACM Transactions on Software Engineering and Methodology (TOSEM), 2022, 31(1): Article No. 13. DOI: 10.1145/3468854.

[52]
Zhou J, Pacheco M, Wan Z, Xia X, Lo D, Wang Y, Hassan A E. Finding a needle in a haystack: Automated mining of silent vulnerability fixes. In Proc. the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Nov. 2021, pp.705–716. DOI: 10.1109/ase51524.2021.9678720.
[53]
Meng N, Jiang Z, Zhong H. Classifying code commits with convolutional neural networks. In Proc. the 2021 International Joint Conference on Neural Networks (IJCNN), Jul. 2021, pp.1–8. DOI: 10.1109/ijcnn52387.2021.9533534.
[54]

Cui D, Fan L, Chen S, Cai Y, Zheng Q, Liu Y, Liu T. Towards characterizing bug fixes through dependency-level changes in apache java open source projects. Science China Information Sciences, 2022, 65(7): 172101. DOI: 10.1007/s11432-020-3317-2.

[55]
Murphy-Hill E, Parnin C, Black A P. How we refactor, and how we know it. In Proc. the 31st IEEE International Conference Software Engineering, May 2009, pp.287–297. DOI: 10.1109/icse.2009.5070529.
[56]

Guo B, Kwon Y W, Song M. Decomposing composite changes for code review and regression test selection in evolving software. Journal of Computer Science and Technology, 2019, 34(2): 416–436. DOI: 10.1007/s11390-019-1917-9.

[57]
Bo D, Wang X, Shi C, Zhu M, Lu E, Cui P. Structural deep clustering network. In Proc. the Web Conference 2020, Apr. 2020, pp.1400–1410. DOI: 10.1145/3366423.3380214.
Journal of Computer Science and Technology
Pages 119-137
Cite this article:
Xu S-B, Chen S-Y, Yao Y, et al. Detecting and Untangling Composite Commits via Attributed Graph Modeling. Journal of Computer Science and Technology, 2025, 40(1): 119-137. https://doi.org/10.1007/s11390-024-2943-9
Metrics & Citations  
Article History
Copyright
Return