| Sign up

PDF (5.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Open Access

Generating Medical Report via Joint Probability Graph Reasoning

Junsan Zhang^¹, Ming Cheng^¹, Xiangyang Li^²(), Xiuxuan Shen^³, Yuxue Liu^¹, Yao Wan^⁴

1Qingdao Institute of Software, College of Computer Science and Technology, China University of Petroleum (East China), Qingdao 266580, China

2Shandong Inspur Intelligent Medical Technology Co., Ltd., Jinan 250101, China

3School of Cyber Engineering, Xidian University, Xi’an 710071, China

4School of Computer Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China

Show Author Information

Abstract

In medical X-ray images, multiple abnormalities may occur frequently. However, existing report generation methods cannot efficiently extract all abnormal features, resulting in incomplete disease diagnoses when generating diagnostic reports. In real medical scenarios, there are co-occurrence relations among multiple diseases. If such co-occurrence relations are mined and integrated into the feature extraction process, the issue of missing abnormal features may be addressed. Inspired by this observation, we propose a novel method to improve the extraction of abnormal features in images through joint probability graph reasoning. Specifically, to reveal the co-occurrence relations among multiple diseases, we conduct statistical analyses on the dataset, and extract disease relationships into a probability map. Subsequently, we devise a graph reasoning network for conducting correlation-based reasoning over the features of medical images, which can facilitate the acquisition of more abnormal features. Furthermore, we introduce a gating mechanism focused on cross-modal features fusion into the current text generation model. This optimization substantially improves the model’s capabilities to learn and fuse information from two distinct modalities—medical images and texts. Experimental results on the IU-X-Ray and MIMIC-CXR datasets demonstrate that our approach outperforms previous state-of-the-art methods, exhibiting the ability to generate higher quality medical image reports.

Keywords

medical report generation cross-modal fusion deep neural network

References

[1]

L. Zhang, K. Zhang, and H. Pan, Sunet++: A deep network with channel attention for small-scale object segmentation on 3D medical images, Tsinghua Science and Technology, vol. 28, no. 4, pp. 628–638, 2023.

Crossref Google Scholar

[2]

Q. Huang, Y. Zhou, L. Tao, W. Yu, Y. Zhang, L. Huo, and Z. He, A Chan-Vese model based on the Markov chain for unsupervised medical image segmentation, Tsinghua Science and Technology, vol. 26, no. 6, pp. 833–844, 2021.

[3]

X. Xing, J. Del Ser, Y. Wu, Y. Li, J. Xia, L. Xu, D. Firmin, P. Gatehouse, and G. Yang, HDL: Hybrid deep learning for the synthesis of myocardial velocity maps in digital twins for cardiac analysis, IEEE J. Biomed. Health Inform., vol. 27, no. 10, pp. 5134–5142, 2023.

[4]

B. Jing, P. Xie, and E. Xing, On the automatic generation of medical imaging reports, in Proc. 56^th Annu. Meeting of the Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 2577–2586.

[5]

B. Jing, Z. Wang, and E. Xing, Show, describe and conclude: On exploiting the structure information of chest X-ray reports, in Proc. 57^th Annu. Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 6570–6580.

[6]

B. Hui, Y. Liu, J. Qiu, L. Cao, L. Ji, and Z. He, Study of texture segmentation and classification for grading small hepatocellular carcinoma based on CT images, Tsinghua Science and Technology, vol. 26, no. 2, pp. 199–207, 2021.

Crossref Google Scholar

[7]

X. Fan, M. Dai, C. Liu, F. Wu, X. Yan, Y. Feng, Y. Feng, and B. Su, Effect of image noise on the classification of skin lesions using deep convolutional neural networks, Tsinghua Science and Technology, vol. 25, no. 3, pp. 425–434, 2020.

Crossref Google Scholar

[8]

Z. Chen, Y. Song, T. H. Chang, and X. Wan, Generating radiology reports via memory-driven transformer, in Proc. 2020 Conf. Empirical Methods in Natural Language Processing (EMNLP ), Virtual Event, 2020, pp. 1439–1449.

[9]

M. Li, R. Liu, F. Wang, X. Chang, and X. Liang, Auxiliary signal-guided knowledge encoder-decoder for medical report generation, World Wide Web, vol. 26, no. 1, pp. 253–270, 2023.

Crossref Google Scholar

[10]

B. Hou, G. Kaissis, R. M. Summers, and B. Kainz, RATCHET: Medical transformer for chest X-ray diagnosis and reporting, in Proc. 24^th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Strasbourg, France, 2021, pp. 293–303.

[11]

C. Y. Li, X. Liang, Z. Hu, and E. P. Xing, Hybrid retrieval-generation reinforced agent for medical image report generation, in Proc. 32^nd Int. Conf. Neural Information Processing Systems, Montréal, Canada, 2018, pp. 1537–1547.

[12]

Y. Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, and D. Xu, When radiology report generation meets knowledge graph, in Proc. 34^th AAAI Conf. Artificial Intelligence, New York, NY, USA, 2020, pp. 12910–12917.

[13]

C. Y. Li, X. Liang, Z. Hu, and E. P. Xing, Knowledge-driven encode, retrieve, paraphrase for medical image report generation, in Proc. 33^rd AAAI Conf. Artificial Intelligence, Honolulu, HI, USA, 2019, pp. 6666–6673.

[14]

X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers, TieNet: Text-image embedding network for common thorax disease classification and reporting in chest X-Rays, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 9049–9058.

[15]

Y. Xue, T. Xu, L. R. Long, Z. Xue, S. Antani, G. R. Thoma, and X. Huang, Multimodal recurrent model with attention for automated radiology report generation, in Proc. 21^st Int. Conf. Medical Image Computing and Computer Assisted Intervention, Granada, Spain, 2018, pp. 457–466.

[16]

J. Yuan, H. Liao, R. Luo, and J. Luo, Automatic radiology report generation based on multi-view image fusion and medical concept enrichment, in Proc. 22^nd Int. Conf. Medical Image Computing and Computer Assisted Intervention, Shenzhen, China, 2019, pp. 721–729.

[17]

F. Liu, X. Wu, S. Ge, W. Fan, and Y. Zou, Exploring and distilling posterior and prior knowledge for radiology report generation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Nashville, TN, USA, 2021, pp. 13748–13757.

[18]

F. Liu, C. Yin, X. Wu, S. Ge, P. Zhang, and X. Sun, Contrastive attention for automatic chest X-ray report generation, in Proc. Findings of the Association for Computational Linguistics, Virtual Event, 2021, pp. 269–280.

[19]

O. Alfarghaly, R. Khaled, A. Elkorany, M. Helal, and A. Fahmy, Automated radiology report generation using conditioned transformers, Inform. Med. Unlocked, vol. 24, p. 100557, 2021.

Crossref Google Scholar

[20]

Y. Zhou, L. Huang, T. Zhou, H. Fu, and L. Shao, Visual-textual attentive semantic consistency for medical report generation, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV ), Montreal, Canada, 2021, pp. 3965–3974.

[21]

Z. Chen, Y. Shen, Y. Song, and X. Wan, Cross-modal memory networks for radiology report generation, in Proc. 59^th Annu. Meeting of the Association for Computational Linguistics and the 11^th Int. Joint Conf. Natural Language Processing (Volume 1 : Long Papers ), Virtual Event, 2021, pp. 5904–5914.

[22]

Y. Liu, X. Feng, and Z. Zhou, Multimodal video classification with stacked contractive autoencoders, Signal Process., vol. 120, pp. 761–766, 2016.

Crossref Google Scholar

[23]

S. Wu, S. Bondugula, F. Luisier, X. Zhuang, and P. Natarajan, Zero-shot event detection using multi-modal fusion of weakly supervised concepts, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 2665–2672.

[24]

A. Habibian, T. Mensink, and C. G. M. Snoek, Video2vec embeddings recognize events when examples are scarce, IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 10, pp. 2089–2103, 2017.

Crossref Google Scholar

[25]

Y. Bie, Y. Yang, and Y. Zhang. Fusing syntactic structure information and lexical semantic information for end-to-end aspect-based sentiment analysis. Tsinghua Science and Technology, vol. 28, no. 2, pp. 230–243, 2023.

[26]

C. Peng, C. Zhang, X. Xue, J. Gao, H. Liang, and Z. Niu. Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification. Tsinghua Science and Technology, vol. 27, no. 4, pp. 664–679, 2022.

[27]

Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. B. Zadeh, and L. P. Morency, Efficient low-rank multimodal fusion with modality-specific factors, in Proc. 56^th Annu. Meeting of the Association for Computational Linguistics (Volume 1 : Long Papers ), Melbourne, Australia, 2018, pp. 2247–2256.

[28]

A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L. P. Morency, Memory fusion network for multi-view sequential learning, in Proc. 32^nd AAAI Conf. Artificial Intelligence, New Orleans, LA, USA, 2018, p. 691.

[29]

Q. Long, M. Wang, and L. Li, Generative imagination elevates machine translation, arXiv preprint arXiv:2009.09654, 2020.

[30]

Z. Z. Lan, L. Bao, S. I. Yu, W. Liu, and A. G. Hauptmann, Multimedia classification and event detection using double fusion, Multimed. Tools Appl., vol. 71, pp. 333–347, 2014.

Crossref Google Scholar

[31]

Y. Lu, Y. Wu, B. Liu, T. Zhang, B. Li, Q. Chu, and N. Yu, Cross-modality person re-identification with shared-specific feature transfer, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Seattle, WA, USA, 2020, pp. 13376–13386.

[32]

J. E. A. Ovalle, T. Solorio, M. Montes-y-Gómez, and F. A. González, Gated multimodal units for information fusion, in Proc. the 5^th Int. Conf. Learning Representations, arXiv preprint arXiv: 1702.01992, 2017.

[33]

A. Kumar and J. Vepa, Gated mechanism for attention based multi modal sentiment analysis, in Proc. 2020 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP ), Barcelona, Spain, 2020, pp. 4477–4481.

[34]

D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, Preparing a collection of radiology examinations for distribution and retrieval, J. Am. Med. Inform. Assoc., vol. 23, no. 2, pp. 304–310, 2016.

Crossref Google Scholar

[35]

A. E. W. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. Y. Deng, R. G. Mark, and S. Horng, MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports, Sci. Data, vol. 6, no. 1, p. 317, 2019.

Crossref Google Scholar

[36]

K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, Bleu: A method for automatic evaluation of machine translation, in Proc. 40^th Annu. Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 2002, pp. 311–318.

[37]

S. Banerjee and A. Lavie, METEOR: An automatic metric for mt evaluation with improved correlation with human judgments, in Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 2005, pp. 65–72.

[38]

C. Y. Lin, ROUGE: A package for automatic evaluation of summaries, in Proc. Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81.

[39]

J. Lu, C. Xiong, D. Parikh, and R. Socher, Knowing when to look: Adaptive attention via a visual sentinel for image captioning, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR ), Honolulu, HI, USA, 2017, pp. 3242–3250.

[40]

H. Qin and Y. Song, Reinforced cross-modal alignment for radiology report generation, in Proc. Findings of the Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 448–458.

[41]

J. You, D. Li, M. Okumura, and K. Suzuki, JPG-Jointly learn to align: Automated disease prediction and radiology report generation, in Proc. 29^th Int. Conf. Computational Linguistics, Gyeongju, Republic of Korea, 2022, pp. 5989–6001.

[42]

Z. Lin, D. Zhang, D. Shi, R. Xu, Q. Tao, L. Wu, M. He, and Z. Ge, Contrastive pre-training and linear interaction attention-based transformer for universal medical reports generation, J. Biomed. Inform., vol. 138, p. 104281, 2023.

Crossref Google Scholar

[43]

R. Wang, X. Wang, Z. Xu, W. Xu, J. Chen, and T. Lukasiewicz, MvCo-DoT: Multi-view contrastive domain transfer network for medical report generation, in Proc. 2023 IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP ), Rhodes Island, Greece, 2023, pp. 1–5.

Tsinghua Science and Technology

Volume 30 Issue 4,
August 2025

Pages 1685-1699

DOI: 10.26599/TST.2024.9010058

Cite this article:

Zhang J, Cheng M, Li X, et al. Generating Medical Report via Joint Probability Graph Reasoning. Tsinghua Science and Technology, 2025, 30(4): 1685-1699. https://doi.org/10.26599/TST.2024.9010058

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号