PDF (3.7 MB)

Cite

Collect

Open Access

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu^{¹^,^M}, Weiguo Hu^{²^,^M}, Jinru Ding^¹, Jie Xu^¹, Xiaoyang Li^², Lifeng Zhu^², Zhian Bai^², Xiaoming Shi^¹, Benyou Wang^³, Haitao Song^⁴, Pengfei Liu^⁵, Xiaofan Zhang^⁶, Shanshan Wang^⁷, Kang Li^⁸, Haofen Wang^⁹, Tong Ruan^¹⁰, Xuanjing Huang^¹¹, Xin Sun^¹², Shaoting Zhang^¹()

1Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China

2Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China

3Chinese University of Hong Kong, Shenzhen 518172, China

4Shanghai Artificial Intelligence Research Institute, Shanghai 200240, and also with Shanghai Jiao Tong University, Shanghai 200240, China

5School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

6Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, China

7Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

8West China Hospital, Sichuan University, Chengdu 610041, China

9School of Design and Innovation, Tongji University, Shanghai 200092, China

10Department of Computer Science and Technology, East China University of Science and Technology, Shanghai 200237, China

11School of Computer Science, Fudan University, Shanghai 200433, China

12Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai 200092, China

Show Author Information

Abstract

Ensuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench”, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300901 questions) to cover 43 clinical specialties, and performs multi-faceted evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations between question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer memorization. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

Keywords

Medical Large Language Model (MLLM)benchmark platform open-source

Electronic Supplementary Material

Download File(s)

BDMA-2024-0079_ESM.pdf (245.8 KB)

References

[1]

A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nat. Med., vol. 29, no. 8, pp. 1930–1940, 2023.

Crossref Google Scholar

[2]

P. Webster, Six ways large language models are changing healthcare, Nat. Med., vol. 29, no. 12, pp. 2969–2971, 2023.

Crossref

[3]

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al., Large language models encode clinical knowledge, arXiv preprint arXiv: 2212.13138, 2022.

[4]

J. Li, S. Zhong, and K. Chen, MLEC-QA: A Chinese multi-choice biomedical question answering dataset, in Proc. 2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event, 2021, pp. 8862–8874.

Crossref

[5]

J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu, et al., Benchmarking large language models on CMExam: A comprehensive Chinese medical exam dataset, arXiv preprint arXiv: 2306.03030, 2023.

[6]

X. Wang, G. H. Chen, D. Song, Z. Zhang, Z. Chen, Q. Xiao, F. Jiang, J. Li, X. Wan, B. Wang, et al., CMB: A comprehensive medical benchmark in Chinese, arXiv preprint arXiv: 2308.08833, 2023.

[7]

N. Zhang, M. Chen, Z. Bi, X. Liang, L. Li, X. Shang, K. Yin, C. Tan, J. Xu, F. Huang, et al., CBLUE: A Chinese biomedical language understanding evaluation benchmark, arXiv preprint arXiv: 2106.08087, 2021.

[8]

S. Wu, M. Koo, L. Blum, A. Black, L. Kao, Z. Fei, F. Scalzo, and I. Kurtz, Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology, NEJM AI, doi: 10.1056/AIdbp2300092.

Crossref

[9]

Z. W. Lim, K. Pushpanathan, S. M. E. Yew, Y. Lai, C. H. Sun, J. S. H. Lam, D. Z. Chen, J. H. L. Goh, M. C. J. Tan, B. Sheng, et al., Benchmarking large language models’ performances for myopia care: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard, EBioMedicine, vol. 95, p. 104770, 2023.

Crossref Google Scholar

[10]

Y. Wang, Y. Teng, K. Huang, C. Lyu, S. Zhang, W. Zhang, X. Ma, Y.-G. Jiang, Y. Qiao, and Y. Wang, Fake alignment: Are LLMs really aligned well? arXiv preprint arXiv: 2311.05915, 2023.

[11]

K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao, X. Chen, Y. Lin, J.-R. Wen, and J. Han, Don’t make your LLM an evaluation benchmark cheater, arXiv preprint arXiv: 2311.01964, 2023.

[12]

C. Sohrabi, G. Mathew, N. Maria, A. Kerwan, T. Franchi, R. A. Agha, and Collaborators, The SCARE 2023 guideline: Updating consensus Surgical CAse REport (SCARE) guidelines, Int. J. Surg., vol. 109, no. 5, pp. 1136–1140, 2023.

Crossref Google Scholar

[13]

X. Wang, X. Zhang, G. Wang, J. He, Z. Li, W. Zhu, Y. Guo, Q. Dou, X. Li, D. Wang, et al., OpenMEDLab: An open-source platform for multi-modality foundation models in medicine, arXiv preprint arXiv: 2402.18028, 2024.

[14]

A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al., GLM-130B: An open bilingual pre-trained model, arXiv preprint arXiv: 2210.02414, 2022.

[15]

H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and T. Liu, HuaTuo: Tuning LLaMA model with Chinese medical knowledge, arXiv preprint arXiv: 2304.06975, 2023.

[16]

Y. Chen, Z. Wang, X. Xing, H. zheng, Z. Xu, K. Fang, J. Wang, S. Li, J. Wu, Q. Liu, et al., BianQue: Balancing the questioning and suggestion ability of health LLMs with multi-turn health conversations polished by ChatGPT, arXiv preprint arXiv: 2310.15896, 2023.

[17]

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, BLEU: A method for automatic evaluation of machine translation, in Proc. 40th Annual Meeting on Association for Computational Linguistics , Morristown, NJ, USA, 2001, pp. 311–318.

Crossref

[18]

C.-Y. Lin and F. J. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, in Proc. 42nd Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, 2004, pp. 605–612.

Crossref

[19]

Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al., A survey on evaluation of large language models, ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, pp. 1–45, 2024.

Crossref Google Scholar

[20]

O. Sychev, A. Anikin, and A. Prokudin, Automatic grading and hinting in open-ended text questions, Cogn. Syst. Res., vol. 59, pp. 264–272, 2020.

Crossref Google Scholar

[21]

I. Abdou and T. Eude, Open-ended questions automated evaluation: Proposal of a new generation, in Proc. 2023 Int. Joint Conf. Robotics and Artificial Intelligence, Shanghai, China, 2023, pp. 143–147.

Crossref

[22]

Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., MMBench: Is your multi-modal model an all-around player? arXiv preprint arXiv: 2307.06281, 2023.

Crossref

[23]

Z. Wang, S. Wang, Y. Li, J. Guo, Y. Wei, Y. Mu, L. Zheng, and W. Li, A new paradigm for applying deep learning to protein-ligand interaction prediction, Brief. Bioinform., vol. 25, no. 3, p. bbae145, 2024.

Crossref Google Scholar

[24]

W. Wang, C. Feng, R. Han, Z. Wang, L. Ye, Z. Du, H. Wei, F. Zhang, Z. Peng, and J. Yang, trRosettaRNA: automated prediction of RNA 3D structure with transformer network, Nat. Commun., vol. 14, no. 1, p. 7266, 2023.

Crossref

[25]

X. Hu, S. Liao, H. Bai, L. Wu, M. Wang, Q. Wu, J. Zhou, L. Jiao, X. Chen, Y. Zhou, et al., Integrating exosomal microRNAs and electronic health data improved tuberculosis diagnosis, EBioMedicine, vol. 40, pp. 564–573, 2019.

Crossref Google Scholar

[26]

Z. Wang, C. Liu, S. Zhang, and Q. Dou, Foundation model for endoscopy video analysis via large-scale self-supervised pre-train, in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, Canada, 2023, pp. 101–111.

Crossref

[27]

J. Cheng, J. Ye, Z. Deng, J. Chen, T. Li, H. Wang, Y. Su, Z. Huang, J. Chen, L. Jiang, et al., SAM-Med2D, arXiv preprint arXiv:2308.16184.

[28]

S. Zhang and D. Metaxas, On the challenges and perspectives of foundation models for medical image analysis, Med. Image Anal., vol. 91, p. 102996, 2024.

Crossref Google Scholar

[29]

Y. Zhang, J. Gao, Z. Tan, L. Zhou, K. Ding, M. Zhou, S. Zhang, and D. Wang, Data-centric foundation models in computational healthcare: A survey, arXiv preprint arXiv: 2401.02458, 2024.

[30]

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, LLaVA-med: Training a large language-and-vision assistant for biomedicine in one day, arXiv preprint arXiv: 2306.00890, 2023.

Crossref

[31]

K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi, et al., Capabilities of gemini models in medicine, arXiv preprint arXiv: 2404.18416, 2024.

[32]

D. Wang, X. Wang, L. Wang, M. Li, Q. Da, X. Liu, X. Gao, J. Shen, J. He, T. Shen, et al., A real-world dataset and benchmark for foundation model adaptation in medical image classification, Sci. Data, vol. 10, no. 1, p. 574, 2023.

Crossref Google Scholar

[33]

Y. Hu, T. Li, Q. Lu, W. Shao, J. He, Y. Qiao, and P. Luo, OmniMedVQA: A new large-scale comprehensive evaluation benchmark for medical LVLM, arXiv preprint arXiv: 2402.09181, 2024.

Crossref

[34]

Opencompass: A universal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023.

Big Data Mining and Analytics

Volume 7 Issue 4,
December 2024

Pages 1116-1128

DOI: 10.26599/BDMA.2024.9020044

Cite this article:

Liu M, Hu W, Ding J, et al. MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models. Big Data Mining and Analytics, 2024, 7(4): 1116-1128. https://doi.org/10.26599/BDMA.2024.9020044