PDF (3.7 MB)
Collect
Submit Manuscript
Open Access

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
Ruijin Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai 200025, China
Chinese University of Hong Kong, Shenzhen 518172, China
Shanghai Artificial Intelligence Research Institute, Shanghai 200240, and also with Shanghai Jiao Tong University, Shanghai 200240, China
School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, China
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
West China Hospital, Sichuan University, Chengdu 610041, China
School of Design and Innovation, Tongji University, Shanghai 200092, China
Department of Computer Science and Technology, East China University of Science and Technology, Shanghai 200237, China
School of Computer Science, Fudan University, Shanghai 200433, China
Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai 200092, China

Show Author Information

Abstract

Ensuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench”, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300901 questions) to cover 43 clinical specialties, and performs multi-faceted evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations between question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer memorization. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

Electronic Supplementary Material

Download File(s)
BDMA-2024-0079_ESM.pdf (245.8 KB)

References

[1]

A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nat. Med., vol. 29, no. 8, pp. 1930–1940, 2023.

[2]
P. Webster, Six ways large language models are changing healthcare, Nat. Med., vol. 29, no. 12, pp. 2969–2971, 2023.
[3]
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al., Large language models encode clinical knowledge, arXiv preprint arXiv: 2212.13138, 2022.
[4]
J. Li, S. Zhong, and K. Chen, MLEC-QA: A Chinese multi-choice biomedical question answering dataset, in Proc. 2021 Conference on Empirical Methods in Natural Language Processing, Virtual Event, 2021, pp. 8862–8874.
[5]
J. Liu, P. Zhou, Y. Hua, D. Chong, Z. Tian, A. Liu, H. Wang, C. You, Z. Guo, L. Zhu, et al., Benchmarking large language models on CMExam: A comprehensive Chinese medical exam dataset, arXiv preprint arXiv: 2306.03030, 2023.
[6]
X. Wang, G. H. Chen, D. Song, Z. Zhang, Z. Chen, Q. Xiao, F. Jiang, J. Li, X. Wan, B. Wang, et al., CMB: A comprehensive medical benchmark in Chinese, arXiv preprint arXiv: 2308.08833, 2023.
[7]
N. Zhang, M. Chen, Z. Bi, X. Liang, L. Li, X. Shang, K. Yin, C. Tan, J. Xu, F. Huang, et al., CBLUE: A Chinese biomedical language understanding evaluation benchmark, arXiv preprint arXiv: 2106.08087, 2021.
[8]
S. Wu, M. Koo, L. Blum, A. Black, L. Kao, Z. Fei, F. Scalzo, and I. Kurtz, Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology, NEJM AI, doi: 10.1056/AIdbp2300092.
[9]

Z. W. Lim, K. Pushpanathan, S. M. E. Yew, Y. Lai, C. H. Sun, J. S. H. Lam, D. Z. Chen, J. H. L. Goh, M. C. J. Tan, B. Sheng, et al., Benchmarking large language models’ performances for myopia care: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard, EBioMedicine, vol. 95, p. 104770, 2023.

[10]
Y. Wang, Y. Teng, K. Huang, C. Lyu, S. Zhang, W. Zhang, X. Ma, Y.-G. Jiang, Y. Qiao, and Y. Wang, Fake alignment: Are LLMs really aligned well? arXiv preprint arXiv: 2311.05915, 2023.
[11]
K. Zhou, Y. Zhu, Z. Chen, W. Chen, W. X. Zhao, X. Chen, Y. Lin, J.-R. Wen, and J. Han, Don’t make your LLM an evaluation benchmark cheater, arXiv preprint arXiv: 2311.01964, 2023.
[12]

C. Sohrabi, G. Mathew, N. Maria, A. Kerwan, T. Franchi, R. A. Agha, and Collaborators, The SCARE 2023 guideline: Updating consensus Surgical CAse REport (SCARE) guidelines, Int. J. Surg., vol. 109, no. 5, pp. 1136–1140, 2023.

[13]
X. Wang, X. Zhang, G. Wang, J. He, Z. Li, W. Zhu, Y. Guo, Q. Dou, X. Li, D. Wang, et al., OpenMEDLab: An open-source platform for multi-modality foundation models in medicine, arXiv preprint arXiv: 2402.18028, 2024.
[14]
A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al., GLM-130B: An open bilingual pre-trained model, arXiv preprint arXiv: 2210.02414, 2022.
[15]
H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and T. Liu, HuaTuo: Tuning LLaMA model with Chinese medical knowledge, arXiv preprint arXiv: 2304.06975, 2023.
[16]
Y. Chen, Z. Wang, X. Xing, H. zheng, Z. Xu, K. Fang, J. Wang, S. Li, J. Wu, Q. Liu, et al., BianQue: Balancing the questioning and suggestion ability of health LLMs with multi-turn health conversations polished by ChatGPT, arXiv preprint arXiv: 2310.15896, 2023.
[17]
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, BLEU: A method for automatic evaluation of machine translation, in Proc. 40th Annual Meeting on Association for Computational Linguistics , Morristown, NJ, USA, 2001, pp. 311–318.
[18]
C.-Y. Lin and F. J. Och, Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics, in Proc. 42nd Annual Meeting on Association for Computational Linguistics, Morristown, NJ, USA, 2004, pp. 605–612.
[19]

Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al., A survey on evaluation of large language models, ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, pp. 1–45, 2024.

[20]

O. Sychev, A. Anikin, and A. Prokudin, Automatic grading and hinting in open-ended text questions, Cogn. Syst. Res., vol. 59, pp. 264–272, 2020.

[21]
I. Abdou and T. Eude, Open-ended questions automated evaluation: Proposal of a new generation, in Proc. 2023 Int. Joint Conf. Robotics and Artificial Intelligence, Shanghai, China, 2023, pp. 143–147.
[22]
Y. Liu, H. Duan, Y. Zhang, B. Li, S. Zhang, W. Zhao, Y. Yuan, J. Wang, C. He, Z. Liu, et al., MMBench: Is your multi-modal model an all-around player? arXiv preprint arXiv: 2307.06281, 2023.
[23]

Z. Wang, S. Wang, Y. Li, J. Guo, Y. Wei, Y. Mu, L. Zheng, and W. Li, A new paradigm for applying deep learning to protein-ligand interaction prediction, Brief. Bioinform., vol. 25, no. 3, p. bbae145, 2024.

[24]
W. Wang, C. Feng, R. Han, Z. Wang, L. Ye, Z. Du, H. Wei, F. Zhang, Z. Peng, and J. Yang, trRosettaRNA: automated prediction of RNA 3D structure with transformer network, Nat. Commun., vol. 14, no. 1, p. 7266, 2023.
[25]

X. Hu, S. Liao, H. Bai, L. Wu, M. Wang, Q. Wu, J. Zhou, L. Jiao, X. Chen, Y. Zhou, et al., Integrating exosomal microRNAs and electronic health data improved tuberculosis diagnosis, EBioMedicine, vol. 40, pp. 564–573, 2019.

[26]
Z. Wang, C. Liu, S. Zhang, and Q. Dou, Foundation model for endoscopy video analysis via large-scale self-supervised pre-train, in Proc. International Conference on Medical Image Computing and Computer-Assisted Intervention, Vancouver, Canada, 2023, pp. 101–111.
[27]
J. Cheng, J. Ye, Z. Deng, J. Chen, T. Li, H. Wang, Y. Su, Z. Huang, J. Chen, L. Jiang, et al., SAM-Med2D, arXiv preprint arXiv:2308.16184.
[28]

S. Zhang and D. Metaxas, On the challenges and perspectives of foundation models for medical image analysis, Med. Image Anal., vol. 91, p. 102996, 2024.

[29]
Y. Zhang, J. Gao, Z. Tan, L. Zhou, K. Ding, M. Zhou, S. Zhang, and D. Wang, Data-centric foundation models in computational healthcare: A survey, arXiv preprint arXiv: 2401.02458, 2024.
[30]
C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, LLaVA-med: Training a large language-and-vision assistant for biomedicine in one day, arXiv preprint arXiv: 2306.00890, 2023.
[31]
K. Saab, T. Tu, W.-H. Weng, R. Tanno, D. Stutz, E. Wulczyn, F. Zhang, T. Strother, C. Park, E. Vedadi, et al., Capabilities of gemini models in medicine, arXiv preprint arXiv: 2404.18416, 2024.
[32]

D. Wang, X. Wang, L. Wang, M. Li, Q. Da, X. Liu, X. Gao, J. Shen, J. He, T. Shen, et al., A real-world dataset and benchmark for foundation model adaptation in medical image classification, Sci. Data, vol. 10, no. 1, p. 574, 2023.

[33]
Y. Hu, T. Li, Q. Lu, W. Shao, J. He, Y. Qiao, and P. Luo, OmniMedVQA: A new large-scale comprehensive evaluation benchmark for medical LVLM, arXiv preprint arXiv: 2402.09181, 2024.
[34]
Opencompass: A universal evaluation platform for foundation models. https: //github.com/open-compass/opencompass, 2023.
Big Data Mining and Analytics
Pages 1116-1128
Cite this article:
Liu M, Hu W, Ding J, et al. MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models. Big Data Mining and Analytics, 2024, 7(4): 1116-1128. https://doi.org/10.26599/BDMA.2024.9020044
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return