Ensuring the general efficacy and benefit for human beings from medical Large Language Models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench”, a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300901 questions) to cover 43 clinical specialties, and performs multi-faceted evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations between question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer memorization. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.
A. J. Thirunavukarasu, D. S. J. Ting, K. Elangovan, L. Gutierrez, T. F. Tan, and D. S. W. Ting, Large language models in medicine, Nat. Med., vol. 29, no. 8, pp. 1930–1940, 2023.
Z. W. Lim, K. Pushpanathan, S. M. E. Yew, Y. Lai, C. H. Sun, J. S. H. Lam, D. Z. Chen, J. H. L. Goh, M. C. J. Tan, B. Sheng, et al., Benchmarking large language models’ performances for myopia care: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard, EBioMedicine, vol. 95, p. 104770, 2023.
C. Sohrabi, G. Mathew, N. Maria, A. Kerwan, T. Franchi, R. A. Agha, and Collaborators, The SCARE 2023 guideline: Updating consensus Surgical CAse REport (SCARE) guidelines, Int. J. Surg., vol. 109, no. 5, pp. 1136–1140, 2023.
Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al., A survey on evaluation of large language models, ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, pp. 1–45, 2024.
O. Sychev, A. Anikin, and A. Prokudin, Automatic grading and hinting in open-ended text questions, Cogn. Syst. Res., vol. 59, pp. 264–272, 2020.
Z. Wang, S. Wang, Y. Li, J. Guo, Y. Wei, Y. Mu, L. Zheng, and W. Li, A new paradigm for applying deep learning to protein-ligand interaction prediction, Brief. Bioinform., vol. 25, no. 3, p. bbae145, 2024.
X. Hu, S. Liao, H. Bai, L. Wu, M. Wang, Q. Wu, J. Zhou, L. Jiao, X. Chen, Y. Zhou, et al., Integrating exosomal microRNAs and electronic health data improved tuberculosis diagnosis, EBioMedicine, vol. 40, pp. 564–573, 2019.
S. Zhang and D. Metaxas, On the challenges and perspectives of foundation models for medical image analysis, Med. Image Anal., vol. 91, p. 102996, 2024.
D. Wang, X. Wang, L. Wang, M. Li, Q. Da, X. Liu, X. Gao, J. Shen, J. He, T. Shen, et al., A real-world dataset and benchmark for foundation model adaptation in medical image classification, Sci. Data, vol. 10, no. 1, p. 574, 2023.