AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (12.9 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access | Just Accepted

MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models

Mianxin Liu1,Weiguo Hu2,Jinru Ding1Jie Xu1Xiaoyang Li2Lifeng Zhu2Zhian Bai2Xiaoming Shi1Benyou Wang3Haitao Song4,Pengfei Liu5Xiaofan Zhang6Shanshan Wang7Kang Li8Haofen Wang9Tong Ruan10Xuanjing Huang11Xin Sun12Shaoting Zhang1( )

1 Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China

2 Ruijin Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai 200025, China

3 Chinese University of Hong Kong, Shenzhen 518172, China

4 Shanghai Artificial Intelligence Research Institute and also with Shanghai Jiao Tong University, Shanghai 200240, China

5 School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

6 Qing Yuan Research Institute, Shanghai Jiao Tong University, Shanghai 200240, China

7 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China

8 Sichuan University West China Hospital, Chengdu 610041, China

9 College of Design and Innovation, Tongji University, Shanghai 200092, China

10 Department of Computer Science and Technology, East China University of Science and Technology, Shanghai 200237, China

11 School of Computer Science, Fudan University, Shanghai 200433, China

12 Xinhua Hospital Affiliated to Shanghai Jiaotong University School of Medicine, Shanghai 200092, China

Mianxin Liu and Weiguo Hu contributed equally.

Show Author Information

Abstract

Ensuring the general efficacy and goodness for human beings from medical large language models (LLM) before real-world deployment is crucial. However, a widely accepted and accessible evaluation process for medical LLM, especially in the Chinese context, remains to be established. In this work, we introduce “MedBench", a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM. First, MedBench assembles the currently largest evaluation dataset (300 901 questions) to cover 43 clinical specialties and performs multi-facet evaluation on medical LLM. Second, MedBench provides a standardized and fully automatic cloud-based evaluation infrastructure, with physical separations for question and ground truth. Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering. Applying MedBench to popular general and medical LLMs, we observe unbiased, reproducible evaluation results largely aligning with medical professionals’ perspectives. This study establishes a significant foundation for preparing the practical applications of Chinese medical LLMs. MedBench is publicly accessible at https://medbench.opencompass.org.cn.

Big Data Mining and Analytics
Cite this article:
Liu M, Hu W, Ding J, et al. MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models. Big Data Mining and Analytics, 2024, https://doi.org/10.26599/BDMA.2024.9020044

1063

Views

224

Downloads

0

Crossref

0

Web of Science

0

Scopus

0

CSCD

Altmetrics

Received: 06 February 2024
Revised: 09 May 2024
Accepted: 11 June 2024
Available online: 01 July 2024

© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return