Large Language Model for Medical Images: A Survey of Taxonomy, Systematic Review, and Future Trends

Peng Wang; Wenpeng Lu; Chunlin Lu; Ruoxi Zhou; Min Li; Libo Qin

doi:10.26599/BDMA.2024.9020090

| Sign up

PDF (1.3 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (4)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Tables (2)

Table 1

Table 2

Open Access

Large Language Model for Medical Images: A Survey of Taxonomy, Systematic Review, and Future Trends

Peng Wang^{¹^,^P}, Wenpeng Lu^{²^,^P}, Chunlin Lu^¹, Ruoxi Zhou^¹, Min Li^¹, Libo Qin^¹()

1School of Computer Science and Engineering, Central South University, Changsha 410083, China

2Key Laboratory of Computing Power Network and Information Security Affiliated with Ministry of Education, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China

Show Author Information

Abstract

The advent of Large Language Models (LLMs) has sparked considerable interest in the medical image domain, as they can generalize to multiple tasks and offer outstanding performance. While LLMs achieve promising results, there is currently a lack of a comprehensive summary of medical images, making it challenging for researchers to understand the progress within this domain. To fill this gap, we make the first attempt to present a comprehensive survey for LLM on medical images. In addition, to better summarize the current progress comprehensively, we further introduce a novel x-stage tuning paradigm for summarization, including zero-stage tuning, one-stage tuning, and multi-stage tuning, offering a unified perspective on LLMs for medical images. Finally, we discuss challenges and future directions in this domain, aiming to spur more breakthroughs in the future. We hope this work can pave the way for the broad application of LLMs in medical images and provide a valuable resource for this domain.

Keywords

Large Language Model (LLM)x-stage tuning medical images

References

[1]

S. K. Zhou, H. Greenspan, C. Davatzikos, J. S. Duncan, B. Van Ginneken, A. Madabhushi, J. L. Prince, D. Rueckert, and R. M. Summers, A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises, Proc. IEEE, vol. 109, no. 5, pp. 820–838, 2021.

Type	Model name	Base	Number of parameters (×10⁹)	Base detail			Link
Type	Model name	Base	Number of parameters (×10⁹)	LLM	LVLM	Open source	Link
Note: (1) “Number of parameters” presents the model parameters’ number after x-stage tuning, we use “/” to separate different parameters of the same model, and “−” to present the model parameters or official links not specified in the paper; (2) In the column “Base detail”, for LLM and LVLM, “√” indicates this base model is an LLM or an LVLM, and “−” indicates this base model is not an LLM and an LVLM. For “Open source”, “√” and “×” indicate this base model are open-source and close-source, respectively.(3) “‡” denotes that the model is available with multiple parameters magnitudes, and here we only document the number of parameters exceeding 1×10⁹.
Zero-stage tuning	ChatCAD^[21]	ChatGPT	−	√	−	×	−
	ChatCAD+^[22]	ChatGPT	−	√	−	×	Github
	GPT-4V^{[14, 15, 23]}	GPT-4V	−	−	√	×	Website
	Gemini^[24]	Gemini	−	−	√	×	Website
	Flamingo^[25]	Flamingo	9/80	−	√	√	Github
One-stage tuning	OphGLM^[26]	ChatGLM	6.2	√	−	√	Github
	Visual Med-Alpaca^[27]	LLaMA	7	√	−	√	Website
	FFA-GPT^[28]	LLaMA2	−	√	−	√	−
	MedBLIP^[29]	EVA and FLAN-T5/BioGPT/BioMedLM	3.4/1.5/2.7	√	−	√	Github
	RadFM^[30]	3D ViT and MedLLaMA	14	√	−	√	Github
	Med-Flamingo^[12]	OpenFlamingo	9	−	√	√	Github
	DeMMo^[31]	Flamingo vision encoder and BioViL as DeMMo vision encoder and LLaMA as LLM	−	−	√	√	−
	MAIRA-1^[32]	LLaVA-1.5 (RAD-DINO and Vicuna-7B)	7	−	√	√	−
	Med-PaLM M^[33]	PaLM-E	12/84/562	−	√	√	Github
	FM for MRG^[34]	BLIP-2 (EVA and ChatGLM)	7.3	−	√	√	−
	MedVInT^[35]	PMC-CLIP and PMC-LLaMA	−	−	√	√	Github
	Prefix T.Medical LM^[36]	CLIP and BioGPT/BioMedLM/GPT2-XL	−	−	√	√	Github
	R2GenGPT^[37]	Swin Transformer and LLaMA2	7	−	√	√	Github
	MedXchat^[38]	CLIP and LLaMA	−	−	√	√	−
	PneumoLLM^[39]	CLIP and LLaMA-7	7	−	√	√	Github
Multi-stage tuning	PeFoMed^[40]	MiniGPT-v2 (EVA and LLaMA2-chat-7B)	7	−	√	√	Github
	SkinGPT-4^[41]	MiniGPT-4 (ViT and Vicuna)	−	−	√	√	Github
	CephGPT-4^[42]	MiniGPT-4 (ViT and Vicuna-7B)/VisualGLM (ViT and ChatGLM-6B)	7/6	−	√	√	−
	PathAsst^[43]	LLaVA (PathCLIP and Vicuna-13B)	13	−	√	√	Github
	CXR-LLaVA^[44]	LLaVA (ViT and LLaMA2-7B)	7	−	√	√	Github
	LLaVA-Med^[45]	LLaVA	7	−	√	√	Github
	ELIXR^[46]	ELIXR-C: CLIPELIXR-B: BLIP-2 (ELIXR-C and PaLM2-S)	−	−	√	√	−
	CXR-BLIP^[47]	BLIP-2 (EVA and OPT Transformer)	2.7	−	√	√	−
	LLIM for MIC^[48]	BLIP-2 (EVA and OPT Transformer)	−	−	√	√	−
	ClinicalBLIP^[49]	InstructBLIP	−	−	√	√	−
	XrayGPT^[50]	MedCLIP and Vicuna	−	−	√	√	Github
	Med-MLLM^[13]	ResNet-50 and Transformer	8.9 $^{‡}$	−	√	√	−
	Qilin-Med-VL^[51]	ViT and Chiese-LLaMA2-13B-Chat	13	−	√	√	Github
	LLM-CXR^[52]	dolly-v2-3b	3	√	−	√	Github
	LMM for RRG^[53]	ResNet-50 and GPT2-S/GPT2-L/OpenLLaMA	7 $^{‡}$	−	√	√	−
	MOSS-MED^[54]	Two layer projection module as visual encoder, MOSS2-2.5B-chat/LLaMA2-7B-chat as LLM	2.5/7	−	√	√	−
	LiteGPT^[55]	MiniGPT-v2 (BiomedCLIP/PubMedCLIP and LLaMA2-chat-7B)	7	−	√	√	Github
	MiniGPT-Med^[56]	MiniGPT-v2 (EVA and LLaMA2-chat-7B)	7	−	√	√	Github

Type	Model	VQA-RAD			SLAKE			PathVQA
Type	Model	Open	Closed	Overall	Open	Closed	Overall	Open	Closed	Overall
Zero-stage tuning	MUMC^[77]	71.5	84.2	79.2	−	−	84.9	39.0	90.4	65.1
Zero-stage tuning	Q2ATransformer^[78]	79.2	81.2	80.5	−	−	−	54.9	88.9	74.6
One-stage tuning	ELIXR^[46]	37.9	69.3	−	−	−	−	−	−	−
One-stage tuning	Prefix T.Medical LM^[36]	−	−	−	84.3	82.1	83.3	40.0	87.0	63.6
Multi-stage tuning	MedVInT-TE^[35]	69.3	84.2	−	88.2	87.7	−	−	−	−
	MedVInT-TD^[35]	73.7	86.8	−	84.5	86.3	−	−	−	−
	LLaVA-Med^[45]	64.8	83.1	−	87.1	86.8	−	39.6	91.1	−
	PeFoMed^[40]	79.9	87.5	84.4	83.1	88.7	85.3	45.7	91.3	68.6