[2]
H. P. Chan, R. K. Samala, L. M. Hadjiiski, and C. Zhou, Deep learning in medical image analysis, in Deep Learning in Medical Image Analysis : Challenges and Applications, G. Lee and H. Fujita, eds. Cham, Switzerland: Springer, 2020, pp. 3–21.
[4]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al., A survey of large language models, arXiv preprint arXiv: 2303.18223, 2024.
[5]
L. Qin, Q. Chen, X. Feng, Y. Wu, Y. Zhang, Y. Li, M. Li, W. Che, and P. S. Yu, Large language models meet NLP: A survey, arXiv preprint arXiv: 2405.12819, 2024.
[6]
L. Qin, Q. Chen, Y. Zhou, Z. Chen, Y. Li, L. Liao, M. Li, W. Che, and P. S. Yu, Multilingual large language model: A survey of resources, taxonomy and frontiers, arXiv preprint arXiv: 2404.04925, 2024.
[9]
L. Qin, F. Wei, Q. Chen, J. Zhou, S. Huang, J. Si, W. Lu, and W. Che, CroPrompt: Cross-task interactive prompting for zero-shot spoken language understanding, arXiv preprint arXiv: 2406.10505, 2024.
[11]
H. Zhou, F. Liu, B. Gu, X. Zou, J. Huang, J. Wu, Y. Li, S. S. Chen, P. Zhou, J. Liu, et al., A survey of large language models in medicine: Progress, application, and challenge, arXiv preprint arXiv: 2311.05112, 2024.
[12]
M. Moor, Q. Huang, S. Wu, M. Yasunaga, Y. Dalmia, J. Leskovec, C. Zakka, E. P. Reis, and P. Rajpurkar, Med-flamingo: A multimodal medical few-shot learner, in Proc. 3 rd Machine Learning for Health Symp., New Orleans, LA, USA, 2023, pp. 353–367.
[14]
Y. Li, Y. Liu, Z. Wang, X. Liang, L. Wang, L. Liu, L. Cui, Z. Tu, L. Wang, and L. Zhou, A systematic evaluation of GPT-4V’s multimodal capability for medical image analysis, arXiv preprint arXiv: 2310.20381, 2024.
[15]
Z. Liu, H. Jiang, T. Zhong, Z. Wu, C. Ma, Y. Li, X. Yu, Y. Zhang, Y. Pan, P. Shu, et al., Holistic evaluation of GPT-4V for biomedical imaging, arXiv preprint arXiv: 2312.05256, 2023.
[16]
M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, and X. Chang, Dynamic graph enhanced contrastive learning for chest X-ray report generation, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, Vancouver, Canada, 2023, pp. 3334–3343.
[17]
J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya, et al., CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison, in Proc. 33 rd AAAI Conf. Artificial Intelligence, Honolulu, HI, USA, 2019, pp. 590–597.
[18]
Y. Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, and D. Xu, When radiology report generation meets knowledge graph, in Proc. 34 th AAAI Conf. Artificial Intelligence, New York, NY, USA, 2020, pp. 12910–12917.
[20]
B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle, et al., Making the most of text semantics to improve biomedical vision-language processing, in Proc. 17 th European Conf. Computer Vision, Tel Aviv, Israel, 2022, pp. 1–21.
[21]
S. Wang, Z. Zhao, X. Ouyang, Q. Wang, and D. Shen, ChatCAD: Interactive computer-aided diagnosis on medical image using large language models, arXiv preprint arXiv: 2302.07257, 2023.
[22]
Z. Zhao, S. Wang, J. Gu, Y. Zhu, L. Mei, Z. Zhuang, Z. Cui, Q. Wang, and D. Shen, Chatcad+: Towards a universal and reliable interactive CAD using LLMs, arXiv preprint arXiv: 2305.15964, 2024.
[23]
Z. Yang, L. Li, K. Lin, J. Wang, C. C. Lin, Z. Liu, and L. Wang, The dawn of LMMs: Preliminary explorations with GPT-4V(ision), arXiv preprint arXiv: 2309.17421, 2023.
[24]
A. Pal and M. Sankarasubbu, Gemini goes to med school: Exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations, arXiv preprint arXiv: 2402.07023, 2024.
[25]
T. Han, L. C. Adams, S. Nebelung, J. N. Kather, K. K. Bressem, and D. Truhn, Multimodal large language models are generalist medical image interpreters, medRxiv preprint medRxiv: 10.1101/2023.12.21.23300146, 2023.
[26]
W. Gao, Z. Deng, Z. Niu, F. Rong, C. Chen, Z. Gong, W. Zhang, D. Xiao, F. Li, Z. Cao, et al., OphGLM: Training an ophthalmology large language-and-vision assistant based on instructions and dialogue, arXiv preprint arXiv: 2306.12174, 2023.
[27]
C. Shu, B. Chen, F. Liu, Z. Fu, E. Shareghi, and N. Collier, Visual med-alpaca: A parameter-efficient biomedical LLM with visual capabilities, https://cambridgeltl.github.io/visual-med-alpaca, 2023.
[28]
D. Shi, X. Chen, W. Zhang, P. Xu, Z. Zhao, Y. Zheng, and M. He, FFA-GPT: An interactive visual question answering system for fundus fluorescein angiography, https://doi.org/10.21203/rs.3.rs-3307492/v1, 2023.
[29]
Q. Chen, X. Hu, Z. Wang, and Y. Hong, MedBLIP: Bootstrapping language-image pre-training from 3D medical images and texts, arXiv preprint arXiv: 2305.10799, 2023.
[30]
C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data, arXiv preprint arXiv: 2308.02463, 2023.
[31]
B. N. Zhao, X. Jiang, X. Luo, Y. Yang, B. Li, Z. Wang, J. Alvarez-Valle, M. P. Lungren, D. Li, and L. Qiu, Large multimodal model for real-world radiology report generation, https://openreview.net/forum?id=3Jl0sjmZx9, 2023.
[32]
S. L. Hyland, S. Bannur, K. Bouzid, D. C. Castro, M. Ranjit, A. Schwaighofer, F. Pérez-García, V. Salvatelli, S. Srivastav, A. Thieme, et al., MAIRA-1: A specialised large multimodal model for radiology report generation, arXiv preprint arXiv: 2311.13668, 2024.
[33]
T. Tu, S. Azizi, D. Driess, M. Schaekermann, M. Amin, P. C. Chang, A. Carroll, C. Lau, R. Tanno, I. Ktena, et al., Towards generalist biomedical AI, arXiv preprint arXiv: 2307.14334, 2023.
[34]
B. Yang, A. Raza, Y. Zou, and T. Zhang, Customizing general-purpose foundation models for medical report generation, arXiv preprint arXiv: 2306.05642, 2023.
[35]
X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, and W. Xie, PMC-VQA: Visual instruction tuning for medical visual question answering, arXiv preprint arXiv: 2305.10415, 2024.
[36]
T. van Sonsbeek, M. M. Derakhshani, I. Najdenkoska, C. G. M. Snoek, and M. Worring, Open-ended medical visual question answering through prefix tuning of language models, arXiv preprint arXiv: 2303.05977, 2023.
[37]
Z. Wang, L. Liu, L. Wang, and L. Zhou, R2GenGPT: Radiology report generation with frozen LLMs, Meta-Radiology, vol. 1, no. 3, p. 100033, 2023.
[38]
L. Yang, Z. Wang, and L. Zhou, MedXChat: Bridging CXR modalities with a unified multimodal large model, arXiv preprint arXiv: 2312.02233, 2024.
[40]
J. He, P. Li, G. Liu, Z. Zhao, and S. Zhong, PeFoMed: Parameter efficient fine-tuning on multimodal large language models for medical visual question answering, arXiv preprint arXiv: 2401.02797, 2024.
[41]
J. Zhou, X. He, L. Sun, J. Xu, X. Chen, Y. Chu, L. Zhou, X. Liao, B. Zhang, and X. Gao, SkinGPT-4: An interactive dermatology diagnostic system with visual large language model, arXiv preprint arXiv: 2304.10691, 2023.
[42]
L. Ma, J. Han, Z. Wang, and D. Zhang, CephGPT-4: An interactive multimodal cephalometric measurement and diagnostic system with visual large language model, arXiv preprint arXiv: 2307.07518, 2023.
[43]
Y. Sun, C. Zhu, S. Zheng, K. Zhang, L. Sun, Z. Shui, Y. Zhang, H. Li, and L. Yang, PathAsst: Redefining pathology through generative foundation AI assistant for pathology, arXiv preprint arXiv: 2305.15072, 2024.
[44]
S. Lee, J. Youn, M. Kim, and S. H. Yoon, CXR-LLAVA: A multimodal large language model for interpreting chest X-ray images, arXiv preprint arXiv: 2310.18341, 2024.
[45]
C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, LLaVA-Med: Training a large language-and-vision assistant for biomedicine in one day, arXiv preprint arXiv: 2306.00890, 2023.
[46]
S. Xu, L. Yang, C. Kelly, M. Sieniek, T. Kohlberger, M. Ma, W. H. Weng, A. Kiraly, S. Kazemzadeh, Z. Melamed, et al., ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders, arXiv preprint arXiv: 2308.01317, 2023.
[47]
K. Tian, Towards automated healthcare: Deep vision and large language models for radiology report generation, PhD dissertation, Harvard University, Cambridge, MA, USA, 2023.
[48]
W. Zhou, Z. Ye, Y. Yang, S. Wang, H. Huang, R. Wang, and D. Yang, Transferring pre-trained large language-image model for medical image captioning, in Proc. CLEF2023 : Conf. and Labs of the Evaluation Forum, Thessaloniki, Greece, 2023. http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-3497/paper-148.pdf
[50]
O. Thawkar, A. Shaker, S. S. Mullappilly, H. Cholakkal, R. M. Anwer, S. Khan, J. Laaksonen, and F. S. Khan, XrayGPT: Chest radiographs summarization using medical vision-language models, arXiv preprint arXiv: 2306.07971, 2023.
[51]
J. Liu, Z. Wang, Q. Ye, D. Chong, P. Zhou, and Y. Hua, Qilin-Med-VL: Towards Chinese large vision-language model for general healthcare, arXiv preprint arXiv: 2310.17956, 2023.
[52]
S. Lee, W. J. Kim, and J. C. Ye, LLM itself can read and generate CXR images, arXiv preprint arXiv: 2305.11490, 2024.
[53]
Y. Lu, S. Hong, Y. Shah, and P. Xu, Effectively fine-tune to improve large multimodal models for radiology report generation, arXiv preprint arXiv: 2312.01504, 2023.
[55]
K. Le-Duc, R. Zhang, N. S. Nguyen, T. H. Pham, A. Dao, B. H. Ngo, A. T. Nguyen, and T. S. Hy, LiteGPT: Large vision-language model for joint chest X-ray localization and classification task, arXiv preprint arXiv: 2407.12064, 2024.
[56]
A. Alkhaldi, R. Alnajim, L. Alabdullatef, R. Alyahya, J. Chen, D. Zhu, A. Alsinan, and M. Elhoseiny, MiniGPT-Med: Large language model as a general interface for radiology diagnosis, arXiv preprint arXiv: 2407.04106, 2024.
[57]
Gemini Team Google, Gemini: A family of highly capable multimodal models, arXiv preprint arXiv: 2312.11805, 2024.
[58]
J. B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al., Flamingo: A visual language model for few-shot learning, in Proc. 36 th Int. Conf. Neural Information Processing Systems, New Orleans, LA, USA, 2022, pp. 23716–23736.
[59]
A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu, W. Zheng, X. Xia, et al., GLM-130B: An open bilingual pre-trained model, arXiv preprint arXiv: 2210.02414, 2023.
[60]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: Open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.
[61]
J. Li, D. Li, C. Xiong, and S. Hoi, BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation, in Proc. 39 th Int. Conf. Machine Learning, Baltimore, MD, USA, 2022, pp. 12888–12900.
[62]
H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., Llama 2: Open foundation and fine-tuned chat models, arXiv preprint arXiv: 2307.09288, 2023.
[63]
A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu, K. Marathe, Y. Bitton, S. Gadre, S. Sagawa, et al., OpenFlamingo: An open-source framework for training large autoregressive vision-language models, arXiv preprint arXiv: 2308.01390, 2023.
[64]
H. Liu, C. Li, Y. Li, and Y. J. Lee, Improved baselines with visual instruction tuning, arXiv preprint arXiv: 2310.03744, 2024.
[65]
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., PaLM-E: An embodied multimodal language model, arXiv preprint arXiv: 2303.03378, 2023.
[66]
J. Li, D. Li, S. Savarese, and S. Hoi, BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, arXiv preprint arXiv: 2301.12597, 2023.
[67]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38 th Int. Conf. Machine Learning, Virtual Event, 2021, pp. 8748–8763.
[68]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, High-resolution image synthesis with latent diffusion models, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 10674–10685.
[69]
D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, MiniGPT-4: Enhancing vision-language understanding with advanced large language models, arXiv preprint arXiv: 2304.10592, 2023.
[70]
J. Chen, D. Zhu, X. Shen, X. Li, Z. Liu, P. Zhang, R. Krishnamoorthi, V. Chandra, Y. Xiong, and M. Elhoseiny, MiniGPT-v2: Large language model as a unified interface for vision-language multi-task learning, arXiv preprint arXiv: 2310.09478, 2023.
[71]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, Visual instruction tuning, in Proc. 37 th Int. Conf. Neural Information Processing Systems, New Orleans, LA, USA, 2024, pp. 34892–34916.
[72]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16×16 words: Transformers for image recognition at scale, arXiv preprint arXiv: 2010.11929, 2021.
[73]
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, et al., OPT: Open pre-trained transformer language models, arXiv preprint arXiv: 2205.01068, 2022.
[74]
W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, InstructBLIP: Towards general-purpose vision-language models with instruction tuning, arXiv preprint arXiv: 2305.06500, 2023.
[75]
Z. Wang, Z. Wu, D. Agarwal, and J. Sun, MedCLIP: Contrastive learning from unpaired medical images and text, arXiv preprint arXiv: 2210.10163, 2022.
[76]
W. L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, Vicuna: An open-source chatbot impressing GPT-4 with 90%* ChatGPT quality, https://lmsys.org/blog/2023-03-30-vicuna, 2023.
[77]
P. Li, G. Liu, J. He, Z. Zhao, and S. Zhong, Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering, in Proc. 26 th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Vancouver, Canada, 2023, pp. 374–383.
[78]
Y. Liu, Z. Wang, D. Xu, and L. Zhou, Q2ATransformer: Improving medical VQA via an answer querying decoder, in Proc. 28 th Int. Conf. Information Processing in Medical Imaging, San Carlos de Bariloche, Argentina, 2023, pp. 445–456.
[79]
S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi, et al., Textbooks are all you need, arXiv preprint arXiv: 2306.11644, 2023.
[80]
A. Mitra, L. Del Corro, G. Zheng, S. Mahajan, D. Rouhana, A. Codas, Y. Lu, W. G. Chen, O. Vrousgos, C. Rosset, et al., Agentinstruct: Toward generative teaching with agentic flows, arXiv preprint arXiv: 2407.03502, 2024.
[81]
P. Wang, N. Zhang, B. Tian, Z. Xi, Y. Yao, Z. Xu, M. Wang, S. Mao, X. Wang, S. Cheng, et al., EasyEdit: An easy-to-use knowledge editing framework for large language models, arXiv preprint arXiv: 2308.07269, 2024.
[82]
S. Wang, Y. Zhu, H. Liu, Z. Zheng, C. Chen, and J. Li, Knowledge editing for large language models: A survey, arXiv preprint arXiv: 2310.16218, 2024.
[83]
Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu, X. Huang, E. Zhao, Y. Zhang, Y. Chen, et al., Siren’s song in the AI ocean: A survey on hallucination in large language models, arXiv preprint arXiv: 2309.01219, 2023.
[84]
M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F. Cooper, D. Ippolito, C. A. Choquette-Choo, E. Wallace, F. Tramèr, and K. Lee, Scalable extraction of training data from (production) language models, arXiv preprint arXiv: 2311.17035, 2023.
[85]
J. Konečný, H. B. McMahan, D. Ramage, and P. Richtárik, Federated optimization: Distributed machine learning for on-device intelligence, arXiv preprint arXiv: 1610.02527, 2016.
[87]
L. Qin, Q. Chen, F. Wei, S. Huang, and W. Che, Cross-lingual prompting: Improving zero-shot chain-of-thought reasoning across languages, arXiv preprint arXiv: 2310.14799, 2023.
[88]
D. Yoon, J. Jang, S. Kim, S. Kim, S. Shafayat, and M. Seo, LangBridge: Multilingual reasoning without multilingual supervision, arXiv preprint arXiv: 2401.10695, 2024.
[89]
X. Tang, A. Zou, Z. Zhang, Z. Li, Y. Zhao, X. Zhang, A. Cohan, and M. Gerstein, MedAgents: Large language models as collaborators for zero-shot medical reasoning, arXiv preprint arXiv: 2311.10537, 2024.
[90]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, and D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, in Proc. 36 th Int. Conf. Neural Information Processing Systems, New Orleans, LA, USA, 2022, pp. 24824–24837.
[91]
Z. Zhang, A. Zhang, M. Li, and A. Smola, Automatic chain of thought prompting in large language models, arXiv preprint arXiv: 2210.03493, 2022.
[92]
W. Chen, X. Ma, X. Wang, and W. W. Cohen, Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, arXiv preprint arXiv: 2211.12588, 2023.
[93]
J. Long, Large language model guided tree-of-thought, arXiv preprint arXiv: 2305.08291, 2023.
[94]
Y. Zhang, Q. Chen, J. Zhou, P. Wang, J. Si, J. Wang, W. Lu, and L. Qin, Wrong-of-thought: An integrated reasoning framework with multi-perspective verification and wrong information, in Proc. Findings of the Association for Computational Linguistics : EMNLP 2024, Miami, FL, USA, 2024, pp. 6644–6653.
[96]
S. Wu, H. Fei, L. Qu, W. Ji, and T. S. Chua, NExT-GPT: Any-to-any multimodal LLM, in Proc. 41 st Int. Conf. Machine Learning, Vienna, Austria, 2024, pp. 53366–53397.
[106]
S. P. Singh, L. Wang, S. Gupta, H. Goli, P. Padmanabhan, and B. Gulyás, 3D deep learning on medical images: A review, Sensors, vol. 20, no. 18, p. 5097, 2020.
[109]
K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, BLEU: A method for automatic evaluation of machine translation, in Proc. 40 th Annu. Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 2002, pp. 311–318.
[110]
C. Y. Lin, ROUGE: A package for automatic evaluation of summaries, in Proc. Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81.
[111]
R. Vedantam, C. L. Zitnick, and D. Parikh, CIDEr: Consensus-based image description evaluation, in Proc. IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 4566–4575.
[112]
S. Banerjee and A. Lavie, METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in Proc. ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Ann Arbor, MI, USA, 2005, pp. 65–72.