PDF (7.5 MB)
Collect
Submit Manuscript
Open Access

TV-SAM: Increasing Zero-Shot Segmentation Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation

College of Computer Science, Sichuan University, Chengdu 610000, China
West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610000, China
Shcool of Engineering, Case Western Reserve University, Cleveland, OH 44106, USA
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Urgench State University, Urgench 220100, Uzbekistan
College of Animal Science and Technology, Gansu Agricultural University, Lanzhou 730000, China
College of Information Science and Technology, Gansu Agricultural University, Lanzhou 730000, China

Show Author Information

Abstract

This study presents a novel multimodal medical image zero-shot segmentation algorithm named the text-visual-prompt segment anything model (TV-SAM) without any manual annotations. The TV-SAM incorporates and integrates the large language model GPT-4, the vision language model GLIP, and the SAM to autonomously generate descriptive text prompts and visual bounding box prompts from medical images, thereby enhancing the SAM’s capability for zero-shot segmentation. Comprehensive evaluations are implemented on seven public datasets encompassing eight imaging modalities to demonstrate that TV-SAM can effectively segment unseen targets across various modalities without additional training. TV-SAM significantly outperforms SAM AUTO (p < 0.01) and GSAM (p < 0.05), closely matching the performance of SAM BBOX with gold standard bounding box prompts (p = 0.07), and surpasses the state-of-the-art methods on specific datasets such as ISIC (0.853 versus 0.802) and WBC (0.968 versus 0.883). The study indicates that TV-SAM serves as an effective multimodal medical image zero-shot segmentation algorithm, highlighting the significant contribution of GPT-4 to zero-shot segmentation. By integrating foundational models such as GPT-4, GLIP, and SAM, the ability to address complex problems in specialized domains can be enhanced.

References

[1]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Lo, et al., Segment anything, arXiv preprint arXiv: 2304.02643, 2023.
[2]

W. Ji, J. Li, Q. Bi, T. Liu, W. Li, and L. Cheng, Segment anything is not always perfect: An investigation of SAM on different real-world applications, Mach. Intell. Res., vol. 21, no. 4, pp. 617–630, 2024.

[3]
L. Tang, H. Xiao, and B. Li, Can SAM segment anything? When SAM meets camouflaged object detection, arXiv preprint arXiv: 2304.04709, 2023.
[4]
R. Deng, C. Cui, Q. Liu, T. Yao, L. W. Remedios, S. Bao, B. A. Landman, L. E. Wheless, L. A. Coburn, K. T. Wilson, et al., Segment anything model (SAM) for digital pathology: Assess zero-shot segmentation on whole slide imaging, arXiv preprint arXiv: 2304.04155, 2023.
[5]

J. Ma, Y. He, F. Li, L. Han, C. You, and B. Wang, Segment anything in medical images, Nat. Commun., vol. 15, p. 654, 2024.

[6]

M. A. Mazurowski, H. Dong, H. Gu, J. Yang, N. Konz, and Y. Zhang, Segment anything model for medical image analysis: An experimental study, Med. Image Anal., vol. 89, p. 102918, 2023.

[7]
D. Cheng, Z. Qin, Z. Jiang, S. Zhang, Q. Lao, and K. Li, SAM on medical images: A comprehensive study on three prompt modes, arXiv preprint arXiv: 2305.00035, 2023.
[8]

S. Saminu, G. Xu, S. Zhang, I. A. El Kader, H. A. Aliyu, A. H. Jabire, Y. K. Ahmed, and M. J. Adamu, Applications of artificial intelligence in automatic detection of epileptic seizures using EEG signals: A review, Artif. Intell. Appl., vol. 1, no. 1, pp. 11–25, 2023.

[9]

Y. Huang, X. Yang, L. Liu, H. Zhou, A. Chang, X. Zhou, R. Chen, J. Yu, J. Chen, C. Chen, et al., Segment anything model for medical images, Med. Image Anal., vol. 92, p. 103061, 2024.

[10]
Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, Consistency models, arXiv preprint arXiv: 2303.01469, 2023.
[11]

J. Gao, Q. Lao, P. Liu, H. Yi, Q. Kang, Z. Jiang, X. Wu, K. Li, and Y. Chen, Anatomically guided cross-domain repair and screening for ultrasound fetal biometry, IEEE Journal of Biomedical and Health Informatics, vol. 27, no. 10, pp. 4914–4925, 2023.

[12]

H. Song, L. Chen, Y. Cui, Q. Li, Q. Wang, J. Fan, J. Yang, and L. Zhang, Denoising of MR and CT images using cascaded multi-supervision convolutional neural networks with progressive training, Neurocomputing, vol. 469, pp. 354–365, 2022.

[13]
J. Gao, Q. Lao, Q. Kang, P. Liu, L. Zhang, and K. Li, Unsupervised cross-disease domain adaptation by Lesion scale matching, in Medical Image Computing and Computer Assisted Intervention—MICCAI 2022, L. Wang, Q. Dou, P. T. Fletcher, S. Speidel, and S. Li, eds. Cham, Switzerland: Springer, 2022, pp. 660–670.
[14]

J. Gao, P. Liu, G. D. Liu, and L. Zhang, Robust needle localization and enhancement algorithm for ultrasound by deep learning and beam steering methods, J. Comput. Sci. Technol., vol. 36, no. 2, pp. 334–346, 2021.

[15]

M. Aljabri, M. AlAmir, M. AlGhamdi, M. Abdel-Mottaleb, and F. Collado-Mesa, Towards a better understanding of annotation tools for medical imaging: A survey, Multimed. Tools Appl., vol. 81, no. 18, pp. 25877–25911, 2022.

[16]
T. Ren, S. Liu, A. Zeng, J. Lin, K. Li, H. Cao, J. Chen, X. Huang, Y. Chen, F. Yan, et al., Grounded SAM: Assembling open-world models for diverse visual tasks, arXiv preprint arXiv: 2401.14159, 2024.
[17]
Z. Qin, H. H. Yi, Q. Lao, and K. Li, Medical image understanding with pretrained vision language models: A comprehensive study, arXiv preprint arXiv: 2209.15517, 2022.
[18]
R. Schaeffer, B. Miranda, and S. Koyejo, Are emergent abilities of large language models a mirage? arXiv preprint arXiv: 2304.15004, 2024.
[19]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, Chain-of-thought prompting elicits reasoning in large language models, arXiv preprint arXiv: 2201.11903, 2022.
[20]
Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, et al., The rise and potential of large language model based agents: A survey, arXiv preprint arXiv: 2309.07864, 2023.
[21]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, arXiv preprint arXiv: 2103.00020v1, 2021.
[22]
L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J. Hwang, et al., Grounded language-image pre-training, arXiv preprint arXiv: 2112.03857, 2022.
[23]
Z. Liu, A. Zhong, Y. Li, L. Yang, C. Ju, Z. Wu, C. Ma, P. Shu, C. Chen, S. Kim, et al., Radiology-GPT: A large language model for radiology, arXiv preprint arXiv: 2306.08666, 2023.
[24]
J. Cheng, J. Ye, Z. Deng, J. Chen, T. Li, H. Wang, Y. Su, Z. Huang, J. Chen, L. Jiang, et al., SAM-Med2D, arXiv preprint arXiv: 2308.16184, 2023.
[25]
H. Wang, S. Guo, J. Ye, Z. Deng, J. Cheng, T. Li, J. Chen, Y. Su, Z. Huang, Y. Shen, et al., SAM-Med3D: Towards general-purpose segmentation models for volumetric medical images, arXiv preprint arXiv: 2310.15161, 2023.
[26]
P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, and A. Chadha, A systematic survey of prompt engineering in large language models: Techniques and applications, arXiv preprint arXiv: 2402.07927, 2024.
[27]
Z. Shao, Z. Yu, M. Wang, and J. Yu, Prompting large language models with answer heuristics for knowledge-based visual question answering, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 14974–14983.
[28]
Z. Yan, K. Zhang, R. Zhou, L. He, X. Li, and L. J. Sun, Multimodal ChatGPT for medical applications: An experimental study of GPT-4V, arXiv preprint arXiv: 2310.19061, 2023.
[29]
A. Neubeck and L. Van Gool, Efficient non-maximum suppression, in Proc. 18th Int. Conf. Pattern Recognition (ICPR'06), Hong Kong, China, 2006, pp. 850–855.
[30]
D. Fan, G. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, PraNet: Parallel reverse attention network for polyp segmentation, arXiv preprint arXiv: 2006.11392, 2020.
[31]
N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo, K. Liopyris, M. Marchetti, et al., Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC), arXiv preprint arXiv: 1902.03368, 2019.
[32]

P. Tschandl, C. Rosendahl, and H. Kittler, The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions, Sci. Data, vol. 5, p. 180161, 2018.

[33]

X. Zheng, Y. Wang, G. Wang, and J. Liu, Fast and robust segmentation of white blood cell images by self-supervised learning, Micron, vol. 107, pp. 55–71, 2018.

[34]

W. Al-Dhabyani, M. Gomaa, H. Khaled, and A. Fahmy, Dataset of breast ultrasound images, Data Brief, vol. 28, p. 104863, 2019.

[35]
H. Gong, G. Chen, R. Wang, X. Xie, M. Mao, Y. Yu, F. Chen, and G. Li, Multi-task learning for thyroid nodule segmentation with thyroid region prior, in Proc. IEEE 18th Int. Symp. on Biomedical Imaging (ISBI), Nice, France, 2021, pp. 257–261.
[36]

M. E. H. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A. Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. Al Emadi, et al., Can AI help in screening viral and COVID-19 pneumonia, IEEE Access, vol. 8, pp. 132665–132676, 2020.

[37]

T. Rahman, A. Khandakar, Y. Qiblawey, A. Tahir, S. Kiranyaz, S. Bin Abul Kashem, M. T. Islam, S. Al Maadeed, S. M. Zughaier, M. S. Khan, et al., Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images, Comput. Biol. Med., vol. 132, p. 104319, 2021.

[38]
A. E. Kavur, N. S. Gezer, M. Barış, S. Aslan, P. H. Conze, V. Groza, D. D. Pham, S. Chatterjee, P. Ernst, S. Özkan, et al., CHAOS challenge-combined (CT-MR) healthy abdominal organ segmentation, Med. Image Anal., vol. 69, p. 101950, 2021.
[39]

M. Xiao, R. Wei, J. Yu, C. Gao, F. Yang, and L. Zhang, CpG island definition and methylation mapping of the T2T-YAO genome, Genom. Proteom. Bioinform., vol. 22, no. 2, p. qzae009, 2024.

[40]

Q. Zhang, H. Zhang, K. Zhou, and L. Zhang, Developing a physiological signal-based, mean threshold and decision-level fusion algorithm (PMD) for emotion recognition, Tsinghua Science and Technology, vol. 28, no. 4, pp. 673–685, 2023.

[41]

L. Zhang, J. Badai, G. Wang, X. Ru, W. Song, Y. You, J. He, S. Huang, H. Feng, R. Chen, et al., Discovering hematoma-stimulated circuits for secondary brain injury after intraventricular hemorrhage by spatial transcriptome analysis, Front. Immunol., vol. 14, p. 1123652, 2023.

[42]

Y. You, L. Zhang, P. Tao, S. Liu, and L. Chen, Spatiotemporal transformer neural network for time-series forecasting, Entropy(Basel), vol. 24, no. 11, p. 1651, 2022.

[43]

X. Lai, J. Zhou, A. Wessely, M. Heppt, A. Maier, C. Berking, J. Vera, and L. Zhang, A disease network-based deep learning approach for characterizing melanoma, Int. J. Cancer, vol. 150, no. 6, pp. 1029–1044, 2022.

[44]

Y. Xia, C. Yang, N. Hu, Z. Yang, X. He, T. Li, and L. Zhang, Exploring the key genes and signaling transduction pathways related to the survival time of glioblastoma multiforme patients by a novel survival analysis model, BMC Genomics, vol. 18, no. Suppl 1, p. 950, 2017.

[45]
Q. Xu, Z. Ma, N. He, and W. Duan, DCSAU-Net: A deeper and more compact split-attention U-Net for medical image segmentation, Comput. Biol. Med., vol. 154, p. 106626, 2023.
[46]
X. Zhou, Z. Li, and T. Tong, Medical image segmentation and saliency detection through a novel color contextual extractor, in Artificial Neural Networks and Machine Learning—ICANN 2023, L. Iliadis, A. Papaleonidas, P. Angelov, and C. Jayne, eds. Cham, Switzerland: Springer, 2023, pp. 457–468.
[47]

L. Zhang, S. Fan, J. Vera, and X. Lai, A network medicine approach for identifying diagnostic and prognostic biomarkers and exploring drug repurposing in human cancer, Comput. Struct. Biotechnol. J., vol. 21, pp. 34–45, 2023.

Big Data Mining and Analytics
Pages 1199-1211
Cite this article:
Jiang Z, Cheng D, Qin Z, et al. TV-SAM: Increasing Zero-Shot Segmentation Performance on Multimodal Medical Images Using GPT-4 Generated Descriptive Prompts Without Human Annotation. Big Data Mining and Analytics, 2024, 7(4): 1199-1211. https://doi.org/10.26599/BDMA.2024.9020058
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return