College of Computer Science, Nankai University, Tianjin 300350, China
Shanghai Artificial Intelligence Laboratory, Shanghai 200232, China
Computer Vision Lab, Inception Institute of Artificial Intelligence, Abu Dhabi, United Arab Emirates
Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore 138632, Singapore
UCAS-Terminus AI Lab, Terminus Group, Chongqing 400042, China
Show Author Information
Hide Author Information
Abstract
Most polyp segmentation methods use convolutional neural networks (CNNs) as their backbone, leading to two key issues when exchanging information between the encoder and decoder: (1) taking into account the differences in contribution between different-level features, and (2) designing an effective mechanism for fusing these features. Unlike existing CNN-based methods, we adopt a transformer encoder, which learns more powerful and robust representations. In addition, considering the image acquisition influence and elusive properties of polyps, we introduce three standard modules, including a cascaded fusion module (CFM), a camouflage identification module (CIM), and a similarity aggregation module (SAM). Among these, the CFM is used to collect the semantic and location information of polyps from high-level features; the CIM is applied to capture polyp information disguised in low-level features, and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area, thereby effectively fusing cross-level features. The proposed model, named Polyp-PVT, effectively suppresses noises in the features and significantly improves their expressive capabilities. Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations (e.g., appearance changes, small objects, and rotation) than existing representative methods. The proposed model is available at https://github.com/DengPingFan/Polyp-PVT.
No abstract is available for this article. Click the button above to view the PDF directly.
References
[1]
M. Fiori, P. Musé, and G. Sapiro, A complete system for candidate polyps detection in virtual colonoscopy, Int. J. Patt. Recogn. Artif. Intell., vol. 28, no. 7, p. 1460014, 2014.
A. V. Mamonov, I. N. Figueiredo, P. N. Figueiredo, and Y. H. Richard Tsai, Automated polyp detection in colon capsule endoscopy, IEEE Trans. Med. Imag., vol. 33, no. 7, pp. 1488–1502, 2014.
O. H. Maghsoudi. Superpixel based segmentation andclassification of polyps in wireless capsule endoscopy, in Proc. 2017 IEEE Signal Processing in Medicine and Biology Symposium (SPMB), Philadelphia, PA, USA, 2017, pp. 1–4.
O. Ronneberger, P. Fischer, and T. Brox, U-net: Convolutional networks for biomedical image segmentation, in Proc. 18th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Munich, Germany, 2015, pp. 234–241.
D. P. Fan, G. P. Ji, T. Zhou, G. Chen, H. Fu, J. Shen, and L. Shao, PraNet: parallel reverse attention network for polyp segmentation, in Proc 23th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Lima, Peru, 2020, pp. 263–273.
X. Guo, C. Yang, Y. Liu, and Y. Yuan, Learn to threshold: ThresholdNet with confidence-guided manifold mixup for polyp segmentation, IEEE Trans. Med. Imag., vol. 40, no. 4, pp. 1134–1146, 2021.
J. Wei, Y. Hu, R. Zhang, Z. Li, S. K. Zhou, and S. Cui, Shallow attention network for polyp segmentation, arXiv preprint arXiv: 2108.00882, 2021.
[8]
J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, D. Gil, C. Rodríguez, and F. Vilariño, WM-DOVA maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians, Comput. Med. Imag. Graph., vol. 43, pp. 99–111, 2015.
J. Silva, A. Histace, O. Romain, X. Dray, and B. Granado, Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer, Int. J. Comput. Assist. Radiol. Surg., vol. 9, no. 2, pp. 283–293, 2014.
N. Tajbakhsh, S. R. Gurudu, and J. Liang, Automated polyp detection in colonoscopy videos using shape and context information, IEEE Trans. Med. Imag., vol. 35, no. 2, pp. 630–644, 2016.
D. P. Fan, G. P. Ji, M. M. Cheng, and L. Shao, Concealed object detection, in Proc. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 2774–2784.
D. Jha, P. H. Smedsrud, M. A. Riegler, P. Halvorsen, T. de Lange, D. Johansen, and H. D. Johansen, Kvasir-SEG: A segmented polyp dataset, in Proc. 26th Int. Conf. Multimedia Modeling, Daejeon, Korea, 2020, pp. 451–462.
D. Vázquez, J. Bernal, F. J. Sánchez, G. Fernández-Esparrach, A. M. López, A. Romero, M. Drozdzal, and A. Courville, A benchmark for endoluminal scene segmentation of colonoscopy images, J. Healthc. Eng., vol. 2017, pp. 1–9, 2017.
T. Rahim, M. A. Usman, and S. Y. Shin, A survey on contemporary computer-aided tumor, polyp, and ulcer detection methods in wireless capsule endoscopy imaging, Comput. Med. Imag. Graph., vol. 85, p. 101767, 2020.
K. Simonyan and A. Zisserman, Very deep convolutional networks for large-scale image recognition, in Proc. 3rd International Conference on Learning Representations, San Diego, CA, USA, 2014.
[18]
X. Li, W. Wang, X. Hu, and J. Yang, Selective kernel networks, in Proc. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 510–519.
W. Wang, X. Li, J. Yang, and T. Lu, Mixed link networks, in Proc. 27th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 2819–2825.
J. Long, E. Shelhamer, and T. Darrell, Fully convolutional networks for semantic segmentation, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 3431–3440.
L. Cai, M. Wu, L. Chen, W. Bai, M. Yang, S. Lyu, and Q. Zhao, Using guided self-attention with local information for polyp segmentation, in Proc. 25th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Singapore, 2022, pp. 629–638.
N. K. Tomar, D. Jha, U. Bagci, and S. Ali, TGANet: Text-guided attention for improved polyp segmentation, in Proc. 25th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Singapore, 2022, pp. 151–160.
R. Zhang, P. Lai, X. Wan, D. J. Fan, F. Gao, X. J. Wu, and G. Li, Lesion-aware dynamic kernel for polyp segmentation, in Proc. 25th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Singapore, 2022, pp. 99–109.
J. H. Shi, Q. Zhang, Y. H. Tang, and Z. Q. Zhang, Polyp-mixer: An efficient context-aware MLP-based paradigm for polyp segmentation, IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 1, pp. 30–42, 2023.
X. Zhao, Z. Wu, S. Tan, D. J. Fan, Z. Li, X. Wan, and G. Li, Semi-supervised spatial temporal attention network for video polyp segmentation, in Proc. 25th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Singapore, 2022, pp. 456–466.
M. Akbari, M. Mohrekesh, E. Nasr-Esfahani, S. M. Reza Soroushmehr, N. Karimi, S. Samavi, and K. Najarian, Polyp segmentation in colonoscopy images using fully convolutional network, in Proc. 2018 40th Annual Int. Conf. IEEE Engineering in Medicine and Biology Society (EMBC), Honolulu, HI, USA, 2018, pp. 69–72.
P. Brandao, O. Zisimopoulos, E. Mazomenos, G. Ciuti, J. Bernal, M. Visentini-Scarzanella, A. Menciassi, P. Dario, A. Koulaouzidis, A. Arezzo, et al., Towards a computed-aided diagnosis system in colonoscopy: Automatic polyp segmentation using convolution neural networks, J. Med. Robot. Res., vol. 3, no. 2, p. 1840002, 2018.
Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, UNet++: A nested U-net architecture for medical image segmentation, in Proc. 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, 2018, pp. 3–11.
D. Jha, P. H. Smedsrud, M. A. Riegler, D. Johansen, T. De Lange, P. Halvorsen, and H. D Johansen, ResUNet: an advanced architecture for medical image segmentation, in Proc. 2019 IEEE Int. Symp. on Multimedia (ISM), San Diego, CA, USA, 2019, pp. 225–230.
X. Sun, P. Zhang, D. Wang, Y. Cao, and B. Liu, Colorectal polyp segmentation by U-net with dilation convolution, in Proc. 2019 18th IEEE Int. Conf. Machine Learning and Applications (ICMLA), Boca Raton, FL, USA, 2020, pp. 851–858.
B. Murugesan, K. Sarveswaran, S. M. Shankaranarayana, K. Ram, J. Joseph, and M. Sivaprakasam, Psi-Net: Shape and boundary aware joint multi-task deep network for medical image segmentation, in Proc. 2019 41st Annual Int. Conf. IEEE Engineering in Medicine and Biology Society (EMBC), Berlin, Germany, 2019, pp. 7223–7226.
H. Ali Qadir, Y. Shin, J. Solhusvik, J. Bergsland, L. Aabakken, and I. Balasingham, Polyp detection and segmentation using mask R-CNN: Does a deeper feature extractor CNN always perform better? in Proc. 2019 13th Int. Symp. on Medical Information and Communication Technology (ISMICT), Oslo, Norway, 2019, pp. 1–6.
K. He, G. Gkioxari, P. Dollár, and R. Girshick, Mask R-CNN, in Proc. 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 2980–2988.
S. Alam, N. K. Tomar, A. Thakur, D. Jha, and A. Rauniyar, Automatic polyp segmentation using U-net-ResNet50, in Proc. MediaEval 2020 Workshop, virtual, 2020.
[35]
D. Banik, K. Roy, D. Bhattacharjee, M. Nasipuri, and O. Krejcar, Polyp-net: A multimodel fusion network for polyp segmentation, IEEE Trans. Instrum. Meas., vol. 70, pp. 1–12, 2021.
T. Rahim, S. Ali Hassan, and S. Y. Shin, A deep convolutional neural network for the detection of polyps in colonoscopy images, Biomed. Signal Process. Contr., vol. 68, p. 102654, 2021.
D. Jha, S. Ali, N. K. Tomar, H. D. Johansen, D. Johansen, J. Rittscher, M. A. Riegler, and P. Halvorsen, Real-time polyp detection, localization and segmentation in colonoscopy using deep learning, IEEE Access, vol. 9, pp. 40496–40510, 2021.
A. M. A. Ahmed, Generative adversarial networks for automatic polyp segmentation, in Proc. MediaEval 2020 Workshop, virtual, 2020.
[39]
V. Thambawita, S. Hicks, P. Halvorsen, and M. A. Riegler, Pyramid-focus-augmentation: Medical image segmentation with step-wise focus, in Proc. MediaEval 2020 Workshop, virtual, 2020.
[40]
N. K. Tomar, D. Jha, S. Ali, H. D. Johansen, D. Johansen, M. A. Riegler, and P. Halvorsen, DDANet: dual decoder attention network for automatic polyp segmentation, in Proc. 2021 Int. Conf. Pattern Recognition, virtual, 2021, 307–314.
C. H. Huang, H. Y. Wu, and Y. L. Lin, HarDNet-MSEG: A simple encoder-decoder polyp segmentation neural network that achieves over 0.9 mean dice and 86 FPS, arXiv preprint arXiv: 2101.07172, 2021.
[42]
P. Chao, C. Y. Kao, Y. Ruan, C. H. Huang, and Y. L. Lin, HarDNet: A low memory traffic network, in Proc. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 3551–3560.
Y. Zhang, H. Liu, and Q. Hu, Transfuse: Fusing transformers and CNNs for medical image segmentation, in Proc. 24th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Strasbourg, France, 2021, pp. 14–24.
Z. Yin, K. Liang, Z. Ma, and J. Guo, Duplex contextual relation network for polyp segmentation, in Proc. 2022 IEEE 19th Int. Symp. on Biomedical Imaging (ISBI), Kolkata, India, 2022, pp. 1–5.
X. Zhao, L. Zhang, and H. Lu, Automatic polyp segmentation via multi-scale subtraction network, in Proc. 24th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Strasbourg, France, 2021, pp. 120–130.
Z. Zhou, J. Shin, L. Zhang, S. Gurudu, M. Gotway, and J. Liang, Fine-tuning convolutional neural networks for biomedical image analysis: Actively and incrementally, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 4761–4772.
N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, Convolutional neural networks for medical image analysis: Fulltraining or fine tuning, IEEE Trans. Med. Imaging, vol. 35, no. 5, pp. 1299–1312, 2016.
X. Xie, J. Chen, Y. Li, L. Shen, K. Ma, and Y. Zheng, MI2GAN: Generative adversarial network for medical image domain adaptation using mutual information constraint, in Proc 23th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Lima, Peru, 2020, pp. 516–525.
R. Zhang, G. Li, Z. Li, S. Cui, D. Qian, and Y. Yu, Adaptive context selection for polyp segmentation, in Proc 23th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Lima, Peru, 2020, pp. 253–262.
N. K. Tomar, Automatic polyp segmentation using fully convolutional neural network, in Proc. MediaEval 2020 Workshop, virtual, 2020.
[51]
D. Jha, S. Hicks, K. Emanuelsen, H. D. Johansen, D. Johansen, T. Lange, M. Riegler, and P. Halvorsen, Medico multimedia task at MediaEval 2020: Automatic polyp segmentation, in Proc. MediaEval 2020 Workshop, virtual, 2020.
[52]
K. Patel, A. M. Bur, and G. Wang, Enhanced U-net: A feature enhancement network for polyp segmentation, in Proc. 2021 18th Conf. Robots and Vision (CRV), Burnaby, Canada, 2021, pp. 181–188.
A. Lumini, L. Nanni, and G. Maguolo, Deep ensembles based on stochastic activation selection for polyp segmentation, in Proc. 2021 Medical Imaging with Deep Learning, Lübeck, Germany, 2021.
M. V. L. Branch and A. S. Carvalho, Polyp segmentation in colonoscopy images using U-net-MobileNetV2, arXiv preprint arXiv: 2103.15715, 2021.
[55]
R. Khadga, D. Jha, S. Ali, S. Hicks, V. Thambawita, M. A. Riegler, and P. Halvorsen, Meta-learning with implicit gradients in a few-shot setting for medical image segmentation, arXiv preprint arXiv: 2106.03223, 2021.
D. V. Sang, T. Q. Chung, P. N. Lan, D. V. Hang, D. V. Long, and N. T. Thuy, Ag-CUResNeSt: A novel method for colon polyp segmentation, arXiv preprint arXiv: 2105.00402, 2021.
[57]
C. Yang, X. Guo, M. Zhu, B. Ibragimov, and Y. Yuan, Mutual-prototype adaptation for cross-domain polyp segmentation, IEEE J. Biomed. Health Inform., vol. 25, no. 10, pp. 3886–3897, 2021.
D. Jha, P. H. Smedsrud, D. Johansen, T. de Lange, H. D. Johansen, P. Halvorsen, and M. A. Riegler, A comprehensive study on colorectal polyp segmentation with ResUNet++, conditional random field and test-time augmentation, IEEE J. Biomed. Health Inform., vol. 25, no. 6, pp. 2029–2040, 2021.
D. Jha, N. K. Tomar, S. Ali, M. A. Riegler, H. D. Johansen, D. Johansen, T. de Lange, and P. Halvorsen, NanoNet: real-time polyp segmentation in video capsule endoscopy and colonoscopy, in Proc. 2021 IEEE 34th Int. Symp. on Computer-Based Medical Systems (CBMS), Aveiro, Portugal, 2021, pp. 37–43.
S. Li, X. Sui, X. Luo, X. Xu, Y. Liu, and R. Goh, Medical image segmentation using squeeze-and-expansion transformers, in Proc. 30th Int. Joint Conf. Artificial Intelligence, virtual, 2021, pp. 807–815.
T. Kim, H. Lee, and D. Kim, UACANet: Uncertainty augmented context attention for polyp semgnetaion, in Proc. 29th ACM Int. Conf. Multimedia, virtual, 2021, pp. 2167–2175.
V. L. Thambawita, S. Hicks, P. Halvorsen, and M. Riegler, DivergentNets: Medical image segmentation by network ensemble, in Proc. 3rd Int. Workshop and Challenge on Computer Vision in Endoscopy (EndoCV2021) in conjunction with the 18th IEEE Int. Symp. Biomedical Imaging (ISBI2021), Nice, France, 2021, pp. 27–38.
[63]
X. Guo, C. Yang, and Y. Yuan, Dynamic-weighting hierarchical segmentation network for medical images, Med. Image Anal., vol. 73, p. 102196, 2021.
G. P. Ji, Y. C. Chou, D. P. Fan, G. Chen, H. Fu, D. Jha, and L. Shao, Progressively normalized self-attention network for video polyp segmentation, in Proc. 24th Int. Conf. Medical Image Computing and Computer Assisted Intervention, Strasbourg, France, 2021, pp. 142–152.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Conf. Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 5998–6008.
[66]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16×16 words: Transformers for image recognition at scale, in Proc. 9th Int. Conf. Learning Representations, Vienna, Austria, 2021.
[67]
Z. Pan, B. Zhuang, J. Liu, H. He, and J. Cai, Scalable vision transformers with hierarchical pooling, in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 2021, pp. 367–376.
B. Heo, S. Yun, D. Han, S. Chun, J. Choe, and S. J. Oh, Rethinking spatial dimensions of vision transformers, in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 2021, pp. 11916–11925.
L. Yuan, Y. Chen, T. Wang, W. Yu, Y. Shi, Z. Jiang, F. E. H. Tay, J. Feng, and S. Yan, Tokens-to-token ViT: Training vision transformers from scratch on ImageNet, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2021, pp. 538–547.
K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, and Y. Wang, Transformer in transformer, in Proc. 35th Conf. Neural Information Processing Systems, virtual, 2021, pp. 15908–15919.
[71]
W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, Pyramid vision transformer: A versatile backbone for dense prediction without convolutions, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2021, pp. 548–558.
W. Wang, E. Xie, X. Li, D. P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, PVT v2: Improved baselines with pyramid vision transformer, Computational Visual Media, vol. 8, no. 3, pp. 415–424, 2022.
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, Swin transformer: Hierarchical vision transformer using shifted windows, in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 2021, pp. 9992–10002.
H. Wu, B. Xiao, N. Codella, M. Liu, X. Dai, L. Yuan, and L. Zhang, CvT: introducing convolutions to vision transformers, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2021, pp. 22–31.
W. Xu, Y. Xu, T. Chang, and Z. Tu, Co-scale conv-attentional image transformers, in Proc. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, Canada, 2021, pp. 9961–9970.
X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, Twins: Revisiting the design of spatial attention in vision transformers, in Proc. 34th Conf. Neural Information Processing Systems, virtual, 2021, pp. 9355–9366.
[77]
B. Graham, A. El-Nouby, H. Touvron, P. Stock, A. Joulin, H. Jegou, and M. Douze, LeViT: A vision transformer in ConvNet’s clothing for faster inference, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV). Montreal, Canada, 2021, 12239–12249.
S. Bhojanapalli, A. Chakrabarti, D. Glasner, D. Li, T. Unterthiner, and A. Veit, Understanding robustness of transformers for image classification, in Proc. 2021 IEEE/CVF Int. Conf. Computer Vision (ICCV), Montreal, Canada, 2021, 10211–10221.
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, SegFormer: Simple and efficient design for semantic segmentation with transformers, in Proc. 34th Conf. Neural Information Processing Systems, virtual, 2021, pp. 12077–12090.
[80]
Z. Wu, L. Su, and Q. Huang, Cascaded partial decoder for fast and accurate salient object detection, in Proc. 2019 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2020, pp. 3902–3911.
S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proc. 32nd Int. Conf. Machine Learning (ICML), Lille, France, 2015, pp. 448–456.
[82]
X. Glorot, A. Bordes, and Y. Bengio, Deep sparse rectifier neural networks, in Proc. 14th Int. Conf. Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 2011, pp. 315–323.
[83]
S. Woo, J. Park, J. Y. Lee, and I. S. Kweon, CBAM: Convolutional block attention module, in Proc. 15th European Conf. Computer Vision, Munich, Germany, 2018, pp. 3–19.
J. Hu, L. Shen, and G. Sun, Squeeze-and-excitation networks, in Proc. 2018 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 7132–7141.
X. Wang, R. B. Girshick, A. Gupta, and K. He, Non-local neural networks, in Proc. 2018 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 7794–7803.
G. Te, Y. Liu, W. Hu, H. Shi, and T. Mei, Edge-aware graph representation learning and reasoning for face parsing, in Proc. 16th European Conf. Computer Vision, Glasgow, UK, 2020, pp. 258–274.
Y. Lu, Y. Chen, D. Zhao, and J. Chen, Graph-FCN for image semantic segmentation, in Proc. 16th Int. Symp. Neural Networks, Moscow, Russia, 2019, pp. 97–105.
J. Wei, S. Wang, and Q. Huang, F³Net: Fusion, feedback and focus for salient object detection, in Proc. 34th AAAI Conf. Artificial Intelligence (2020), 32nd Innovative Applications of Artificial Intelligence Conf. (IAAI), 10th AAAI Symp. Educational Advances in Artificial Intelligence (EAAI), New York, NY, USA, 2020, pp. 12321–12328.
I. Loshchilov and F. Hutter, Decoupled weight decay regularization, in Proc. 7th Int. Conf. Learning Representations (ICLR), New Orleans, LA, USA, 2017.
[90]
F. Milletari, N. Navab, and S. A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmentation, in Proc. 2016 Fourth Int. Conf. 3D Vision (3DV), Stanford, CA, USA, 2016, pp. 565–571.
R. Margolin, L. Zelnik-Manor, and A. Tal, How to evaluate foreground maps, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 248–255.
D. P. Fan, G. P. Ji, X. B. Qin, M. M. Cheng, Cognitive vision inspired object segmentation metric and loss function, (in Chinese), SCIENTIA SINICA Informat., vol. 51, no. 9, pp. 1475–1489, 2021.
D. P. Fan, C. Gong, Y. Cao, B. Ren, M. M. Cheng, and A. Borji, Enhanced-alignment measure for binary foreground map evaluation, in Proc. 27th Int. Joint Conf. Artificial Intelligence, Stockholm, Sweden, 2018, pp. 698–704.
Y. Fang, C. Chen, Y. Yuan, and K. -Y. Tong, Selective feature aggregation network with area-boundary constraints for polyp segmentation, in Proc. 22nd Int. Conf. Medical Image Computing and Computer Assisted Intervention, Shenzhen, China, 2022, pp. 302–310.
G. P. Ji, G. Xiao, Y. C. Chou, D. P. Fan, K. Zhao, G. Chen, and L. Van Gool, Video polyp segmentation: A deep learning perspective, Mach. Intell. Res., vol. 19, no. 6, pp. 531–549, 2022.
J. Bernal, J. Sánchez, and F. Vilariño, Towards automatic polyp detection with a polyp appearance model, Pattern Recognit., vol. 45, no. 9, pp. 3166–3182, 2012.
The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).
Segmentation examples of our model and SANet[
7
] with different challenge cases, e.g., camouflage (1st and 2nd rows) and image acquisition influence (3rd row). The images from top to bottom are from ClinicDB[
8
], ETIS[
9
], and ColonDB[
10
], which show that our model has better generalization ability.
Framework of our Polyp-PVT, which consists of (a) a pyramid vision transformer (PVT) as the encoder network, (b) cascaded fusion module (CFM) for fusing the high-level feature, (c) camouflage identification module (CIM) to filter out the low-level information, and (d) similarity aggregation module (SAM) for integrating the high- and low-level features for the final output.
Details of the introduced SAM. It is composed of GCN and non-local, which extend the pixel features of polyp regions with high-level semantic location cues to the entire region.
Loss curves under different training parameter settings.
Visualization results with the current models. Green indicates a correct polyp. Yellow is the missed polyp. Red is the wrong prediction. As we can see, the proposed model can accurately locate and segment polyps, regardless of size.
Visualization results with the current models.
Evaluation of model generalization ability. We provide the max Dice results on ColonDB and ETIS.
FROC curves of different methods on ColonDB.
Visualization of the ablation study results, which are converted from the output into heat maps. As can be seen, removing any module leads to missed or incorrectly detected results.
Visualization of the feature map in the CIM module.
Visualization of the P1 and P2 predictions.
Visualization of some failure cases. Green indicates a correct polyp. Yellow is the missed polyp. Red is the wrong prediction.
A survey on polyp segmentation. (The meanings of the abbreviation are listed as follows. CL: CVC-CLINIC, EL: ETIS-Larib, C6: CVC-612, AM: ASU-Mayo[46, 47], ES: EndoScene, DB: ColonDB, CV: CVC-VideoClinicDB, C: Colon, ED: Endotect 2020, KS: Kvasir-SEG, KCS: Kvasir Capsule-SEG, PraNet: same to datasets used in PraNet[5], IS: image segmentation, VS: video segmentation, CF: classfication, OD: object detection, Own: private data.)
No.
Model
Publication
Code
Type
Dataset
Core component
Reference
1
CSCPD
IJPRAI
−
IS
Own
Adaptive-scale candidate
[1]
2
APD
TMI
−
IS
Own
Geometrical analysis, binary classifier
[2]
3
SBCP
SPMB
−
IS
Own
Superpixel
[3]
4
FCN
EMBC
−
IS
DB
FCN and patch selection
[26]
5
D-FCN
JMRR
−
IS
CL, EL, AM, and DB
FCN and Shape-from-Shading (SfS)
[27]
6
UNet++
DLMIA
PyTorch
IS
AM
Skip pathways and deep supervision
[28]
7
Psi-Net
EMBC
PyTorch
IS
Endovis
Shape and boundary aware
[31]
8
Mask R-CNN
ISMICT
−
IS
C6, EL, and DB
Deep feature extractors
[32]
9
UDC
ICMLA
−
IS
C6 and EL
Dilation convolution
[30]
10
ThresholdNet
TMI
PyTorch
IS
ES and WCE
Learn to threshold Confidence-guided manifold mixup
[6]
11
MI2GAN
MICCAI
−
IS
C6 and EL
GAN-based model
[48]
12
ACSNet
MICCAI
PyTorch
IS
ES and KS
Adaptive context selection
[49]
13
PraNet
MICCAI
PyTorch
IS
PraNet
Parallel partial decoder attention
[5]
14
GAN
MediaEval
−
IS
KS
Image-to-image translation
[38]
15
APS
MediaEval
−
IS
KS
Variants of U-shaped structure
[50]
16
PFA
MediaEval
PyTorch
IS
KS
Pyramid focus augmentation
[39]
17
MMT
MediaEval
−
IS
KS
Competition introduction
[51]
18
U-Net-ResNet50
MediaEval
−
IS
KS
Variants of U-shaped structure
[34]
19
Survey
CMIG
−
CF
Own
Classification
[15]
20
Polyp-Net
TIM
−
IS
DB and CV
Multimodel fusion network
[35]
21
Deep CNN
BSPC
−
OD
EL
Convolutional neural network
[36]
22
EU-Net
CRV
PyTorch
IS
PraNet
Semantic information enhancement
[52]
23
DSAS
MIDL
{Matlab
IS
KS
Stochastic activation selection
[53]
24
U-Net-MobileNetV2
arXiv
−
IS
KS
Variants of U-shaped structure
[54]
25
DCRNet
ISBI
PyTorch
IS
ES, KS, andPICCOLO
Within-image and cross-image contextual relations
[44]
26
MSEG
arXiv
PyTorch
IS
PraNet
HarDNet and partial decoder
[41]
27
FSSNet
arXiv
−
IS
C6 and KS
Meta-learning
[55]
28
AG-CUResNeSt
RIVF
−
IS
PraNet
ResNeSt, attention gates
[56]
29
MPAPS
JBHI
PyTorch
IS
DB, KS, and EL
Mutual-prototype adaptation network
[57]
30
ResUNet++
JBHI
PyTorch
IS, VS
PraNet and AM
ResUNet++, CRF and TTA
[58]
31
NanoNet
CBMS
PyTorch
IS, VS
ED, KS, and KCS
Real-Time polyp segmentation
[59]
32
ColonSegNet
Access
PyTorch
IS
KS
Residual block and SENet
[37]
33
Segtran
IJCAI
PyTorch
IS
C6 and KS
Transformer
[60]
34
DDANet
ICPR
PyTorch
IS
KS
Dual decoder attention network
[40]
35
UACANet
ACM MM
PyTorch
IS
PraNet
Uncertainty augmented Context attention network
[61]
36
DivergentNet
ISBI
PyTorch
IS
EndoCV 2021
Combine multiple models
[62]
37
DWHieraSeg
MIA
PyTorch
IS
ES
Dynamic-weighting
[63]
38
Transfuse
MICCAI
PyTorch
IS
PraNet
Transformer and CNN
[43]
39
SANet
MICCAI
PyTorch
IS
PraNet
Shallow attention network
[7]
40
PNS-Net
MICCAI
PyTorch
VS
C6, KS, ES, and AM
Progressively normalized self-attention network
[64]
Parameter setting during the training stage.
Optimizer
Learning rate
Multi-scale
Clip
Decay rate
Weight decay
Number of epochs
Input size
AdamW
10−4
[0.75, 1, 1.25]
0.5
0.1
10−4
100
352×352
Network parameters of each module. Note that the encoder parameters are the same as PVT without any changes. BasicConv2d and Conv2d are with the parameters [in_channel, out_channel, kernel_size, padding] and GCN [num_state, num_node].
Module
Parameter
Value
Encoder
patch_size
[4]
embed_dims
[64, 128, 320, 512]
num_heads
[1, 2, 5, 8]
mlp_ratios
[8, 8, 4, 4]
depths
[3, 4, 18, 3]
sr_ratios
[8, 4, 2, 1]
drop_rate
[0]
drop_path_rate
[0.1]
SAM
AvgPool2d
[6]
Conv2d
[32, 16, 1, 1]
Conv2d
[32, 16, 1, 1]
Conv2d
[16, 32, 1, 1]
GCN
[16, 16]
BasicConv2d
[64, 32, 1, 0]
CFM
BasicConv2d
[32, 32, 3, 1]
BasicConv2d
[32, 32, 3, 1]
BasicConv2d
[32, 32, 3, 1]
BasicConv2d
[32, 32, 3, 1]
BasicConv2d
[64, 64, 3, 1]
BasicConv2d
[64, 64, 3, 1]
BasicConv2d
[96, 96, 3, 1]
BasicConv2d
[96, 32, 3, 1]
CIM
AvgPool2d
[1]
AvgPool2d
[1]
Conv2d
[64, 4, 1, 0]
ReLU
−
Conv2d
[4, 64, 1, 0]
Sigmoid
−
Conv2d
[2, 1, 7, 3]
Sigmoid
−
Quantitative results of the test datasets, i.e., Kvasir-SEG and ClinicDB.
Model
Kvasir-SEG[13]
ClinicDB[8]
mDic
mIoU
MAE
mDic
mIoU
MAE
U-Net
0.818
0.746
0.794
0.858
0.881
0.893
0.055
0.823
0.755
0.811
0.889
0.913
0.954
0.019
UNet++
0.821
0.743
0.808
0.862
0.886
0.909
0.048
0.794
0.729
0.785
0.873
0.891
0.931
0.022
SFA
0.723
0.611
0.670
0.782
0.834
0.849
0.075
0.700
0.607
0.647
0.793
0.840
0.885
0.042
MSEG
0.897
0.839
0.885
0.912
0.942
0.948
0.028
0.909
0.864
0.907
0.938
0.961
0.969
0.007
DCRNet
0.886
0.825
0.868
0.911
0.933
0.941
0.035
0.896
0.844
0.890
0.933
0.964
0.978
0.010
ACSNet
0.898
0.838
0.882
0.920
0.941
0.952
0.032
0.882
0.826
0.873
0.927
0.947
0.959
0.011
PraNet
0.898
0.840
0.885
0.915
0.944
0.948
0.030
0.899
0.849
0.896
0.936
0.963
0.979
0.009
EU-Net
0.908
0.854
0.893
0.917
0.951
0.954
0.028
0.902
0.846
0.891
0.936
0.959
0.965
0.011
SANet
0.904
0.847
0.892
0.915
0.949
0.953
0.028
0.916
0.859
0.909
0.939
0.971
0.976
0.012
Polyp-PVT (Ours)
0.917
0.864
0.911
0.925
0.956
0.962
0.023
0.937
0.889
0.936
0.949
0.985
0.989
0.006
Quantitative results of the test dataset Endoscene. The SFA result is generated using the published code.
Model
mDic
mIoU
MAE
U-Net
0.710
0.627
0.684
0.843
0.847
0.875
0.022
UNet++
0.707
0.624
0.687
0.839
0.834
0.898
0.018
SFA
0.467
0.329
0.341
0.640
0.644
0.817
0.065
MSEG
0.874
0.804
0.852
0.924
0.948
0.957
0.009
ACSNet
0.863
0.787
0.825
0.923
0.939
0.968
0.013
DCRNet
0.856
0.788
0.830
0.921
0.943
0.960
0.010
PraNet
0.871
0.797
0.843
0.925
0.950
0.972
0.010
EU-Net
0.837
0.765
0.805
0.904
0.919
0.933
0.015
SANet
0.888
0.815
0.859
0.928
0.962
0.972
0.008
Polyp-PVT (Ours)
0.900
0.833
0.884
0.935
0.973
0.981
0.007
Quantitative results of the test datasets ColonDB and ETIS. The SFA result is generated using the published code.
Model
ColonDB[10]
ETIS[9]
mDic
mIoU
MAE
mDic
mIoU
MAE
U-Net
0.512
0.444
0.498
0.712
0.696
0.776
0.061
0.398
0.335
0.366
0.684
0.643
0.740
0.036
UNet++
0.483
0.410
0.467
0.691
0.680
0.760
0.064
0.401
0.344
0.390
0.683
0.629
0.776
0.035
SFA
0.469
0.347
0.379
0.634
0.675
0.764
0.094
0.297
0.217
0.231
0.557
0.531
0.632
0.109
ACSNet
0.716
0.649
0.697
0.829
0.839
0.851
0.039
0.578
0.509
0.530
0.754
0.737
0.764
0.059
MSEG
0.735
0.666
0.724
0.834
0.859
0.875
0.038
0.700
0.630
0.671
0.828
0.854
0.890
0.015
DCRNet
0.704
0.631
0.684
0.821
0.840
0.848
0.052
0.556
0.496
0.506
0.736
0.742
0.773
0.096
PraNet
0.712
0.640
0.699
0.820
0.847
0.872
0.043
0.628
0.567
0.600
0.794
0.808
0.841
0.031
EU-Net
0.756
0.681
0.730
0.831
0.863
0.872
0.045
0.687
0.609
0.636
0.793
0.807
0.841
0.067
SANet
0.753
0.670
0.726
0.837
0.869
0.878
0.043
0.750
0.654
0.685
0.849
0.881
0.897
0.015
Polyp-PVT (Ours)
0.808
0.727
0.795
0.865
0.913
0.919
0.031
0.787
0.706
0.750
0.871
0.906
0.910
0.013
Standard deviation (SD) of the mDic of our model and the comparison models.
Model
mDic±SD
Kvasir-SEG
ClinicDB
ColonDB
ETIS
Endoscene
U-Net
0.818±0.039
0.823±0.047
0.483±0.034
0.398±0.033
0.710±0.049
UNet++
0.821±0.040
0.794±0.044
0.456±0.037
0.401±0.057
0.707±0.053
SFA
0.723±0.052
0.701±0.054
0.444±0.037
0.297±0.025
0.468±0.050
MSEG
0.897±0.041
0.910±0.048
0.735±0.039
0.700±0.039
0.874±0.051
ACSNet
0.898±0.045
0.882±0.048
0.716±0.040
0.578±0.035
0.863±0.055
DCRNet
0.886±0.043
0.896±0.049
0.704±0.039
0.556±0.039
0.857±0.052
PraNet
0.898±0.041
0.899±0.048
0.712±0.038
0.628±0.036
0.871±0.051
EU-Net
0.908±0.042
0.902±0.048
0.756±0.040
0.687±0.039
0.837±0.049
SANet
0.904±0.042
0.916±0.049
0.752±0.040
0.750±0.047
0.888±0.054
Polyp-PVT (Ours)
0.917±0.042
0.937±0.050
0.808±0.043
0.787±0.044
0.900±0.052
Quantitative results for ablation studies.
Dataset
Metric
Baseline
Without CFM
Without CIM
Without SAM
Final
Endoscene
mDic
0.869
0.892
0.882
0.874
0.900
mIoU
0.792
0.826
0.808
0.801
0.833
ClinicDB
mDic
0.903
0.915
0.930
0.930
0.937
mIoU
0.847
0.865
0.881
0.877
0.889
ColonDB
mDic
0.796
0.802
0.805
0.779
0.808
mIoU
0.707
0.721
0.724
0.696
0.727
ETIS
mDic
0.759
0.771
0.785
0.778
0.787
mIoU
0.668
0.690
0.711
0.693
0.706
Kvasir-SEG
mDic
0.910
0.922
0.910
0.910
0.917
mIoU
0.856
0.872
0.858
0.853
0.864
Ablation study of GCN in the SAM module. The mDic scores are provided.
Setting
Endoscene
ClinicDB
ColonDB
ETIS
Kvasir-SEG
Without GCN
0.876
0.928
0.784
0.725
0.894
With Conv
0.894
0.919
0.787
0.742
0.909
With GCN
0.900
0.937
0.808
0.787
0.917
Ablation experiments of the powerful rotation adaptability. All experiments are under the condition of large rotation (15 degrees). The mDic scores are provided.
Setting
Endoscene
ClinicDB
ColonDB
ETIS
Kvasir-SEG
Without GCN
0.857
0.909
0.756
0.667
0.894
With Conv
0.865
0.898
0.789
0.719
0.893
With GCN
0.874
0.929
0.806
0.744
0.915
Video polyp segmentation results on the CVC-300-TV[96].
Model
mDic
mIoU
MAE
U-Net
0.631
0.516
0.567
0.793
0.826
0.849
0.027
UNet++
0.638
0.527
0.581
0.796
0.831
0.847
0.024
ResUNet++
0.533
0.410
0.469
0.703
0.718
0.720
0.052
ACSNet
0.732
0.627
0.703
0.837
0.871
0.875
0.016
PraNet
0.716
0.624
0.700
0.833
0.852
0.904
0.016
PNS-Net
0.813
0.710
0.778
0.909
0.921
0.942
0.013
Polyp-PVT (Ours)
0.880
0.802
0.869
0.915
0.961
0.965
0.011
Result of video polyp segmentation on CVC-612-T and CVC-612-V.