S4Net: Single stage salient-instance segmentation

Ruochen Fan; Ming-Ming Cheng; Qibin Hou; Tai-Jiang Mu; Jingdong Wang; Shi-Min Hu

doi:10.1007/s41095-020-0173-9

| Sign up

PDF (844.6 KB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Research Article | Open Access

S4Net: Single stage salient-instance segmentation

Ruochen Fan^¹, Ming-Ming Cheng^², Qibin Hou^², Tai-Jiang Mu^¹, Jingdong Wang^³, Shi-Min Hu^¹()

1 BNRist, Tsinghua University, Beijing 100086, China.

2 Nankai University, Tianjin 300071, China.

3 MSRA, Beijing 100086, China.

Show Author Information

Abstract

In this paper, we consider salient instance segmentation. As well as producing bounding boxes, our network also outputs high-quality instance-level segments as initial selections to indicate the regions of interest. Taking into account the category-independent property of each target, we design a single stage salient instance segmentation framework, with a novel segmentation branch. Our new branch regards not only local context inside each detection window but also the surrounding context, enabling us to distinguish instances in the same scope even with partial occlusion. Our network is end-to-end trainable and is fast (running at 40 fps for images with resolution $320 \times 320$ ). We evaluate our approach on a publicly available benchmark and show that it outperforms alternative solutions. We also provide a thorough analysis of our design choices to help readers better understand the function of each part of our network. Source code can be found at https://github.com/RuochenFan/S4Net.

Keywords

salient-instance segmentation salient object detection single stage region-of-interest masking

References

[1]

F. F.

Li,

; R.

VanRullen,

; C.

Koch,

; P.

Perona,

Rapid natural scene categorization in the near absence of attention. Proceedings of the National Academy of Sciences of the United States of America Vol. 99, No. 14, 9596-9601, 2002.

Crossref Google Scholar

[2]

Elazary,

; L.

Itti,

Interesting objects are visually salient. Journal of Vision Vol. 8, No. 3, 3, 2008.

Crossref Google Scholar

[3]

M.-M.

Cheng,

; F.-L.

Zhang,

; N. J.

Mitra,

; X.

Huang,

; S.-M.

Hu,

RepFinder: Finding approximately repeated scene elements for image editing. ACM Transactions on Graphics Vol. 29, No. 4, Article No. 83, 2010.

Crossref Google Scholar

[4]

H. S.

Wu,

; Y. S.

Wang,

; K. C.

Feng,

; T. T.

Wong,

; T. Y.

Lee,

; P. A.

Heng,

Resizing by symmetry-summarization. ACM Transactions on Graphics Vol. 29, No. 6, Article No. 159, 2010.

Crossref Google Scholar

[5]

Chen,

; M.-M.

Cheng,

; P.

Tan,

; A.

Shamir,

; S.-M.

Hu,

Sketch2photo: Internet image montage. ACM Transactions on Graphics Vol. 28, No. 5, Article No. 124, 2009.

Crossref Google Scholar

[6]

Wu,

; I.

Lenz,

; A.

Saxena,

Hierarchical semantic labeling for task-relevant RGB-D perception. In: Proceedings of the Robotics: Science and Systems, 2014.

Crossref

[7]

Borji,

; M.-M.

Cheng,

; Q.

Hou,

; H.

Jiang,

; J.

Li,

Salient object detection: A survey. Computational Visual Media Vol. 5, No. 2, 117-150, 2019.

Crossref Google Scholar

[8]

Bylinskii,

; T.

Judd,

; A.

Oliva,

; A.

Torralba,

; F.

Durand,

What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 3, 740-757, 2019.

Crossref Google Scholar

[9]

Li,

; Y.

Xie,

; L.

Lin,

; Y.

Yu,

Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2386-2395, 2017.

Crossref

[10]

J. M.

Wolfe,

; T. S.

Horowitz,

What attributes guide the deployment of visual attention and how do they do it? Nature Reviews Neuroscience Vol. 5, No. 6, 495-501, 2004.

Crossref Google Scholar

[11]

Desimone,

; J.

Duncan,

Neural mechanisms of selective visual attention. Annual Review of Neuroscience Vol. 18, No. 1, 193-222, 1995.

Crossref Google Scholar

[12]

S. K.

Mannan,

; C.

Kennard,

; M.

Husain,

The role of visual salience in directing eye movements in visual object agnosia. Current Biology Vol. 19, No. 6, R247-R248, 2009.

Crossref Google Scholar

[13]

Itti,

; C.

Koch,

; E.

Niebur,

A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 20, No. 11, 1254-1259, 1998.

Crossref Google Scholar

[14]

Itti,

; C.

Koc,

Computational modeling of visual attention. Nature Reviews Neuroscience Vol. 2, No. 3, 194-203, 2001.

Crossref Google Scholar

[15]

M. M.

Cheng,

; N. J.

Mitra,

; X. L.

Huang,

; P. H. S.

Torr,

; S. M.

Hu,

Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 37, No. 3, 569-582, 2015.

Crossref Google Scholar

[16]

H. Z.

Jiang,

; J. D.

Wang,

; Z. J.

Yuan,

; Y.

Wu,

; N. N.

Zheng,

; S. P.

Li,

Salient object detection: A discriminative regional feature integration approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2083-2090,2013.

Crossref

[17]

Zhu,

; S.

Liang,

; Y.

Wei,

; J.

Sun,

Saliency optimization from robust background detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2814-2821, 2014.

Crossref

[18]

Rother,

; V.

Kolmogorov,

; A.

Blake

“GrabCut”: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics Vol. 23, No. 3, 309-314, 2004.

Crossref Google Scholar

[19]

Hou,

; M.-M.

Cheng,

; X.

Hu,

; A.

Borji,

; Z.

Tu,

; P. H. S.

Torr,

Deeply supervised salient object detection with short connections. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 41, No. 4, 815-828, 2019.

Crossref Google Scholar

[20]

Li,

; Y.

Yu,

Deep contrast learning for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 478-487, 2016.

Crossref

[21]

Wang,

; H.

Lu,

; X.

Ruan,

; M.-H.

Yang,

Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3183-3192, 2015.

Crossref

[22]

Dai,

; K.

He,

; J.

Sun,

Convolutional feature masking for joint object and stuff segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3992-4000, 2015.

Crossref

[23]

Hariharan,

; P.

Arbeláez,

; R.

Girshick,

; J.

Malik,

Simultaneous detection and segmentation. In: Computer Vision - ECCV 2014. Lecture Notes in Computer Science, Vol. 8695. D.

Fleet,

; T.

Pajdla,

; B.

Schiele,

; T.

Tuytelaars,

Eds. Springer Cham, 297-312, 2014.

Crossref

[24]

Hariharan,

; P.

Arbelaez,

; R.

Girshick,

; J.

Malik,

Hypercolumns for object segmentation and fine-grained localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 447-456, 2015.

Crossref

[25]

Girshick,

; J.

Donahue,

; T.

Darrell,

; J.

Malik,

Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580-587, 2014.

Crossref

[26]

S. Q.

Ren,

; K. M.

He,

; R.

Girshick,

; J.

Sun,

Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137-1149, 2017.

Crossref Google Scholar

[27]

Dai,

; Y.

Li,

; K.

He,

; J.

Sun,

R-FCN: Object detection via region-based fully convolutional networks. In: Proceedings of the Advances in Neural Information Processing Systems 29, 2016.

[28]

Girshick,

Fast R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 1440-1448, 2015.

Crossref

[29]

K. M.

He,

; X. Y.

Zhang,

; S. Q.

Ren,

; J.

Sun,

Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 37, No. 9, 1904-1916, 2015.

Crossref Google Scholar

[30]

J. F.

Dai,

; K. M.

He,

; Y.

Li,

; S. Q.

Ren,

; J.

Sun,

Instance-sensitive fully convolutional networks. In: Computer Vision - ECCV 2016. Lecture Notes in Computer Science, Vol. 9910. B.

Leibe,

; J.

Matas,

; N.

Sebe,

; M.

Welling,

Eds. Springer Cham, 534-549, 2016.

Crossref

[31]

He,

; G.

Gkioxari,

; P.

Dollár,

; R.

Girshick,

Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2961-2969, 2017.

Crossref

[32]

T.-Y.

Lin,

; P.

Dollár,

; R. B.

Girshick,

; K.

He,

; B.

Hariharan,

; S. J.

Belongie,

Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2117-2125, 2017.

Crossref

[33]

Y. C.

Wei,

; X. D.

Liang,

; Y. P.

Chen,

; X. H.

Shen,

; M. M.

Cheng,

; J. S.

Feng,

; Y.

Zhao,

; S.

Yan,

STC: A simple to complex framework for weakly-supervised semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 11, 2314-2320, 2017.

Crossref Google Scholar

[34]

Q. B.

Hou,

; D.

Massiceti,

; P. K.

Dokania,

; Y. C.

Wei,

; M. M.

Cheng,

; P. H. S.

Torr,

Bottom-up top-down cues for weakly-supervised semantic segmentation. In: Energy Minimization Methods in Computer Vision and Pattern Recognition. Lecture Notes in Computer Science, Vol. 10746. M.

Pelillo,

; E.

Hancock,

Eds. Springer Cham, 263-277, 2018.

Crossref

[35]

Russakovsky,

; J.

Deng,

; H.

Su,

; J.

Krause,

; S.

Satheesh,

; S.

Ma,

; Z.

Huang,

; A.

Karpathy,

; A.

Khosla,

; M

Bernstein,

. et al. ImageNet large scale visual recognition challenge International Journal of Computer Vision Vol. 115, 211-252, 2015.

Crossref Google Scholar

[36]

Everingham,

; S. M. A.

Eslami,

; L.

van Gool,

; C. K. I.

Williams,

; J.

Winn,

; A.

Zisserman,

The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision Vol. 111, No. 1, 98-136, 2015.

Crossref Google Scholar

[37]

J. M.

Zhang,

; S.

Sclaroff,

; Z.

Lin,

; X. H.

Shen,

; B.

Price,

; R.

Mech,

Unconstrained salient object detection via proposal subset optimization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5733-5742, 2016.

Crossref

[38]

Pont-Tuset,

; P.

Arbelaez,

; J. T.

Barron,

; F.

Marques,

; J.

Malik,

Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 1, 128-140, 2017.

Crossref Google Scholar

[39]

Qi,

; M. M.

Cheng,

; A.

Borji,

; H. C.

Lu,

; L. F.

Bai,

SaliencyRank: Two-stage manifold ranking for salient object detection. Computational Visual Media Vol. 1, No. 4, 309-320, 2015.

Crossref Google Scholar

[40]

Borji,

; M. M.

Cheng,

; H. Z.

Jiang,

; J.

Li,

Salient object detection: A benchmark. IEEE Transactions on Image Processing Vol. 24, No. 12, 5706-5722, 2015.

Crossref Google Scholar

[41]

Achanta,

; A.

Shaji,

; K.

Smith,

; A.

Lucchi,

; P.

Fua,

; S.

Süsstrunk,

SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 34, No. 11, 2274-2282, 2012.

Crossref Google Scholar

[42]

P. F.

Felzenszwalb,

; D. P.

Huttenlocher,

Efficient graph-based image segmentation. International Journal of Computer Vision Vol. 59, No. 2, 167-181, 2004.

Crossref Google Scholar

[43]

J. B.

Shi,

; J.

Malik,

Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 22, No. 8, 888-905, 2000.

Crossref Google Scholar

[44]

J. D.

Wang,

; H. Z.

Jiang,

; Z. J.

Yuan,

; M. M.

Cheng,

; X. W.

Hu,

; N. N.

Zheng,

Salient object detection: A discriminative regional feature integration approach. International Journal of Computer Vision Vol. 123, No. 2, 251-268, 2017.

Crossref Google Scholar

[45]

Zhao,

; W.

Ouyang,

; H.

Li,

; X.

Wang,

Saliency detection by multi-context deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1265-1274, 2015.

Crossref

[46]

Lee,

; Y.-W.

Tai,

; J.

Kim,

Deep saliency with encoded low level distance map and high level features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 660-668, 2016.

Crossref

[47]

Li,

; Y.

Yu,

Visual saliency based on multiscale deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5455-5463, 2015.

[48]

D. G.

Lowe,

Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision Vol. 60, No. 2, 91-110, 2004.

Crossref Google Scholar

[49]

Bay,

; A.

Ess,

; T.

Tuytelaars,

; L.

Van Gool,

Speeded-up robust features (SURF). Computer Vision and Image Understanding Vol. 110, No. 3, 346-359, 2008.

Crossref Google Scholar

[50]

Dalal

; B.

Triggs,

Histograms of oriented gradients for human detection. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, 886-893, 2005.

[51]

Sermanet,

; D.

Eigen,

; X.

Zhang,

; M.

Mathieu,

; R.

Fergus,

; Y.

LeCun,

Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229, 2013.

[52]

J. R.

Uijlings,

; K. E.

Van De Sande,

; T.

Gevers,

; A. W.

Smeulders,

Selective search for object recognition. International Journal of Computer Vision Vol. 104, No. 2, 154-171, 2013.

Crossref Google Scholar

[53]

M.-M.

Cheng,

; Z.

Zhang,

; W.-Y.

Lin,

; P.

Torr,

BING: Binarized normed gradients for objectness estimation at 300fps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3286-3293, 2014.

Crossref

[54]

P. O.

Pinheiro,

; R.

Collobert,

; P.

Dollár,

Learning to segment object candidates. In: Proceedings of the Advances in Neural Information Processing Systems 28, 2015.

[55]

Arbeláez,

; M.

Maire,

; C.

Fowlkes,

; J.

Malik,

Contour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 33, No. 5, 898-916, 2011.

Crossref Google Scholar

[56]

Li,

; H.

Qi,

; J.

Dai,

; X.

Ji,

; Y.

Wei,

Fully convolutional instance-aware semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2359-2367, 2017.

Crossref

[57]

He,

; X.

Zhang,

; S.

Ren,

; J.

Sun,

Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770-778, 2016.

Crossref

[58]

T.-Y.

Lin,

; P.

Goyal,

; R.

Girshick,

; K.

He,

; P.

Dollár,

Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2980-2988, 2017.

Crossref

[59]

Yosinski,

; J.

Clune,

; A.

Nguyen,

; T.

Fuchs,

; H.

Lipson,

Understanding neural networks through deep visualization. arXiv preprint arXiv:1506.06579, 2015.

[60]

Zhao,

; J.

Shi,

; X.

Qi,

; X.

Wang,

; J.

Jia,

Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2881-2890, 2017.

Crossref

[61]

Abadi,

; A.

Agarwal,

; P.

Barham,

; E.

Brevdo,

; Z.

Chen,

; C.

Citro,

; G. S.

Corrado,

; A.

Davis,

; J.

Dean,

; M.

Devin,

et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

[62]

T. Y.

Lin,

; M.

Maire,

; S.

Belongie,

; J.

Hays,

; P.

Perona,

; D.

Ramanan,

; P.

Dollár,

; C. L.

Zitnick,

Microsoft COCO: Common objects in context. In: Computer Vision - ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. D.

Fleet,

; T.

Pajdla,

; B.

Schiele,

; T.

Tuytelaars,

Eds. Springer Cham, 740-755, 2014.

Crossref

[63]

A. G.

Howard,

; M.

Zhu,

; B.

Chen,

; D.

Kalenichenko,

; W.

Wang,

; T.

Weyand,

; M.

Andreetto,

; H.

Adam,

Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.

[64]

Simonyan,

; A.

Zisserman,

Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[65]

D. P.

Fan,

; M. M.

Cheng,

; J. J.

Liu,

; S. H.

Gao,

; Q. B.

Hou,

; A.

Borji,

Salient objects in clutter: Bringing salient object detection to the foreground. In: Computer Vision - ECCV 2018. Lecture Notes in Computer Science, Vol. 11219. V.

Ferrari,

; M.

Hebert,

; C.

Sminchisescu,

; Y.

Weiss,

Eds. Springer Cham, 196-212, 2018.

[66]

Liu,

; J.

Han,

DHSNet: Deep hierarchical saliency network for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 678-686, 2016.

Crossref

[67]

Kolesnikov,

; C. H.

Lampert,

Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In: Computer Vision - ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. B.

Leibe,

; J.

Matas,

; N.

Sebe,

; M.

Welling,

Eds. Springer Cham, 695-711, 2016.

Crossref

[68]

Y. C.

Wei,

; J. S.

Feng,

; X. D.

Liang,

; M. M.

Cheng,

; Y.

Zhao,

; S. C.

Yan,

Object region mining with adversarial erasing: A simple classification to semantic segmentation approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6488-6496, 2017.

Crossref

[69]

L. C.

Chen,

; G.

Papandreou,

; I.

Kokkinos,

; K.

Murphy,

; A. L.

Yuille,

DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 40, No. 4, 834-848, 2018.

Crossref Google Scholar

[70]

J. M.

Zhang,

; Z.

Lin,

; J.

Brandt,

; X. H.

Shen,

; S.

Sclaroff,

Top-down neural attention by excitation backprop. In: Computer Vision - ECCV 2016. Lecture Notes in Computer Science, Vol. 9908. B.

Leibe,

; J.

Matas,

; N.

Sebe,

; M.

Welling,

Eds. Springer Cham, 543-559, 2016.

Crossref

Computational Visual Media

Volume 6 Issue 2,
June 2020

Pages 191-204

DOI: 10.1007/s41095-020-0173-9

Cite this article:

Fan R, Cheng M-M, Hou Q, et al. S4Net: Single stage salient-instance segmentation. Computational Visual Media, 2020, 6(2): 191-204. https://doi.org/10.1007/s41095-020-0173-9