CasNet: A Cascade Coarse-to-Fine Network for Semantic Segmentation

Zhenyang Wang; Zhidong Deng; Shiyao Wang

doi:10.26599/TST.2018.9010044

| Sign up

PDF (4.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (4)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Tables (3)

Table 1

Table 2

Table 3

Open Access

CasNet: A Cascade Coarse-to-Fine Network for Semantic Segmentation

Zhenyang Wang, Zhidong Deng(), Shiyao Wang

Department of Computer Science, Tsinghua University, Beijing 100084, China.

Show Author Information

Abstract

Semantic segmentation is a fundamental topic in computer vision. Since it is required to make dense predictions for an entire image, a network can hardly achieve good performance on various kinds of scenes. In this paper, we propose a cascade coarse-to-fine network called CasNet, which focuses on regions that are difficult to make pixel-level labels. The CasNet comprises three branches. The first branch is designed to produce coarse predictions for easy-to-label pixel regions. The second one learns to distinguish the relatively difficult-to-label pixels from the entire image. Finally, the last branch generates final predictions by combining both the coarse and the fine prediction results through a weighting coefficient that is estimated by the second branch. Three branches focus on their own objectives and collaboratively learn to predict from coarse-to-fine predictions. To evaluate the performance of the proposed network, we conduct experiments on two public datasets: SIFT Flow and Stanford Background. We show that these three branches can be trained in an end-to-end manner, and the experimental results demonstrate that the proposed CasNet outperforms existing state-of-the-art models, and it achieves prediction accuracy of 91.6% and 89.7% on SIFT Flow and Standford Background, respectively.

Keywords

semantic segmentation convolutional neural network hard negative mining

References

[1]

K. M.

, Zhang

X. Y.

, Ren

S. Q.

, and Sun

, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.

[2]

Graves

, Mohamed

A. R.

, and Hinton

, Speech recognition with deep recurrent neural networks, in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 6645-6649.

Crossref

[3]

Sermanet

, Eigen

, Zhang

, Mathieu

, Fergus

, and LeCun

, Overfeat: Integrated recognition, localization and detection using convolutional networks, arXiv preprint arXiv: 1312.6229, 2013.

Google Scholar

[4]

Long

, Shelhamer

, and Darrell

, Fully convolutional networks for semantic segmentation, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3431-3440.

Crossref

[5]

Sung

K. K.

and Poggio

, Example-based learning for view-based human face detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp. 39-51, 1998.

Crossref Google Scholar

[6]

Liu

, Yuen

, and Torralba

, Sift flow: Dense correspondence across scenes and its applications, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 978-994, 2011.

Crossref Google Scholar

[7]

Gould

, Fulton

, and Koller

, Decomposing a scene into geometric and semantically consistent regions, in Proc. IEEE 12th Int. Conf. Computer Vision, Kyoto, Japan, 2009, pp. 1-8.

Crossref

[8]

Ladický

, Russell

, Kohli

, and Torr

P. H. S.

, Associative hierarchical CRFs for object class image segmentation, in Proc. IEEE 12th Int. Conf. Computer Vision, Kyoto, Japan, 2009, pp. 739-746.

Crossref

[9]

Lempitsky

, Vedaldi

, and Zisserman

, A pylon model for semantic segmentation, in Proc. 24th Int. Conf. Neural Information Processing Systems, Granada, Spain, 2011, pp. 1485-1493.

[10]

X. M.

, Zemel

R. S.

, and Carreira-Perpinan

M. A.

, Multiscale conditional random fields for image labeling, in Proc. 2004 IEEE Computer Society Conf. Computer Vision and Pattern Recognition, Washington, DC, USA, 2004, pp. 695-702.

[11]

Farabet

, Couprie

, Najman

, and LeCun

, Learning hierarchical features for scene labeling, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915-1929, 2013.

Crossref Google Scholar

[12]

Couprie

, Farabet

, Najman

, and LeCun

, Indoor semantic segmentation using depth information, arXiv preprint arXiv: 1301.3572, 2013.

Google Scholar

[13]

Zhao

H. S.

, Shi

J. P.

, Qi

X. J.

, Wang

X. G.

, and Jia

J. Y.

, Pyramid scene parsing network, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017.

Crossref

[14]

Lin

T. Y.

, Dollár

, Girshick

, He

K. M.

, Hariharan

, and Belongie

, Feature pyramid networks for object detection, arXiv preprint arXiv: 1612.03144, 2016.

Google Scholar

[15]

and Koltun

, Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv: 1511.07122, 2015.

Google Scholar

[16]

Visin

, Romero

, Cho

, Matteucci

, Ciccone

, Kastner

, Bengio

, and Courville

, Reseg: A recurrent neural network-based model for semantic segmentation, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 2016, pp. 41-48.

Crossref

[17]

Liang

, Hu

X. L.

, and Zhang

, Convolutional neural networks with intra-layer recurrent connections for scene labeling, in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 937-945.

[18]

Chen

L. C.

, Papandreou

, Kokkinos

, Murphy

, and Yuille

A. L.

, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, arXiv preprint arXiv: 1606.00915, 2016.

Google Scholar

[19]

Zheng

, Jayasumana

, Romera-Paredes

, Vineet

, Su

Z. Z.

, Du

D. L.

, Huang

, and Torr

P. H. S.

, Conditional random fields as recurrent neural networks, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1529-1537.

Crossref

[20]

Mostajabi

, Yadollahpour

, and Shakhnarovich

, Feedforward semantic segmentation with zoom-out features, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3376-3385.

Crossref

[21]

Veit

, Wilber

J. M.

, and Belongie

, Residual networks behave like ensembles of relatively shallow networks, in Advances in Neural Information Processing Systems 29, Barcelona, Spain, 2016, pp. 550-558.

[22]

Jia

Y. Q.

, Shelhamer

, Donahue

, Karayev

, Long

, Girshick

, Guadarrama

, and Darrell

, Caffe: Convolutional architecture for fast feature embedding, in Proc. 22nd ACM Int. Conf. Multimedia, Orlando, FL, USA, 2014, pp. 675-678.

Crossref

[23]

Tighe

and Lazebnik

, Superparsing: Scalable nonparametric image parsing with superpixels, in Proc. 11th European Conf. Computer Vision, Heraklion, Greece, 2010, pp. 352-365.

Crossref

[24]

Tighe

and Lazebnik

, Finding things: Image parsing with regions and per-exemplar detectors, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 3001-3008.

Crossref

[25]

Pinheiro

P. O.

and Collobert

, Recurrent convolutional neural networks for scene labeling, in Proc. 31st International Conference on Machine Learning, Beijing, China, 2014, pp. 82-90.

[26]

Jin

X. J.

, Chen

Y. P.

, Jie

Z. Q.

, Feng

J. S.

, and Yan

S. C.

, Multi-path feedback recurrent neural networks for scene parsing, in Proc. 31st AAAI Conf. on Artificial Intelligence, San Francisco, CA, USA, 2017, pp. 4096-4102.

[27]

Eigen

and Fergus

, Nonparametric image parsing using adaptive neighbor sets, in Proc. 2012 IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 2799-2806.

Crossref

[28]

Singh

and Kosecka

, Nonparametric scene parsing with adaptive feature relevance and semantic context, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 3151-3157.

Crossref

Tsinghua Science and Technology

Volume 24 Issue 2,
April 2019

Pages 207-215

DOI: 10.26599/TST.2018.9010044

Cite this article:

Wang Z, Deng Z, Wang S. CasNet: A Cascade Coarse-to-Fine Network for Semantic Segmentation. Tsinghua Science and Technology, 2019, 24(2): 207-215. https://doi.org/10.26599/TST.2018.9010044

Dataset	Method	Coarse branch	Refine branch	Attention branch	Pixel acc. (%)
SIFT Flow	(a)	$\sqrt$			89.2
	(b)	$\sqrt$	$\sqrt$		91.0 $↑_{0.8}$
	(c)	$\sqrt$	$\sqrt$	$\sqrt$	91.6 $↑_{1.4}$
Stanford Background	(a)	$\sqrt$			88.5
	(b)	$\sqrt$	$\sqrt$		89.2 $↑_{0.7}$
	(c)	$\sqrt$	$\sqrt$	$\sqrt$	89.7 $↑_{1.2}$

Method	Pixel acc. (%)	Class acc. (%)
Liu et al.^[6]	76.7	–
Tighe and Lazebnik^[23] SVM	75.6	41.4
Tighe and Lazebnik^[24]	78.6	39.2
SVM+MRF	78.6	39.2
Farabet et al.^[11] natural	72.3	50.8
Farabet et al.^[11] balanced	78.5	29.6
Pinheiro and Collobert^[25]	77.7	29.8
Liang et al.^[17]	84.3	41.0
Long et al.^[4]	85.9	53.9
Jin et al.^[26]	86.9	56.5
He et al.^[1]	90.52	–
Ours	91.6	52.5

Method	Pixel acc. (%)	Class acc. (%)
Gould et al.^[7]	76.4	–
Tighe and Lazebnik^[23]	77.5	–
Eigen and Fergus^[27]	75.3	66.5
Singh and Kosecka^[28]	74.1	62.2
Lempitsky et al.^[9]	81.9	72.4
Liang et al.^[17]	83.1	74.8
Jin et al.^[26]	86.6	79.0
Ours	89.7	75.4