AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (4.6 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

CasNet: A Cascade Coarse-to-Fine Network for Semantic Segmentation

Zhenyang WangZhidong Deng( )Shiyao Wang
Department of Computer Science, Tsinghua University, Beijing 100084, China.
Show Author Information

Abstract

Semantic segmentation is a fundamental topic in computer vision. Since it is required to make dense predictions for an entire image, a network can hardly achieve good performance on various kinds of scenes. In this paper, we propose a cascade coarse-to-fine network called CasNet, which focuses on regions that are difficult to make pixel-level labels. The CasNet comprises three branches. The first branch is designed to produce coarse predictions for easy-to-label pixel regions. The second one learns to distinguish the relatively difficult-to-label pixels from the entire image. Finally, the last branch generates final predictions by combining both the coarse and the fine prediction results through a weighting coefficient that is estimated by the second branch. Three branches focus on their own objectives and collaboratively learn to predict from coarse-to-fine predictions. To evaluate the performance of the proposed network, we conduct experiments on two public datasets: SIFT Flow and Stanford Background. We show that these three branches can be trained in an end-to-end manner, and the experimental results demonstrate that the proposed CasNet outperforms existing state-of-the-art models, and it achieves prediction accuracy of 91.6% and 89.7% on SIFT Flow and Standford Background, respectively.

References

[1]
He K. M., Zhang X. Y., Ren S. Q., and Sun J., Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778.
[2]
Graves A., Mohamed A. R., and Hinton G., Speech recognition with deep recurrent neural networks, in Proc. 2013 IEEE Int. Conf. Acoustics, Speech and Signal Processing, Vancouver, Canada, 2013, pp. 6645-6649.
[3]
Sermanet P., Eigen D., Zhang X., Mathieu M., Fergus R., and LeCun Y., Overfeat: Integrated recognition, localization and detection using convolutional networks, arXiv preprint arXiv: 1312.6229, 2013.
[4]
Long J., Shelhamer E., and Darrell T., Fully convolutional networks for semantic segmentation, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3431-3440.
[5]
Sung K. K. and Poggio T., Example-based learning for view-based human face detection, IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1, pp. 39-51, 1998.
[6]
Liu C., Yuen J., and Torralba A., Sift flow: Dense correspondence across scenes and its applications, IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 978-994, 2011.
[7]
Gould S., Fulton R., and Koller D., Decomposing a scene into geometric and semantically consistent regions, in Proc. IEEE 12th Int. Conf. Computer Vision, Kyoto, Japan, 2009, pp. 1-8.
[8]
Ladický L., Russell C., Kohli P., and Torr P. H. S., Associative hierarchical CRFs for object class image segmentation, in Proc. IEEE 12th Int. Conf. Computer Vision, Kyoto, Japan, 2009, pp. 739-746.
[9]
Lempitsky V., Vedaldi A., and Zisserman A., A pylon model for semantic segmentation, in Proc. 24th Int. Conf. Neural Information Processing Systems, Granada, Spain, 2011, pp. 1485-1493.
[10]
He X. M., Zemel R. S., and Carreira-Perpinan M. A., Multiscale conditional random fields for image labeling, in Proc. 2004 IEEE Computer Society Conf. Computer Vision and Pattern Recognition, Washington, DC, USA, 2004, pp. 695-702.
[11]
Farabet C., Couprie C., Najman L., and LeCun Y., Learning hierarchical features for scene labeling, IEEE Trans. Pattern Anal. Mach. Intell., vol. 35, no. 8, pp. 1915-1929, 2013.
[12]
Couprie C., Farabet C., Najman L., and LeCun Y., Indoor semantic segmentation using depth information, arXiv preprint arXiv: 1301.3572, 2013.
[13]
Zhao H. S., Shi J. P., Qi X. J., Wang X. G., and Jia J. Y., Pyramid scene parsing network, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017.
[14]
Lin T. Y., Dollár P., Girshick R., He K. M., Hariharan B., and Belongie S., Feature pyramid networks for object detection, arXiv preprint arXiv: 1612.03144, 2016.
[15]
Yu F. and Koltun V., Multi-scale context aggregation by dilated convolutions, arXiv preprint arXiv: 1511.07122, 2015.
[16]
Visin F., Romero A., Cho K., Matteucci M., Ciccone M., Kastner K., Bengio Y., and Courville A., Reseg: A recurrent neural network-based model for semantic segmentation, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 2016, pp. 41-48.
[17]
Liang M., Hu X. L., and Zhang B., Convolutional neural networks with intra-layer recurrent connections for scene labeling, in Proc. 28th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2015, pp. 937-945.
[18]
Chen L. C., Papandreou G., Kokkinos I., Murphy K., and Yuille A. L., Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs, arXiv preprint arXiv: 1606.00915, 2016.
[19]
Zheng S., Jayasumana S., Romera-Paredes B., Vineet V., Su Z. Z., Du D. L., Huang C., and Torr P. H. S., Conditional random fields as recurrent neural networks, in Proc. 2015 IEEE Int. Conf. Computer Vision, Santiago, Chile, 2015, pp. 1529-1537.
[20]
Mostajabi M., Yadollahpour P., and Shakhnarovich G., Feedforward semantic segmentation with zoom-out features, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 3376-3385.
[21]
Veit A., Wilber J. M., and Belongie S., Residual networks behave like ensembles of relatively shallow networks, in Advances in Neural Information Processing Systems 29, Barcelona, Spain, 2016, pp. 550-558.
[22]
Jia Y. Q., Shelhamer E., Donahue J., Karayev S., Long J., Girshick R., Guadarrama S., and Darrell T., Caffe: Convolutional architecture for fast feature embedding, in Proc. 22nd ACM Int. Conf. Multimedia, Orlando, FL, USA, 2014, pp. 675-678.
[23]
Tighe J. and Lazebnik S., Superparsing: Scalable nonparametric image parsing with superpixels, in Proc. 11th European Conf. Computer Vision, Heraklion, Greece, 2010, pp. 352-365.
[24]
Tighe J. and Lazebnik S., Finding things: Image parsing with regions and per-exemplar detectors, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 3001-3008.
[25]
Pinheiro P. O. and Collobert R., Recurrent convolutional neural networks for scene labeling, in Proc. 31st International Conference on Machine Learning, Beijing, China, 2014, pp. 82-90.
[26]
Jin X. J., Chen Y. P., Jie Z. Q., Feng J. S., and Yan S. C., Multi-path feedback recurrent neural networks for scene parsing, in Proc. 31st AAAI Conf. on Artificial Intelligence, San Francisco, CA, USA, 2017, pp. 4096-4102.
[27]
Eigen D. and Fergus R., Nonparametric image parsing using adaptive neighbor sets, in Proc. 2012 IEEE Conf. Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 2799-2806.
[28]
Singh G. and Kosecka J., Nonparametric scene parsing with adaptive feature relevance and semantic context, in Proc. 2013 IEEE Conf. Computer Vision and Pattern Recognition, Portland, OR, USA, 2013, pp. 3151-3157.
Tsinghua Science and Technology
Pages 207-215
Cite this article:
Wang Z, Deng Z, Wang S. CasNet: A Cascade Coarse-to-Fine Network for Semantic Segmentation. Tsinghua Science and Technology, 2019, 24(2): 207-215. https://doi.org/10.26599/TST.2018.9010044

654

Views

31

Downloads

4

Crossref

N/A

Web of Science

4

Scopus

1

CSCD

Altmetrics

Received: 25 October 2017
Accepted: 18 December 2017
Published: 31 December 2018
© The author(s) 2019
Return