A Pixel–Channel Hybrid Attention Model for Image Processing

Qiang Hua; Liyou Chen; Pan Li; Shipeng Zhao; Yan Li

doi:10.26599/TST.2021.9010054

| Sign up

PDF (6.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Show Outline

Figures (5)

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Tables (10)

Table 1

Table 2

Table 3

Table 4

Table 5

Open Access

A Pixel–Channel Hybrid Attention Model for Image Processing

Qiang Hua, Liyou Chen(), Pan Li(), Shipeng Zhao(), Yan Li()

College of Mathematics and Information Science, Hebei University, Baoding 071002, China

Research Center for Applied Mathematics and Interdisciplinary Sciences, Beijing Normal University, Zhuhai 519087, China

Show Author Information

Abstract

In the field of image processing, better results can often be achieved through the deepening of neural network layers involving considerably more parameters. In image classification, improving classification accuracy without introducing too many parameters remains a challenge. As for image conversion, the use of the conversion model of the generative adversarial network often produces semantic artifacts, resulting in images with lower quality. Thus, to address the above problems, a new type of attention module is proposed in this paper for the first time. This proposed approach uses the pixel–channel hybrid attention (PCHA) mechanism, which combines the attention information of the pixel and channel domains. The comparative results of using different attention modules on multiple-image data verify the superiority of the PCHA module in performing classification tasks. For image conversion, we propose a skip structure (S-PCHA model) in the up- and down-sampling processes based on the PCHA model. The proposed model can help the algorithm identify the most distinctive semantic object in a given image, as this structure effectively realizes the intercommunication of encoder and decoder information. Furthermore, the results showed that the attention model could establish a more realistic mapping from the source domain to the target domain in the image conversion algorithm, thus improving the quality of the image generated by the conversion model.

Keywords

deep learning attention mechanism image classification image processing convolutional neural networks

References

[1]

Bahdanau

, K.

Cho

, and Y.

Bengio

, Neural machine translation by jointly learning to align and translate, presented at the 2015 International Conference on Learning Representations, San Diego, CA, USA, 2015.

[2]

, J.

, R.

Kiros

, K.

Cho

, A.

Courville

, R.

Salakhudinov

, R.

Zemel

, and Y.

Bengio

, Show, attend and tell: Neural image caption generation with visual attention, in Proc. of the 32nd International Conference on Machine Learning, Lile, France, 2015, pp. 2048–2057.

[3]

, J.

Yang

, D.

Batra

, and D.

Parikh

, Hierarchical question-image co-attention for visual question answering, in Proc. of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 2016, 289–297.

[4]

LeCun

, L.

Bottou

, Y.

Bengio

, and P.

Haffner

, Gradient-based learning applied to document recognition, Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

Crossref Google Scholar

[5]

, L.

Shen

, S.

Albanie

, G.

Sun

, and E.

, Squeeze-and-excitation networks, In Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Silver Spring, MD, USA, 2018, pp. 7132–7141.

Crossref

[6]

Vaswani

, N.

Shazeer

, N.

Parmar

, J.

Uszkoreit

, L.

Jones

, A. N.

Gomez

, Ł

Kaiser

, and I.

Polosukhin

, Attention is all you need, in Proc. of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6000–6010.

[7]

Wang

, M.

Jiang

, C.

Qian

, S.

Yang

, C.

, H.

Zhang

, X.

Wang

, and X.

Tang

, Residual attention network for image classification, in Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3156–3164.

Crossref

[8]

Woo

, J.

Park

, J. Y.

Lee

, and I. S.

Kweon

, Cbam: Convolutional block attention module, in Proc. of the 15th European Conference on Computer Vision, Munich, Germany, 2018, pp. 3–19.

Crossref

[9]

Wang

, R.

Girshick

, A.

Gupta

, and K.

, Non-local neural networks, in Proc. of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 7794–7803.

Crossref

[10]

, J.

Liu

, H.

Tian

, Y.

Bao

, Z.

Fang

, and H.

, Dual attention network for scene segmentation, in Proc. of the 2019 IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 3146–3154.

Crossref

[11]

Jaderberg

, K.

Simonyan

, A.

Zisserman

, and K.

Kavukcuoglu

, Spatial transformer networks, in Proc. of the 28th International Conference on Neural Information Processing Systems, Montreal, Canada, 2015, pp. 2017–2025.

[12]

W. T.

Chan

, F. Y. L.

Chin

, D.

, G.

Zhang

, and Y.

Zhang

, On-line scheduling of parallel jobs on two machines, Journal of Discrete Algorithms, vol. 6, no. 1, pp. 3–10, 2008.

Crossref Google Scholar

[13]

Xin

, J.

Zhang

, and Y.

Shao

, Complex network classification with convolutional neural network, Tsinghua Science and Technology, vol. 25, no. 4, pp. 447–457, 2020.

Crossref Google Scholar

[14]

W. T.

Chan

, Y.

Zhang

, S. P. Y.

Fung

, D.

, and H.

Zhu

, Efficient algorithms for finding a longest common increasing subsequence, Journal of Combinatorial Optimization, vol. 13, no. 3, pp. 277–288, 2007.

Crossref Google Scholar

[15]

Goodfellow

, J.

Pouget-Abadie

, M.

Mirza

, B.

, D.

Warde-Farley

, S.

Ozair

, A.

Courville

, and Y.

Bengio

, Generative adversarial nets, Advances in Neural Information Processing Systems, vol. 27, 2672–2680, 2014.

Google Scholar

[16]

Isola

, J. Y.

Zhu

, T.

Zhou

, and A. A.

Efros

, Image-to-imagetranslation with conditional adversarial networks, in Proc. of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 1125–1134.

Crossref

[17]

J. Y.

Zhu

, R.

Zhang

, D.

Pathak

, T.

Darrell

, A. A.

Efros

, O.

Wang

, and E.

Shechtman

, Multimodal image-to-image translation by enforcing bi-cycle consistency, in Proc. of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 465–476.

[18]

Wang

, M.

Liu

, J.

Zhu

, A.

Tao

, J.

Kautz

, and B.

Catanzaro

, High-resolution image synthesis and semantic manipulation with conditional GANs, in Proc. of the 2018 IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 8798–8807.

Crossref

[19]

Liang

, H.

Zhang

, L.

Lin

, and E.

Xing

, Generative semantic manipulation with mask-contrasting GAN, in Proc. of the 15th European Conference on Computer Vision, Munich, Germany, 2018, pp. 574–590.

Crossref

[20]

Chen

, C.

, X.

Yang

, and D.

Tao

, Attention-gan for object transfiguration in wild images, in Proc. of the 15th European Conference on Computer Vision, Munich, Germany, 2018, pp. 167–184.

Crossref

[21]

Y. A.

Mejjati

, C.

Richardt

, J.

Tompkin

, D.

Cosker

, and K. I.

Kim

, Unsupervised attention-guided image to image translation, in Proc. of the 32nd International Conference on Neural Information Processing Systems, Montréal, Canada, 2018, pp. 3697–3707.

[22]

J. Y.

Zhu

, T.

Park

, P.

Isola

, and A. A.

Efros

, Unpaired image-to-image translation using cycle-consistent adversarial networks, in Proc. of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2223–2232.

Crossref

[23]

Mirza

and S.

Osindero

, Conditional generative adversarial nets, arXiv preprint arXiv:1411.1784, 2014.

Google Scholar

[24]

Shelhamer

, J.

Long

, and T.

Darrell

, Fully convolutional networks for semantic segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 640–651, 2017.

Crossref Google Scholar

[25]

Heusel

, H.

Ramsauer

, T.

Unterthiner

, B.

Nessler

, and S.

Hochreiter

, GANs trained by a two time-scale update rule converge to a local Nash equilibrium, in Proc. of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 6629–6640.

[26]

Bińkowski

, D. J.

Sutherland

, M.

Arbel

, and A.

Gretton

, Demystifying MMD GANs, presented at International Conference on Learning Representations, Vancouver, Cananda, 2018.

[27]

Krizhevsky

, Learning multiple layers of features from tiny images, Technical report, University of Toronto, Toronto, Canada, 2009.

[28]

Simonyan

and A.

Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

Google Scholar

[29]

Szegedy

, W.

Liu

, Y.

Jia

, P.

Sermanet

, S.

Reed

, D.

Anguelov

, D.

Erhan

, V.

Vanhoucke

, and A.

Rabinovich

, Going deeper with convolutions, in Proc. of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015, pp. 1–9.

Crossref

[30]

, X.

Zhang

, S.

Ren

, and J.

Sun

, Deep residual learning for image recognition, in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770–778.

Crossref

[31]

Huang

, Z.

Liu

, L.

van der Maaten

, and K. Q.

Weinberger

, Densely connected convolutional networks, in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 4700–4708.

Crossref

[32]

Szegedy

, V.

Vanhoucke

, S.

Ioffe

, J.

Shlens

, and Z.

Wojna

, Rethinking the inception architecture for computer vision, in Proc. of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 2818–2826.

Crossref

[33]

Ioffe

and C.

Szegedy

, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proc. of the 32nd International Conference on Machine Learning, Lile, France, 2015, pp. 448–456.

[34]

, H.

Zhang

, P.

Tan

, and M.

Gong

, Dualgan: Unsupervised dual learning for image-to-image translation, in Proc. of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 2849–2857.

Crossref

[35]

Kim

, M.

Cha

, H.

Kim

, J. K.

Lee

, and J.

Kim

, Learning to discover cross-domain relations with generative adversarial networks, in Proc. of the 34th International Conference on Machine Learning, Sydney, Australia, 2017, pp. 1857–1865.

[36]

LeCun

, The MNIST database of handwritten digits, http://yann.lecun.com/exdb/mnist/, 1998.

[37]

Hui

, Y.

Liu

, J.

Qiu

, L.

Cao

, L.

, and Z.

, Study of texture segmentation and classification for grading small hepatocellular carcinoma based on CT images, Tsinghua Science and Technology, vol. 26, no. 2, pp. 199–207, 2020.

Crossref Google Scholar

Tsinghua Science and Technology

Volume 27 Issue 5,
October 2022

Pages 804-816

DOI: 10.26599/TST.2021.9010054

Cite this article:

Hua Q, Chen L, Li P, et al. A Pixel–Channel Hybrid Attention Model for Image Processing. Tsinghua Science and Technology, 2022, 27(5): 804-816. https://doi.org/10.26599/TST.2021.9010054

Return

10.26599/TST.2021.9010054.T001Table 1Qualitative analysis of different attention models.

Model	Parameter	Computation load	Representation ability
SENet	Minimum	Least	Weak
CBAM	Less	Less	Relative weak
Non-Local	More	More	Middle
DANet	More	Maximum	Relative strong
PCHA	Maximum	Middle	Strong

10.26599/TST.2021.9010054.T002Table 2CNN structure for image classification. conv represents the convolutional layer used in the encoder or down-sampling layer, convT is the transposed convolution used in the decoder or up-sampling layer; s is the convolution step size, p represents whether to fill in zeros, and out refers to the number of output channels.

Input size	Network	Output size
3 $\times$ 32 $\times$ 32	conv(3 $\times$ 3, $s = 1, p = 1, 𝑜𝑢𝑡 = 64)$	64 $\times$ 32 $\times$ 32
64 $\times$ 32 $\times$ 32	conv(3 $\times$ 3, $s = 1$ , $p = 1, 𝑜𝑢𝑡 = 64)$	64 $\times$ 32 $\times$ 32
64 $\times$ 32 $\times$ 32	max pooling (2 $\times$ 2, $s = 2$ )	64 $\times$ 16 $\times$ 16
64 $\times$ 16 $\times$ 16	Dropout-1	64 $\times$ 16 $\times$ 16
64 $\times$ 16 $\times$ 16	conv(3 $\times$ 3, $s = 1$ , $p = 1, 𝑜𝑢𝑡 = 128)$	128 $\times$ 16 $\times$ 16
128 $\times$ 16 $\times$ 16	conv(3 $\times$ 3, $s = 1, p = 1,$ $𝑜𝑢𝑡 = 128)$	128 $\times$ 16 $\times$ 16
128 $\times$ 16 $\times$ 16	avg pooling (2 $\times 2, s = 2)$	128 $\times$ 8 $\times$ 8
128 $\times$ 8 $\times$ 8	Dropout-2	128 $\times$ 8 $\times$ 8
128 $\times$ 8 $\times$ 8	conv(3 $\times$ 3, $s = 1, p = 1, 𝑜𝑢𝑡 = 256)$	128 $\times$ 8 $\times$ 8
128 $\times$ 8 $\times$ 8	conv(3 $\times$ 3, $s = 1,$ $p = 1, 𝑜𝑢𝑡 = 256)$	128 $\times$ 8 $\times$ 8
128 $\times$ 8 $\times$ 8	global average pooling	256 $\times$ 1 $\times$ 1
256	Full-connection layer	10

10.26599/TST.2021.9010054.T003Table 3Generator network structure for image conversion.

Input size	Network	Output size
3 $\times$ 128 $\times$ 128	conv(7 $\times$ 7, $s = 1, p = 1, 𝑜𝑢𝑡 = 64)$	64 $\times$ 128 $\times$ 128
64 $\times$ 128 $\times$ 128	conv(3 $\times$ 3, $s = 2, p = 1, 𝑜𝑢𝑡 = 128)$	128 $\times$ 64 $\times$ 64
128 $\times$ 64 $\times$ 64	conv(3 $\times$ 3, $s = 2, p = 1, 𝑜𝑢𝑡 = 256)$	256 $\times$ 32 $\times$ 32
256 $\times$ 32 $\times$ 32	Res [conv(3 $\times$ 3, $s = 2,$ $p = 1,$ $𝑜𝑢𝑡 = 256)]$ $\times$ 6	256 $\times$ 32 $\times$ 32
256 $\times$ 32 $\times$ 32	convT(3 $\times$ 3, $s = 2, p = 1, 𝑜𝑢𝑡 = 128)$	128 $\times$ 64 $\times$ 64
128 $\times$ 64 $\times$ 64	convT(3 $\times$ 3, $s =, p = 1, 𝑜𝑢𝑡 = 64)$	64 $\times$ 128 $\times$ 128
64 $\times$ 128 $\times$ 128	convT(7 $\times$ 7, $s = 1, p = 0, 𝑜𝑢𝑡 = 64)$	3 $\times$ 128 $\times$ 128

10.26599/TST.2021.9010054.T004Table 4Discriminator network structure in image conversion.

Input size	Network	Output size
3 $\times$ 128 $\times$ 128	conv(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 64)$	64 $\times$ 64 $\times$ 64
64 $\times$ 64 $\times$ 64	conv(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 128)$ conv(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 256)$	256 $\times$ 16 $\times$ 16
256 $\times$ 16 $\times$ 16	conv(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 512)$	256 $\times$ 15 $\times$ 15
512 $\times$ 15 $\times$ 15	conv(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 1)$	1 $\times$ 14 $\times$ 14

10.26599/TST.2021.9010054.T005Table 5Classification results of different attention models using the CNN on the cifar-10 dataset.

Model	Storage of parameter (bit)	FLOP	5-epoch accuracy (%)	8-epoch accuracy (%)	10-epoch accuracy (%)
Original CNN	1 148 874	153 466 112	83.26	84.16	83.36
SENet	1 153 066	153 613 696	82.32	84.36	84.26
Non-local network	1 154 074	164 115 712	82.82	84.39	84.99
CBAM	1 151 020	153 806 208	83.14	84.06	84.36
DANet	1 154 074	164 115 712	80.95	82.66	83.66
PCHA	1 154 106	155 743 616	84.23	84.88	85.1

10.26599/TST.2021.9010054.T006Table 6The classification performance of different models on Cifar-100.

Model	Storage of parameters ( $10^{6}$ bit)	Top-1 Error rate (%)	Top-5 Error rate (%)
VGG-16^[33]	34.01	27.17	8.83
GoogleNet^[34]	6.26	21.97	5.98
ResNet-34^[25]	21.32	23.34	6.73
SE-ResNet-34^[5]	21.65	22.07	6.12
Res-Attention59^[7]	55.75	33.62	12.98
DenseNet121^[26]	7.04	22.89	6.54
MobileNet^[35]	3.34	34.12	10.57
Inception-V3^[27]	22.37	22.82	6.41
VGG-16 + PCHA	34.04	26.97	8.72
GoogleNet + PCHA	6.31	21.7	5.54

10.26599/TST.2021.9010054.T007Table 7Autoencoder model experiment.

Model	Storage of parameter (bit)	FLOP	Time required to run 10 epoch (s)
Original autoencoder	65 503	1 801 632	784.5
SENet	66 063	1 808 960	959.9
Non-local network	66 183	1 934 912	1976.2
CBAM	65 955	1 846 864	1684.7
DANet	66 183	1 934 912	2315.7
PCHA	66 199	1 835 616	1788.9

10.26599/TST.2021.9010054.T008Table 8Network structure of the autoencoder.

Input size	Network	Output size
1 $\times$ 28 $\times$ 28	conv(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 16)$	16 $\times$ 14 $\times$ 14
16 $\times$ 14 $\times$ 14	conv(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 32)$	32 $\times$ 7 $\times$ 7
32 $\times$ 7 $\times$ 7	conv(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 16)$	16 $\times$ 7 $\times$ 7
16 $\times$ 7 $\times$ 7	Fully connected layer 1	30
30	Fully connected layer 2	16 $\times$ 7 $\times$ 7
16 $\times$ 7 $\times$ 7	convT(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 16)$	16 $\times$ 14 $\times$ 14
16 $\times$ 14 $\times$ 14	convT(4 $\times$ 4, $s = 2, p = 1, 𝑜𝑢𝑡 = 1)$	1 $\times$ 28 $\times$ 28

10.26599/TST.2021.9010054.T009Table 9FID and KID values of the images generated by the different models.

Model	zebra $\to$ horse		horse $\to$ zebra
Model	FID	KID	FID	KID
DiscoGAN	227.86	16.72 $\pm$ 0.60	201.88	14.21 $\pm$ 0.32
DualGAN	216.06	12.42 $\pm$ 0.62	157.03	10.43 $\pm$ 0.36
CycleGAN	198.71	11.71 $\pm$ 0.55	173.28	10.28 $\pm$ 0.26
Attention-GAN	181.05	10.05 $\pm$ 0.42	156.25	9.25 $\pm$ 0.41
Attention-Guided GAN	167.83	8.83 $\pm$ 0.32	141.15	7.15 $\pm$ 0.29
S-PCHA+CycleGAN	167.01	6.07 $\pm$ 0.54	118.02	5.26 $\pm$ 0.31

10.26599/TST.2021.9010054.T010Table 10FID and KID values of the S-PCHA ablation experiment.

Model	zebra $\to$ horse		horse $\to$ zebra
Model	FID	KID	FID	KID
CycleGAN	181.05	11.71 $\pm$ 0.55	156.26	10.28 $\pm$ 0.26
PAM	192.43	11.43 $\pm$ 0.68	167.35	10.39 $\pm$ 1.35
CAM	177.42	10.47 $\pm$ 0.74	147.78	9.63 $\pm$ 0.68
PCHA	170.84	9.09 $\pm$ 1.10	116.25	8.76 $\pm$ 0.77
S-PCHA	167.01	6.07 $\pm$ 0.54	118.02	5.26 $\pm$ 0.31