Multi-granularity sequence generation for hierarchical image classification

Xinda Liu; Lili Wang

doi:10.1007/s41095-022-0332-2

| Sign up

PDF (3.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Research Article | Open Access

Multi-granularity sequence generation for hierarchical image classification

Xinda Liu^¹, Lili Wang^{¹^,²}()

1State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing 100191, China

2Peng Cheng Laboratory, Shengzhen 518000, China

Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Hierarchical multi-granularity image classification is a challenging task that aims to tag each given image with multiple granularity labels simultaneously. Existing methods tend to overlook that different image regions contribute differently to label prediction at different granularities, and also insufficiently consider relationships between the hierarchical multi-granularity labels. We introduce a sequence-to-sequence mechanism to overcome these two problems and propose a multi-granularity sequence generation (MGSG) approach for the hierarchical multi-granularity image classification task. Specifically, we introduce a transformer architecture to encode the image into visual representation sequences. Next, we traverse the taxonomic tree and organize the multi-granularity labels into sequences, and vectorize them and add positional information. The proposed multi-granularity sequence generation method builds a decoder that takes visual representation sequences and semantic label embedding as inputs, and outputs the predicted multi-granularity label sequence. The decoder models dependencies and correlations between multi-granularity labels through a masked multi-head self-attention mechanism, and relates visual information to the semantic label information through a cross-modality attention mechanism. In this way, the proposed method preserves the relationships between labels at different granularity levels and takes into account the influence of different image regions on labels with different granularities. Evaluations on six public benchmarks qualitatively and quantitatively demonstrate the advantages of the proposed method. Our project is available at https://github.com/liuxindazz/mgsg.

Keywords

hierarchical multi-granularity classification vision and text transformer sequence generation fine-grained image recognition cross-modality attention

References

[1]

Niu,

; Huang,

; Ouyang,

W. L.

; Wang,

Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing Vol. 29, 5542–5556, 2020.

Crossref Google Scholar

[2]

Du,

R. Y.

; Chang,

D. L.

; Bhunia,

A. K.

; Xie,

J. Y.

; Ma,

Z. Y.

; Song,

Y. Z.

; Guo,

Fine-grained visual classification via progressive multi-granularity training of jigsaw patches. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12365. Vedaldi,

; Bischof,

; Brox,

; Frahm,

J. M.

Eds. Springer Cham, 153–168, 2020.

Crossref

[3]

Liu,

D. Y.

; Wu,

; Zheng,

; Liu,

L. Q.

; Wang,

Verbal-person nets: Pose-guided multi-granularity language-to-person generation. IEEE Transactions on Neural Networks and Learning Systems , 2022.

Crossref Google Scholar

[4]

Ren,

Y. X.

; Wu,

; Xiao,

X. F.

; Yang,

J. C.

Online multi-granularity distillation for GAN compression. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 6773–6783, 2021.

Crossref

[5]

Chen,

T. S.

; Wu,

W. X.

; Gao,

Y. F.

; Dong,

; Luo,

X. N.

; Lin,

Fine-grained representation learning and recognition by exploiting hierarchical semantic embedding. In: Proceedings of the 26th ACM International Conference on Multimedia, 2023–2031, 2018.

Crossref

[6]

Chang,

D. L.

; Pang,

K. Y.

; Zheng,

Y. X.

; Ma,

Z. Y.

; Song,

Y. Z.

; Guo,

Your “flamingo” is my “bird”: Fine-grained, or not. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11471–11480, 2021.

Crossref

[7]

Wang,

R. Z.

; cai,

; Xiao,

K. W.

; Jia,

X. X.

; Han,

; Meng,

D. Y.

Label hierarchy transition: Modeling class hierarchies to enhance deep classifiers. arXiv preprint arXiv:2112.02353, 2021.

Google Scholar

[8]

Silla,

C. N.

; Freitas,

A. A.

A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery Vol. 22, Nos. 1–2, 31–72, 2011.

Crossref Google Scholar

[9]

Rousu,

; Saunders,

; Szedmak,

; Shawe-Taylor,

Kernel-based learning of hierarchical multilabel classification models. Journal of Machine Learning Research Vol. 7, 1601–1626, 2006.

Google Scholar

[10]

Cesa-Bianchi,

; Gentile,

; Zaniboni,

Incremental algorithms for hierarchical classification. Journal of Machine Learning Research Vol. 7, 31–54, 2006.

Crossref Google Scholar

[11]

Triguero,

; Vens,

Labelling strategies for hierarchical multi-label classification techniques. Pattern Recognition Vol. 56, 170–183, 2016.

Crossref Google Scholar

[12]

Barutcuoglu,

; Schapire,

R. E.

; Troyanskaya,

O. G.

Hierarchical multi-label prediction of gene function. Bioinformatics Vol. 22, No. 7, 830–836, 2006.

Crossref Google Scholar

[13]

Dimitrovski,

; Kocev,

; Loskovska,

; Džeroski,

Hierarchical annotation of medical images. Pattern Recognition Vol. 44, Nos. 10–11, 2436–2449, 2011.

Crossref Google Scholar

[14]

Chen,

T. S.

; Lin,

; Chen,

R. Q.

; Hui,

X. L.

; Wu,

H. F.

Knowledge-guided multi-label few-shot learning for general image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 44, No. 3, 1371–1384, 2022.

Crossref Google Scholar

[15]

Li,

L. L.

; Zhou,

T. F.

; Wang,

W. G.

; Li,

J. W.

; Yang,

Deep hierarchical semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1236–1247, 2022.

Crossref

[16]

Chen,

H. T.

; Wang,

; Hu,

Q. H.

Multi-granularity regularized re-balancing for class incremental learning. IEEE Transactions on Knowledge and Data Engineering Vol. 35, No. 7, 7263–7277, 2023.

Google Scholar

[17]

Wang,

; Hu,

Q. H.

; Zhu,

P. F.

; Li,

L. H.

; Lu,

B. X.

; Garibaldi,

J. M.

; Li,

X. L.

Deep fuzzy tree for large-scale hierarchical visual classification. IEEE Transactions on Fuzzy Systems Vol. 28, No. 7, 1395–1406, 2020.

Google Scholar

[18]

Wang,

; Wang,

; Hu,

Q. H.

; Zhou,

Y. C.

; Su,

H. L.

Hierarchical semantic risk minimization for large-scale classification. IEEE Transactions on Cybernetics Vol. 52, No. 9, 9546–9558, 2022.

Crossref Google Scholar

[19]

Wang,

; Hu,

Q. H.

; Chen,

; Qian,

Y. H.

Uncertainty instructed multi-granularity decision for large-scale hierarchical classification. Information Sciences Vol. 586, 644–661, 2022.

Crossref Google Scholar

[20]

Min,

W. Q.

; Jiang,

S. Q.

; Liu,

L. H.

; Rui,

; Jain,

A survey on food computing. ACM Computing Surveys Vol. 52, No. 5, Article No. 92, 2019.

Crossref Google Scholar

[21]

Ge,

W. F.

; Lin,

X. R.

; Yu,

Y. Z.

Weakly supervised complementary parts models for fine-grained image classification from the bottom up. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3029–3038, 2019.

Crossref

[22]

Jiang,

S. Q.

; Min,

W. Q.

; Liu,

L. H.

; Luo,

Z. D.

Multi-scale multi-view deep feature aggregation for food recognition. IEEE Transactions on Image Processing Vol. 29, 265–276, 2020.

Crossref Google Scholar

[23]

Lin,

T. Y.

; RoyChowdhury,

; Maji,

Bilinear convolutional neural networks for fine-grained visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 40, No. 6, 1309–1322, 2018.

Crossref Google Scholar

[24]

Chen,

; Bai,

Y. L.

; Zhang,

; Mei,

Destruction and construction learning for fine-grained image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5152–5161, 2019.

Crossref

[25]

Sun,

G. L.

; Cholakkal,

; Khan,

; Shao,

Fine-grained recognition: Accounting for subtle differences between similar classes. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 7, 12047–12054, 2020.

Crossref Google Scholar

[26]

Zhuang,

P. Q.

; Wang,

Y. L.

; Qiao,

Learning attentive pairwise interaction for fine-grained classification. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 34, No. 7, 13130–13137, 2020.

Crossref Google Scholar

[27]

Zou,

D. N.

; Zhang,

S. H.

; Mu,

T. J.

; Zhang,

A new dataset of dog breed images and a benchmark for finegrained classification. Computational Visual Media Vol. 6, No. 4, 477–487, 2020.

Crossref Google Scholar

[28]

Chen,

; Yang,

Semi-supervised dictionary learning with label propagation for image classification. Computational Visual Media Vol. 3, No. 1, 83–94, 2017.

Crossref Google Scholar

[29]

Chen,

K. X.

; Wu,

X. J.

Component SPD matrices: A low-dimensional discriminative data descriptor for image set classification. Computational Visual Media Vol. 4, No. 3, 245–252, 2018.

Crossref Google Scholar

[30]

Ren,

J. Y.

; Wu,

X. J.

Vectorial approximations of infinite-dimensional covariance descriptors for image classification. Computational Visual Media Vol. 3, No. 4, 379–385, 2017.

Crossref Google Scholar

[31]

Huang,

S. L.

; Xu,

; Tao,

D. C.

; Zhang,

Part-stacked CNN for fine-grained visual categorization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1173–1182, 2016.

Crossref

[32]

Donahue,

; Jia,

Y. Q.

; Vinyals,

; Hoffman,

; Zhang,

; Tzeng,

; Darrell,

DeCAF: A deep convolutional activation feature for generic visual recognition. In: Proceedings of the 31st International Conference on Machine Learning, Vol. 32, 647–655, 2014.

[33]

Vaswani,

; Shazeer,

; Parmar,

; Uszkoreit,

; Jones,

; Gomez,

A. N.

; Kaiser,

Ł.

; Polosukhin,

Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, 6000–6010, 2017.

[34]

Guo,

M. H.

; Xu,

T. X.

; Liu,

J. J.

; Liu,

Z. N.

; Jiang,

P. T.

; Mu,

T. J.

; Zhang,

S. H.

; Martin,

R. R.

; Cheng,

M. M.

; Hu,

S. M.

Attention mechanisms in computer vision: A survey. Computational Visual Media Vol. 8, No. 3, 331–368, 2022.

Crossref Google Scholar

[35]

Devlin,

; Chang,

M. W.

; Lee,

; Toutanova,

BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the Association for Computational Linguistics, 4171–4186, 2019.

[36]

Brown,

T. B.

; Mann,

; Ryder,

; Subbiah,

; Kaplan,

; Dhariwal,

; Neelakantan,

; Shyam,

; Sastry,

; Askell,

; et al. Language models are few-shot learners. In: Proceedings of the 34th International Conference on Neural Information Processing Systems, 1877–1901, 2020.

[37]

Wang,

X. L.

; Girshick,

; Gupta,

; He,

K. M.

Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803, 2018.

Crossref

[38]

Cao,

; Xu,

J. R.

; Lin,

; Wei,

F. Y.

; Hu,

GCNet: Non-local networks meet squeeze-excitation networks and beyond. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshop, 1971–1980, 2019.

Crossref

[39]

Hu,

; Shen,

; Albanie,

; Sun,

; Wu,

E. H.

Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141, 2018.

Crossref

[40]

Wang,

Q. L.

; Wu,

B. G.

; Zhu,

P. F.

; Li,

P. H.

; Zuo,

W. M.

; Hu,

Q. H.

ECA-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11531–11539, 2020.

Crossref

[41]

Dosovitskiy,

; Beyer,

; Kolesnikov,

; Weissenborn,

; Zhai,

X. H.

; Unterthiner,

; Dehghani,

; Minderer,

; Heigold,

; Gelly,

; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In: Proceedings of the International Conference on Learning Representations, 1–9, 2021.

[42]

Xu,

Y. F.

; Wei,

H. P.

; Lin,

M. X.

; Deng,

Y. Y.

; Sheng,

K. K.

; Zhang,

M. D.

; Tang,

; Dong,

W. M.

; Huang,

F. Y.

; Xu,

C. S.

Transformers in computational visual media: A survey. Computational Visual Media Vol. 8, No. 1, 33–62, 2022.

Crossref Google Scholar

[43]

Touvron,

; Cord,

; Douze,

; Massa,

; Sablayrolles,

; Jégou,

Training data-efficient image transformers & distillation through attention. In: Proceedings of the 38th International Conference on Machine Learning, Vol. 139, 10347–10357, 2021.

[44]

Liu,

; Lin,

Y. T.

; Cao,

; Hu,

; Wei,

Y. X.

; Zhang,

; Lin,

; Guo,

B. N.

Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9992–10002, 2021.

Crossref

[45]

Carion,

; Massa,

; Synnaeve,

; Usunier,

; Kirillov,

; Zagoruyko,

End-to-end object detection with transformers. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12346. Vedaldi,

; Bischof,

; Brox,

; Frahm,

J. M.

Eds. Springer Cham, 213–229, 2020.

Crossref

[46]

Zhu,

X. Z.

; Su,

W. J.

; Lu,

L. W.

; Li,

; Wang,

X. G.

; Dai,

J. F.

Deformable DETR: Deformable transformers for end-to-end object detection. In: Proceedings of the International Conference on Learning Representations, 1–9, 2021.

[47]

Ye,

L. W.

; Rochan,

; Liu,

; Wang,

Cross-modal self-attention network for referring image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10494–10503, 2019.

Crossref

[48]

Yang,

F. Z.

; Yang,

; Fu,

J. L.

; Lu,

H. T.

; Guo,

B. N.

Learning texture transformer network for image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5790–5799, 2020.

Crossref

[49]

He,

; Chen,

J. N.

; Liu,

; Kortylewski,

; Yang,

; Bai,

Y. T.

; Wang,

C. H.

TransFG: A transformer architecture for fine-grained recognition. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 36, No. 1, 852–860, 2022.

Crossref Google Scholar

[50]

Zhang,

; Cao,

; Zhang,

; Liu,

X. C.

; Wang,

Z. Y.

; Ling,

; Chen,

W. Q.

A free lunch from ViT: Adaptive attention multi-scale fusion Transformer for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 3234–3238, 2022.

Crossref

[51]

Hu,

Y. Q.

; Jin,

; Zhang,

; Hong,

H. W.

; Zhang,

J. F.

; He,

; Xue,

RAMS-trans: Recurrent attention multi-scale transformer for fine-grained image recognition. In: Proceedings of the 29th ACM International Conference on Multimedia, 4239–4248, 2021.

Crossref

[52]

Wang,

; Yu,

X. H.

; Gao,

Y. S.

Feature fusion vision transformer for fine-grained visual categorization. In: Proceedings of the British Machine Vision Conference, 2021.

[53]

Liu,

X. D.

; Wang,

L. L.

; Han,

X. G.

Transformer with peak suppression and knowledge guidance for fine-grained image recognition. Neurocomputing Vol. 492, 137–149, 2022.

Crossref Google Scholar

[54]

Chou,

P. Y.

; Lin,

C. H.

; Kao,

W. C.

A novel plug-in module for fine-grained visual classification. arXiv preprint arXiv:2202.03822, 2022.

Google Scholar

[55]

Liu,

; Shen,

; Lakshminarasimhan,

V. B.

; Liang,

P. P.

; Bagher Zadeh,

; Morency,

L. P.

Efficient low-rank multimodal fusion with modality-specific factors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2247–2256, 2018.

Crossref

[56]

Wah,

; Branson,

; Welinder,

; Perona,

; Belongie,

The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001. California Institute of Technology, 2011.

Google Scholar

[57]

Maji,

; Rahtu,

; Kannala,

; Blaschko,

; Vedaldi,

Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.

Google Scholar

[58]

Krause,

; Stark,

; Jia,

; Li,

F. F.

3D object representations for fine-grained categorization. In: Proceedings of the IEEE International Conference on Computer Vision Workshops, 554–561, 2013.

Crossref

[59]

Min,

W. Q.

; Liu,

L. H.

; Luo,

Z. D.

; Jiang,

S. Q.

Ingredient-guided cascaded multi-attention network for food recognition. In: Proceedings of the 27th ACM International Conference on Multimedia, 1331–1339, 2019.

Crossref

[60]

Min,

W. Q.

; Liu,

L. H.

; Wang,

Z. L.

; Luo,

Z. D.

; Wei,

X. M.

; Wei,

X. L.

; Jiang,

S. Q.

ISIA food-500: A dataset for large-scale food recognition via stacked global-local attention network. In: Proceedings of the 28th ACM International Conference on Multimedia, 393–401, 2020.

Crossref

[61]

He,

K. M.

; Zhang,

X. Y.

; Ren,

S. Q.

; Sun,

Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.

Crossref

[62]

Sheng,

K. K.

; Dong,

W. M.

; Huang,

H. B.

; Chai,

M. L.

; Zhang,

; Ma,

C. Y.

; Hu,

B. G.

Learning to assess visual aesthetics of food images. Computational Visual Media Vol. 7, No. 1, 139–152, 2021.

Crossref Google Scholar

[63]

Zhao,

T. Y.

; Zhang,

B. P.

; He,

; Wei,

Z. G.

; Zhou,

; Yu,

; Fan,

J. P.

Embedding visual hierarchy with deep networks for large-scale visual recognition. IEEE Transactions on Image Processing Vol. 27, No. 10, 4740–4755, 2018.

Crossref Google Scholar

[64]

Wang,

; Liu,

R. N.

; Lin,

; Chen,

D. Y.

; Li,

; Hu,

Q. H.

; Philip Chen,

C. L.

Coarse-to-fine: Progressive knowledge transfer-based multitask convolutional neural network for intelligent large-scale fault diagnosis. IEEE Transactions on Neural Networks and Learning Systems Vol. 34, No. 2, 761–774, 2023.

Crossref Google Scholar

[65]

Fan,

J. P.

; Zhao,

T. Y.

; Kuang,

Z. Z.

; Zheng,

; Zhang,

; Yu,

; Peng,

J. Y.

HD-MTL: Hierarchical deep multi-task learning for large-scale visual recognition. IEEE Transactions on Image Processing Vol. 26, No. 4, 1923–1938, 2017.

Crossref Google Scholar

Computational Visual Media

Volume 10 Issue 2,
April 2024

Pages 243-260

DOI: 10.1007/s41095-022-0332-2

Cite this article:

Liu X, Wang L. Multi-granularity sequence generation for hierarchical image classification. Computational Visual Media, 2024, 10(2): 243-260. https://doi.org/10.1007/s41095-022-0332-2