The prevalence of long-tailed distributions in real-world data often results in classification models favoring the dominant classes, neglecting the less frequent ones. Current approaches address the issues in long-tailed image classification by rebalancing data, optimizing weights, and augmenting information. However, these methods often struggle to balance the performance between dominant and minority classes because of inadequate representation learning of the latter. To address these problems, we introduce descriptional words into images as cross-modal privileged information and propose a cross-modal enhanced method for long-tailed image classification, referred to as CMLTNet. CMLTNet improves the learning of intra-class similarity of tail-class representations by cross-modal alignment and captures the difference between the head and tail classes in semantic space by cross-modal inference. After fusing the above information, CMLTNet achieved an overall performance that was better than those of benchmark long-tailed and cross-modal learning methods on the long-tailed cross-modal datasets, NUS-WIDE and VireoFood-172. The effectiveness of the proposed modules was further studied through ablation experiments. In a case study of feature distribution, the proposed model was better in learning representations of tail classes, and in the experiments on model attention, CMLTNet has the potential to help learn some rare concepts in the tail class through mapping to the semantic space.
Zhang, Y.; Wei, X. S.; Zhou, B.; Wu, J. Bag of tricks for long-tailed visual recognition with deep convolutional neural networks. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 4, 3447–3455, 2021.
Vapnik, V.; Vashist, A. A new learning paradigm: Learning using privileged information. Neural Networks Vol. 22, Nos. 5–6, 544–557, 2009.
Vapnik, V.; Izmailov, R. Learning using privileged information: Similarity control and knowledge transfer. Journal of Machine Learning Research Vol. 16, No. 61, 2023–2049, 2015.
Li, X.; Xu, Z.; Wei, K.; Deng, C. Generalized zero-shot learning via disentangled representation. Proceedings of the AAAI Conference on Artificial Intelligence Vol. 35, No. 3, 1966–1974, 2021.
Gao, J.; Chen, J.; Fu, H.; Jiang, Y. G. Dynamic mixup for multi-label long-tailed food ingredient recognition. IEEE Transactions on Multimedia Vol. 25, 4764–4773, 2023.
Jiang, S.; Min, W.; Liu, L.; Luo, Z. Multi-scale multi-view deep feature aggregation for food recognition. IEEE Transactions on Image Processing Vol. 29, 265–276, 2020.
Tang, J.; Shu, X.; Li, Z.; Qi, G. J.; Wang, J. Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Transactions on Multimedia Computing, Communications, and Applications Vol. 12, No. 4s, Article No. 68, 2016.
Tang, J.; Shu, X.; Qi, G. J.; Li, Z.; Wang, M.; Yan, S.; Jain, R. Tri-clustered tensor completion for social-aware image tag refinement. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 8, 1662–1674, 2017.