| Sign up

Article Link

Cite

EndNote(RIS) BibTeX

Collect

Collect

Submit Manuscript

Show Outline

Outline

Abstract

Keywords

Electronic Supplementary Material

References

Show full outline

Hide outline

Regular Paper

Visual Topic Semantic Enhanced Machine Translation for Multi-Modal Data Efficiency

Chao Wang^¹, Si-Jia Cai^¹, Bei-Xiang Shi^², Zhi-Hong Chong^¹()

1School of Computer Science and Engineering, Southeast University, Nanjing 210096, China

2School of Architecture, Southeast University, Nanjing 210096, China

Show Author Information

Abstract

The scarcity of bilingual parallel corpus imposes limitations on exploiting the state-of-the-art supervised translation technology. One of the research directions is employing relations among multi-modal data to enhance performance. However, the reliance on manually annotated multi-modal datasets results in a high cost of data labeling. In this paper, the topic semantics of images is proposed to alleviate the above problem. First, topic-related images can be automatically collected from the Internet by search engines. Second, topic semantics is sufficient to encode the relations between multi-modal data such as texts and images. Specifically, we propose a visual topic semantic enhanced translation (VTSE) model that utilizes topic-related images to construct a cross-lingual and cross-modal semantic space, allowing the VTSE model to simultaneously integrate the syntactic structure and semantic features. In the above process, topic similar texts and images are wrapped into groups so that the model can extract more robust topic semantics from a set of similar images and then further optimize the feature integration. The results show that our model outperforms competitive baselines by a large margin on the Multi30k and the Ambiguous COCO datasets. Our model can use external images to bring gains to translation, improving data efficiency.

Keywords

multi-modal machine translation visual topic semantics data efficiency

Electronic Supplementary Material

Download File(s)

JCST-2101-11302-Highlights.pdf (151.3 KB)

References

[1]

Specia L, Frank S, Sima’an K, Elliott D. A shared task on multimodal machine translation and crosslingual image description. In Proc. the 1st Conference on Machine Translation: Volume 2, Shared Task Papers, Aug. 2016, pp.543–553. DOI: 10.18653/v1/W16-2346.

[2]

Caglayan O, Madhyastha P, Specia L, Barrault L. Probing the need for visual context in multimodal machine translation. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp.4159–4170. DOI: 10.18653/v1/N19-1422.

[3]

Huang P Y, Liu F, Shiang S R, Oh J, Dyer C. Attention-based multimodal neural machine translation. In Proc. the 1st Conference on Machine Translation: Volume 2, Shared Task Papers, Aug. 2016, pp.639–645. DOI: 10.18653/v1/w16-2360.

[4]

Caglayan O, Barrault L, Bougares F. Multimodal attention for neural machine translation. arXiv: 1609.03976, 2016. https://arxiv.org/abs/1609.03976, Nov. 2023.

[5]

Elliott D, Frank S, Sima’an K, Specia L. Multi30K: Multilingual English-German image descriptions. In Proc. the 5th Workshop on Vision and Language, Aug. 2016, pp.70–74. DOI: 10.18653/v1/w16-3210.

[6]

Kádár A, Chrupała G, Alishahi A. Representation of linguistic form and function in recurrent neural networks. Computational Linguistics, 2017, 43(4): 761–780. DOI: 10.1162/coli_a_00300.

Crossref Google Scholar

[7]

Elliott D, Kádár A. Imagination improves multimodal translation. In Proc. the 8th International Joint Conference on Natural Language Processing, Nov. 2017, pp.130–141. DOI: 10.48550/1705.04350.

[8]

Zhou M Y, Cheng R X, Lee Y J, Yu Z. A visual attention grounding neural model for multimodal machine translation. In Proc. the 2018 Conference on Empirical Methods in Natural Language Processing, Oct. 31–Nov. 4, 2018, pp.3643–3653. DOI: 10.18653/v1/d18-1400.

[9]

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.6000–6010. DOI: 10.48550/arXiv.1706.03762.

[10]

Papineni K, Roukos S, Ward T, Zhu W J. BLEU: A method for automatic evaluation of machine translation. In Proc. the 40th Annual Meeting of the Association for Computational Linguistics, Jul. 2002, pp.311–318. DOI: 10.3115/1073083.1073135.

[11]

Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proc. the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, Jun. 2005, pp.65–72. DOI: 10.3115/1626355.1626389.

[12]

Lin T Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick C L. Microsoft COCO: Common objects in context. In Proc. the 13th European Conference on Computer Vision, Sept. 2014, pp.740–755. DOI: 10.1007/978-3-319-10602-1_48.

[13]

Ren S Q, He K M, Girshick R, Sun J. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proc. the 28th International Conference on Neural Information Processing Systems, Dec. 2015, pp.91–99. DOI: 10.1109/tpami.2016.2577031.

[14]

Calixto I, Liu Q, Campbell N. Doubly-attentive decoder for multi-modal neural machine translation. In Proc. the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul. 2017, pp.1913–1924. DOI: 10.18653/v1/p17-1175.

[15]

Arslan H S, Fishel M, Anbarjafari G. Doubly attentive transformer machine translation. arXiv: 1807.11605, 2018. https://arxiv.org/abs/1807.11605, Nov. 2023.

[16]

Yao S W, Wan X J. Multimodal transformer for multimodal machine translation. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.4346–4350. DOI: 10.18653/v1/2020.acl-main.400.

[17]

Delbrouck J B, Dupont S. An empirical study on the effectiveness of images in multimodal neural machine translation. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.910–919. DOI: 10.18653/v1/d17-1095.

[18]

Yang P C, Chen B X, Zhang P, Sun X. Visual agreement regularized training for multi-modal machine translation. In Proc. the 37th AAAI Conference on Artificial Intelligence, Feb. 2020, pp.9418–9425. DOI: 10.1609/aaai.v34i05.6484.

[19]

Cheng Y. Agreement-based joint training for bidirectional attention-based neural machine translation. In Joint Training for Neural Machine Translation, Cheng Y (ed.), Springer, 2019, pp.11–23. DOI: 10.1007/978-981-32-9748-7_2.

[20]

Ive J, Madhyastha P S, Specia L. Distilling translations with visual awareness. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.6525–6538. DOI: 10.18653/v1/p19-1653.

[21]

Xia Y C, Tian F, Wu L J, Li J X, Qin T, Yu N H, Liu T Y. Deliberation networks: Sequence generation beyond one-pass decoding. In Proc. the 31st International Conference on Neural Information Processing Systems, Feb. 2017, pp.1782–1792. DOI: 10.5555/3294771.3294941.

[22]

Lin H, Meng F D, Su J S, Yin Y J, Yang Z Y, Ge Y B. Dynamic context-guided capsule network for multimodal machine translation. In Proc. the 28th ACM International Conference on Multimedia, Oct. 2020, pp.1320–1329. DOI: 10.1145/3394171.3413715.

[23]

Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.3859–3869.

[24]

Yin Y J, Meng F D, Su J S, Zhou C L, Yang Z Y, Zhou J, Luo J B. A novel graph-based multi-modal fusion encoder for neural machine translation. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics, Jul. 2020, pp.3025–3035. DOI: 10.18653/v1/2020.acl-main.273.

[25]

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.

[26]

He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: 10.1109/cvpr.2016.90.

[27]

Calixto I, Liu Q. Incorporating global visual features into attention-based neural machine translation. In Proc. the 2017 Conference on Empirical Methods in Natural Language Processing, Sept. 2017, pp.992–1003. DOI: 10.18653/ v1/d17-1105.

[28]

Long Q Y, Wang M X, Li L. Generative imagination elevates machine translation. In Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2021, pp.5738–5748. DOI: 10.18653/v1/2021.naacl-main.457.

[29]

Goodfellow I J, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A C, Bengio Y. Generative adversarial networks. In Proc. the 27th International Conference on Neural Information Processing Systems, Dec. 2014, pp.2672–2680. DOI: 10.1007/978-981-33-6048-8_1.

[30]

Zhang Z S, Chen K H, Wang R, Utiyama M, Sumita E, Li Z C, Zhao H. Neural machine translation with universal visual representation. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.

[31]

Manning C D, Surdeanu M, Bauer J, Finkel J R, Bethard S, Mcclosky D. The Stanford CoreNLP natural language processing toolkit. In Proc. the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Jun. 2014, pp.55–56. DOI: 10.3115/v1/p14-5010.

[32]

Zhang B, Zhu J, Su H. Toward the third generation of artificial intelligence. Scientia Sinica Informationis, 2020, 50(8): 1281–1302. DOI: 10.1360/ssi-2020-0204.

Crossref Google Scholar

[33]

Arora S, Liang Y, Ma T Y. A simple but tough-to-beat baseline for sentence embeddings. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.

[34]

Pennington J, Socher R, Manning C D. GloVe: Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1532–1543. DOI: 10.3115/v1/d14-1162.

[35]

Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In Proc. the 3rd International Conference on Learning Representations, May 2015.

[36]

Cho K, Merriënboer B V, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, Oct. 2014, pp.1724–1734. DOI: 10.3115/v1/d14-1179.

[37]

Sennrich R, Firat O, Cho K, Birch A, Haddow B, Hitschler J, Junczys-Dowmunt M, Läubli S, Miceli-Barone A V, Mokry J, Nǎdejde M. Nematus: A toolkit for neural machine translation. In Proc. the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Apr. 2017, pp.65–68. DOI: 10.18653/v1/e17-3017.

[38]

Lu J S, Yang J W, Batra D, Parikh D. Hierarchical question-image co-attention for visual question answering. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.289–297. DOI: 10.48550/1606.00061.

[39]

Kiros R, Salakhutdinov R, Zemel R S. Unifying visual-semantic embeddings with multimodal neural language models. arXiv: 1411.2539, 2014. https://arxiv.org/abs/1411.2539, Nov. 2023.

[40]

Koehn P, Hoang H, Birch A, Callison-Burch C. Moses: Open source toolkit for statistical machine translation. In Proc. the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, Jun. 2007, pp.177–180. DOI: 10.3115/1557769.1557821.

[41]

Shibata Y, Kida T, Fukamachi S, Takeda M, Shinohara A, Shinohara T. Byte pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University, 1999.

[42]

Kingma D P, Ba J. Adam: A method for stochastic optimization. In Proc. the 3rd International Conference on Learning Representations, May 2015.

[43]

Wang D, Xiong D. Efficient object-level visual context modeling for multimodal machine translation: Masking irrelevant objects helps grounding. In Proc. the 35th AAAI Conference on Artificial Intelligence, Feb. 2021, pp.2720–2728. DOI: 10.48550/arXiv.2101.05208.

[44]

Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jun. 2019, pp.4171–4186. DOI: 10.18653/v1/N19-1423.

[45]

Conneau A, Lample G. Cross-lingual language model pretraining. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, pp.7059–7069.

[46]

Ni M H, Huang H Y, Su L, Cui E, Bharti T, Wang L J, Zhang D D, Duan N. M3p: Learning universal representations via multitask multilingual multimodal pre-training. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.3976–3985. DOI: 10.1109/cvpr46437.2021.00397.

[47]

Zhou M Y, Zhou L W, Wang S H, Cheng Y, Li L J, Yu Z, Liu J J. UC²: Universal cross-lingual cross-modal vision-and-language pre-training. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.4153–4163. DOI: 10.1109/cvpr46437. 2021.00414.

[48]

Zhu J H, Xia Y C, Wu L J, He D, Qin T, Zhou W G, Li H Q, Liu T Y. Incorporating BERT into neural machine translation. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.

[49]

Caglayan O, Kuyu M, Amac M S, Madhyastha P, Erdem E, Erdem A, Specia L. Cross-lingual visual pre-training for multimodal machine translation. In Proc. the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Apr. 2021, pp.1317–1324. DOI: 10.18653/v1/2021.eacl-main.112.

[50]

Van Der Maaten L, Hinton G. Visualizing data using t-SNE. Journal of Machine Learning Research, 2008, 9(86): 2579–2605. DOI: 10.48550/2108.01301.

Crossref Google Scholar

Journal of Computer Science and Technology

Volume 38 Issue 6,
November 2023

Pages 1223-1236

DOI: 10.1007/s11390-023-1302-6

Cite this article:

Wang C, Cai S-J, Shi B-X, et al. Visual Topic Semantic Enhanced Machine Translation for Multi-Modal Data Efficiency. Journal of Computer Science and Technology, 2023, 38(6): 1223-1236. https://doi.org/10.1007/s11390-023-1302-6

About Us

Learn about Open Access

Tsinghua University Press

Publish with Us

Peer Review Policy

Copyright and Licensing

Article Processing Charge

Contact Us

Journal Collaboration: Yao Meng (Ms.)✉️ +86-10-83470574

Technical Support: Kuo Zhao (Mr.)✉️ +86-10-83470507

Media Contact: Hao Jin (Mr.)✉️ +86-10-83470559

Address: Floor 6, Tower B, Xueyan Building, Shuangqing Road, Haidian District, Beijing 100084, China.

SciOpen——中国科技期刊卓越行动计划支持项目

Copyright © 2025 Tsinghua University Press Ltd.

京ICP备 10035462号-42 京公网安备11010802044758号