PDF (2.6 MB)
Collect
Submit Manuscript
Research Article | Open Access

Brain-inspired multimodal learning based on neural networks

Chang LiuFuchun Sun()Bo Zhang
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Show Author Information

Abstract

Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimodal learning architecture is proposed based on deep neural networks, which are inspired by the biology of the visual cortex of the human brain. This unified framework is validated by two practical multimodal learning tasks: image captioning, involving visual and natural language signals, and visual-haptic fusion, involving haptic and visual signals. Extensive experiments are conducted under the framework, and competitive results are achieved.

References

[1]
Riesenhuber M, Poggio T. Hierarchical models of object recognition in cortex. Nat Neurosci 1999, 2(11): 1019-1025.
[2]
Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 1962, 160(1): 106-154.
[3]
LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE 1998, 86(11): 2278-2324.
[4]
Goodale MA, Milner AD. Separate visual pathways for perception and action. Trends Neurosci 1992, 15(1): 20-25.
[6]
Hu XL, Zhang JW, Li JM, Zhang B. Sparsity-regularized HMAX for visual recognition. PLOS One 2014, 9(1): e81813.
[7]
Dura-Bernal S, Wennekers T, Denham SL. The role of feedback in a hierarchical model of object perception. In From Brains to Systems. Hernández C, Sanz R, Gómez-Ramirez J, Smith LS, Hussain A, Chella A, Aleksander I, Eds. New York, NY: Springer, 2011, pp 165-179.
[8]
Casagrande VA. A third parallel visual pathway to primate area V1. Trends Neurosci 1994, 17(7): 305- 310.
[9]
Markov NT, Vezoli J, Chameau P, Falchier A, Quilodran R, Huissoud C, Lamy C, Misery P, Giroud P, Ullman S, Barone P, Dehay C, Knoblauch K, Kennedy H. Anatomy of hierarchy: Feedforward and feedback pathways in macaque visual cortex. J Comp Neurol 2014, 522(1): 225-259.
[10]
Murphy PC, Sillito AM. Corticofugal feedback influences the generation of length tuning in the visual pathway. Nature 1987, 329(6141): 727-729.
[11]
Casagrande VA. A third parallel visual pathway to primate area V1. Trends Neurosci 1994, 17(7): 305-310.
[12]
Shotton J, Fitzgibbon A, Cook M, Sharp T, Finocchio M, Moore R, Kipman A, Blake A. Real-time human pose recognition in parts from single depth images. In Proceedings of CVPR 2011, Colorado Springs, CO, USA, 2011, pp 1297-1304.
[13]
McMahan HB, Holt G, Sculley D, Young M, Ebner D, Grady J, Nie L, Phillips T, Davydov E, Golovin D, Chikkerur S, Liu D, Wattenberg M, Hrafnkelsson AM, Boulos T, Kubica J. Ad click prediction: A view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, Illinois, USA, 2013, pp 1222-1230.
[14]
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015, 521(7553): 436-444.
[15]
Goodfellow I, Bengio Y, Courville A. Deep Learning. Cambridge, MA: MIT Press, 2016.
[16]
Lahat D, Adali T, Jutten C. Multimodal data fusion: An overview of methods, challenges, and prospects. Proc IEEE 2015, 103(9): 1449-1477.
[17]
Atrey PK, Hossain MA, El Saddik A, Kankanhalli MS. Multimodal fusion for multimedia analysis: A survey. Multimed Syst 2010, 16(6): 345-379.
[18]
McCulloch WS, Pitts W. A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 1943, 5(4): 115-133.
[19]
Dai JF, Li Y, He KM, Sun J. R-FCN: Object detection via region-based fully convolutional networks. In Proceedings of the 30th Conference on Neural Information Processing Systems, Barcelona, Spain, 2016.
[20]
Sutskever I, Vinyals O, Le OV. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, Canada, 2014.
[21]
Hodosh M, Young P, Hockenmaier J. Framing image description as a ranking task: Data, models and evaluation metrics. J Artif Intell Res 2013, 47: 853- 899.
[22]
Jia X, Gavves E, Fernando B, Tuytelaars T. Guiding the long-short term memory model for image caption generation. In Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 2015.
[23]
Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015.
[24]
Karpathy A, Li FF. Deep visual-semantic alignments for generating image descriptions. In Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 2015.
[25]
Mao JH, Xu W, Yang Y, Wang J, Yuille AL. Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
[25]
Mao JH, Xu W, Yang Y, Wang J, Huang ZH, Yuille A. Deep captioning with multimodal recurrent neural networks (m-RNN). arXiv preprint arXiv:1412.6632, 2015.
[26]
Chu V, McMahon I, Riano L, McDonald CG, He Q, Perez-Tejada JM, Arrigo M, Darrell T, Kuchenbecker KJ. Robotic learning of haptic adjectives through physical interaction. Rob Auton Syst 2015, 63: 279- 292.
[27]
Gao Y, Hendricks LA, Kuchenbecker KJ, Darrell T. Deep learning for tactile understanding from visual and haptic data. In Proceedings of 2016 IEEE International Conference on Robotics and Automation, Stockholm, Sweden, 2016, pp 536-543.
Brain Science Advances
Pages 61-72
Cite this article:
Liu C, Sun F, Zhang B. Brain-inspired multimodal learning based on neural networks. Brain Science Advances, 2018, 4(1): 61-72. https://doi.org/10.26599/BSA.2018.9050004
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return