AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (6.5 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Article | Open Access

A Comprehensive Survey on Embodied Intelligence: Advancements, Challenges, and Future Perspectives

Fuchun Sun1( )Runfa Chen1Tianying Ji1Yu Luo1Huaidong Zhou1Huaping Liu1
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Show Author Information

Abstract

Embodied Intelligence, which integrates physical interaction capabilities with cognitive computation in real-world scenarios, provides a promising path to achieve Artificial General Intelligence (AGI). Recently, the landscape of embodied intelligence has grown profoundly, empowering robotics, autonomous driving, intelligent manufacturing, and so on. This paper presents a comprehensive survey on the evolution of embodied intelligence, tracing its journey from philosophical roots to contemporary advancements. We emphasize significant progress in the integration of perceptual, cognitive, and behavioral components, rather than focusing on these elements in isolation. Despite these advancements, several challenges remain, including hardware limitations, model generalization, physical world understanding, multimodal integration, and ethical considerations, which are critical for the development of robust and reliable embodied intelligence systems. To address these challenges, we outline future research directions, emphasizing Large Perception-Cognition-Behavior (PCB) models, physical intelligence, and morphological intelligence. Central to these perspectives is the general agent framework termed as Bcent, which integrates perception, cognition, and behavior dynamics. Bcent aims to enhance the adaptability, robustness, and intelligence of embodied systems, aligning with the ongoing progress in robotics, autonomous systems, healthcare, and more.

References

[1]
N. J. Nilsson, The Quest for Artificial Intelligence. Cambridge, UK: Cambridge University Press, 2009.
[2]
R. Pfeifer and J. Bongard, How the Body Shapes the Way We Think. Cambridge, MA, USA: MIT Press, 2006.
[3]

R. A. Brooks, Intelligence without representation, Artif. Intell, vol. 47, nos.1–3, pp. 139–159, 1991.

[4]
A. Clark, Being There: Putting Brain, Body and World Together Again. Cambridge, MA, USA: MIT Press, 1997.
[5]
OpenAI, Introducing ChatGPT, https://www.openai.com/blog/chatgpt, 2022.
[6]
A. Tzanavari and N. Tsapatsoulis, Affective, Interactive and Cognitive Methods for E-Learning Design: Creating an Optimal Education Experience. Hershey, PA, USA: IGI Global, 2010.
[7]

B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, Building machines that learn and think like people, Behav. Brain Sci, vol. 40, p. e253, 2016.

[8]
Y. LeCun, A path towards autonomous machine intelligence version, https://openreview.net/pdf?id=BZ5a1r-kVsf, 2022.
[9]
V. C. Müller, Ethics of artificial intelligence and robotics, https://plato.stanford.edu/entries/ethics-ai/, 2020.
[10]

B. Goertzel, Artificial general intelligence: Concept, state of the art, and future prospects, J. Artif. Gen. Intell, vol. 5, no. 1, pp. 1–8, 2014.

[11]
R. Descartes, Meditations on First Philosophy. Cambridge, UK: Cambridge University Press, 1641.
[12]

A. M. Turing, Computing machinery and intelligence, Mind, vol. 59, no. 236, pp. 433–460, 1950.

[13]

R. Held and A. Hein, Movement-produced stimulation in the development of visually guided behavior, J. Comp. Physiol. Psychol, vol. 56, no. 5, pp. 872–876, 1963.

[14]

B. Kuipers, E. A. Feigenbaum, P. E. Hart, and N. J. Nilsson, Shakey: from conception to history, AI Mag, vol. 38, no. 1, pp. 88–103, 2017.

[15]

R. A. Brooks, A robust layered control system for a mobile robot, IEEE J. Robot. Autom, vol. 2, no. 1, pp. 14–23, 1986.

[16]

Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.

[17]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You only look once: Unified, real-time object detection, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779–788.
[18]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, in Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017, pp. 5998–6008.
[19]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv: 1707.06347, 2017.
[20]
T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in Proc. 35th Int. Conf. Machine Learning, Stockholm, Sweden, 2018, pp. 1861–1870.
[21]
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al., A generalist agent, arXiv preprint arXiv: 2205.06175, 2022.
[22]
Unity Technologies, Unity game engine, https://unity.com/, 2024.
[23]
E. Todorov, T. Erez, and Y. Tassa, MuJoCo: A physics engine for model-based control, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems. Vilamoura-Algarve, Portugal, 2012, pp. 5026–5033.
[24]
G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, OpenAI gym, arXiv preprint arXiv: 1606.01540, 2016.
[25]
S. Srivastava, C. Li, M. Lingelbach, R. Martín-Martín, F. Xia, K. E. Vainio, Z. Lian, C. Gokmen, S. Buch, K. Liu, et al., BEHAVIOR: Benchmark for everyday household activities in virtual, interactive, and ecological environments, in Proc. 5th Conf. Robot Learning, London, UK, 2022, pp. 477–490.
[26]
C. Li, R. Zhang, J. Wong, C. Gokmen, S. Srivastava, R. Martín-Martín, C. Wang, G. Levine, M. Lingelbach, J. Sun, et al., BEHAVIOR-1K: A benchmark for embodied AI with 1, 000 everyday activities and realistic simulation, in Proc. 6th Conf. Robot Learning, Auckland, New Zealand, 2023, pp. 80–93.
[27]
Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, A survey on vision-language-action models for embodied AI, arXiv preprint arXiv: 2405.14093, 2024.
[28]
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., RT-2: Vision-language-action models transfer web knowledge to robotic control, arXiv preprint arXiv: 2307.15818, 2023.
[29]
J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, Physically grounded vision-language models for robotic manipulation, arXiv preprint arXiv: 2309.02561, 2023.
[30]
F. F. Li, With spatial intelligence, AI will understand the real world, https://www.ted.com/talks/, 2024.
[31]
B. Chen, Z. Xu, S. Kirmani, B. Ichter, D. Sadigh, L. Guibas, and F. Xia, SpatialVLM: Endowing vision-language models with spatial reasoning capabilities, in Proc. 2024 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2024, pp. 14455–14465.
[32]
H. Xue, A. Torralba, J. Tenenbaum, D. Yamins, Y. Li, and H. Y. Tung, 3D-IntPhys: Towards more generalized 3D-grounded visual intuitive physics under challenging scenes, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 7116–7136.
[33]
R. Liu, J. Wei, S. S. Gu, T. Y. Wu, S. Vosoughi, C. Cui, D. Zhou, and A. M. Dai, Mind’s eye: Grounded language model reasoning through simulation, in Proc. 10th Int. Conf. Learning Representations (ICLR), virtual, 2022.
[34]
N. Hansen, H. Su, and X. Wang, TD-MPC2: Scalable, robust world models for continuous control, arXiv preprint arXiv: 2310.16828, 2023.
[35]
H. Liu, W. Yan, M. Zaharia, and P. Abbeel, World model on million-length video and language with blockwise RingAttention, arXiv preprint arXiv: 2402.08268, 2024.
[36]
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, Mastering diverse domains through world models, arXiv preprint arXiv: 2301.04104, 2023.
[37]
J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, et al., Genie: Generative interactive environments, arXiv preprint arXiv: 2402.15391, 2024.
[38]

A. Gupta, S. Savarese, S. Ganguli, and F. F. Li, Embodied intelligence via learning and evolution, Nat. Commun., vol. 12, no. 1, p. 5721, 2021.

[39]

H. P. Liu, D. Guo, F. C. Sun, and X. Zhang, Morphology-based embodied intelligence: Historical retrospect and research progress, Acta Autom. Sin., vol. 49, no. 6, pp. 1131–1154, 2023.

[40]

J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, A survey of embodied AI: From simulators to research tasks, IEEE Trans. Emerg. Top. Comput. Intell., vol. 6, no. 2, pp. 230–244, 2022.

[41]

Z. Zhao, Q. Wu, J. Wang, B. Zhang, C. Zhong, and A. A. Zhilenkov, Exploring embodied intelligence in soft robotics: A review, Biomimetics, vol. 9, no. 4, p. 248, 2024.

[42]
Y. Liu, W. Chen, Y. Bai, X. Liang, G. Li, W. Gao, and L. Lin, Aligning cyber space with physical world: A comprehensive survey on embodied AI, arXiv preprint arXiv: 2407.06886, 2024.
[43]
R. Pfeifer and F. Iida, Embodied artificial intelligence: Trends and challenges, in Embodied Artificial Intelligence, F. Iida, R. Pfeifer, L. Steels, and Y. Kuniyoshi, Eds. Heidelberg, Germany: Springer, 2004, pp. 1–26.
[44]

W. S. McCulloch and W. Pitts, A logical calculus of the ideas immanent in nervous activity, Bull. Math. Biophys., vol. 5, no. 4, pp. 115–133, 1943.

[45]
F. J. Varela, E. Thompson, and E. Rosch, The Embodied Mind: Cognitive Science and Human Experience. Cambridge, MA, USA: MIT Press, 1992.
[46]

L. Smith and M. Gasser, The development of embodied cognition: Six lessons from babies, Artif. Life, vol. 11, no. 1-2, pp. 13–29, 2005.

[47]
V. Scheinman, The Stanford arm, https://www.computerhistory.org/collections/catalog/102723508, 1969.
[48]
C. L. Breazeal, Sociable machines: Expressive social exchange between humans and robots, PhD dissertation, MIT, MA, USA, 2000.
[49]
K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cambridge, MA, USA: MIT Press, 2012.
[50]

A. Krizhevsky, I. Sutskever, and G. E. Hinton, ImageNet classification with deep convolutional neural networks, Commun. ACM, vol. 60, no. 6, pp. 84–90, 2017.

[51]
Google, TensorFlow: Google’s open source machine learning framework, https://www.tensorflow.org, 2024.
[52]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, Automatic differentiation in PyTorch, in Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS 2017) Workshop on Autodiff, Long Beach, CA, USA, 2017.
[53]

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of Go with deep neural networks and tree search, Nature, vol. 529, no. 7587, pp. 484–489, 2016.

[54]
NVIDIA Corporation, NVIDIA Isaac Sim, https://developer.nvidia.com/isaac-sim, 2019.
[55]
T. Yu, D. Quillen, Z. He, R. Julian, A. Narayan, H. Shively, A. Bellathur, K. Hausman, C. Finn, and S. Levine, Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, in Proc. 3rd Conf. Robot Learning, Osaka, Japan, 2019, pp. 1094–1100.
[56]
B. Shen, F. Xia, C. Li, R. Martin-Martin, L. Fan, G. Wang, C. Perez-D’Arpino, S. Buch, S. Srivastava, L. Tchapmi, et al., iGibson 1.0: A simulation environment for interactive tasks in large realistic scenes, in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems (IROS), Prague, Czech Republic, 2021, pp. 7520–7527.
[57]
C. Li, F. Xia, R. Martín-Martín, M. Lingelbach, S. Srivastava, B. Shen, K. Vainio, C. Gokmen, G. Dharan, T. Jain, et al., iGibson 2.0: Object-centric simulation for robot learning of everyday household tasks, in Proc. 5th Conf. Robot Learning, London, UK, 2022, pp. 455–465.
[58]
[59]

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. Won Chung, C. Sutton, S. Gehrmann, et al., PaLM: Scaling language modeling with pathways, J. Mach. Learn. Res., vol. 24, no. 240, pp. 1–113, 2023.

[60]
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. PaLM-E: An embodied multimodal language model, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 8469–8488.
[61]
OpenAI, ChatGPT + Robot = Figure 01, https://community.openai.com/t/openai-chatgpt-robot-figure-01/681733, 2024.
[62]
NVIDIA Corporation, NVIDIA announces Project GR00T foundation model for humanoid robots and major Isaac robotics platform update, https://nvidianews.nvidia.com/news/foundation-model-isaac-robotics-platform, 2024.
[63]
Z. Su, K. Hausman, Y. Chebotar, A. Molchanov, G. E. Loeb, G. S. Sukhatme, and S. Schaal, Force estimation and slip detection/classification for grip control using a biomimetic tactile sensor, in Proc. IEEE-RAS 15th Int. Conf. Humanoid Robots (Humanoids), Seoul, Republic of Korea, 2015, pp. 297–303.
[64]

W. Yuan, S. Dong, and E. Adelson, GelSight: high-resolution robot tactile sensors for estimating geometry and force, Sensors, vol. 17, no. 12, pp. 2762, 2017.

[65]

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Comput., vol. 1, no. 4, pp. 541–551, 1989.

[66]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in Proc. 2014 IEEE Conf. Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 580–587.
[67]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. 2016 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770– 778.
[68]

T. Kong, F. Sun, H. Liu, Y. Jiang, L. Li, and J. Shi, FoveaBox: beyound anchor-based object detection, IEEE Trans. Image Process, vol. 29, pp. 7389–7398, 2020.

[69]

F. Sun, T. Kong, W. Huang, C. Tan, B. Fang, and H. Liu, Feature pyramid reconfiguration with consistent loss for object detection, IEEE Trans. Image Process., vol. 28, no. 10, pp. 5041–5051, 2019.

[70]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in Proc. 8th Int. Conf. Learning Representations (ICLR), virtual, 2020.
[71]
J. Carreira and A. Zisserman, Quo vadis, action recognition? A new model and the kinetics dataset, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 6299–6308.
[72]
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al., Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection, arXiv preprint arXiv: 2303.05499, 2023.
[73]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W. Y. Lo, et al., Segment anything, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 4015–4026.
[74]
R. Q. Charles, S. Hao, K. Mo, and L. J. Guibas, PointNet: Deep learning on point sets for 3D classification and segmentation, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 652–660.
[75]
C. R. Qi, L. Yi, H. Su, and L. J. Guibas, PointNet++: Deep hierarchical feature learning on point sets in a metric space, in Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 2017, pp. 5105–5114.
[76]
C. Godard, O. Mac Aodha, and G. J. Brostow, Unsupervised monocular depth estimation with left-right consistency, in Proc. 2017 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 270–279.
[77]
J. R. Chang and Y. S. Chen, Pyramid stereo matching network, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 5410–5418.
[78]
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, et al., Deep Speech: Scaling up end-to-end speech recognition, arXiv preprint arXiv: 1412.5567, 2014.
[79]
S. Schneider, A. Baevski, R. Collobert, and M. Auli, wav2vec: unsupervised pre-training for speech recognition, arXiv preprint arXiv: 1904.05862, 2019.
[80]
A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, wav2vec 2.0: A framework for self-supervised learning of speech representations, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020, pp. 12449–12460.
[81]

L. Cao, F. Sun, X. Liu, W. Huang, R. Kotagiri, and H. Li, End-to-end ConvNet for tactile recognition using residual orthogonal tiling and pyramid convolution ensemble, Cogn. Comput., vol. 10, no. 5, pp. 718–736, 2018.

[82]

C. Liu, W. Huang, F. Sun, M. Luo, and C. Tan, LDS-FCM: A linear dynamical system based fuzzy C-means method for tactile recognition, IEEE Trans. Fuzzy Syst., vol. 27, no. 1, pp. 72–83, 2019.

[83]
Y. Wang, W. Huang, F. Sun, T. Xu, Y. Rong, and J. Huang, Deep multimodal fusion by channel exchanging, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020, pp. 4835–4845.
[84]
Y. Wang, F. Sun, M. Lu, and A. Yao, Learning deep multimodal feature representation with asymmetric multi-layer fusion, in Proc. 28th ACM Int. Conf. Multimedia, Seattle, WA, USA, 2020, pp. 3902–3910.
[85]
Y. Wang, X. Chen, L. Cao, W. Huang, F. Sun, and Y. Wang, Multimodal token fusion for vision transformers, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 12186–12195.
[86]

A. Newell and H. Simon, The logic theory machine: A complex information processing system, IEEE Trans. Inform. Theory, vol. 2, no. 3, pp. 61–79, 1956.

[87]

B. G. Buchanan and E. A. Feigenbaum, Dendral and meta-dendral: Their applications dimension, Artif. Intell., vol. 11, nos. 1&2, pp. 5–24, 1978.

[88]
E. H. Shortliffe, Mycin: A knowledge-based computer program applied to infectious diseases. in Proc. 1st Annu. Symp. Computer Application in Medical Care, Washington, DC, USA, 1977, pp. 66–69.
[89]

T. Winograd, Understanding natural language, Cogn. Psychol., vol. 3, no. 1, pp. 1–191, 1972.

[90]

H. Liu and P. Singh, ConceptNet—A practical commonsense reasoning tool-kit, BT Technol. J., vol. 22, no. 4, pp. 211–226, 2004.

[91]
A. Singhal, Introducing the knowledge graph: things, not strings, https://blog.google/products/search/introducing-knowledge-graph-things-not/, 2012.
[92]
M. Janner, J. Fu, M. Zhang, and S. Levine, When to trust your model: Model-based policy optimization, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 12519–12530.
[93]
T. Ji, Y. Luo, F. Sun, M. Jing, F. He, and W. Huang, When to update your model: Constrained model-based reinforcement learning, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 23150–23163.
[94]
Y. R. Liu, B. Huang, Z. Zhu, H. Tian, M. Gong, Y. Yu, and K. Zhang, Learning world models with identifiable factorization, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 31831–31864.
[95]
D. Hafner, T. Lillicrap, M. Norouzi, and J. Ba, Mastering atari with discrete world models, arXiv preprint arXiv: 2010.02193, 2020.
[96]
T. Ji, Y. Liang, Y. Zeng, Y. Luo, G. Xu, J. Guo, R. Zheng, F. Huang, F. Sun, and H. Xu, ACE: off-policy actor-critic with causality-aware entropy regularization, arXiv preprint arXiv: 2402.14528, 2024.
[97]

J. L. Elman, Finding structure in time, Cogn. Sci., vol. 14, no. 2, pp. 179–211, 1990.

[98]

S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.

[99]

S. Rasp, P. D. Dueben, S. Scher, J. A. Weyn, S. Mouatadid, and N. Thuerey, WeatherBench: A benchmark data set for data-driven weather forecasting, J. Adv. Model. Earth Syst., vol. 12, no. 11, pp. 1–17, 2020.

[100]

D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski, DeepAR: Probabilistic forecasting with autoregressive recurrent networks, Int. J. Forecast., vol. 36, no. 3, pp. 1181–1191, 2020.

[101]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv: 1301.3781, 2013.
[102]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative adversarial nets, in Proc. 28th Int. Conf. Neural Information Processing Systems (NIPS 2014), Montréal, Canada, 2014, pp. 2672–2680.
[103]
R. Chen, W. Huang, B. Huang, F. Sun, and B. Fang, Reusing discriminators for encoding: Towards unsupervised image-to-image translation, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 8168–8177.
[104]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova, E. Hulburd, D. Liu, M. Wang, A. G. Catlin, M. Lei, J. Zhang, et al., BERT: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv: 1810.04805, 2018.
[105]
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., Language models are few-shot learners. in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020, pp. 1877–1901, 2020.
[106]
A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, Zero-shot text-to-image generation, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 8821–8831.
[107]
J. Ho, A. Jain, and P. Abbeel, Denoising diffusion probabilistic models, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020, pp. 6840–6851.
[108]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, High-resolution image synthesis with latent diffusion models, in Proc. 2022 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 10684–10695.
[109]
OpenAI, Creating video from text, https://openai.com/index/sora/, 2024.
[110]
A. van den Oord, Y. Li, O. Vinyals, P. de Haan, and S. Löwe, Representation learning with contrastive predictive coding, arXiv preprint arXiv: 1807.03748, 2018.
[111]
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, A simple framework for contrastive learning of visual representations, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, pp. 1597–1607.
[112]
B. Ouyang, W. Huang, R. Chen, Z. Tan, Y. Liu, M. Sun, and J. Zhu, Knowledge representation learning with contrastive completion coding, in Proc. 2021 Conf. Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 2021, pp. 3061–3073.
[113]
C. Finn, P. Abbeel, and S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in Proc. 34th Int. Conf. Machine Learning, Sydney, Australia, 2017, pp. 1126–1135.
[114]
A. Nichol, J. Achiam, and J. Schulman, On first-order meta-learning algorithms, arXiv preprint arXiv: 1803.02999, 2018.
[115]
H. Zhang, B. Zheng, A. Guo, T. Ji, P. A. Heng, J. Zhao, and L. Li, Scrutinize what we ignore: Reining task representation shift in context-based offline meta reinforcement learning, arXiv preprint arXiv: 2405.12001, 2024.
[116]

Z. Li and D. Hoiem, Learning without forgetting, IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 12, pp. 2935–2947, 2018.

[117]

J. G. Ziegler and N. B. Nichols, Optimum settings for automatic controllers, Trans. Am. Soc. Mech. Eng., vol. 64, no. 8, pp. 759–765, 1942.

[118]

J. Richalet, A. Rault, J. L. Testud, and J. Papon, Model predictive heuristic control, Automatica, vol. 14, no. 5, pp. 413–428, 1978.

[119]

R. S. Sutton, Learning to predict by the methods of temporal differences, Mach. Learn., vol. 3, pp. 9–14, 1988.

[120]

C. J. C. H. Watkins and P. Dayan, Q-learning, Mach. Learn., vol. 8, no. 3, pp. 279–292, 1992.

[121]

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., Human-level control through deep reinforcement learning, Nature, vol. 518, pp. 529–533, 2015.

[122]
D. A. Pomerleau, ALVINN: An autonomous land vehicle in a neural network, in Proc. 1st Int. Conf. Neural Information Processing Systems, Denver, CO, USA, pp. 305–313.
[123]
J. Ho and S. Ermon, Generative adversarial imitation learning, in Proc. 30th Int. Conf. Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, 2016, pp. 4572–4580.
[124]
C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song, Diffusion policy: Visuomotor policy learning via action diffusion, in Proc. Robotics: Science and Systems XIX, Daegu, Republic of Korea, 2023.
[125]
Y. Luo, T. Ji, F. Sun, J. Zhang, H. Xu, and X. Zhan, OMPO: A unified framework for RL under policy and dynamics shifts, arXiv preprint arXiv: 2405.19080, 2024.
[126]
K. Lee, Y. Seo, S. Lee, H. Lee, and J. Shin, Context-aware dynamics model for generalization in model-based reinforcement learning, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, 2020, pp. 5757–5766.
[127]
Z. Xue, Q. Cai, S. Liu, D. Zheng, P. Jiang, K. Gai, and B. An, State regularized policy optimization on data with dynamics shift, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 32926–32937.
[128]
A. Kumar, A. Zhou, G. Tucker, and S. Levine, Conservative Q-learning for offline reinforcement learning, in Proc. 34th Conf. Neural Information Processing Systems (NeurIPS 2020), Vancouver, Canada, 2020, pp. 1179–1191.
[129]
T. Ji, Y. Luo, F. Sun, X. Zhan, J. Zhang, and H. Xu, Seizing serendipity: Exploiting the value of past success in off-policy actor-critic, arXiv preprint arXiv: 2306.02865, 2023.
[130]
Y. Luo, T. Ji, F. Sun, J. Zhang, H. Xu, and X. Zhan, Offline-boosted actor-critic: Adaptively blending optimal historical behaviors in deep off-policy RL, arXiv preprint arXiv: 2405.18520, 2024.
[131]
H. Niu, T. Ji, B. Liu, H. Zhao, X. Zhu, J. Zheng, P. Huang, G. Zhou, J. Hu, and X. Zhan, H2O+: an improved framework for hybrid offline-and-online RL with dynamics gaps, arXiv preprint arXiv: 2309.12716, 2023.
[132]

J. Duan, Y. Guan, S. E. Li, Y. Ren, Q. Sun, and B. Cheng, Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors, IEEE Trans. Neural Netw. Learning Syst., vol. 33, no. 11, pp. 6584–6598, 2022.

[133]
L. Fan, Y. Zhu, J. Zhu, Z. Liu, O. Zeng, A. Gupta, J. Creus-Costa, S. Savarese, and F. F. Li, SURREAL: Open-source reinforcement learning framework and robot manipulation benchmark, in Proc. 2nd Conf. Robot Learning, Zürich, Switzerland, 2018, pp. 767–782.
[134]
L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, Decision transformer: Reinforcement learning via sequence modeling, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 15084–15097.
[135]
K. Sims, Evolving virtual creatures, in Proc. 21st Annu. Conf. Computer Graphics and Interactive Techniques, Orlando, FL, USA, 1994, pp. 15–22.
[136]
T. Wang, R. Liao, J. Ba, and S. Fidler, NerveNet: Learning structured policy with graph neural networks, in Proc. 6th Int. Conf. Learning Representations (ICLR), Vancouver, Canada, 2018.
[137]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, VQA: Visual question answering, in Proc. 2015 IEEE Int. Conf. Computer Vision (ICCV), Santiago, Chile, 2015, pp. 2425–2433.
[138]
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, Show and tell: A neural image caption generator, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 3156–3164.
[139]
L. H. Li, M. Yatskar, D. Yin, C. J. Hsieh, and K. W. Chang, VisualBERT: A simple and performant baseline for vision and language, arXiv preprint arXiv: 1908.03557, 2019.
[140]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., Learning transferable visual models from natural language supervision, in Proc. 38th Int. Conf. Machine Learning, virtual, 2021, pp. 8748–8763.
[141]
P. Anderson, Q. Wu, D. Teney, J. Bruce, M. Johnson, N. Sunderhauf, I. Reid, S. Gould, and A. van den Hengel, Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 3674–3683.
[142]
S. Tan, M. Ge, D. Guo, H. Liu, and F. Sun, Depth-aware vision-and-language navigation using scene query attention network, in Proc. Int. Conf. Robotics and Automation (ICRA), Philadelphia, PA, USA, 2022, pp. 9390–9396.
[143]

F. Sun, H. Liu, C. Yang, and B. Fang, Multimodal continual learning using online dictionary updating, IEEE Trans. Cogn. Dev. Syst., vol. 13, no. 1, pp. 171–178, 2021.

[144]

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, NeRF: Representing scenes as neural radiance fields for view synthesis, Commun. ACM, vol. 65, no. 1, pp. 99–106, 2021.

[145]

B. Kerbl, G. Kopanas, T. Leimkuehler, and G. Drettakis, 3D Gaussian splatting for real-time radiance field rendering, ACM Trans. Graph., vol. 42, no. 4, pp. 1–14, 2023.

[146]
H. Liu, C. Li, Y. Li, and Y. J. Lee, Improved baselines with visual instruction tuning, arXiv preprint arXiv: 2310.03744, 2023.
[147]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, Visual instruction tuning, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 34892–34916.
[148]
Y. Hong, H. Zhen, P. Chen, S. Zheng, Y. Du, Z. Chen, and C. Gan, 3D-LLM: Injecting the 3D world into large language models, in Proc. 37th Conf. Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 2023, pp. 20482–20494.
[149]
C. G. Atkeson and S. Schaal, Robot learning from demonstration, in Proc. 14th Int. Conf. Machine Learning, Nashville, TN, USA, 1997, pp. 12–20.
[150]
S. Schaal, Learning from demonstration, in Proc. 9th Int. Conf. Neural Information Processing Systems, Denver, CO, USA, 1996, pp. 1040–1046.
[151]
C. Yang, X. Ma, W. Huang, F. Sun, H. Liu, J. Huang, and C. Gan, Imitation learning from observations by minimizing inverse dynamics disagreement, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 239–249.
[152]
K. Ding, B. Chen, R. Wu, Y. Li, Z. Zhang, H. A. Gao, S. Li, G. Zhou, Y. Zhu, H. Dong, et al., PreAfford: universal affordance-based pre-grasping for diverse objects and environments, arXiv preprint arXiv: 2404.03634, 2024.
[153]
Z. Fu, T. Z. Zhao, and C. Finn, Mobile ALOHA: Learning bimanual mobile manipulation with low-cost whole-body teleoperation, arXiv preprint arXiv: 2401.02117, 2024.
[154]
T. Z. Zhao, V. Kumar, S. Levine, and C. Finn, Learning fine-grained bimanual manipulation with low-cost hardware, arXiv preprint arXiv: 2304.13705, 2023.
[155]
Z. Wu, T. Liu, L. Luo, Z. Zhong, J. Chen, H. Xiao, C. Hou, H. Lou, Y. Chen, R. Yang, et al., MARS: An instance-aware, modular and realistic simulator for autonomous driving, in Proc. 3rd CAAI Int. Conf. Artificial Intelligence (CICAI 2023), Fuzhou, China, 2023, pp. 3–15.
[156]

O. M. Andrychowicz, B. Baker, M. Chociej, R. Józefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al., Learning dexterous in-hand manipulation, Int. J. Robot. Res., vol. 39, no. 1, pp. 3–20, 2020.

[157]

J. Aloimonos, I. Weiss, and A. Bandyopadhyay, Active vision, Int. J. Comput. Vis., vol. 1, no. 4, pp. 333–356, 1988.

[158]

S. Liu, G. Lever, Z. Wang, J. Merel, S. M. Ali Eslami, D. Hennes, W. M. Czarnecki, Y. Tassa, S. Omidshafiei, A. Abdolmaleki, et al., From motor control to team play in simulated humanoid football, Sci. Robot., vol. 7, no. 69, p. eabo0235, 2022.

[159]
Y. Yu, W. Huang, F. Sun, C. Chen, Y. Wang, and X. Liu, Sound adversarial audio-visual navigation, arXiv preprint arXiv: 2202.10910, 2022.
[160]
C. Zhong, Y. Zheng, Y. Zheng, H. Zhao, L. Yi, X. Mu, L. Wang, P. Li, G. Zhou, C. Yang, et al., 3D implicit transporter for temporally consistent keypoint discovery, in Proc. 2023 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Vancouver, Canada, 2023, pp. 3869–3880.
[161]

F. Liu, F. Sun, B. Fang, X. Li, S. Sun, and H. Liu, Hybrid robotic grasping with a soft multimodal gripper and a deep multistage learning scheme, IEEE Trans. Robot., vol. 39, no. 3, pp. 2379–2399, 2023.

[162]
S. LaValle, Rapidly-exploring random trees: A new tool for path planning, https://msl.cs.illinois.edu/~lavalle/papers/Lav98c.pdf, 1998.
[163]

L. E. Kavraki, P. Svestka, J. C. Latombe, and M. H. Overmars, Probabilistic roadmaps for path planning in high-dimensional configuration spaces, IEEE Trans. Robot. Automat., vol. 12, no. 4, pp. 566–580, 1996.

[164]
A. Bicchi and V. Kumar, Robotic grasping and contact: A review, in Proc. 2000 IEEE Int. Conf. Robotics and Automation, San Francisco, CA, USA, 2000, pp. 348–353.
[165]
D. Hafner, T. Lillicrap, J. Ba, and M. Norouzi, Dream to control: Learning behaviors by latent imagination, in Proc. 7th Int. Conf. Learning Representations (ICLR), New Orleans, LA, USA, 2019.
[166]
Y. Luo, T. Ji, F. Sun, H. Liu, J. Zhang, M. Jing, and W. Huang, Goal-conditioned hierarchical reinforcement learning with high-level model approximation, IEEE Trans. Neural Netw. Learning Syst. , pp. 1–15, 2024.
[167]
Y. Luo, F. Sun, T. Ji, and X. Zhan, Bidirectional-reachable hierarchical reinforcement learning with mutually responsive policies, arXiv preprint arXiv: arXiv: 2406.18053, 2024.
[168]
R. Chen, J. Han, F. Sun, and W. Huang, Subequivariant graph reinforcement learning in 3D environments, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 4545–4565.
[169]
R. Chen, L. Wang, Y. Du, T. Xue, F. Sun, J. Zhang, and W. Huang, Subequivariant reinforcement learning in 3D multi-entity physical environments, in Proc. 41st Int. Conf. Machine Learning, Vienna, Austria, 2024, pp. 7440–7461.
[170]
A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, Embodied question answering, in Proc. 2018 IEEE/CVF Conf. Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 1–10.
[171]

S. Tan, M. Ge, D. Guo, H. Liu, and F. Sun, Knowledge-based embodied question answering, IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 10, pp. 11948–11960, 2023.

[172]
P. Li, B. Tian, Y. Shi, X. Chen, H. Zhao, G. Zhou, and Y. Q. Zhang, TOIST: Task oriented instance segmentation transformer with noun-pronoun distillation, in Proc. 36th Conf. Neural Information Processing Systems (NeurIPS 2022), New Orleans, LA, USA, 2022, pp. 17597–17611.
[173]
Tesla, Autopilot, https://www.tesla.com/autopilot, 2024.
[174]
M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, et al., Do As I can, not As I say: Grounding language in robotic affordances, arXiv preprint arXiv: 2204.01691, 2022.
[175]
W. Huang, C. Wang, R. Zhang, Y. Li, J. Wu, and F. F. Li, VoxPoser: Composable 3D value maps for robotic manipulation with language models, in Proc. 6th Conf. Robot Learning, Auckland, New Zealand, 2023, pp. 540–562.
[176]
J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, Code as policies: Language model programs for embodied control, in Proc. 2023 IEEE Int. Conf. Robotics and Automation (ICRA), London, UK, 2023. 9493–9500.
[177]
H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y. Du, Y. Hong, and C. Gan, 3D-VLA: A 3D vision-language-action generative world model, arXiv preprint arXiv: 2403.09631, 2024.
[178]

S. H. Vemprala, R. Bonatti, A. Bucker, and A. Kapoor, ChatGPT for robotics: Design principles and model abilities, IEEE Access, vol. 12, pp. 55682–55696, 2024.

[179]
A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al., Open X-embodiment: Robotic learning datasets and RT-X models, arXiv preprint: 2310.08864, 2023.
[180]
Y. Li, X. Chen, H. Zhao, J. Gong, G. Zhou, F. Rossano, and Y. Zhu, Understanding embodied reference with touch-line transformer, in Proc. 11th Int. Conf. Learning Representations (ICLR), Kigali, Rwanda, 2023.
[181]
E. Coumans, Bullet physics simulation, in Proc. 42nd Annu. Conf. Computer Graphics and Interactive Techniques, Los Angeles, CA, USA, 2015, p. 1.
[182]
NVIDIA Corporation, NVIDIA physx. https://developer.nvidia.com/physx-sdk, 2024.
[183]
Epic Games, Inc. , Unreal engine, https://www.unrealengine.com/, 2024.
[184]
NVIDIA Corporation, NVIDIA FLeX, https://developer.nvidia.com/flex, 2024.
[185]
E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu, et al., AI2-THOR: An interactive 3D environment for visual AI, arXiv preprint arXiv: 1712.05474, 2017.
[186]
M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al., Habitat: A platform for embodied AI research, in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 9339–9347.
[187]
X. Puig, K. Ra, M. Boben, J. Li, T. Wang, S. Fidler, and A. Torralba, VirtualHome: Simulating household activities via programs, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition. Salt Lake City, UT, USA, 2018, pp. 8494–8502.
[188]
M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, ALFRED: A benchmark for interpreting grounded instructions for everyday tasks, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 10740–10749.
[189]
Y. Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, S. Nasiriany, and Y. Zhu, Robosuite: A modular simulation framework and benchmark for robot learning, arXiv preprint arXiv: 2009.12293, 2020.
[190]

S. James, Z. Ma, D. R. Arrojo, and A. J. Davison, RLBench: the robot learning benchmark & learning environment, IEEE Robot. Autom. Lett., vol. 5, no. 2, pp. 3019–3026, 2020.

[191]
C. Beattie, J. Z. Leibo, D. Teplyashin, T. Ward, M. Wainwright, H. Küttler, A. Lefrancq, S. Green, V. Valdés, A. Sadik, et al., DeepMind lab, arXiv preprint arXiv: 1612.03801, 2016.
[192]
Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. de Las Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al., DeepMind control suite, arXiv preprint arXiv: 1801.00690, 2018.
[193]
F. Xiang, Y. Qin, K. Mo, Y. Xia, H. Zhu, F. Liu, M. Liu, H. Jiang, Y. Yuan, H. Wang, et al., SAPIEN: A SimulAted part-based interactive ENvironment, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 11097–11107.
[194]
X. Lin, Y. Wang, J. Olkin, and D. Held, SoftGym: Benchmarking deep reinforcement learning for deformable object manipulation, in Proc. 5th Annu. Conf. Robot Learning, London, UK, 2021, pp. 432–448.
[195]
V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al., Isaac gym: High performance GPU-based physics simulation for robot learning, arXiv preprint arXiv: 2108.10470, 2021.
[196]
C. Gan, J. Schwartz, S. Alter, D. Mrowca, M. Schrimpf, J. Traer, J. De Freitas, J. Kubilius, A. Bhandwaldar, N. Haber, et al., ThreeDWorld: A platform for interactive multi-modal physical simulation, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021.
[197]
M. Deitke, W. Han, A. Herrasti, A. Kembhavi, E. Kolve, R. Mottaghi, J. Salvador, D. Schwenk, E. VanderBilt, M. Wallingford, et al., RoboTHOR: An open simulation-to-real embodied AI platform, in Proc. 2020 IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 3164–3174.
[198]
C. Yan, D. Misra, A. Bennnett, A. Walsman, Y. Bisk, and Y. Artzi, CHALET: Cornell house agent learning environment, arXiv preprint arXiv: 1801.07357, 2018.
[199]
X. Gao, R. Gong, T. Shu, X. Xie, S. Wang, and S. C. Zhu, VRKitchen: an interactive 3D virtual environment for task-oriented learning, arXiv preprint arXiv: 1903.05757, 2019.
[200]
A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y. Zhao, J. Turner, N. Maestre, M. Mukadam, D. S. Chaplot, O. Maksymets, et al., Habitat 2.0: Training home assistants to rearrange their habitat, in Proc. 35th Conf. Neural Information Processing Systems (NeurIPS 2021), virtual, 2021, pp. 251–266.
[201]
A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., RT-1: Robotics transformer for real-world control at scale, arXiv preprint arXiv: 2212.06817, 2022.
[202]
A. Brohan, Y. Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, et al., Do as I can, not as I say: Grounding language in robotic affordances, in Proc. 6th Conf. Robot Learning, Auckland, New Zealand, 2023, pp. 287–318.
[203]
E. Jang, A. Irpan, M. Khansari, D. Kappler, F. Ebert, C. Lynch, S. Levine, and C. Finn, BC-Z: Zero-shot task generalization with robotic imitation learning, in Proc. 5th Conf. Robot Learning, London, UK, pp. 991–1002.
[204]
D. Kalashnikov, J. Varley, Y. Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman, MT-opt: Continuous multi-task robotic reinforcement learning at scale, arXiv preprint arXiv: 2104.08212, 2021.
[205]
M. Shridhar, L. Manuelli, and D. Fox, CLIPort: what and where pathways for robotic manipulation, in Proc. 5th Conf. Robot Learning, London, UK, pp. 894–906.
[206]
D. Misra, A. Bennett, V. Blukis, E. Niklasson, M. Shatkhin, and Y. Artzi, Mapping instructions to actions in 3D environments with visual goal prediction, in Proc. 2018 Conf. Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 2667–2678.
[207]
OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, et al., GPT-4 technical report, arXiv preprint arXiv: 2303.08774, 2023.
[208]
X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y. Jing, W. Zhang, H. Liu, et al., Vision-language foundation models as effective robot imitators, in Proc. 11th Int. Conf. Learning Representations (ICLR), Kigali, Rwanda, 2023.
[209]
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al., LLaMA: open and efficient foundation language models, arXiv preprint arXiv: 2302.13971, 2023.
[210]
T. Kudo and J. Richardson, SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing, arXiv preprint arXiv: 1808.06226, 2018.
[211]
O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, T. Kreiman, C. Xu, et al., Octo: an open-source generalist robot policy, arXiv preprint arXiv: 2405.12213, 2024.
[212]

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res., vol. 21, no. 140, pp. 1–67, 2020.

[213]
J. Li, D. Li, S. Savarese, and S. Hoi, BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, in Proc. 40th Int. Conf. Machine Learning, Honolulu, HI, USA, 2023, pp. 19730–19742.
[214]
A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, Point-E: A system for generating 3D point clouds from complex prompts, arXiv preprint arXiv: 2212.08751, 2022.
[215]
E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, FiLM: visual reasoning with a general conditioning layer, in Proc. 32nd AAAI Conf. Artificial Intelligence, 13th Innovative Applications of Artificial Intelligence Conf., 8th AAAI Symp. Educational Advances in Artificial Intelligence, New Orleans, LA, USA, 2018, pp. 3942–3951.
[216]

S. Luo, N. F. Lepora, U. Martinez-Hernandez, J. Bimbo, and H. Liu, Editorial: ViTac: Integrating vision and touch for multimodal and cross-modal perception, Front. Robot. AI, vol. 8, pp. 697601, 2021.

[217]

W. Xu, G. Zhou, Y. Zhou, Z. Zou, J. Wang, W. Wu, and X. Li, A vision-based tactile sensing system for multimodal contact information perception via neural network, IEEE Trans. Instrum. Meas., vol. 73, pp. 1–11, 2024.

[218]
W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, et al., Inner monologue: Embodied reasoning through planning with language models, arXiv preprint arXiv: 2207.05608, 2022.
[219]
S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta, R3M: A universal visual representation for robot manipulation, arXiv preprint arXiv: 2203.12601, 2022.
[220]
J. Han, J. Cen, L. Wu, Z. Li, X. Kong, R. Jiao, Z. Yu, T. Xu, F. Wu, Z. Wang, et al., A survey of geometric graph neural networks: Data structures, models and applications, arXiv preprint arXiv: 2403.00485, 2024.
[221]
B. Wen, W. Yang, J. Kautz, and S. Birchfield, FoundationPose: unified 6D pose estimation and tracking of novel objects, arXiv preprint arXiv: 2312.08344, 2023.
[222]
C. Zhang, L. Wong, G. Grand, and J. Tenenbaum, Grounded physical language understanding with probabilistic programs and simulated worlds, in Proc. 44th Annu. Meet. Cognitive Science Society, Sydney, Australia, 2023, pp. 3476–3483.
[223]
T. Zhang, H. X. Yu, R. Wu, B. Y. Feng, C. Zheng, N. Snavely, J. Wu, and W. T. Freeman, PhysDreamer: physics-based interaction with 3D objects via video generation, arXiv preprint arXiv: 2404.13026, 2024.
[224]

F. Sun, N. Liu, X. Wang, R. Sun, S. Miao, Z. Kang, B. Fang, H. Liu, Y. Zhao, and H. Huang, Digital-twin-assisted skill learning for 3C assembly tasks, IEEE Trans. Cybern., vol. 54, no. 7, pp. 3852–3863, 2024.

[225]
C. Zhong, C. Yang, F. Sun, J. Qi, X. Mu, H. Liu, and W. Huang, Sim2Real object-centric keypoint detection and description, in Proc. 36th AAAI Conf. Artificial Intelligence, 34th Conf. Innovative Applications of Artificial Intelligence, 12th Symp. Educational Advances in Artificial Intelligence, virtual, 2022, pp. 5440–5449.
[226]
Y. Yuan, Y. Song, Z. Luo, W. Sun, and K. M. Kitani, Transform2Act: Learning a transform-and-control policy for efficient agent design, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.
[227]
A. Gupta, L. Fan, S. Ganguli, and F. F. Li, MetaMorph: Learning universal controllers with transformers, in Proc. 9th Int. Conf. Learning Representations (ICLR), virtual, 2021.
[228]
B. Trabucco, M. Phielipp, and G. Berseth, AnyMorph: Learning transferable polices by inferring agent morphology, in Proc. 39th Int. Conf. Machine Learning, Baltimore, MD, USA, 2022, pp. 21677–21691.
[229]
H. Furuta, Y. Iwasawa, Y. Matsuo, and S. S. Gu, A system for morphology-task generalization via unified representation and behavior distillation, in Proc. 11th Int. Conf. Learning Representations (ICLR), Kigali, Rwanda, 2023.
[230]
D. Pathak, C. Lu, T. Darrell, P. Isola, and A. A. Efros, Learning to control self-assembling morphologies: A study of generalization via modularity, in Proc. 33rd Conf. Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, 2019, pp. 2295–2305.
[231]
W. Huang, I. Mordatch, and D. Pathak, One policy to control them all: Shared modular policies for agent-agnostic control, in Proc. 37th Int. Conf. Machine Learning, virtual, 2020, pp. 4455–4464.
[232]
X. Liu, D. Pathak, and K. M. Kitani, REvolveR: continuous evolutionary models for robot-to-robot policy transfer, arXiv preprint arXiv: 2202.05244, 2022.
[233]
K. Huang, D. Guo, X. Zhang, X. Ji, and H. Liu, CompetEvo: towards morphological evolution from competition, arXiv preprint arXiv: 2405.18300, 2024.
[234]
F. Sun, Behavior AI: Robot cognitive learning from agent to Bcent, https://www.crossmodal-learning.org/news/2021-06-fuchun-sun-icra.html, 2021.
CAAI Artificial Intelligence Research
Article number: 9150042
Cite this article:
Sun F, Chen R, Ji T, et al. A Comprehensive Survey on Embodied Intelligence: Advancements, Challenges, and Future Perspectives. CAAI Artificial Intelligence Research, 2024, 3: 9150042. https://doi.org/10.26599/AIR.2024.9150042
Part of a topical collection:

253

Views

111

Downloads

0

Crossref

Altmetrics

Received: 23 June 2024
Accepted: 26 August 2024
Published: 10 December 2024
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return