AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (1.5 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Article | Open Access

Object-to-Manipulation Graph for Affordance Navigation

Xinhang Song1,2Bohan Wang1,2Liye Dong1,2Gongwei Chen1,2Xinyun Hu1,2Shuqiang Jiang1,2( )
Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China
University of Chinese Academy of Sciences, Beijing 100049, China
Show Author Information

Abstract

Object navigation, whose goal is to let the agent to reach some places (or objects), has been a popular topic in embodied Artificial Intelligence (AI) researches. However, in our real-world applications, it is more practical to find the targets with particular goals, raising the new requirements of finding the places to achieve the particular functions. In this paper, we define a new task of affordance navigation, whose goal is to find possible places to accomplish the required functions, achieving some particular effects. We first introduce a new dataset for affordance navigation, collected by the proposed affordance algorithm. In order to avoid the high cost of labor, the groundtruth of each episode which is annotated with the interaction data provided by the AI2-THOR simulator. In addition, we also propose an affordance navigation framework, where an Object-to-Manipulation Graph (OMG) is constructed and optimized to emphasize the corresponding nodes (including object nodes and manipulation nodes). Finally, a navigation policy is implemented (trained by reinforcement learning) to guide the navigation to the target places. Experimental results on AI2-THOR simulator illustrate the effectiveness of the proposed approach, which achieves significant gains of 14.0% and 11.7% (on success rate and Success weighted by Path Length (SPL), respectively) over the baseline model.

References

[1]
D. A. Norman, The Psychology of Everyday Things. New York, NY, USA: Basic Books, 1988.
[2]
W. Yang, X. L. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, Visual semantic navigation using scene priors, presented at 7th Int. Conf. Learning Representations (ICLR 2019), New Orleans, LA, USA, 2019.
[3]
M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, Learning to learn how to learn: Self-adaptive visual navigation using meta-learning, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6743–6752.
[4]
H. Du, X. Yu, and L. Zheng, Learning object relation graph and tentative policy for visual navigation, in Proc. 16th European Conf. Computer Vision (ECCV), Glasgow, UK, 2020, pp. 19–34.
[5]
S. Zhang, X. Song, Y. Bai, W. Li, Y. Chu, and S. Jiang, Hierarchical object-to-zone graph for object navigation, arXiv preprint arXiv: 2109.02066, 2021.
[6]
S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, Cognitive mapping and planning for visual navigation, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 7272–7281.
[7]
D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov, Learning to explore using active neural SLAM, presented at 8th Int. Conf. Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 2020.
[8]
M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al., Habitat: A platform for embodied AI research, in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 9338–9346.
[9]
E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames, presented at 8th Int. Conf. Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 2020.
[10]

H. S. Koppula and A. Saxena, Anticipating human activities using object affordances for reactive robotic response, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 14–29, 2016.

[11]
T. Nagarajan, C. Feichtenhofer, and K. Grauman, Grounded human-object interaction hotspots from video, arXiv preprint arXiv: 1812.04558, 2018.
[12]
Y. Zhu, C. Jiang, Y. Zhao, D. Terzopoulos, and S. C. Zhu. Inferring forces and learning human utilities from videos, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 3823–3833.
[13]
J. B. Alayrac, J. Sivic, I. Laptev, and S. Lacoste-Julien, Joint discovery of object states and manipulation actions, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2146–2155.
[14]
K. Fang, T. L. Wu, D. Yang, S. Savarese, and J. J. Lim, Demo2Vec: Reasoning object affordances from online videos, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 2139–2147.
[15]
T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman, Ego-Topo: Environment affordances from egocentric video, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 160–169.
[16]
N. Rhinehart and K. M. Kitani, Learning action maps of large environments via first-person vision, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 580–588.
[17]
C. Y. Chuang, J. Li, A. Torralba, and S. Fidler, Learning to act properly: Predicting and explaining affordances from images, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 975–983.
[18]
T. Nagarajan and K. Grauman, Learning affordance landscapes for interaction exploration in 3D environments, presented at NeurIPS 2020: 34th Annual Conf. Neural Information Processing Systems, Virtual Event, 2020.
[19]
E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu, et al., AI2-THOR: An interactive 3D environment for visual AI, arXiv preprint arXiv: 1712.05474, 2017.
[20]
J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, et al., The Replica dataset: A digital replica of indoor spaces, arXiv preprint arXiv: 1906.05797, 2019.
[21]
F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, Gibson env: Real-world perception for embodied agents, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 9068–9079.
[22]
A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang, Matterport3D: Learning from RGB-D data in indoor environments, arXiv preprint arXiv: 1709.06158, 2017.
[23]
P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al., On evaluation of embodied navigation agents, arXiv preprint arXiv: 1807.06757, 2018.
[24]
Y. Zhu, Y. Zhao, and S. C. Zhu, Understanding tools: Task-oriented object modeling, learning and recognition, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 2855–2864.
[25]

A. Gupta, A. Kembhavi, and L. S. Davis, Observing human-object interactions: Using spatial and functional compatibility for recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 10, pp. 1775–1789, 2009.

[26]

M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner, SceneGrok, ACM Trans. Graph., vol. 33, no. 6, pp. 1–10, 2014.

[27]
X. Wang, R. Girdhar, and A. Gupta, Binge watching: Scaling affordance learning from sitcoms, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3366–3375.
[28]
V. Delaitre, D. F. Fouhey, I. Laptev, J. Sivic, A. Gupta, and A. A. Efros, Scene semantics from long-term observation of people, in Proc. 12th Eur. Conf. Computer Vision, Florence, Italy, 2012, pp. 284–298.
[29]
T. Nagarajan and K. Grauman, Learning affordance landscapes for interaction exploration in 3D environments, arXiv preprint arXiv: 2008.09241, 2020.
[30]
M. Kotaru, K. Joshi, D. Bharadia, and S. Katti, Spotfi: Decimeter level localization using WiFi, in Proc. 2015 ACM Conf. Special Interest Group on Data Communication, London, UK, 2015, pp. 269–282.
[31]

Y. Zhuang, C. Y. Zhang, J. Z. Huai, Y. Li, L. Chen, and R. Z. Chen, Bluetooth localization technology: Principles, applications, and future trends, IEEE Internet Thing J., vol. 9, no. 23, pp. 23506–23524, 2022.

[32]
W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, Visual semantic navigation using scene priors, arXiv preprint arXiv: 1810.06543, 2018.
[33]
H. Du, X. Yu, and L. Zheng, Learning object relation graph and tentative policy for visual navigation, arXiv preprint arXiv: 2007.11018, 2020.
[34]
D. McDermott, M. Ghallab, A. Howe, C. A. Knoblock, A. Ram, M. Veloso, D. S. Weld, and D. Wilkins, PDDL—The planning domain definition language, Tech. Rep. CVC TR-98-003/DCS TR-1165, AIPS-98 Planning Competition Committee, New Haven, CT, USA, 1998.
[35]
K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.
[36]
J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.
[37]
S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, arXiv preprint arXiv: 1506.01497, 2015.
[38]
T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv: 1609.02907, 2016.
[39]
V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in Proc. ICML'16: 33rd Int. Conf. Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 1928–1937.
[40]
D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.
[41]
Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, F. F. Li, and A. Farhadi, Target-driven visual navigation in indoor scenes using deep reinforcement learning, in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), Singapore, 2017, pp. 3357–3364.
CAAI Artificial Intelligence Research
Article number: 9150032
Cite this article:
Song X, Wang B, Dong L, et al. Object-to-Manipulation Graph for Affordance Navigation. CAAI Artificial Intelligence Research, 2024, 3: 9150032. https://doi.org/10.26599/AIR.2024.9150032

549

Views

84

Downloads

0

Crossref

Altmetrics

Received: 29 May 2023
Revised: 26 December 2023
Accepted: 06 February 2024
Published: 04 July 2024
© The author(s) 2024.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return