AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Search articles, authors, keywords, DOl and etc.

Published Date

Reset Search

{{expandStatus?'Exit ':''}}Advanced Search

Journals A - Z

About Us

Publish with Us

Support

PDF (1.5 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Article | Open Access

Object-to-Manipulation Graph for Affordance Navigation

Xinhang Song^{¹^,²}, Bohan Wang^{¹^,²}, Liye Dong^{¹^,²}, Gongwei Chen^{¹^,²}, Xinyun Hu^{¹^,²}, Shuqiang Jiang^{¹^,²}(

)

1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China

2University of Chinese Academy of Sciences, Beijing 100049, China

Show Author Information

Abstract

Object navigation, whose goal is to let the agent to reach some places (or objects), has been a popular topic in embodied Artificial Intelligence (AI) researches. However, in our real-world applications, it is more practical to find the targets with particular goals, raising the new requirements of finding the places to achieve the particular functions. In this paper, we define a new task of affordance navigation, whose goal is to find possible places to accomplish the required functions, achieving some particular effects. We first introduce a new dataset for affordance navigation, collected by the proposed affordance algorithm. In order to avoid the high cost of labor, the groundtruth of each episode which is annotated with the interaction data provided by the AI2-THOR simulator. In addition, we also propose an affordance navigation framework, where an Object-to-Manipulation Graph (OMG) is constructed and optimized to emphasize the corresponding nodes (including object nodes and manipulation nodes). Finally, a navigation policy is implemented (trained by reinforcement learning) to guide the navigation to the target places. Experimental results on AI2-THOR simulator illustrate the effectiveness of the proposed approach, which achieves significant gains of 14.0% and 11.7% (on success rate and Success weighted by Path Length (SPL), respectively) over the baseline model.

Keywords

navigation affordance manipulation graph neural network

References

[1]

D. A. Norman, The Psychology of Everyday Things. New York, NY, USA: Basic Books, 1988.

[2]

W. Yang, X. L. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, Visual semantic navigation using scene priors, presented at 7th Int. Conf. Learning Representations (ICLR 2019), New Orleans, LA, USA, 2019.

[3]

M. Wortsman, K. Ehsani, M. Rastegari, A. Farhadi, and R. Mottaghi, Learning to learn how to learn: Self-adaptive visual navigation using meta-learning, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019, pp. 6743–6752.

Crossref

[4]

H. Du, X. Yu, and L. Zheng, Learning object relation graph and tentative policy for visual navigation, in Proc. 16th European Conf. Computer Vision (ECCV), Glasgow, UK, 2020, pp. 19–34.

Crossref

[5]

S. Zhang, X. Song, Y. Bai, W. Li, Y. Chu, and S. Jiang, Hierarchical object-to-zone graph for object navigation, arXiv preprint arXiv: 2109.02066, 2021.

Crossref

[6]

S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, Cognitive mapping and planning for visual navigation, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 7272–7281.

Crossref

[7]

D. S. Chaplot, D. Gandhi, S. Gupta, A. Gupta, and R. Salakhutdinov, Learning to explore using active neural SLAM, presented at 8th Int. Conf. Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 2020.

[8]

M. Savva, A. Kadian, O. Maksymets, Y. Zhao, E. Wijmans, B. Jain, J. Straub, J. Liu, V. Koltun, J. Malik, et al., Habitat: A platform for embodied AI research, in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), Seoul, Republic of Korea, 2019, pp. 9338–9346.

Crossref

[9]

E. Wijmans, A. Kadian, A. Morcos, S. Lee, I. Essa, D. Parikh, M. Savva, and D. Batra, DD-PPO: Learning near-perfect pointgoal navigators from 2.5 billion frames, presented at 8th Int. Conf. Learning Representations (ICLR 2020), Addis Ababa, Ethiopia, 2020.

[10]

H. S. Koppula and A. Saxena, Anticipating human activities using object affordances for reactive robotic response, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 14–29, 2016.

Crossref Google Scholar

[11]

T. Nagarajan, C. Feichtenhofer, and K. Grauman, Grounded human-object interaction hotspots from video, arXiv preprint arXiv: 1812.04558, 2018.

Crossref

[12]

Y. Zhu, C. Jiang, Y. Zhao, D. Terzopoulos, and S. C. Zhu. Inferring forces and learning human utilities from videos, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 3823–3833.

Crossref

[13]

J. B. Alayrac, J. Sivic, I. Laptev, and S. Lacoste-Julien, Joint discovery of object states and manipulation actions, in Proc. IEEE Int. Conf. Computer Vision (ICCV), Venice, Italy, 2017, pp. 2146–2155.

Crossref

[14]

K. Fang, T. L. Wu, D. Yang, S. Savarese, and J. J. Lim, Demo2Vec: Reasoning object affordances from online videos, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 2139–2147.

Crossref

[15]

T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman, Ego-Topo: Environment affordances from egocentric video, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 2020, pp. 160–169.

Crossref

[16]

N. Rhinehart and K. M. Kitani, Learning action maps of large environments via first-person vision, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 580–588.

Crossref

[17]

C. Y. Chuang, J. Li, A. Torralba, and S. Fidler, Learning to act properly: Predicting and explaining affordances from images, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 975–983.

Crossref

[18]

T. Nagarajan and K. Grauman, Learning affordance landscapes for interaction exploration in 3D environments, presented at NeurIPS 2020: 34th Annual Conf. Neural Information Processing Systems, Virtual Event, 2020.

[19]

E. Kolve, R. Mottaghi, W. Han, E. VanderBilt, L. Weihs, A. Herrasti, M. Deitke, K. Ehsani, D. Gordon, Y. Zhu, et al., AI2-THOR: An interactive 3D environment for visual AI, arXiv preprint arXiv: 1712.05474, 2017.

[20]

J. Straub, T. Whelan, L. Ma, Y. Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, et al., The Replica dataset: A digital replica of indoor spaces, arXiv preprint arXiv: 1906.05797, 2019.

[21]

F. Xia, A. R. Zamir, Z. He, A. Sax, J. Malik, and S. Savarese, Gibson env: Real-world perception for embodied agents, in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018, pp. 9068–9079.

Crossref

[22]

A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang, Matterport3D: Learning from RGB-D data in indoor environments, arXiv preprint arXiv: 1709.06158, 2017.

Crossref

[23]

P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Malik, R. Mottaghi, M. Savva, et al., On evaluation of embodied navigation agents, arXiv preprint arXiv: 1807.06757, 2018.

[24]

Y. Zhu, Y. Zhao, and S. C. Zhu, Understanding tools: Task-oriented object modeling, learning and recognition, in Proc. 2015 IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 2015, pp. 2855–2864.

Crossref

[25]

A. Gupta, A. Kembhavi, and L. S. Davis, Observing human-object interactions: Using spatial and functional compatibility for recognition, IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 10, pp. 1775–1789, 2009.

Crossref Google Scholar

[26]

M. Savva, A. X. Chang, P. Hanrahan, M. Fisher, and M. Nießner, SceneGrok, ACM Trans. Graph., vol. 33, no. 6, pp. 1–10, 2014.

Crossref Google Scholar

[27]

X. Wang, R. Girdhar, and A. Gupta, Binge watching: Scaling affordance learning from sitcoms, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 2017, pp. 3366–3375.

Crossref

[28]

V. Delaitre, D. F. Fouhey, I. Laptev, J. Sivic, A. Gupta, and A. A. Efros, Scene semantics from long-term observation of people, in Proc. 12th Eur. Conf. Computer Vision, Florence, Italy, 2012, pp. 284–298.

Crossref

[29]

T. Nagarajan and K. Grauman, Learning affordance landscapes for interaction exploration in 3D environments, arXiv preprint arXiv: 2008.09241, 2020.

[30]

M. Kotaru, K. Joshi, D. Bharadia, and S. Katti, Spotfi: Decimeter level localization using WiFi, in Proc. 2015 ACM Conf. Special Interest Group on Data Communication, London, UK, 2015, pp. 269–282.

Crossref

[31]

Y. Zhuang, C. Y. Zhang, J. Z. Huai, Y. Li, L. Chen, and R. Z. Chen, Bluetooth localization technology: Principles, applications, and future trends, IEEE Internet Thing J., vol. 9, no. 23, pp. 23506–23524, 2022.

Crossref Google Scholar

[32]

W. Yang, X. Wang, A. Farhadi, A. Gupta, and R. Mottaghi, Visual semantic navigation using scene priors, arXiv preprint arXiv: 1810.06543, 2018.

[33]

H. Du, X. Yu, and L. Zheng, Learning object relation graph and tentative policy for visual navigation, arXiv preprint arXiv: 2007.11018, 2020.

Crossref

[34]

D. McDermott, M. Ghallab, A. Howe, C. A. Knoblock, A. Ram, M. Veloso, D. S. Weld, and D. Wilkins, PDDL—The planning domain definition language, Tech. Rep. CVC TR-98-003/DCS TR-1165, AIPS-98 Planning Competition Committee, New Haven, CT, USA, 1998.

[35]

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, in Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 770–778.

Crossref

[36]

J. Pennington, R. Socher, and C. Manning, GloVe: Global vectors for word representation, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 2014, pp. 1532–1543.

Crossref

[37]

S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards real-time object detection with region proposal networks, arXiv preprint arXiv: 1506.01497, 2015.

[38]

T. N. Kipf and M. Welling, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv: 1609.02907, 2016.

[39]

V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Harley, T. P. Lillicrap, D. Silver, and K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in Proc. ICML'16: 33rd Int. Conf. Int. Conf. Machine Learning, New York, NY, USA, 2016, pp. 1928–1937.

[40]

D. P. Kingma and J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv: 1412.6980, 2014.

[41]

Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, F. F. Li, and A. Farhadi, Target-driven visual navigation in indoor scenes using deep reinforcement learning, in Proc. IEEE Int. Conf. Robotics and Automation (ICRA), Singapore, 2017, pp. 3357–3364.

Crossref

[42]

Trossen Robotics, LoCoBot, https://www.trossenrobotics.com/locobot-overview.aspx, 2024.

CAAI Artificial Intelligence Research

Volume 3,
2024

Article number: 9150032

DOI: 10.26599/AIR.2024.9150032

Cite this article:

Song X, Wang B, Dong L, et al. Object-to-Manipulation Graph for Affordance Navigation. CAAI Artificial Intelligence Research, 2024, 3: 9150032. https://doi.org/10.26599/AIR.2024.9150032

549

Views

Downloads

Crossref

Google Scholar
Citation

Altmetrics

Received: 29 May 2023

Revised: 26 December 2023

Accepted: 06 February 2024

Published: 04 July 2024

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).