Detecting human-object interaction with multi-level pairwise feature network

Hanchao Liu; Tai-Jiang Mu; Xiaolei Huang

doi:10.1007/s41095-020-0188-2

| Sign up

PDF (6.1 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Research Article | Open Access

Detecting human-object interaction with multi-level pairwise feature network

Hanchao Liu^¹, Tai-Jiang Mu^¹(), Xiaolei Huang^²

1Key Laboratory of Pervasive Computing, Ministry of Education, BNRist, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

2College of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USA

Show Author Information

Abstract

Human-object interaction (HOI) detection is crucial for human-centric image understanding which aims to infer $⟨$ human, action, object $⟩$ triplets within an image. Recent studies often exploit visual features and the spatial configuration of a human-object pair in order to learn the action linking the human and object in the pair. We argue that such a paradigm of pairwise feature extraction and action inference can be applied not only at the whole human and object instance level, but also at the part level at which a body part interacts with an object, and at the semantic level by considering the semantic label of an object along with human appearance and human-object spatial configuration, to infer the action. We thus propose a multi-levelpairwise feature network (PFNet) for detecting human-object interactions. The network consists of threeparallel streams to characterize HOI utilizing pairwise features at the above three levels; the three streams are finally fused to give the action prediction. Extensive experiments show that our proposed PFNet outperforms other state-of-the-art methods on the V-COCO dataset and achieves comparable results to the state-of-the-art on the HICO-DET dataset.

Keywords

human-object interaction detection pairwisefeature network deep learning multi-level;object instance

References

[1]

He,

K. M.

; Zhang,

X. Y.

; Ren,

S. Q.

; Sun,

; Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770-778, 2016.

[2]

Ren,

S. Q.

; He,

K. M.

; Girshick,

; Sun,

Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137-1149, 2017.

Crossref Google Scholar

[3]

Redmon,

; Farhadi,

YOLO9000: Better, faster, stronger. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6517-6525, 2017.

Crossref

[4]

Borji,

; Cheng,

M. M.

; Hou,

Q. B.

; Jiang,

H. Z.

; Li,

Salient object detection: A survey. Computational Visual Media Vol. 5, No. 2, 117-150, 2019.

Crossref Google Scholar

[5]

Xu,

D. F.

; Zhu,

Y. K.

; Choy,

C. B.

; Fei-Fei,

Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3097-3106, 2017.

[6]

Peyre,

; Laptev,

; Schmid,

; Sivic,

Detecting unseen visual relations using analogies. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 1981-1990, 2019.

Crossref

[7]

Chao,

Y. W.

; Liu,

Y. F.

; Liu,

X. Y.

; Zeng,

H. Y.

; Deng,

Learning to detect human-object interactions. arXiv preprint arXiv:1702.05448, 2017.

Google Scholar

[8]

Gkioxari,

; Girshick,

; Dollár,

; He,

K. M.

Detecting and recognizing human-object interactions. arXiv preprint arXiv:1704.07333, 2017.

Google Scholar

[9]

Ma,

C. Y.

; Kadav,

; Melvin,

; Kira,

; AlRegib,

; Graf,

H. P.

Attend and interact: Higher-order object interactions for video understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6790-6800, 2018.

[10]

Mallya,

; Lazebnik,

Learning models for actions and person-object interactions with transfer to question answering. In: Computer Vision—ECCV 2016. Lecture Notes in Computer Science, Vol 9905. Leibe,

; Matas,

; Sebe,

; Welling,

Eds. Springer Cham, 414-428, 2016.

Crossref

[11]

Gao,

; Zou,

Y. L.

; Huang,

J. B.

iCAN: Instance-centric attention network for human-object interaction detection. arXiv preprint arXiv:1808.10437, 2018.

Google Scholar

[12]

Li,

Y. L.

; Zhou,

S. Y.

; Huang,

X. J.

; Xu,

; Ma,

; Fang,

H. S.

; Wang,

Y. F.

; Lu,

C. W.

Transferable interactiveness knowledge for human-object interaction detection. arXiv preprint arXiv:1881.08264, 2019.

Google Scholar

[13]

Wang,

T. C.

; Anwer,

R. M.

; Khan,

M. H.

; Khan,

F. S.

; Pang,

Y. W.

; Shao,

et al. Deep contextual attention for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5693-5701, 2019.

Crossref

[14]

Gupta,

; Schwing,

A. G.

; Hoiem,

No-frills human-object interaction detection: Factorization, layout encodings, and training techniques. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9676-9684, 2019.

Crossref

[15]

Wan,

; Zhou,

D. S.

; Liu,

Y. F.

; Li,

R. J.

; He,

X. M.

Pose-aware multi-level feature network for human object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 9468-9477, 2019.

Crossref

[16]

Zhou,

; Chi,

Relation parsing neural network for human-object interaction detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 843-851, 2019.

Crossref

[17]

Gupta,

; Malik,

Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.

Google Scholar

[18]

Zhao,

Z. C.

; Ma,

H. M.

; You,

S. D.

Single image action recognition using semantic body part actions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3411-3419, 2017.

Crossref

[19]

Luvizon,

D. C.

; Picard,

; Tabia,

2D/3D pose estimation and action recognition using multitask deep learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 5137-5146, 2018.

Crossref

[20]

Abdulmunem,

; Lai,

Y. K.

; Sun,

X. F.

Saliency guided local and global descriptors for effective action recognition. Computational Visual Media Vol. 2, No. 1, 97-106, 2016.

Crossref Google Scholar

[21]

Girdhar,

; Ramanan,

Attentional pooling for action recognition. arXiv preprint arXiv:1711.01467, 2017.

Google Scholar

[22]

Ulutan,

; Iftekhar,

A. S. M.

; Manjunath,

B. S.

VSGNet: Spatial attention network for detecting human object interactions using graph convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 13617-13626, 2020.

Crossref

[23]

Qi,

S. Y.

; Wang,

W. G.

; Jia,

B. X.

; Shen,

J. B.

; Zhu,

S. C.

Learning human-object interactions by graph parsing neural networks. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11213. Ferrari,

; Hebert,

; Sminchisescu,

; Weiss,

Eds. Springer Cham, 407-423, 2018.

[24]

Xu,

; Wong,

; Li,

; Zhao,

; Kankanhalli,

M. S.

Learning to detect human-object interactions with knowledge. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019-2028, 2019.

Crossref

[25]

Kato,

; Li,

; Gupta,

Compositional learning for human object interaction. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11218. Ferrari,

; Hebert,

; Sminchisescu,

; Weiss,

Eds. Springer Cham, 247-264, 2018.

[26]

Bansal,

; Rambhatla,

S. S.

; Shrivastava,

; Chellappa,

Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, No. 7, 10460-10469, 2020.

Crossref

[27]

Wang,

T. C.

; Yang,

; Danelljan,

; Khan,

F. S.

; Zhang,

X. Y.

; Sun,

Learning human-object interaction detection using interaction points. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4115-4124, 2020.

Crossref

[28]

Liao,

; Liu,

; Wang,

; Chen,

Y. J.

; Qian,

; Feng,

J. S.

PPDM: Parallel point detection and matching for real-time human-object interaction detection. arXiv preprint arXiv:1912.12898, 2020.

Google Scholar

[29]

He,

K. M.

; Gkioxari,

; Dollar,

; Girshick,

R. B.

”Mask R-CNN”. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 42, No. 2, 386-397, 2020.

Crossref Google Scholar

[30]

Fang,

H. S.

; Xie,

S. Q.

; Tai,

Y. W.

; Lu,

C. W.

RMPE: Regional multi-person pose estimation. arXiv preprint arXiv:1612.00137, 2016.

Google Scholar

[31]

Fang,

H. S.

; Cao,

J. K.

; Tai,

Y. W.

; Lu,

C. W.

Pairwise body-part attention for recognizing human-object interactions. In: Computer Vision—ECCV 2018. Lecture Notes in Computer Science, Vol. 11214. Ferrari,

; Hebert,

; Sminchisescu,

; Weiss,

Eds. Springer Cham, 52-68, 2018.

[32]

Mikolov,

; Sutskever,

; Chen,

; Corrado,

; Dean,

2013.Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, Vol. 2, 3111-3119, 2013.

[33]

Lin,

T. Y.

; Goyal,

; Girshick,

; He,

K. M.

; Dollár,

Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2999-3007, 2017.

Crossref

[34]

Lin,

T. Y.

; Maire,

; Belongie,

; Hays,

; Perona,

; Ramanan,

; Dollár,

; Zitnick,

C. L.

Microsoft COCO: Common objects in context. In: Computer Vision—ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet,

; Pajdla,

; Schiele,

; Tuytelaars,

Eds. Springer Cham, 740-755, 2014.

Crossref

[35]

Girshick,

; Radosavovic,

; Gkioxari,

; Dollar,

; He,

K. M.

Detectron. 2018. Available at https://github.com/facebookresearch/detectron.

[36]

Kingma,

D. P.

; Ba,

Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Google Scholar

[37]

Zhou,

T. F.

; Wang,

W. G.

; Qi,

S. Y.

; Ling,

H. B.

; Shen,

J. B.

Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4262-4271, 2020.

Crossref

[38]

Shen,

; Yeung,

; Hoffman,

; Mori,

; Fei-Fei,

Scaling human-object interaction recognition through zero-shot learning. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1568-1576, 2018.

Crossref

[39]

Li,

Y. L.

; Liu,

X. P.

; Lu,

; Wang,

S. Y.

; Liu,

J. Q.

; Li,

J. F.

; Lu,

C. W.

Detailed 2D-3D joint representation for human-object interaction. arXiv preprint arXiv:2004.08154, 2020.

Google Scholar

[40]

Li,

Y. L.

; Xu,

; Liu,

X. P.

; Huang,

X. J.

; Xu,

; Wang,

S. Y.

; Fang,

H. S.

; Ma,

; Chen,

M. Y.

; Lu,

C. W.

PaStaNet: Toward human activity knowledge engine. arXiv preprint arXiv:2004.00945, 2020.

Google Scholar

Computational Visual Media

Volume 7 Issue 2,
June 2021

Pages 229-239

DOI: 10.1007/s41095-020-0188-2

Cite this article:

Liu H, Mu T-J, Huang X. Detecting human-object interaction with multi-level pairwise feature network. Computational Visual Media, 2021, 7(2): 229-239. https://doi.org/10.1007/s41095-020-0188-2