Hierarchical attention based spatial-temporal graph-to-sequence learning for grounded video description

Kai Shen; Lingfei Wu; Fangli Xu; Siliang Tang; Jun Xiao; Yueting Zhuang

doi:10.24963/ijcai.2020/131

IJCAI 2020

Conference paper

07 Jan 2021

Hierarchical attention based spatial-temporal graph-to-sequence learning for grounded video description

Download paper

Abstract

The task of Grounded Video Description (GVD) is to generate sentences whose objects can be grounded with the bounding boxes in the video frames. Existing works often fail to exploit structural information both in modeling the relationships among the region proposals and in attending them for text generation. To address these issues, we cast the GVD task as a spatial-temporal Graph-to-Sequence learning problem, where we model video frames as spatial-temporal sequence graph in order to better capture implicit structural relationships. In particular, we exploit two ways to construct a sequence graph that captures spatial-temporal correlations among different objects in each frame and further present a novel graph topology refinement technique to discover optimal underlying graph structure. In addition, we also present hierarchical attention mechanism to attend sequence graph in different resolution levels for better generating the sentences. Our extensive experiments demonstrate the effectiveness of our proposed method compared to state-of-the-art methods.

Workshop paper