Storyboards comprising key illustrations and images help filmmakers to outline ideas, key moments, and story events when filming movies. Inspired by this, we introduce the first contextual benchmark dataset Script-to-Storyboard (Sc2St) composed of storyboards to explicitly express story structures in the movie domain, and propose the contextual retrieval task to facilitate movie story understanding. The Sc2St dataset contains fine-grained and diverse texts, annotated semantic keyframes, and coherent storylines in storyboards, unlike existing movie datasets. The contextual retrieval task takes as input a multi-sentence movie script summary with keyframe history and aims to retrieve a future keyframe described by a corresponding sentence to form the storyboard. Compared to classic text-based visual retrieval tasks, this requires capturing the context from the description (script) and keyframe history. We benchmark existing text-based visual retrieval methods on the new dataset and propose a recurrent-based framework with three variants for effective context encoding. Comprehensive experiments demonstrate that our methods compare favourably to existing methods; ablation studies validate the effectiveness of the proposed context encoding approaches.
Publications
Article type
Year

Computational Visual Media 2025, 11(1): 103-122
Published: 28 February 2025
Downloads:28