PDF (27.4 MB)
Collect
Submit Manuscript
Research Article | Open Access

Script-to-Storyboard: A new contextual retrieval dataset and benchmark

Department of Computer Science, University of Bath, Bath BA2 7AY, UK
Australian Institute for Machine Learning, School of Computer Science, The University of Adelaide, Adelaide, SA 5005, Australia
Show Author Information

Graphical Abstract

View original image Download original image

Abstract

Storyboards comprising key illustrations and images help filmmakers to outline ideas, key moments, and story events when filming movies. Inspired by this, we introduce the first contextual benchmark dataset Script-to-Storyboard (Sc2St) composed of storyboards to explicitly express story structures in the movie domain, and propose the contextual retrieval task to facilitate movie story understanding. The Sc2St dataset contains fine-grained and diverse texts, annotated semantic keyframes, and coherent storylines in storyboards, unlike existing movie datasets. The contextual retrieval task takes as input a multi-sentence movie script summary with keyframe history and aims to retrieve a future keyframe described by a corresponding sentence to form the storyboard. Compared to classic text-based visual retrieval tasks, this requires capturing the context from the description (script) and keyframe history. We benchmark existing text-based visual retrieval methods on the new dataset and propose a recurrent-based framework with three variants for effective context encoding. Comprehensive experiments demonstrate that our methods compare favourably to existing methods; ablation studies validate the effectiveness of the proposed context encoding approaches.

References

[1]
Huang, Q.; Xiong, Y.; Xiong, Y.; Zhang, Y.; Lin, D. From trailers to storylines: An efficient way to learn from movies. arXiv preprint arXiv: 1806.05341, 2018.
[2]
Gu, C.; Sun, C.; Ross, D. A.; Vondrick, C.; Pantofaru, C.; Li, Y.; Vijayanarasimhan, S.; Toderici, G.; Ricco, S.; Sukthankar, R.; et al. AVA: A video dataset of spatio-temporally localized atomic visual actions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6047–6056, 2018.
[3]
Tapaswi, M.; Zhu, Y.; Stiefelhagen, R.; Torralba, A.; Urtasun, R.; Fidler, S. MovieQA: Understanding stories in movies through question-answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4631–4640, 2016.
[4]
Vicol, P.; Tapaswi, M.; Castrejón, L.; Fidler, S. MovieGraphs: Towards understanding human-centric situations from videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8581–8590, 2018.
[5]
Xiong, Y.; Huang, Q.; Guo, L.; Zhou, H.; Zhou, B.; Lin, D. A graph-based framework to bridge movies and synopses. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 4591–4600, 2019.
[6]

Rohrbach, A.; Torabi, A.; Rohrbach, M.; Tandon, N.; Pal, C.; Larochelle, H.; Courville, A.; Schiele, B. Movie description. International Journal of Computer Vision Vol. 123, No. 1, 94–120, 2017.

[7]
Huang, Q.; Xiong, Y.; Rao, A.; Wang, J.; Lin, D. MovieNet: A holistic dataset for movie understanding. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12349. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 709–727, 2020.
[8]
Bain, M.; Nagrani, A.; Brown, A.; Zisserman, A. Condensed movies: Story based retrieval with contextual embeddings. In: Computer Vision – ACCV 2020. Lecture Notes in Computer Science, Vol. 12626. Ishikawa, H.; Liu, C. L.; Pajdla, T.; Shi, J. Eds. Springer Cham, 460–479, 2021.
[9]
Kim, K.-M.; Heo, M.-O.; Choi, S.-H.; Zhang, B.-T. DeepStory: Video story QA by deep embedded memory networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2016–2022, 2017.
[10]
Kim, J. H.; Kitaev, N.; Chen, X.; Rohrbach, M.; Zhang, B. T.; Tian, Y.; Batra, D.; Parikh, D. CoDraw: Collaborative drawing as a testbed for grounded goal-driven communication. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 6495–6513, 2019.
[11]
Ravi, H.; Wang, L.; Muniz, C. M.; Sigal, L.; Metaxas, D. N.; Kapadia, M. Show me a story: Towards coherent neural story illustration. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7613–7621, 2018.
[12]
Chen, S.; Liu, B.; Fu, J.; Song, R.; Jin, Q.; Lin, P.; Qi, X.; Wang, C.; Zhou, J. Neural storyboard artist: Visualizing stories with coherent image sequences. In: Proceedings of the 27th ACM International Conference on Multimedia, 2236–2244, 2019.
[13]
Li, Y.; Gan, Z.; Shen, Y.; Liu, J.; Cheng, Y.; Wu, Y.; Carin, L.; Carlson, D.; Gao, J. StoryGAN: A sequential conditional GAN for story visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6322–6331, 2019.
[14]
Huang, T. H K.; Ferraro, F.; Mostafazadeh, N.; Misra, I.; Agrawal, A.; Devlin, J.; Girshick, R.; He, X.; Kohli, P.; Batra, D.; et al. Visual storytelling. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1233–1239, 2016.
[15]
Faghri, F.; Fleet, D. J.; Kiros, J. R.; Fidler, S. VSE++: Improving visual-semantic embeddings with hard negatives. In: Proceedings of the British Machine Vision Conference, 2018.
[16]

Zheng, Z.; Zheng, L.; Garrett, M.; Yang, Y.; Xu, M.; Shen, Y. D. Dual-path convolutional image–text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications Vol. 16, No. 2, Article No. 51, 2020.

[17]
Zhang, Y.; Lu, H. Deep cross-modal projection learning for image-text matching. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11205. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 707–723, 2018.
[18]
Lee, K. H.; Chen, X.; Hua, G.; Hu, H.; He, X. Stacked cross attention for image-text matching. In: Computer Vision – ECCV 2018. Lecture Notes in Computer Science, Vol. 11208. Ferrari, V.; Hebert, M.; Sminchisescu, C.; Weiss, Y. Eds. Springer Cham, 212–228, 2018.
[19]
Wang, Z.; Liu, X.; Li, H.; Sheng, L.; Yan, J.; Wang, X.; Shao, J. CAMP: Cross-modal adaptive message passing for text–image retrieval. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 5763–5772, 2019.
[20]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In: Proceedings of the 31st Conference on Neural Information Processing Systems, 2017.
[21]
Chen, Y.-C.; Li, L.; Yu, L.; Kholy, A. E.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. UNITER: UNiversal Image-TExt Representation Learning. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12375. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 104–120, 2020.
[22]
Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12375. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 121–137, 2020.
[23]
Liu, Y.; Albanie, S.; Nagrani, A.; Zisserman, A. Use What you have: Video retrieval using representations from collaborative experts. In: Proceedings of the 30th British Machine Vision Conference, 2020.
[24]
Gabeur, V.; Sun, C.; Alahari, K.; Schmid, C. Multi-modal transformer for video retrieval. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12349. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 214–229, 2020.
[25]

Jordan, M. I.; Jacobs, R. A. Hierarchical mixtures of experts and the EM algorithm. Neural Computation Vol. 6, No. 2, 181–214, 1994.

[26]
Karpathy, A.; Toderici, G.; Shetty, S.; Leung, T.; Sukthankar, R.; Li, F. F. Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1725–1732, 2014.
[27]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In: Proceedings of the 33rd International Conference on Machine Learning, 1681–1690, 2016.
[28]
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, 5908–5916, 2017.
[29]
Zhu, M.; Pan, P.; Chen, W.; Yang, Y. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5795–5803, 2019.
[30]
Li, B.; Qi, X.; Lukasiewicz, T.; Torr, P. H. S. Controllable text-to-image generation. arXiv preprint arXiv: 1909.07083, 2019.
[31]
Liang, J.; Pei, W.; Lu, F. CPGAN: Content-parsing generative adversarial networks for text-to-image synthesis. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12349. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 491–508, 2020.
[32]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1316–1324, 2018.
[33]
Ding, M.; Yang, Z.; Hong, W.; Zheng W.; Zhou, C.; Yin, D.; Lin, J.; Zou, X.; Shao, Z.; Yang, H.; et al. CogView: Mastering text-to-image generation via transformers. In: Proceedings of the 35th International Conference on Neural Information Processing Systems, Article No. 1516, 19822–19835, 2024.
[34]
Ding, M.; Zheng, W.; Hong, W.; Tang, J. CogView2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv: 2204.14217, 2022.
[35]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv: 2204.06125, 2022.
[36]
El-Nouby, A.; Sharma, S.; Schulz, H.; Hjelm, R. D.; El Asri, L.; Kahou, S. E.; Bengio, Y.; Taylor, G. Tell, draw, and repeat: Generating and modifying images based on continual linguistic instruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 10303–10311, 2019.
[37]

Xue, Y.; Guo, Y. C.; Zhang, H.; Xu, T.; Zhang, S. H.; Huang, X. Deep image synthesis from intuitive user input: A review and perspectives. Computational Visual Media Vol. 8, No. 1, 3–31, 2022.

[38]

Wang, M.; Yang, G. W.; Hu, S. M.; Yau, S. T.; Shamir, A. Write-a-video: computational video montage from themed text. ACM Transactions on Graphics Vol. 38, No. 6, Article No. 177, 2019.

[39]
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision – ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
[40]
Caesar, H.; Uijlings, J.; Ferrari, V. COCO-stuff: Thing and stuff classes in context. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1209–1218, 2018.
[41]
Jin, S.; Xu, L.; Xu, J.; Wang, C.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P. Whole-body human pose estimation in the wild. In: Computer Vision – ECCV 2020. Lecture Notes in Computer Science, Vol. 12354. Vedaldi, A.; Bischof, H.; Brox, T.; Frahm, J. M. Eds. Springer Cham, 196–214, 2020.
[42]
Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. arXiv preprint arXiv: 2103.00020, 2021.
[43]
Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; Komatsuzaki, A. LAION-400M: Open dataset of CLIP-filtered 400 million image–text pairs. arXiv preprint arXiv: 2111.02114, 2021.
[44]
Gu, J.; Meng, X.; Lu, G.; Hou, L.; Niu, M.; Liang, X.; Yao, L.; Huang, R.; Zhang, W.; Jiang, X.; et al. Wukong: A 100 million large-scale Chinese cross-modal pre-training benchmark. arXiv preprint arXiv: 2202.06767, 2022.
[45]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K. Q. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2261–2269, 2017.
[46]
Huang, Y.; Wu, Q.; Song, C.; Wang, L. Learning semantic concepts and order for image and sentence matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6163–6171, 2018.
[47]

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 39, No. 6, 1137–1149, 2017.

[48]
Devlin, J.; Chang, M.; Lee, K.; Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 4171–4186, 2019.
[49]
Miech, A.; Laptev, I.; Sivic, J. Learning a text–video embedding from incomplete and heterogeneous data. arXiv preprint arXiv: 1804.02516, 2018.
[50]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7132–7141, 2018.
[51]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826, 2016.
[52]
Dauphin, Y. N.; Fan, A.; Auli, M.; Grangier, D. Language modeling with gated convolutional networks. In: Proceedings of the 34th International Conference on Machine Learning, 933–941, 2017.
[53]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7794–7803, 2018.
[54]
Zhang, H.; Koh, J. Y.; Baldridge, J.; Lee, H.; Yang, Y. Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 833–842, 2021.
[55]
Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In: Proceedings of the 38th International Conference on Machine Learning, 8821–8831, 2021.
[56]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An imperative style, high-performance deep learning library. In: Proceedings of the 33rd International Conference on Neural Information Processing Systems, Article No. 721, 8026–8037, 2019.
[57]
Kingma, D. P.; Ba, J.; Hammad, M. M. Adam: A method for stochastic optimization. arXiv preprint arXiv: 1412.6980, 2014.
Computational Visual Media
Pages 103-122
Cite this article:
Tian X, Yang Y-L, Wu Q. Script-to-Storyboard: A new contextual retrieval dataset and benchmark. Computational Visual Media, 2025, 11(1): 103-122. https://doi.org/10.26599/CVM.2025.9450322
Metrics & Citations  
Article History
Copyright
Rights and Permissions
Return