Scholar - SciOpen

In this paper, we present Emotion-Aware Music Driven Movie Montage, a novel paradigm for the challenging task of generating movie montages. Specifically, given a movie and a piece of music as the guidance, our method aims to generate a montage out of the movie that is emotionally consistent with the music. Unlike previous work such as video summarization, this task requires not only video content understanding, but also emotion analysis of both the input movie and music. To this end, we propose a two-stage framework, including a learning-based module for the prediction of emotion similarity and an optimization-based module for the selection and composition of candidate movie shots. The core of our method is to align and estimate emotional similarity between music clips and movie shots in a multi-modal latent space via contrastive learning. Subsequently, the montage generation is modeled as a joint optimization of emotion similarity and additional constraints such as scene-level story completeness and shot-level rhythm synchronization. We conduct both qualitative and quantitative evaluations to demonstrate that our method can generate emotionally consistent montages and outperforms alternative baselines.

Regular Paper Issue

A Comparative Study of CNN- and Transformer-Based Visual Style Transfer

Hua-Peng Wei, Ying-Ying Deng, Fan Tang, Xing-Jia Pan, Wei-Ming Dong

Journal of Computer Science and Technology 2022, 37(3): 601-614

Published: 31 May 2022

Abstract Collect Collected

Vision Transformer has shown impressive performance on the image classification tasks. Observing that most existing visual style transfer (VST) algorithms are based on the texture-biased convolution neural network (CNN), here raises the question of whether the shape-biased Vision Transformer can perform style transfer as CNN. In this work, we focus on comparing and analyzing the shape bias between CNN- and transformer-based models from the view of VST tasks. For comprehensive comparisons, we propose three kinds of transformer-based visual style transfer (Tr-VST) methods (Tr-NST for optimization-based VST, Tr-WCT for reconstruction-based VST and Tr-AdaIN for perceptual-based VST). By engaging three mainstream VST methods in the transformer pipeline, we show that transformer-based models pre-trained on ImageNet are not proper for style transfer methods. Due to the strong shape bias of the transformer-based models, these Tr-VST methods cannot render style patterns. We further analyze the shape bias by considering the inuence of the learned parameters and the structure design. Results prove that with proper style supervision, the transformer can learn similar texture-biased features as CNN does. With the reduced shape bias in the transformer encoder, Tr-VST methods can generate higher-quality results compared with state-of-the-art VST methods.

Open Access Research Article Issue

Non-dominated sorting based multi-page photo collage

Yu Song, Fan Tang, Weiming Dong, Changsheng Xu

Computational Visual Media 2022, 8(2): 199-212

Published: 06 December 2021

Abstract

PDF (5.2 MB) Collect Collected

Downloads：28

The development of social networking services (SNSs) revealed a surge in image sharing. The sharing mode of multi-page photo collage (MPC), which posts several image collages at a time, can often be observed on many social network platforms, which enables uploading images and arrangement in a logical order. This study focuses on the construction of MPC for an image collection and its formulation as an issue of joint optimization, which involves not only the arrangement in a single collage but also the arrangement among different collages. Novel balance-aware measurements, which merge graphic features and psy-chological achievements, are introduced. Non-dominated sorting genetic algorithm is adopted to optimize the MPC guided by the measurements. Experiments demonstrate that the proposed method can lead to diverse, visually pleasant, and logically clear MPC results, which are comparable to manually designed MPC results.

Open Access Review Article Issue

Transformers in computational visual media: A survey

Yifan Xu, Huapeng Wei, Minxuan Lin, Yingying Deng, Kekai Sheng, Mengdan Zhang, Fan Tang, Weiming Dong, Feiyue Huang, Changsheng Xu

Computational Visual Media 2022, 8(1): 33-62

Published: 27 October 2021

Abstract

PDF (5.2 MB) Collect Collected

Downloads：83

Transformers, the dominant architecture for natural language processing, have also recently attracted much attention from computational visual media researchers due to their capacity for long-range representation and high performance. Transformers are sequence-to-sequence models, which use a self-attention mechanism rather than the RNN sequential structure. Thus, such models can be trained in parallel and can represent global information. This study comprehensively surveys recent visual transformer works. We categorize them according to task scenario: backbone design, high-level vision, low-level vision and generation, and multimodal learning. Their key ideas are also analyzed. Differing from previous surveys, we mainly focus on visual transformer methods in low-level vision and generation. The latest works on backbone design are also reviewed in detail. For ease of understanding, we precisely describe the main contributions of the latest works in the form of tables. As well as giving quantitative comparisons, we also present image results for low-level vision and generation tasks. Computational costs and source code links for various important works are also given in this survey to assist further development.

Open Access Research Article Issue

SiamCPN: Visual tracking with the Siamese center-prediction network

Dong Chen, Fan Tang, Weiming Dong, Hanxing Yao, Changsheng Xu

Computational Visual Media 2021, 7(2): 253-265

Published: 05 April 2021

Abstract

PDF (9.6 MB) Collect Collected

Downloads：96

Object detection is widely used in objecttracking; anchor-free object tracking provides an end-to-end single-object-tracking approach. In thisstudy, we propose a new anchor-free network, the Siamese center-prediction network (SiamCPN). Given the presence of referenced object features in the initial frame, we directly predict the center point and size of the object in subsequent frames in a Siamese-structure network without the need for per-frame post-processing operations. Unlike other anchor-free tracking approaches that are based on semantic segmentation and achieve anchor-free tracking by pixel-level prediction, SiamCPN directly obtains all information required for tracking, greatly simplifying the model. A center-prediction sub-network is applied to multiple stages of the backbone to adaptively learn from the experience of different branches of the Siamese net. The model can accurately predict object location, implement appropriate corrections, and regress the size of the target bounding box. Compared to other leading Siamese networks, SiamCPN is simpler, faster, and more efficient as it uses fewer hyperparameters. Experiments demonstrate that our method outperforms other leading Siamese networks on GOT-10K and UAV123 benchmarks, and is comparable to other excellent trackers on LaSOT, VOT2016, and OTB-100 while improving inference speed 1.5 to 2 times.

Open Access Research Article Issue

Learning to assess visual aesthetics of food images

Kekai Sheng, Weiming Dong, Haibin Huang, Menglei Chai, Yong Zhang, Chongyang Ma, Bao-Gang Hu

Computational Visual Media 2021, 7(1): 139-152

Published: 28 November 2020

Abstract

PDF (12.9 MB) Collect Collected

Downloads：45

Distinguishing aesthetically pleasing food photos from others is an important visual analysis task for social media and ranking systems related to food. Nevertheless, aesthetic assessment of food images remains a challenging and relatively unexplored task, largely due to the lack of related food image datasets and practical knowledge. Thus, we present the Gourmet Photography Dataset (GPD), the first large-scale dataset for aesthetic assessment of food photos. It contains $24, 000$ images with corresponding binary aesthetic labels, covering a large variety of foods and scenes. We also provide a non-stationary regularization method to combat over-fitting and enhance the ability of tuned models to generalize. Quantitative results from extensive experiments, including a generalization ability test, verify that neural networks trained on the GPD achieve comparable performance to human experts on the task of aesthetic assessment. We reveal several valuable findings to support further research and applications related to visual aesthetic analysis of food images. To encourage further research, we have made the GPD publicly available at https://github.com/Openning07/GPA.

Regular Paper Issue

Facial Image Attributes Transformation via Conditional Recycle Generative Adversarial Networks

Huai-Yu Li, Wei-Ming Dong, Bao-Gang Hu

Journal of Computer Science and Technology 2018, 33(3): 511-521

Published: 11 May 2018

Abstract Collect Collected

This study introduces a novel conditional recycle generative adversarial network for facial attribute transformation, which can transform high-level semantic face attributes without changing the identity. In our approach, we input a source facial image to the conditional generator with target attribute condition to generate a face with the target attribute. Then we recycle the generated face back to the same conditional generator with source attribute condition. A face which should be similar to that of the source face in personal identity and facial attributes is generated. Hence, we introduce a recycle reconstruction loss to enforce the final generated facial image and the source facial image to be identical. Evaluations on the CelebA dataset demonstrate the effectiveness of our approach. Qualitative results show that our approach can learn and generate high-quality identity-preserving facial images with specified attributes.

Total 7

<1/11>GOpage