This paper explores the Vision Transformer (ViT) backbone for Unsupervised Domain Adaptive (UDA) person Re-Identification (Re-ID). While some recent studies have validated ViT for supervised Re-ID, no study has yet to use ViT for UDA Re-ID. We observe that the ViT structure provides a unique advantage for UDA Re-ID, i.e., it has a prompt (the learnable class token) at its bottom layer, that can be used to efficiently condition the deep model for the underlying domain. To utilize this advantage, we propose a novel two-stage UDA pipeline named Prompting And Tuning (PAT) which consists of a prompt learning stage and a subsequent fine-tuning stage. In the first stage, PAT roughly adapts the model from source to target domain by learning the prompts for two domains, while in the second stage, PAT fine-tunes the entire backbone for further adaption to increase the accuracy. Although these two stages both adopt the pseudo labels for training, we show that they have different data preferences. With these two preferences, prompt learning and fine-tuning integrated well with each other and jointly facilitated a competitive PAT method for UDA Re-ID.
- Article type
- Year
- Co-author
The person re-identification (re-ID) community has witnessed an explosion in the scale of data that it has to handle. On one hand, it is important for large-scale re-ID to provide constant or sublinear search time and dramatically reduce the storage cost for data points from the viewpoint of efficiency. On the other hand, the semantic affinity existing in the original space should be preserved because it greatly boosts the accuracy of re-ID. To this end, we use the deep hashing method, which utilizes the pairwise similarity and classification label to learn deep hash mapping functions, in order to provide discriminative representations. More importantly, considering the great advantage of asymmetric hashing over the existing symmetric one, we finally propose an asymmetric deep hashing (ADH) method for large-scale re-ID. Specifically, a two-stream asymmetric convolutional neural network is constructed to learn the similarity between image pairs. Another asymmetric pairwise loss is formulated to capture the similarity between the binary hashing codes and real-value representations derived from the deep hash mapping functions, so as to constrain the binary hash codes in the Hamming space to preserve the semantic structure existing in the original space. Then, the image labels are further explored to have a direct impact on the hash function learning through a classification loss. Furthermore, an efficient alternating algorithm is elaborately designed to jointly optimize the asymmetric deep hash functions and high-quality binary codes, by optimizing one parameter with the other parameters fixed. Experiments on the four benchmarks, i.e., DukeMTMC-reID, Market-1501, Market-1501+500k, and CUHK03 substantiate the competitive accuracy and superior efficiency of the proposed ADH over the compared state-of-the-art methods for large-scale re-ID.
A Brain-Computer Interface (BCI) aims to produce a new way for people to communicate with computers. Brain signal classification is a challenging issue owing to the high-dimensional data and low Signal-to-Noise Ratio (SNR). In this paper, a novel method is proposed to cope with this problem through sparse representation for the P300 speller paradigm. This work is distinguished using two key contributions. First, we investigate sparse coding and its feasibility for brain signal classification. Training signals are used to learn the dictionaries and test signals are classified according to their sparse representation and reconstruction errors. Second, sample selection and a channel-aware dictionary are proposed to reduce the effect of noise, which can improve performance and enhance the computing efficiency simultaneously. A novel classification method from the sample set perspective is proposed to exploit channel correlations. Specifically, the brain signal of each channel is classified jointly using its spatially neighboring channels and a novel weighted regulation strategy is proposed to overcome outliers in the group. Experimental results have demonstrated that our methods are highly effective. We achieve a state-of-the-art recognition rate of 72.5%, 88.5%, and 98.5% at 5, 10, and 15 epochs, respectively, on BCI Competition III Dataset II.
In this study, we address the problems encountered by incremental face clustering. Without the benefit of having observed the entire data distribution, incremental face clustering is more challenging than static dataset clustering. Conventional methods rely on the statistical information of previous clusters to improve the efficiency of incremental clustering; thus, error accumulation may occur. Therefore, this study proposes to predict the summaries of previous data directly from data distribution via supervised learning. Moreover, an efficient framework to cluster previous summaries with new data is explored. Although learning summaries from original data costs more than those from previous clusters, the entire framework consumes just a little bit more time because clustering current data and generating summaries for new data share most of the calculations. Experiments show that the proposed approach significantly outperforms the existing incremental face clustering methods, as evidenced by the improvement of average F-score from 0.644 to 0.762. Compared with state-of-the-art static face clustering methods, our method can yield comparable accuracy while consuming much less time.
Person re-IDentification (re-ID) is an important research topic in the computer vision community, with significance for a range of applications. Pedestrians are well-structured objects that can be partitioned, although detection errors cause slightly misaligned bounding boxes, which lead to mismatches. In this paper, we study the person re-identification performance of using variously designed pedestrian parts instead of the horizontal partitioning routine typically applied in previous hand-crafted part works, and thereby obtain more effective feature descriptors. Specifically, we benchmark the accuracy of individual part matching with discriminatively trained Convolutional Neural Network (CNN) descriptors on the Market-1501 dataset. We also investigate the complementarity among different parts using combination and ablation studies, and provide novel insights into this issue. Compared with the state-of-the-art, our method yields a competitive accuracy rate when the best part combination is used on two large-scale datasets (Market-1501 and CUHK03) and one small-scale dataset (VIPeR).
Gender classification is an important task in automated face analysis. Most existing approaches for gender classification use only raw/aligned face images after face detection as input. These methods exhibit fair classification ability under constrained conditions, in which face images are acquired under similar illumination with similar poses. The performances of these methods may deteriorate when face images show drastic variances in poses and occlusion as routinely encountered in real-world data. The reduction in the performances of current gender classification methods may be attributed to the sensitiveness of features to image translations. This work proposes to alleviate this sensitivity by introducing a majority voting procedure that involves multiple face patches. Specifically, this work utilizes a deep learning method based on multiple large patches. Several Convolutional Neural Networks (CNN) are trained on individual, predefined patches that reflect various image resolutions and partial cropping. The decisions of each CNN are aggregated through majority voting to obtain the final gender classification accurately. Extensive experiments are conducted on four gender classification databases, including Labeled Face in-the-Wild (LFW), CelebA, ColorFeret, and All-Age Faces database, a novel database collected by our group. Each individual patch is evaluated, and complementary patches are selected for voting. We show that the classification accuracy of our method is comparable with that of state-of-the-art systems. This characteristic validates the effectiveness of our proposed method.
Person re-identification (person re-id) aims to match observations on pedestrians from different cameras. It is a challenging task in real word surveillance systems and draws extensive attention from the community. Most existing methods are based on supervised learning which requires a large number of labeled data. In this paper, we develop a robust unsupervised learning approach for person re-id. We propose an improved Bag-of-Words (iBoW) model to describe and match pedestrians under different camera views. The proposed descriptor does not require any re-id labels, and is robust against pedestrian variations. Experiments show the proposed iBoW descriptor outperforms other unsupervised methods. By combination with efficient metric learning algorithms, we obtained competitive accuracy compared to existing state-of-the-art methods on person re-identification benchmarks, including VIPeR, PRID450S, and Market1501.