Abstract
This paper explores the Vision Transformer (ViT) backbone for Unsupervised Domain Adaptive (UDA) person Re-Identification (Re-ID). While some recent studies have validated ViT for supervised Re-ID, no study has yet to use ViT for UDA Re-ID. We observe that the ViT structure provides a unique advantage for UDA Re-ID, i.e., it has a prompt (the learnable class token) at its bottom layer, that can be used to efficiently condition the deep model for the underlying domain. To utilize this advantage, we propose a novel two-stage UDA pipeline named Prompting And Tuning (PAT) which consists of a prompt learning stage and a subsequent fine-tuning stage. In the first stage, PAT roughly adapts the model from source to target domain by learning the prompts for two domains, while in the second stage, PAT fine-tunes the entire backbone for further adaption to increase the accuracy. Although these two stages both adopt the pseudo labels for training, we show that they have different data preferences. With these two preferences, prompt learning and fine-tuning integrated well with each other and jointly facilitated a competitive PAT method for UDA Re-ID.