Abstract:Person re-identification can be regarded as a form of fine-grained visual classification task. Existing unsupervised person Re-ID methods typically focus solely on global features of human bodies, failing to capture accurate fine-grained local features, thereby hindering the recognition accuracy of the models. To address this issue, we propose a ViT-based fine-grained feature enhancement network. This network leverages a vision-language model to generate masks for local regions of human bodies in images. Subsequently, based on the distinct interaction strategies between learnable tokens and image patches within the self-attention mechanism, the class token and introduced learnable local tokens are utilized to learn global and local fine-grained feature representations, respectively. Furthermore, to further enhance feature representation capabilities, a spatial information enhancement module is designed. This module augments feature learning by mining spatial contextual relationships among representative image patches within local regions of human bodies. Finally, utilizing the extracted global and local fine-grained features, online and offline camera-aware contrastive losses are computed separately to bolster the model’s robustness to person identities in an unsupervised environment. Experimental results on the Market-1501, MSMT17, and PersonX datasets validate the effectiveness of the proposed method, achieving mAP/Rank-1 accuracies of 90.3%/95.9%, 59.2%/83.5%, and 91.3%/96.1%, respectively.