Abstract:Target perception in complex scenes is one of the most important research fields of deep learning in computer vision, and vehicle detection in complex traffic scenes is the object of research by many scholars today. In the process of video target detection, due to the insufficient utilization of the time dimension feature information of moving objects, time features between long sequences are extremely easy to be ignored. This paper proposes a spatio-temporal consistent video vehicle detection and tracking algorithm. The algorithm is composed of a two-branch network structure: one of branch is composed of transformer network modules based on spatial correlation. The branch network is mainly used to determine the correlation between the previous and subsequent frames, perceive the consistency between adjacent frames, and predict the temporal and spatial consistency of the target vehicle relevance; another network branch is composed of network modules based on cross-feature pyramid fusion. This module mainly extracts the local information of the detected object combined with shallow spatial edge information and high-level semantic feature information. This branch extracts the spatial position of the object characteristic information. The network structure combines the Transformer mechanism and the cross-feature pyramid module, and uses the advantages of Transformer’s sensitivity to the time correlation between long sequences and the feature pyramid network module’s sensitivity to edge information to detect and track video frame objects to ensure neighboring the long-range correlation of the frame is deeply integrated with the feature information of the edge and the deep layer. The experimental results show that the dual-branch network structure designed in this paper achieves better accuracy and faster convergence speed in video target tracking and detection. At the same time, experiments in saliency video target detection show the effectiveness and generalization of the algorithm.