Abstract:Video action recognition has been widely used in many fields such as video surveillance and automatic driving. SlowFast network is often used in the field of video action recognition. At present, attention is used to enhance relevant information in SlowFast correlation network. The combination of attention mechanism and network is to embed the attention mechanism among various convolutional blocks of the network. If the attention mechanism is deeply embedded into the specific convolutional layer of the convolutional block, the information extraction capability of the SlowFast network will be further enhanced. Firstly, a deep nested attention mechanism is proposed, which contains an attention SCTM that can extract space-time and channel information, and further strengthens the three information extraction capabilities of SlowFast network. In addition, the current multi-stream network fusion information is not fully interactive and processed. A multi-stream spatio-temporal information fusion network based on cross-attention and ConvLSTM is proposed to make the information of each stream in the multi-stream network fully interact. The improved SlowFast network has achieved 98.5% Top-1 accuracy on UCF101 and 80.1% accuracy on HMDB51. Compared with the original SlowFast network, the SlowFast spatiotemporal information fusion network with deeply nested attention has superior performance in information extraction and fusion.