电子测量与仪器学报

张起尧,桑海峰.深度嵌套注意力下的SlowFast信息融合动作识别网络[J].电子测量与仪器学报,2024,38(3):159-166

深度嵌套注意力下的SlowFast信息融合动作识别网络

SlowFast information fusion action recognition networkbased on deeply nested attention mechanism

DOI：

中文关键词: 视频动作识别 SlowFast 注意力深层嵌套信息融合网络时空通道注意力

英文关键词:video action recognition SlowFast deep nesting of attention information fusion network spatial channel temporal attention

基金项目:国家自然科学基金（62173078）、辽宁省自然科学基金（2022-MS-268）项目资助

作者	单位
张起尧	沈阳工业大学信息科学与工程学院沈阳110870
桑海峰	沈阳工业大学信息科学与工程学院沈阳110870

Author	Institution
Zhang Qiyao	School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, China
Sang Haifeng	School of Information Science and Engineering, Shenyang University of Technology, Shenyang 110870, China

摘要点击次数: 1056

全文下载次数: 4062

中文摘要:

视频动作识别在视频监控、自动驾驶等多个领域都有着广泛的应用。SlowFast网络是视频动作识别领域经常使用的网络。目前SlowFast相关网络中使用注意力进行相关信息增强，注意力机制与网络的结合方式是将注意力机制嵌套到网络的各个卷积块之间，如果将注意力机制深层嵌套到卷积块的具体卷积层中，SlowFast网络的信息提取能力将更进一步。首先提出了一种深度嵌套注意力机制，该深度嵌套机制内部包含一种可以提取时空与通道信息的注意力SCTM，使SlowFast网络的3种信息提取能力得到了进一步加强。此外，目前多流网络融合的信息并没有充分的交互与处理。提出了一种基于交叉注意力与ConvLSTM的多流时空信息融合网络，使多流网络中每个流的信息充分交互。改进后的SlowFast网络在UCF101数据集上的Top-1准确率已达到98.5%，在HMDB51数据集中的准确率达到了80.1%。均优于目前已有的模型，比原始SlowFast网络提高了2.64%，且鉴于上述数据，深度嵌套注意力的 SlowFast 时空信息融合网络在信息提取与融合方面具有优越性能。

英文摘要:

Video action recognition has been widely used in many fields such as video surveillance and automatic driving. SlowFast network is often used in the field of video action recognition. At present, attention is used to enhance relevant information in SlowFast correlation network. The combination of attention mechanism and network is to embed the attention mechanism among various convolutional blocks of the network. If the attention mechanism is deeply embedded into the specific convolutional layer of the convolutional block, the information extraction capability of the SlowFast network will be further enhanced. Firstly, a deep nested attention mechanism is proposed, which contains an attention SCTM that can extract space-time and channel information, and further strengthens the three information extraction capabilities of SlowFast network. In addition, the current multi-stream network fusion information is not fully interactive and processed. A multi-stream spatio-temporal information fusion network based on cross-attention and ConvLSTM is proposed to make the information of each stream in the multi-stream network fully interact. The improved SlowFast network has achieved 98.5% Top-1 accuracy on UCF101 and 80.1% accuracy on HMDB51. Compared with the original SlowFast network, the SlowFast spatiotemporal information fusion network with deeply nested attention has superior performance in information extraction and fusion.

查看全文查看/发表评论下载PDF阅读器