电子测量与仪器学报

刘铁,段勇.融合 CNN 和 Transformer 的机器人室内场景识别[J].电子测量与仪器学报,2023,37(5):223-229

融合 CNN 和 Transformer 的机器人室内场景识别

Robot indoor scene recognition based on fusion of CNN and Transformer

DOI：

英文关键词:CNN Transformer robot scene recognition local feature

基金项目:辽宁省高等学校优秀科技人才支持计划(LR15045)、辽宁省教育厅科学研究经费面上项目(LJKZ0139)资助

作者	单位
刘铁	1.沈阳工业大学信息科学与工程学院
段勇	1.沈阳工业大学信息科学与工程学院

Author	Institution
Liu Tie	1.School of Information Science and Engineering, Shenyang University of Technology
Duan Yong	1.School of Information Science and Engineering, Shenyang University of Technology

摘要点击次数: 1194

全文下载次数: 2511

中文摘要:

为了提高机器人在复杂的室内环境中场景识别的准确率,本文提出一种融合卷积神经网络( convolutional neural network,CNN)和视觉 Transformer 结构的机器人室内场景识别模型。本文模型利用 CNN 提取场景局部特征,然后使用视觉 Transformer 结构捕捉特征中远距离依赖关系,其中提出的视觉 Transformer 结构包括 3 个部分,分别是特征编码结构(Attention Embedding)、Encoder 结构和一个将高层语义特征转化成像素级特征的结构(Attention Project)。本文研究的机器人场景识别模型利用 CNN 提高视觉 Transformer 局部细节特征的描述能力,同时通过视觉 Transformer 帮助 CNN 构建远距离特征的依赖关系, 从而能够有效的表征和利用机器人工作场景图像的视觉特征。最后,通过机器人在实际工作环境中采集的数据集和开源的 COLD 数据集进行实验,验证了本文研究模型的有效性,场景识别精度更高。

英文摘要:

In order to improve the accuracy of robot scene recognition in complex indoor environments, this paper proposes a robot scene recognition model that fuses convolutional neural network (CNN) and visual Transformer structure. The model uses CNN to extract local features of the scene. And the visual Transformer structure is used to capture the distant dependencies in the features. The proposed visual Transformer structure consists of three parts, they are a feature encoding structure (Attention Embedding), an Encoder structure, and a structure that converts high-level semantic features into pixel-level features (Attention Project). The robot scene recognition model studied in this paper uses CNN to improve the description ability of local detail features of the visual Transformer. Furthermore, the visual Transformer helps CNN to construct the dependencies of distant features, which can effectively characterize and utilize the visual features of the robot working scene images. Finally, the effectiveness of the model is verified by experimenting with the dataset collected by the robot in the actual working environment and the open source COLD dataset. The scene recognition accuracy of our model is higher.

查看全文查看/发表评论下载PDF阅读器