Abstract:In monocular 3D detection, the complex network structure and inaccurate target depth information obtained after depth estimation are two problems that need to be dealt with. To address this issue, we propose an end-to-end joint multi-attention depth estimation monocular 3D target detection network structure, named CDCN-3D. First of all, to obtain the salient features of the target, we introduce an adaptive spatial attention mechanism to aggregate the pixel features, which enhances local features and improves the network representation ability. Second, we use an improved C-ASPP approach to address the problem of local information loss in depth estimation, capturing more accurate direction perception and position-sensitive information for each depth information. Finally, the accurate P-BEV is used to map the three-dimensional information of the target to a two-dimensional plane, and then the single-stage target detector is used to complete the detection and output task. Through experiments on the KITTI dataset, the proposed CDCN-3D network shows improved accuracy compared to other networks, with the same FPS as that of the existing monocular 3D detection network. More specifically, and the detection accuracy of the CDCN-3D network is improved by 2. 31%, 1. 48%, 1. 14% respectively by the class of Car、Pedestrian、Cyclist, which can complete the 3D target detection task.