Abstract:To solve the problems of poor segmentation effect and unclear target edge segmentation of single-modal visible RGB image semantic segmentation at night or in the environment of light change, and there are still many shortcomings in the existing cross-modal semantic segmentation networks when obtaining global context information and fusing cross-modal features. This paper proposed a cross-modal semantic segmentation algorithm based on dual-branch multi-scale feature fusion. The Segformer is used as the backbone network to extract features and capture long-distance dependencies. The feature enhancement module is used to improve the contrast of shallow feature maps and the discrimination of edge information. The effective attention enhancement module and cross-modal feature fusion module are used to model the relationship between pixels of different modal feature maps, aggregate complementary information, and give full play to the advantages of cross-modal features. Finally, the lightweight All-MLP decoder was used to reconstruct the image and predict the segmentation result. Compared with the mainstream algorithms in the existing literature, the proposed algorithm has the best evaluation indicators on the MFNet urban street view dataset, and the mAcc and mIoU reach 76.9%and 59.8%respectively. Experimental results show that the proposed algorithm can effectively improve the problem of unclear target edge contour segmentation and improve the accuracy of image segmentation when dealing with complex scenes.