Abstract:To address the issue of low accuracy caused by the single extraction of deep features in current bird sound recognition methods, this study proposed a DenseNet based bird sound recognition method with feature fusion. First, the Mel-spectrogram was extracted from bird sound signals as the network input. Then, DenseNet was used as the base network, and convolutional block attention module was integrated into the standard convolutional layer of all dense blocks dense blocks. The convolutional block attention module learns the feature representation of training set, determines the importance and correlation of different levels of bird song feature information, and further weights and fuses them according to channel and spatial dimensions, making the network pay more attention to the important feature channels and spatial positions in bird song features. Then, adding dropout block algorithm after the standard convolutional layer of dense blocks promotes the network to learn features from different regions in a more balanced manner, improves the network’s adaptability to new bird song data, and enables the network to better capture common features in the data. Subsequently, a deep feature extraction branch using transformer encoder was established for DenseNet to enhance the network’s ability to capture global information and long-distance dependencies in birdsong features. Finally, the deep features extracted by the two branches are fused to enrich the information content of the deep features. This method was tested in seven sets on the Xeno-Canto data set. Experimental results on the test data set show that the proposed method achieves an average accuracy of 88.65%, which is 10.83% higher than the EMSCNN method, 20.14% higher than the AlexNet method, 16.3% higher than the VGGNet method, and 4.28% higher than DenseNet. The experiment proved the effectiveness and progressiveness of the proposed method. It outperforms other comparative deep learning methods in terms of recognition performance and effectiveness.