Abstract:In heavy industries such as non-ferrous metal metallurgy, the detection of hazardous gas leaks is crucial for ensuring employee safety and maintaining stable production. Traditional single-modal detection methods often struggle with reduced accuracy in complex industrial environments due to their limited ability to handle disturbances, particularly in noisy conditions. To address these challenges, this paper introduces a multimodal gas leak detection model designed for industrial environments. This model integrates smoke sensor data and infrared image data, leveraging the complementary strengths of each data source to enhance detection accuracy and robustness. Initially, the gMLP architecture is utilized to capture complex patterns in sensor data; concurrently, the Swin-Transformer is employed to extract local and global features from infrared images. Subsequently, a fusion strategy based on multi-head attention effectively combines the latent representations of different modal data to achieve hazardous gas detection. Experiments conducted on multimodal gas datasets in both normal and noisy environments demonstrate that the model achieves a detection accuracy of 97.92%. The results indicate that, compared to single-modal methods, the multimodal approach significantly improves detection accuracy and robustness, enhancing performance in complex industrial gas leak detection scenarios.