Abstract:Industrial gas leak detection remains a critical challenge, with existing methods predominantly relying on single-modality data. This reliance neglects the complementary nature of different modalities and limits the ability to accurately and robustly detect leaks in complex environments. To address these limitations, this study proposes a novel gas leak detection model, the Multimodal Fusion Transformer (MFT), which integrates data from multiple industrial modalities. The MFT model employs two distinct feature encoders to effectively extract features from each modality, tailored to their unique characteristics. To fully leverage the potential of multimodal data, a multi-head attention mechanism is utilized to fuse the latent representations of different modalities. This approach ensures that the complementary information from each modality is effectively combined. Experimental results demonstrate that the proposed method significantly improves the accuracy and robustness of gas leak detection. The MFT model achieves an impressive 98.05% accuracy on the publicly available MultimodalGasData dataset, highlighting its efficacy in utilizing the complementary information across various modalities. This advancement marks a substantial step forward in enhancing the reliability and performance of industrial gas leak detection systems.