Abstract:Addressing the challenges posed by variability and missing samples in multimodal data, we introduce a Multi-modal Information Fusion (MIF) technique that leverages both vibration and infrared image data. This innovative approach facilitates an effective and rapid assessment of power transformer fault states. First, a bidirectional gated recurrent unit (BGRU) is employed to extract features from the textual data of vibrations, the frequency images derived from vibrations, and the infrared images captured from the power transformer. The BGRU then yields feature vectors corresponding to different modalities. Subsequently, a cross-attention mechanism is utilized to establish relationships between these diverse modalities, enabling feature vector fusion. Finally, a combination of convolutional and fully connected layers determines the fault status of the power transformer. The experiment data come from the 10 kV power transformer, which contains the vibration signal and infrared images. Comparative analysis reveals that the MIF method outperforms benchmark techniques across four evaluation metrics, achieving a commendable fault diagnosis accuracy rate of 96%. Furthermore, the MIF methodology demonstrates its robustness by delivering highly reliable diagnostic outcomes under varying voltage and current conditions, offering a promising solution for the detection of faults in transformer multimodal data.