电子测量与仪器学报

李珂,王雅静,昝志辉,齐瑞洁.基于残差膨胀卷积与门控编解码网络的语音增强[J].电子测量与仪器学报,2025,39(4):74-83

基于残差膨胀卷积与门控编解码网络的语音增强

Speech enhancement based on residual dilatationconvolutional and gated codec networks

DOI：

英文关键词:speech enhancement deep learning codec network dilatational convolution gating mechanism

基金项目:山东省自然科学基金（ZR2024MD031）项目资助

作者	单位
李珂	山东理工大学计算机科学与技术学院淄博255049
王雅静	山东理工大学电气与电子工程学院淄博255049
昝志辉	山东理工大学计算机科学与技术学院淄博255049
齐瑞洁	山东理工大学计算机科学与技术学院淄博255049

Author	Institution
Li Ke	School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, China
Wang Yajing	School of Electrical and Electronic Engineering, Shandong University of Technology, Zibo 255049, China
Zan Zhihui	School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, China
Qi Ruijie	School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, China

摘要点击次数: 44

全文下载次数: 81

中文摘要:

语音信号的时序依赖性特征和上下文信息在语音增强任务中至关重要,针对编解码网络对其捕获不充分导致增强效果差的问题,构建了一种非对称的残差膨胀卷积与门控编解码网络(RD-EGN)，该网络包含编码器、中间层和解码器3部分。编码器设计了一种因果卷积层结构,以时序特征建模,捕获语音序列中不同层的特征,并保持语音信号的因果性;中间层设计了残差膨胀卷积网络(RDCN),融合膨胀卷积、残差连接和级联的扩张块使网络拥有更高的感受野,以跨层的方式传递信息并提取语音长时依赖性特征,在此基础上将RDCN与长短时记忆网络相结合,捕获更广泛的上下文信息;解码器引入门控机制,动态调整信息流的门控程度,获得更丰富的全局特征并重建增强语音。分别在TIMIT、UrbanSound8k、VoiceBank及NOISE92数据集上进行消融及性能对照,实验结果表明,RD-EGN相较于卷积循环网络（CRN）、自编码器卷积神经网络（AECNN）、膨胀密集自动编码器（DDAEC）等具有较少的训练参数和较高的SSNR得分、主观评价指标(CSIG, CBAK和COVL)得分,并且在客观评价指标方面,语音质量客观评价指标(PESQ)提高了2.5%~7.1%,短时客观可懂度(STOI)提高了1%~5.3%,具有较为突出的增强性能与泛化能力。

英文摘要:

The time-dependent features and context information of speech signals are crucial in speech enhancement tasks. Aiming at the problem that codec networks insufficiently capture these features, resulting in poor enhancement performance, an asymmetric residual dilatation convolutional and gated codec network (RD-EGN) is constructed. The network comprised three parts: the encoder, intermediate layer and decoder. The encoder designed a causal convolution layer structure to model the temporal feature, capture the features of different layers in the speech sequence and maintain the speech signal’s causality. The intermediate layer incorporated a residual dilated convolutional network (RDCN), which integrated dilated convolution, residual connections, and cascaded expansion blocks to endow the network with a larger receptive field. It facilitated cross-layer information transfer and extracted long-term dependency features in speech. The RDCN is combined with the long short-term memory network to capture broader context information. The decoder introduced a gating mechanism to adjust the gating degree of information flow dynamically, obtain richer global features and reconstruct enhanced speech. Ablation and performance comparison experiments were conducted on the TIMIT,UrbanSound8k,VoiceBank,and NOISE92 datasets. The results show that, RD-EGN has fewer training parameters and higher scores in SSNR and subjective evaluation metrics (CSIG, CBAK, and COVL) than CRN, AECNN and DDAEC. In objective evaluation metrics, the PESQ is increased by 2.5% to 7.1%,and the STOI is increased by1% to 5.3%. RD-EGN demonstrates outstanding enhancement performance and generalization ability.

查看全文查看/发表评论下载PDF阅读器