基于残差膨胀卷积与门控编解码网络的语音增强

首页 > 过刊浏览>2025年第39卷第4期 >74-83

基于残差膨胀卷积与门控编解码网络的语音增强
DOI:
                        
CSTR:
                        
作者:
                        
作者单位:1.山东理工大学计算机科学与技术学院淄博255049;2.山东理工大学电气与电子工程学院淄博255049
作者简介:
通讯作者:
中图分类号:TN912.35
基金项目:山东省自然科学基金（ZR2024MD031）项目资助

Speech enhancement based on residual dilatation convolutional and gated codec networks

Author:

Affiliation:

1.School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, China; 2.School of Electrical and Electronic Engineering, Shandong University of Technology, Zibo 255049, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

文章评论

摘要:

语音信号的时序依赖性特征和上下文信息在语音增强任务中至关重要,针对编解码网络对其捕获不充分导致增强效果差的问题,构建了一种非对称的残差膨胀卷积与门控编解码网络(RD-EGN)，该网络包含编码器、中间层和解码器3部分。编码器设计了一种因果卷积层结构,以时序特征建模,捕获语音序列中不同层的特征,并保持语音信号的因果性;中间层设计了残差膨胀卷积网络(RDCN),融合膨胀卷积、残差连接和级联的扩张块使网络拥有更高的感受野,以跨层的方式传递信息并提取语音长时依赖性特征,在此基础上将RDCN与长短时记忆网络相结合,捕获更广泛的上下文信息;解码器引入门控机制,动态调整信息流的门控程度,获得更丰富的全局特征并重建增强语音。分别在TIMIT、UrbanSound8k、VoiceBank及NOISE92数据集上进行消融及性能对照,实验结果表明,RD-EGN相较于卷积循环网络（CRN）、自编码器卷积神经网络（AECNN）、膨胀密集自动编码器（DDAEC）等具有较少的训练参数和较高的SSNR得分、主观评价指标(CSIG, CBAK和COVL)得分,并且在客观评价指标方面,语音质量客观评价指标(PESQ)提高了2.5%~7.1%,短时客观可懂度(STOI)提高了1%~5.3%,具有较为突出的增强性能与泛化能力。

Abstract:

The time-dependent features and context information of speech signals are crucial in speech enhancement tasks. Aiming at the problem that codec networks insufficiently capture these features, resulting in poor enhancement performance, an asymmetric residual dilatation convolutional and gated codec network (RD-EGN) is constructed. The network comprised three parts: the encoder, intermediate layer and decoder. The encoder designed a causal convolution layer structure to model the temporal feature, capture the features of different layers in the speech sequence and maintain the speech signal’s causality. The intermediate layer incorporated a residual dilated convolutional network (RDCN), which integrated dilated convolution, residual connections, and cascaded expansion blocks to endow the network with a larger receptive field. It facilitated cross-layer information transfer and extracted long-term dependency features in speech. The RDCN is combined with the long short-term memory network to capture broader context information. The decoder introduced a gating mechanism to adjust the gating degree of information flow dynamically, obtain richer global features and reconstruct enhanced speech. Ablation and performance comparison experiments were conducted on the TIMIT,UrbanSound8k,VoiceBank,and NOISE92 datasets. The results show that, RD-EGN has fewer training parameters and higher scores in SSNR and subjective evaluation metrics (CSIG, CBAK, and COVL) than CRN, AECNN and DDAEC. In objective evaluation metrics, the PESQ is increased by 2.5% to 7.1%,and the STOI is increased by1% to 5.3%. RD-EGN demonstrates outstanding enhancement performance and generalization ability.

参考文献

相似文献

引证文献

引用本文

李珂,王雅静,昝志辉,齐瑞洁.基于残差膨胀卷积与门控编解码网络的语音增强[J].电子测量与仪器学报,2025,39(4):74-83

复制

文章指标

点击次数:
下载次数:
HTML阅读次数:
引用次数:

历史

收稿日期:
最后修改日期:
录用日期:
在线发布日期: 2025-06-10
出版日期:

网站首页

杂志简介

投稿须知

在线阅读

欢迎订阅

招商合作

联系我们

English

引用本文

相关视频

分享

文章指标

历史

文章二维码