基于残差膨胀卷积与门控编解码网络的语音增强
DOI:
CSTR:
作者:
作者单位:

山东理工大学西校区

作者简介:

通讯作者:

中图分类号:

TN912.35 TH701

基金项目:

山东省自然科学基金项目资助


Speech enhancement based on residual dilatation convolutional and gated codec networks
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    语音信号的时序依赖性特征和上下文信息在语音增强任务中至关重要,针对编解码网络对其捕获不充分导致增强效果差的问题,构建了一种非对称的残差膨胀卷积与门控编解码网络(RD-EGN)。该网络包含编码器、中间层和解码器三部分:编码器设计了一种因果卷积层结构,以时序特征建模,捕获语音序列中不同层的特征,并保持语音信号的因果性;中间层设计了残差膨胀卷积网络(RDCN),融合膨胀卷积、残差连接和级联的扩张块使网络拥有更高的感受野,以跨层的方式传递信息并提取语音长时依赖性特征,在此基础上将RDCN与长短时记忆网络相结合,捕获更广泛的上下文信息;解码器引入门控机制,动态调整信息流的门控程度,获得更丰富的全局特征并重建增强语音。分别在TIMIT、UrbanSound8k、VoiceBank及NOISE92数据集上进行消融及性能对照实验,结果表明,RD-EGN相较于CRN、AECNN、DDAEC等具有较少的训练参数和较高的SSNR得分、主观评价指标(CSIG, CBAK和COVL)得分,并且在客观评价指标方面,语音质量客观评价指标(PESQ)提高了2.5%~7.1%,短时客观可懂度(STOI)提高了1%~5.3%,具有较为突出的增强性能与泛化能力。

    Abstract:

    The time-dependent features and context information of speech signals are crucial in speech enhancement tasks.Aiming at the problem that codec networks insufficiently capture these features,resulting in poor enhancement performance,an asymmetric residual dilatation convolutional and gated codec network (RD-EGN) is constructed.The network comprised three parts:the encoder,intermediate layer and decoder.The encoder designed a causal convolution layer structure to model the temporal feature, capture the features of different layers in the speech sequence and maintain the speech signal’s causality.The intermediate layer incorporated a residual dilated convolutional network (RDCN),which integrated dilated convolution,residual connections,and cascaded expansion blocks to endow the network with a larger receptive field.It facilitated cross-layer information transfer and extracted long-term dependency features in speech.The RDCN is combined with the long short-term memory network to capture broader context information.The decoder introduced a gating mechanism to adjust the gating degree of information flow dynamically,obtain richer global features and reconstruct enhanced speech.Ablation and performance comparison experiments were conducted on the TIMIT,UrbanSound8k,VoiceBank,and NOISE92 datasets.The results show that,RD-EGN has fewer training parameters and higher scores in SSNR and subjective evaluation metrics (CSIG,CBAK,and COVL) than CRN,AECNNand DDAEC.In objective evaluation metrics,the PESQ is increased by2.5% to 7.1%,and the STOI is increased by1% to 5.3%.RD-EGN demonstrates outstanding enhancement performance and generalization ability.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-09-13
  • 最后修改日期:2025-02-14
  • 录用日期:2025-02-17
  • 在线发布日期:
  • 出版日期:
文章二维码