罗庆予,张天骐,方 蓉,张慧芝.联合频谱映射与掩蔽估计的协作式语音增强方法[J].电子测量与仪器学报,2023,37(10):14-23
联合频谱映射与掩蔽估计的协作式语音增强方法
Collaborative speech enhancement method combiningspectral mapping and masking estimation
  
DOI:
中文关键词:  语音增强  复频谱映射  复掩蔽  多尺度融合 Transformer  轻量型网络
英文关键词:speech enhancement  complex spectral mapping  complex masking  multi scale fusion Transformer  lightweight network
基金项目:国家自然科学基金 ( 61671095, 61771085)、 重庆市自然科学基金 ( cstc2021jcyj-msxmX0836)、 重庆市教育委员会科研项目(KJ1600427, KJ1600429)资助
作者单位
罗庆予 1.重庆邮电大学通信与信息工程学院 
张天骐 1.重庆邮电大学通信与信息工程学院 
方 蓉 1.重庆邮电大学通信与信息工程学院 
张慧芝 1.重庆邮电大学通信与信息工程学院 
AuthorInstitution
Luo Qingyu 1.School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications 
Zhang Tianqi 1.School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications 
Fang Rong 1.School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications 
Zang Huizhi 1.School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications 
摘要点击次数: 921
全文下载次数: 906
中文摘要:
      为提高目前基于掩蔽与基于频谱映射的语音增强方法性能上界以及复杂环境下的泛化能力,提出了一种在联合复频谱 与复掩蔽学习框架下的协作式单通道语音增强方法。 该方法采用编码器-双分支解码器结构,在编解码部分设计了一种交互协 作学习单元(ICU)来监督交互语音信息流,并提供有效的潜在特征空间;中间层则是设计出一种多尺度融合 Transformer,以少 量参数在空间-通道维度上多尺度地提取细节信息后融合输出,同时对语音子频带与全频带信息建模。 在大、小数据集与 115 种噪声环境下进行实验,结果表明该方法仅以 0. 57 M 的参数量,取得比大部分先进且相关方法更优的主、客观指标,具有良好 的鲁棒性与有效性。
英文摘要:
      In order to improve the performance upper bound and generalization ability of current speech enhancement methods based on masking and spectrum mapping, a collaborative monaural speech enhancement method based on the learning framework of combined complex spectrum and masking is proposed. An interactive cooperative learning unit (ICU) is designed in the codec part to monitor the interactive speech information flow and provide an effective potential feature space. In the middle layer, a multi-scale fusion Transformer is designed to extract multi-scale details in the spatial-channel dimension with a small number of parameters for fusion output, at the meanwhile, modeling the voice sub-band and full band information. Experiments on large and small data sets and 115 noise environments show that the proposed method only uses 0. 57 M parameters to obtain better subjective and objective indicators than most advanced and related methods, which has good robustness and effectiveness.
查看全文  查看/发表评论  下载PDF阅读器