Logo
  • FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement

    Xiang Hao
    ,
    Xiangdong Su
    ,
    Radu Horaud
    ,
    Xiaofei Li

    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band and sub-band refer to the models that input full-band and sub-band noisy spectral feature, output full-band and sub-band speech target, respectively. The sub-band model processes each frequency independently. Its input consists of one frequency and several context frequencies. The output is the prediction of the clean speech target for the corresponding frequency. These two types of models have distinct characteristics. The full-band model can capture the global spectral context and the long-distance cross-band dependencies. However, it lacks the ability to modeling signal stationarity and attending the local spectral pattern. The sub-band model is just the opposite. In our proposed FullSubNet, we connect a pure full-band model and a pure sub-band model sequentially and use practical joint training to integrate these two types of models' advantages. We conducted experiments on the DNS challenge (INTERSPEECH 2020) dataset to evaluate the proposed method. Experimental results show that full-band and sub-band information are complementary, and the FullSubNet can effectively integrate them. Besides, the performance of the FullSubNet also exceeds that of the top-ranked methods in the DNS Challenge (INTERSPEECH 2020).
    [PDF] [Code] [Demo]
    logo
  • Sub-Band Knowledge Distillation Framework for Speech Enhancement

    Xiang Hao
    ,
    Shixue Wen
    ,
    Xiangdong Su
    ,
    Yun Liu
    ,
    Guanglai Gao
    ,
    Xiaofei Li

    INTERSPEECH 2020 - The 21th Annual Conference of the International Speech Communication Association
    In single-channel speech enhancement, methods based on full-band spectral features have been widely studying, while only a few methods pay attention to non-full-band spectral features. In this paper, we explore a knowledge distillation framework based on sub-band spectral mapping for single-channel speech enhancement. First, we divide the full frequency band into multiple sub-bands and pre-train elite-level sub-band enhancement model (teacher model) for each sub-band. The teacher models are dedicated to processing their own sub-bands. Next, under the teacher models’ guidance, we train a general sub-band enhancement model (student model) that works for all sub-bands. Without increasing the number of model parameters and computational complexity, the student model’s performance is further improved. To evaluate the proposed method, we conducted a large number of experiments on an open-source data set. The final experimental results show that the guidance from the elite-level teacher models dramatically improves the student model’s performance, which exceeds the full-band model by employing fewer parameters.
    [PDF]
    logo
  • SNR-Based Teachers-Student Technique for Speech Enhancement

    Xiang Hao
    ,
    Xiangdong Su
    ,
    Zhiyu Wang
    ,
    Qiang Zhang
    ,
    Huali Xu
    ,
    Guanglai Gao

    ICME 2020 - 2020 IEEE International Conference on Multimedia and Expo (ICME)
    It is very challenging for speech enhancement methods to achieves robust performance under both high signal-to-noise ratio (SNR) and low SNR simultaneously. In this paper, we propose a method that integrates an SNR-based teachers-student technique and time-domain U-Net to deal with this problem. Specifically, this method consists of multiple teacher models and a student model. We first train the teacher models under multiple small-range SNRs that do not coincide with each other so that they can perform speech enhancement well within the specific SNR range. Then, we choose different teacher models to supervise the training of the student model according to the SNR of the training data. Eventually, the student model can perform speech enhancement under both high SNR and low SNR. To evaluate the proposed method, we constructed a dataset with an SNR ranging from -20dB to 20dB based on the public dataset. We experimentally analyzed the effectiveness of the SNR-based teachers-student technique and compared the proposed method with several state-of-the-art methods.
    [PDF]
    logo
  • Masking and Inpainting: A Two-Stage Speech Enhancement Approach for Low SNR and Non-Stationary Noise

    Xiang Hao
    ,
    Xiangdong Su
    ,
    Shixue Wen
    ,
    Zhiyu Wang
    ,
    Yiqian Pan
    ,
    Feilong Bao
    ,
    Wei Chen

    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
    Currently, low signal-to-noise ratio (SNR) and non-stationary noise cause severe performance degradation for most of speech enhancement models. For better speech enhancement at the above scenarios, this paper proposes a two-stage approach that consists of binary masking and spectrogram inpainting. In the binary masking stage, we first obtain binary mask by hardening soft mask and then use it to remove time-frequency points that are dominated by severe noise. In the spectrogram inpainting stage, we use a CNN with partial convolution to perform inpainting on the masked spectrogram from the previous stage. We compared our approach with two powerful baselines, including Wave-U-Net and CRN, on a low SNR dataset containing lots of non-stationary noises. The experimental results show that our approach outperformed the baselines and achieved the state-of-the-art performance.
    [PDF]
    logo
  • Learning an Adversarial Network for Speech Enhancement Under Extremely Low Signal-to-Noise Ratio Condition

    Xiangdong Su
    ,
    Xiang Hao
    ,
    Zhiyu Wang
    ,
    Yun Liu
    ,
    Huali Xu
    ,
    Tongyang Liu
    ,
    Guanglai Gao

    ICONIP 2019 - International Conference on Neural Information Processing 2019
    Speech enhancement under low Signal-to-noise ratio (SNR) condition is a challenging task. This paper formulates the speech enhancement as a spectrogram mapping problem that converts the noisy speech spectrogram to the clean speech spectrogram. On such basis, we propose a robust speech enhancement approach based on deep adversarial learning for extremely low SNR Condition. The deep adversarial network is trained on a few paired spectrograms of the noisy and the clean speeches, and several strategies are applied to optimize it, skip connection, patchGAN and spectral normalization. Our approach is evaluated under extremely low SNR conditions (the lowest SNR is −20 dB), and the result demonstrates that our approach significantly improves the speech quality and substantially outperforms the representative deep learning models, including DNN, SEGAN and Bidirectional LSTM using phase-sensitive spectrum approximation cost function (PSA-BLSTM) regarding Short-Time Objective Intelligibility (STOI) and Perceptual evaluation of speech quality (PESQ).
    [PDF]
    logo
  • UNetGAN: A Robust Speech Enhancement Approach in Time Domain for Extremely Low Signal-to-noise Ratio Condition

    Xiang Hao
    ,
    Xiangdong Su
    ,
    Zhiyu Wang
    ,
    Hui Zhang
    ,
    Batushiren

    INTERSPEECH 2019 - The 20th Annual Conference of the International Speech Communication Association
    Speech enhancement at extremely low signal-to-noise ratio (SNR) condition is a very challenging problem and rarely investigated in previous works. This paper proposes a robust speech enhancement approach (UNetGAN) based on U-Net and generative adversarial learning to deal with this problem. This approach consists of a generator network and a discriminator network, which operate directly in the time domain. The generator network adopts a U-Net like structure and employs dilated convolution in the bottleneck of it. We evaluate the performance of the UNetGAN at low SNR conditions (up to -20dB) on the public benchmark. The result demonstrates that it significantly improves the speech quality and substantially outperforms the representative deep learning models, including SEGAN, cGAN fo SE, Bidirectional LSTM using phase-sensitive spectrum approximation cost function (PSA-BLSTM) and Wave-U-Net regarding Short-Time Objective Intelligibility (STOI) and Perceptual evaluation of speech quality (PESQ).
    [PDF]
    logo
Designed & Created by HAO, Xiang (haoxiangsnr) © 2021