Kohei Saijo, Wangyou Zhang, Zhong-Qiu Wang, Shinji Watanabe, Tetsunori Kobayashi, Tetsuji Ogawa,
Abstract: We propose a multi-task universal speech enhancement (MUSE) model that can perform five speech enhancement (SE) tasks: dereverberation, denoising, speech separation (SS), target speaker extraction (TSE), and speaker counting. This is achieved by integrating two modules into a SE model: 1) an internal separation module that does both speaker counting and separation; and 2) a TSE module that extracts the target speech from the internal separation outputs using target speaker cues. The model is trained to perform TSE if the target speaker cue is given and SS otherwise. By training the model to remove noise and reverberation, we allow the model to tackle the five tasks mentioned above with a single model, which has not been accomplished yet. Evaluation results demonstrate that the proposed MUSE model can successfully handle multiple tasks with a single model.
Contents:
Audio examples on anechoic condition
Audio examples on noisy reverberant condition
Audio examples on anechoic condition
Example 2-mix
Input mixture | |
Clean speech 1 |
Clean speech 2 |
Separation output 1 |
Separation output 2 |
TSE output 1 |
TSE output 2 |
Example 3-mix
Input mixture | ||
Clean speech 1 |
Clean speech 2 |
Clean speech 3 |
Separation output 1 |
Separation output 2 |
Separation output 3 |
TSE output 1 |
TSE output 2 |
TSE output 3 |
Example 4-mix
Input mixture | |||
Clean speech 1 |
Clean speech 2 |
Clean speech 3 |
Clean speech 4 |
Separation output 1 |
Separation output 2 |
Separation output 3 |
Separation output 4 |
TSE output 1 |
TSE output 2 |
TSE output 3 |
TSE output 4 |
Example 5-mix
Input mixture | ||||
Clean speech 1 |
Clean speech 2 |
Clean speech 3 |
Clean speech 4 |
Clean speech 5 |
Separation output 1 |
Separation output 2 |
Separation output 3 |
Separation output 4 |
Separation output 5 |
TSE output 1 |
TSE output 2 |
TSE output 3 |
TSE output 4 |
TSE output 5 |
Audio examples on noisy reverberant condition
Example 1-mix
Input mixture |
Reference 1 |
Separation output 1 |
TSE output 1 |
Example 2-mix
Input mixture | |
Clean speech 1 |
Clean speech 2 |
Separation output 1 |
Separation output 2 |
TSE output 1 |
TSE output 2 |
Example 3-mix
Input mixture |
||
Reference 1 |
Reference 2 |
Reference 3 |
Separation output 1 |
Separation output 2 |
Separation output 3 |
TSE output 1 |
TSE output 2 |
TSE output 3 |
Example 4-mix
Input mixture |
|||
Reference 1 |
Reference 2 |
Reference 3 |
Reference 4 |
Separation output 1 |
Separation output 2 |
Separation output 3 |
Separation output 4 |
TSE output 1 |
TSE output 2 |
TSE output 3 |
TSE output 4 |
Example 5-mix
Input mixture |
||||
Reference 1 |
Reference 2 |
Reference 3 |
Reference 4 |
Reference 5 |
Separation output 1 |
Separation output 2 |
Separation output 3 |
Separation output 4 |
Separation output 5 |
TSE output 1 |
TSE output 2 |
TSE output 3 |
TSE output 4 |
TSE output 5 |