Kohei Saijo, Wangyou Zhang, Samuele Cornell, Robin Scheibler, Chenda Li, Zhaoheng Ni, Anurag Kumar, Marvin Sach, Yihui Fu, Wei Wang, Tim Fingscheidt, Shinji Watanabe
Abstract: There has been a growing effort to develop universal speech enhancement (SE) to handle inputs with various speech distortions and recording conditions. The URGENT Challenge series aims to foster such universal SE by embracing a broad range of distortion types, increasing data diversity, and incorporating extensive evaluation metrics. This work introduces the Interspeech 2025 URGENT Challenge, the second edition of the series, to explore several aspects that have received limited attention so far: language dependency, universality for more distortion types, data scalability, and the effectiveness of using noisy training data. We received 32 submissions, where the best system uses a discriminative model, while most other competitive ones are hybrid methods. Analysis reveals some key findings: (i) some generative or hybrid approaches are preferred in subjective evaluations over the top discriminative model, and (ii) purely generative SE models can exhibit language dependency.
Contents:
English audio example 1
English audio example 2
English audio example 3
Japanese audio example 1
Japanese audio example 2
English audio example 1
Distortion types: additive noise (SNR: 0.1[dB]) and reverberation.
This is an example under a low SNR condition.
All the systems remove noise very well.
Although the spectrogram of T13 looks very clean, some words are pronounced wrongly or unclearly.
Noisy speech ![]() |
Clean speech ![]() |
|||
T1 (discriminative) ![]() |
T2 (hybrid) ![]() |
T3 (hybrid) ![]() |
T10 (discriminative) ![]() |
T13 (generative) ![]() |
English audio example 2
Distortion types: additive noise (SNR: 6.1[dB]), clipping, bandwidth limitation, and mp3 codec.
This is an example with degraded by multiple distortions.
The outputs from discriminative models (T1 and T10) contain weird artifacts above 4kHz, while those from
generative or hybrid approaches sound clean.
Among them, the purely generative model, T13, restores the speech very well.
Noisy speech ![]() |
Clean speech ![]() |
|||
T1 (discriminative) ![]() |
T2 (hybrid) ![]() |
T3 (hybrid) ![]() |
T10 (discriminative) ![]() |
T13 (generative) ![]() |
English audio example 3
Distortion types: additive noise (SNR: 10.2[dB]).
This is an example of an easy sample.
All the systems work very well.
Noisy speech ![]() |
Clean speech ![]() |
|||
T1 (discriminative) ![]() |
T2 (hybrid) ![]() |
T3 (hybrid) ![]() |
T10 (discriminative) ![]() |
T13 (generative) ![]() |
Japanese audio example 1
Distortion types: additive noise (SNR: 3.5[dB]), reverberation, clipping, DAC codec, and bandwidth
limitation.
This is an harsh example with degraded by five distortions.
Still, it is not very difficult for native Japanese speaker to recognize the speech content.
Although not perfect, the outputs from the first four systems (T1, T2, T3, and T10) sound okay.
However, interestingly, the output from T13 does not sound Japanese but rather resembles English or other
European languages which made up the majority of the training data.
Noisy speech ![]() |
Clean speech ![]() |
|||
T1 (discriminative) ![]() |
T2 (hybrid) ![]() |
T3 (hybrid) ![]() |
T10 (discriminative) ![]() |
T13 (generative) ![]() |
Japanese audio example 2
Distortion types: additive noise, packet loss
This is an example of real-recorded noisy speech with additional packet loss.
Since we do not have the corresponding clean speech, we attached the original noisy speech without packet
loss instead.
Similarly to English audio example 2
which has bandwidth limitation, the outputs
from the discriminative models (T1 and T10) have some weird artifacts at packet-lossed frames.
In contrast, the hybrid or generative models (T2, T3, and T13) do not have such an artifact, although the
packet-lossed frames are not necessarily inpainted very well.
Interestingly, the inpainted frames of T13's output sound like Japanese produced by native English
speakers who speak Japanese as a second language (e.g., "kudasai" -> "kudosai").
Noisy speech ![]() |
Original noisy speech (w/o packet loss) ![]() |
|||
T1 (discriminative) ![]() |
T2 (hybrid) ![]() |
T3 (hybrid) ![]() |
T10 (discriminative) ![]() |
T13 (generative) ![]() |