Analysis of RemixIT and Self-Remixing when training from scratch

NOTE: Output order of the three separated signals are aligned with the ground-truth audios (two speech and a noise).

Input mixture and clean speeches

Mixture

Clean speech1

Clean speech2

Conformer Self-Remixing

Epoch 0 Output 1 (speech1)

Epoch 0 Output 2 (speech2)

Epoch 0 Output 3 (noise)

Epoch 5 Output 1 (speech1)

Epoch 5 Output 2 (speech2)

Epoch 5 Output 3 (noise)

Epoch 10 Output 1 (speech1)

Epoch 10 Output 2 (speech2)

Epoch 10 Output 3 (noise)

Epoch 15 Output 1 (speech1)

Epoch 15 Output 2 (speech2)

Epoch 15 Output 3 (noise)

Conformer RemixIT

Epoch 0 Output 1 (speech1)

Epoch 0 Output 2 (speech2)

Epoch 0 Output 3 (noise)

Epoch 5 Output 1 (speech1)

Epoch 5 Output 2 (speech2)

Epoch 5 Output 3 (noise)

Epoch 10 Output 1 (speech1)

Epoch 10 Output 2 (speech2)

Epoch 10 Output 3 (noise)

Epoch 15 Output 1 (speech1)

Epoch 15 Output 2 (speech2)

Epoch 15 Output 3 (noise)

TF-GridNet Self-Remixing

Epoch 0 Output 1 (speech1)

Epoch 0 Output 2 (speech2)

Epoch 0 Output 3 (noise)

Epoch 5 Output 1 (speech1)

Epoch 5 Output 2 (speech2)

Epoch 5 Output 3 (noise)

Epoch 10 Output 1 (speech1)

Epoch 10 Output 2 (speech2)

Epoch 10 Output 3 (noise)

Epoch 15 Output 1 (speech1)

Epoch 15 Output 2 (speech2)

Epoch 15 Output 3 (noise)

TFGridNet RemixIT

Epoch 0 Output 1 (speech1)

Epoch 0 Output 2 (speech2)

Epoch 0 Output 3 (noise)

Epoch 5 Output 1 (speech1)

Epoch 5 Output 2 (speech2)

Epoch 5 Output 3 (noise)

Epoch 10 Output 1 (speech1)

Epoch 10 Output 2 (speech2)

Epoch 10 Output 3 (noise)

Epoch 15 Output 1 (speech1)

Epoch 15 Output 2 (speech2)

Epoch 15 Output 3 (noise)