STABLE TRAINING OF DNN FOR SPEECH ENHANCEMENT BASED ON
PERCEPTUALLY-MOTIVATED BLACK-BOX COST FUNCTION

Masaki Kawanaka, Yuma Koizumi, Ryoichi Miyazaki, and Kohei Yatabe

Algorithms:
-Noisy: Input speech file degraded by background noise.
-Clean: Clean speech file.
-SEGAN: Output speech file by SEGAN (Pascual et al., 2017).
-Deep Feature Loss: Output speech file by Deep Feature Losses (Germain et al., 2019).
-Ours: Output speech file by Proposed.

Tranche 1 Tranche 2 Tranche 3 Tranche 4
-Tranche1 is most affected by background noise.

Audio samples from Tranche 1
file name noisy clean SEGAN Deep Feature Loss Ours
p232_036.wav
p232_170.wav
p232_415.wav
p257_070.wav
p257_395.wav

Audio samples from Tranche 2
file name noisy clean SEGAN Deep Feature Loss Ours
p232_080.wav
p257_170.wav
p257_306.wav
p257_322.wav
p257_382.wav

Audio samples from Tranche 3
file name noisy clean SEGAN Deep Feature Loss Ours
p232_145.wav
p232_189.wav
p232_261.wav
p257_243.wav
p257_272.wav

Audio samples from Tranche 4
file name noisy clean SEGAN Deep Feature Loss Ours
p232_075.wav
p232_153.wav
p232_176.wav
p257_205.wav
p257_250.wav


Test data: VoiceBank-DEMAND dataset constructed by Valentini
Output speech files by SEGAN and Deep Feature Losses: URL


References

S. Pascual, A. Bonafonte, and J. Serra, "SEGAN: Speech Enhancement Generative Adversarial Network," Proc. of Interspeech, 2017.
F. G. Germain, Q. Chen, and V. Koltun, "Speech Denoising with Deep Feature Losses," Proc. of Interspeech, 2019.
C. Valentini-Botinho, X. Wang, S. Takaki, and J. Yamagishi, "Investigating RNN-based Speech Enhancement methods for Noise-Robust Text-to-Speech," Proc. of 9th ISCA Speech Synth. Workshop (SSW), 2016.