Unofficial Parallel WaveGAN implementation demo

This is the demonstration page of UNOFFICIAL following model implementations.

Github: https://github.com/kan-bayashi/ParallelWaveGAN

Audio samples (English)

Here is the comparison in the analysis-synthesis condition using LJSpeech dataset.
Note that we limit the frequency range from 80 to 7600 Hz in Mel spectrogram calculation.

Groundtruth: Target speech.
Parallel WaveGAN (official): Official samples provided in the official demo HP.
Parallel WaveGAN (ours): Our samples based this config.
MelGAN + STFT-loss (ours): Our samples based this config.
FB-MelGAN (ours): Our samples based this config.
MB-MelGAN (ours): Our samples based this config.
HiFi-GAN (ours): Our samples based this config.
StyleMelGAN (ours): Our samples based this config.


Groundtruth	ParallelWaveGAN (official)

ParallelWaveGAN (ours)	MelGAN + STFT-loss (ours)

FB-MelGAN (ours)	MB-MelGAN (ours)

HiFiGAN (ours)	StyleMelGAN (ours)


Groundtruth	ParallelWaveGAN (official)

ParallelWaveGAN (ours)	MelGAN + STFT-loss (ours)

FB-MelGAN (ours)	MB-MelGAN (ours)

HiFiGAN (ours)	StyleMelGAN (ours)


Groundtruth	ParallelWaveGAN (official)

ParallelWaveGAN (ours)	MelGAN + STFT-loss (ours)

FB-MelGAN (ours)	MB-MelGAN (ours)

HiFiGAN (ours)	StyleMelGAN (ours)


Groundtruth	ParallelWaveGAN (official)

ParallelWaveGAN (ours)	MelGAN + STFT-loss (ours)

FB-MelGAN (ours)	MB-MelGAN (ours)

HiFiGAN (ours)	StyleMelGAN (ours)


Groundtruth	ParallelWaveGAN (official)

ParallelWaveGAN (ours)	MelGAN + STFT-loss (ours)

FB-MelGAN (ours)	MB-MelGAN (ours)

HiFiGAN (ours)	StyleMelGAN (ours)

Audio samples (Japanese)

Audio sampels trained on JSUT dataset.
Note that groundtruth samples are 48 kHz and we downsampled to 24 kHz and we limit the frequency range from 80 to 7600 Hz in Mel spectrogram calculation.

Groundtruth: Target speech.
Parallel WaveGAN (ours): Our samples based this config.


Groundtruth	ParallelWaveGAN (ours)

Groundtruth	ParallelWaveGAN (ours)

Groundtruth	ParallelWaveGAN (ours)

Audio samples (Mandarin)

Audio sampels trained on CSMSC dataset.
Note that groundtruth samples are 48 kHz and we downsampled to 24 kHz and we limit the frequency range from 80 to 7600 Hz in Mel spectrogram calculation.

Groundtruth: Target speech.
Parallel WaveGAN (ours): Our samples based this config.


Groundtruth	ParallelWaveGAN (ours)

Groundtruth	ParallelWaveGAN (ours)

Groundtruth	ParallelWaveGAN (ours)

References

Author

Tomoki Hayashi
e-mail: hayashi.tomoki@g.sp.m.is.nagoya-u.ac.jp