ESPnet2-TTS realtime demonstration¶
This notebook provides a demonstration of the realtime E2E-TTS using ESPnet2-TTS and ParallelWaveGAN repo.
ESPnet2-TTS: https://github.com/espnet/espnet/tree/master/egs2/TEMPLATE/tts1
ParallelWaveGAN: https://github.com/kan-bayashi/ParallelWaveGAN
Author: Tomoki Hayashi ([@kan-bayashi](https://github.com/kan-bayashi))
Installation¶
[ ]:
# NOTE: pip shows imcompatible errors due to preinstalled libraries but you do not need to care
!pip install -q espnet==0.10.6 pyopenjtalk==0.2 pypinyin==0.44.0 parallel_wavegan==0.5.4 gdown==4.4.0 typeguard==2.13.3 espnet_model_zoo
Single speaker model demo¶
Model Selection¶
Please select model: English, Japanese, and Mandarin are supported.
Text2wav models: - VITS
Text2mel models: - Tacotron2 - Transformer-TTS - (Conformer) FastSpeech - (Conformer) FastSpeech2
Vocoders: - Parallel WaveGAN - Multi-band MelGAN - HiFiGAN - Style MelGAN.
The terms of use follow that of each corpus. We use the following corpora: -
ljspeech_*
: LJSpeech dataset - https://keithito.com/LJ-Speech-Dataset/ -jsut_*
: JSUT corpus - https://sites.google.com/site/shinnosuketakamichi/publication/jsut -jvs_*
: JVS corpus + JSUT corpus - https://sites.google.com/site/shinnosuketakamichi/research-topics/jvs_corpus - https://sites.google.com/site/shinnosuketakamichi/publication/jsut -tsukuyomi_*
: つくよみちゃんコーパス + JSUT corpus - https://tyc.rei-yumesaki.net/material/corpus/ - https://sites.google.com/site/shinnosuketakamichi/publication/jsut -csmsc_*
: Chinese Standard Mandarin Speech Corpus - https://www.data-baker.com/open_source.html
[ ]:
#@title Choose English model { run: "auto" }
lang = 'English'
tag = 'kan-bayashi/ljspeech_vits' #@param ["kan-bayashi/ljspeech_tacotron2", "kan-bayashi/ljspeech_fastspeech", "kan-bayashi/ljspeech_fastspeech2", "kan-bayashi/ljspeech_conformer_fastspeech2", "kan-bayashi/ljspeech_joint_finetune_conformer_fastspeech2_hifigan", "kan-bayashi/ljspeech_joint_train_conformer_fastspeech2_hifigan", "kan-bayashi/ljspeech_vits"] {type:"string"}
vocoder_tag = "none" #@param ["none", "parallel_wavegan/ljspeech_parallel_wavegan.v1", "parallel_wavegan/ljspeech_full_band_melgan.v2", "parallel_wavegan/ljspeech_multi_band_melgan.v2", "parallel_wavegan/ljspeech_hifigan.v1", "parallel_wavegan/ljspeech_style_melgan.v1"] {type:"string"}
[ ]:
#@title Choose Japanese model { run: "auto" }
lang = 'Japanese'
tag = 'kan-bayashi/jsut_full_band_vits_prosody' #@param ["kan-bayashi/jsut_tacotron2", "kan-bayashi/jsut_transformer", "kan-bayashi/jsut_fastspeech", "kan-bayashi/jsut_fastspeech2", "kan-bayashi/jsut_conformer_fastspeech2", "kan-bayashi/jsut_conformer_fastspeech2_accent", "kan-bayashi/jsut_conformer_fastspeech2_accent_with_pause", "kan-bayashi/jsut_vits_accent_with_pause", "kan-bayashi/jsut_full_band_vits_accent_with_pause", "kan-bayashi/jsut_tacotron2_prosody", "kan-bayashi/jsut_transformer_prosody", "kan-bayashi/jsut_conformer_fastspeech2_tacotron2_prosody", "kan-bayashi/jsut_vits_prosody", "kan-bayashi/jsut_full_band_vits_prosody", "kan-bayashi/jvs_jvs010_vits_prosody", "kan-bayashi/tsukuyomi_full_band_vits_prosody"] {type:"string"}
vocoder_tag = 'none' #@param ["none", "parallel_wavegan/jsut_parallel_wavegan.v1", "parallel_wavegan/jsut_multi_band_melgan.v2", "parallel_wavegan/jsut_style_melgan.v1", "parallel_wavegan/jsut_hifigan.v1"] {type:"string"}
[ ]:
#@title Choose Mandarin model { run: "auto" }
lang = 'Mandarin'
tag = 'kan-bayashi/csmsc_full_band_vits' #@param ["kan-bayashi/csmsc_tacotron2", "kan-bayashi/csmsc_transformer", "kan-bayashi/csmsc_fastspeech", "kan-bayashi/csmsc_fastspeech2", "kan-bayashi/csmsc_conformer_fastspeech2", "kan-bayashi/csmsc_vits", "kan-bayashi/csmsc_full_band_vits"] {type: "string"}
vocoder_tag = "none" #@param ["none", "parallel_wavegan/csmsc_parallel_wavegan.v1", "parallel_wavegan/csmsc_multi_band_melgan.v2", "parallel_wavegan/csmsc_hifigan.v1", "parallel_wavegan/csmsc_style_melgan.v1"] {type:"string"}
Model Setup¶
[ ]:
from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none
text2speech = Text2Speech.from_pretrained(
model_tag=str_or_none(tag),
vocoder_tag=str_or_none(vocoder_tag),
device="cuda",
# Only for Tacotron 2 & Transformer
threshold=0.5,
# Only for Tacotron 2
minlenratio=0.0,
maxlenratio=10.0,
use_att_constraint=False,
backward_window=1,
forward_window=3,
# Only for FastSpeech & FastSpeech2 & VITS
speed_control_alpha=1.0,
# Only for VITS
noise_scale=0.333,
noise_scale_dur=0.333,
)
Synthesis¶
[ ]:
import time
import torch
# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")
x = input()
# synthesis
with torch.no_grad():
start = time.time()
wav = text2speech(x)["wav"]
rtf = (time.time() - start) / (len(wav) / text2speech.fs)
print(f"RTF = {rtf:5f}")
# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))
Multi-speaker Model Demo¶
Model Selection¶
Now we provide only English multi-speaker pretrained model.
The terms of use follow that of each corpus. We use the following corpora: -
libritts_*
: LibriTTS corpus - http://www.openslr.org/60 -vctk_*
: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit - http://www.udialogue.org/download/cstr-vctk-corpus.html
[ ]:
#@title English multi-speaker pretrained model { run: "auto" }
lang = 'English'
tag = 'kan-bayashi/vctk_full_band_multi_spk_vits' #@param ["kan-bayashi/vctk_gst_tacotron2", "kan-bayashi/vctk_gst_transformer", "kan-bayashi/vctk_xvector_tacotron2", "kan-bayashi/vctk_xvector_transformer", "kan-bayashi/vctk_xvector_conformer_fastspeech2", "kan-bayashi/vctk_gst+xvector_tacotron2", "kan-bayashi/vctk_gst+xvector_transformer", "kan-bayashi/vctk_gst+xvector_conformer_fastspeech2", "kan-bayashi/vctk_multi_spk_vits", "kan-bayashi/vctk_full_band_multi_spk_vits", "kan-bayashi/libritts_xvector_transformer", "kan-bayashi/libritts_xvector_conformer_fastspeech2", "kan-bayashi/libritts_gst+xvector_transformer", "kan-bayashi/libritts_gst+xvector_conformer_fastspeech2", "kan-bayashi/libritts_xvector_vits"] {type:"string"}
vocoder_tag = "none" #@param ["none", "parallel_wavegan/vctk_parallel_wavegan.v1.long", "parallel_wavegan/vctk_multi_band_melgan.v2", "parallel_wavegan/vctk_style_melgan.v1", "parallel_wavegan/vctk_hifigan.v1", "parallel_wavegan/libritts_parallel_wavegan.v1.long", "parallel_wavegan/libritts_multi_band_melgan.v2", "parallel_wavegan/libritts_hifigan.v1", "parallel_wavegan/libritts_style_melgan.v1"] {type:"string"}
Model Setup¶
[ ]:
from espnet2.bin.tts_inference import Text2Speech
from espnet2.utils.types import str_or_none
text2speech = Text2Speech.from_pretrained(
model_tag=str_or_none(tag),
vocoder_tag=str_or_none(vocoder_tag),
device="cuda",
# Only for Tacotron 2 & Transformer
threshold=0.5,
# Only for Tacotron 2
minlenratio=0.0,
maxlenratio=10.0,
use_att_constraint=False,
backward_window=1,
forward_window=3,
# Only for FastSpeech & FastSpeech2 & VITS
speed_control_alpha=1.0,
# Only for VITS
noise_scale=0.333,
noise_scale_dur=0.333,
)
Speaker selection¶
[ ]:
import glob
import os
import numpy as np
import kaldiio
# Get model directory path
from espnet_model_zoo.downloader import ModelDownloader
d = ModelDownloader()
model_dir = os.path.dirname(d.download_and_unpack(tag)["train_config"])
# X-vector selection
spembs = None
if text2speech.use_spembs:
xvector_ark = [p for p in glob.glob(f"{model_dir}/../../dump/**/spk_xvector.ark", recursive=True) if "tr" in p][0]
xvectors = {k: v for k, v in kaldiio.load_ark(xvector_ark)}
spks = list(xvectors.keys())
# randomly select speaker
random_spk_idx = np.random.randint(0, len(spks))
spk = spks[random_spk_idx]
spembs = xvectors[spk]
print(f"selected spk: {spk}")
# Speaker ID selection
sids = None
if text2speech.use_sids:
spk2sid = glob.glob(f"{model_dir}/../../dump/**/spk2sid", recursive=True)[0]
with open(spk2sid) as f:
lines = [line.strip() for line in f.readlines()]
sid2spk = {int(line.split()[1]): line.split()[0] for line in lines}
# randomly select speaker
sids = np.array(np.random.randint(1, len(sid2spk)))
spk = sid2spk[int(sids)]
print(f"selected spk: {spk}")
# Reference speech selection for GST
speech = None
if text2speech.use_speech:
# you can change here to load your own reference speech
# e.g.
# import soundfile as sf
# speech, fs = sf.read("/path/to/reference.wav")
# speech = torch.from_numpy(speech).float()
speech = torch.randn(50000,) * 0.01
Synthesis¶
[ ]:
import time
import torch
# decide the input sentence by yourself
print(f"Input your favorite sentence in {lang}.")
x = input()
# synthesis
with torch.no_grad():
start = time.time()
wav = text2speech(x, speech=speech, spembs=spembs, sids=sids)["wav"]
rtf = (time.time() - start) / (len(wav) / text2speech.fs)
print(f"RTF = {rtf:5f}")
# let us listen to generated samples
from IPython.display import display, Audio
display(Audio(wav.view(-1).cpu().numpy(), rate=text2speech.fs))