Published: 2019-07-01

Abstract

We propose an end-to-end text-to-speech (TTS) synthesis model that explicitly uses information from pre-trained embeddings of the text. Recent work in natural language processing has developed self-supervised representations of text that have proven very effective as pre-training for language understanding tasks. We propose using one such pre-trained representation (BERT) to encode input phrases, as an additional input to a Tacotron2-based sequence-to-sequence TTS model. We hypothesize that the text embeddings contain information about the semantics of the phrase and the importance of each word, which should help TTS systems produce more natural prosody and pronunciation. We conduct subjective listening tests of our proposed models using the 24-hour LJSpeech corpus, finding that they improve mean opinion scores modestly but significantly over a baseline TTS model without pre-trained text embedding input.

Generated examples

Baseline

Phrase-level model

Subword-level model

Examples from A/B forced choice with high agreement

Subword example preferred

result in some degree of interference with the personal liberty of those involved.

Subword :

Baseline :

In June 1964, the Secret Service sent to a number of Federal law enforcement and intelligence agencies

Subword :

Baseline :

determination to use a means, other than legal or peaceful, to satisfy his grievance, end quote, within the meaning of the new criteria.

Subword :

Baseline :

Baseline example preferred

it has obtained the services of outside consultants, such as the Rand Corporation,

Subword :

Baseline :

and from a specialist in psychiatric prognostication at Walter Reed Hospital.

Subword :

Baseline :

Citation

@inproceedings{hayashi2019pretrained,
  title={Pre-trained Text Embeddings for Enhanced Text-to-Speech Synthesis},
  author={Hayashi, Tomoki and Watanabe, Shinji and Toda, Tomoki and Takeda, Kazuya and Toshniwal, Shubham and Livescu, Karen},
  booktitle={Interspeech 2019 (Accepted)},
  year={2019}
}