Yuxuan Wang - Tacotron: Towards End-to-End Speech Synthesis (2017)
History /
Edit /
PDF /
EPUB /
BIB /
Created: January 20, 2017 / Updated: November 2, 2024 / Status: finished / 2 min read (~364 words)
Created: January 20, 2017 / Updated: November 2, 2024 / Status: finished / 2 min read (~364 words)
- A text-to-speech synthesis system typically consists of multiple stages, such as
- a text analysis frontend
- an acoustic model
- an audio synthesis module
- It is common for statistical parametric TTS to have a text frontend extracting various linguistic features, a duration model, an acoustic feature prediction model and a complex signal-processing-based vocoder
- TTS is a large-scale inverse problem: a highly compressed source (text) is "decompressed" into audio
- Tacotron is an end-to-end generative TTS model based on the sequence-to-sequence (seq2seq) with attention paradigm
- The model takes characters as input and outputs raw spectrogram, using several techniques to improve the capability of a vanilla seq2seq model
- WaveNet
- Slow due to its sample-level autoregressive nature
- Requires conditioning on linguistic features from an existing TTS frontend (thus is not end-to-end, it only replaces the vocoder and acoustic model)
- DeepVoice
- Replaces every component in a typical TTS pipeline by a corresponding neural network
- Each component is independently trained, and it's nontrivial to change the system to train in an end-to-end fashion
- Wang et al.
- Earliest work touching end-to-end TTS using seq2seq with attention
- It requires a pre-trained hidden Markov model (HMM) aligner to help the seq2seq model learn the alignment
- A few tricks are used to get the model trained, which the authors note hurts prosody
- It predicts vocoder parameters hence needs a vocoder
- The model is trained on phoneme inputs and the experimental results seem to be somewhat limited
- Char2Wav
- Predicts vocoder parameters before using a SampleRNN neural vocoder, whereas Tacotron directly predicts raw spectrogram
- The seq2seq and SampleRNN models need to be separately pre-trained (while Tacotron's model can be trained from scratch)
- The backbone of Tacotron is a seq2seq model with attention
- The model includes an encoder, an attention-based decoder, and a post-processing net
- The model takes characters as input and produces spectrogram frames, which are then converted to waveforms
- Tacotron outperforms the parametric system, but is outperformed by the concatenative system
- Wang, Yuxuan, et al. "Tacotron: Towards End-to-End Speech Synthesis." arXiv preprint arXiv:1703.10135 (2017).