VALL-E
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Paper: https://arxiv.org/abs/2301.02111Abstract. We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
LibriSpeech Samples
Text | Speaker Prompt | Ground Truth | VALL-E | LibriTTS feiteng | LibirTTS ours | LibriTTS-R ours |
---|---|---|---|---|---|---|
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission. | ||||||
And lay me down in thy cold bed and leave my shining lot. | ||||||
Number ten, fresh nelly is waiting on you, good night husband. | ||||||
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech. |