VALL-E

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Paper: https://arxiv.org/abs/2301.02111

Abstract. We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.

LibriSpeech Samples

Text Speaker Prompt Ground Truth VALL-E LibriTTS feiteng LibirTTS ours LibriTTS-R ours
They moved thereafter cautiously about the hut groping before and about them to find something to show that Warrenton had fulfilled his mission.
And lay me down in thy cold bed and leave my shining lot.
Number ten, fresh nelly is waiting on you, good night husband.
Yea, his honourable worship is within, but he hath a godly minister or two with him, and likewise a leech.