Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling


Ziqiang Zhang*,   Long Zhou*,   Chengyi Wang,   Sanyuan Chen,   Yu Wu,   Shujie Liu,  
Zhuo Chen,   Yanqing Liu,   Huaming Wang,   Jinyu Li,   Lei He,   Sheng Zhao,   Furu Wei


Abstract. We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID.

This page is for research demonstration purposes only.


Model Overview

Figure. The overall framework of VALL-E X, which can synthesize personalized speech in another language for a monolingual speaker. Taking the phoneme sequences derived from the source and target text, and the source acoustic tokens derived from an audio codec model as prompts, VALL-E X is able to produce the acoustic tokens in the target language, which can be then decompressed to the target speech waveform. Thanks to its powerful in-context learning capabilities, VALL-E X does not require cross-lingual speech data of the same speakers for training, and can perform various zero-shot cross-lingual speech generation tasks, such as cross-lingual text-to-speech synthesis and speech-to-speech translation.

Zero-Shot Cross-Lingual Text to Speech

1) English TTS with Chinese prompts (English samples are from LibriSpeech, Chinese samples are from EMIME and AISHELL-3 dataset).

English Text Chinese Speaker Prompt Baseline VALL-E X
Look a little closer while our guide lets the light of his lamp fall upon the black wall at your side.
He honours whatever he recognizes in himself, such morality equals self-glorification.
One dark night at the head of a score of his tribe, he fell upon Wabigoon's camp, his object being the abduction of the princess.
There could be little art in this last and final round of fencing.
It's the first time Hilda has been to our house and Tom introduces her around.
It was youth and poverty and proximity and everything was young and kindly.

2) Chinese TTS with English prompts (Chinese samples are from EMIME and AISHELL-3 test, English samples are from LibriSpeech).

Chinese Text English Speaker Prompt VALL-E X

Zero-Shot Speech-to-Speech Translation

1) Chinese to English Translation on EMIME dataset.

Chinese Speech English Ground Truth Baseline VALL-E X Trans

2) English to Chinese Translation on EMIME dataset.

English Speech Chinese Ground Truth VALL-E X Trans

3) Chinese to English Translation using AISHELL-3 test.

Chinese Text Chinese Speech Baseline VALL-E X Trans

4) English to Chinese Translation using LibriSpeech dev-clean.

English Text English Speech VALL-E X Trans
His instant of panic was followed by a small sharp blow high on his chest.
The last two days of the voyage Bartley found almost intolerable.
She merely brushed his cheek with her lips and put a hand lightly and joyously on either shoulder.
But in this awful moment of the danger of the church. their vow was superseded by a more sublime and indispensable duty.
We've lost the key of the cellar and there's nothing out except water and i don't think you'd care for that.
He had been late he had offered no excuse no explanation.

Foreign Accent Control

1) English to Chinese on EMIME dataset.

English Speech (Prompt) Chinese Speech (Ground Truth) VALL-E X with English LID VALL-E X with Chinese LID

2) Chinese to English on EMIME dataset.

Chinese Speech (Prompt) English Speech (Ground Truth) VALL-E X with English LID VALL-E X with Chinese LID

Voice Emotion Maintenance

VALL-E X Trans can synthesize personalized target speech while maintaining the emotion in the source speech. The source audio are sampled from the Emotional Voices Database EmoV-DB.

Emotion English Speech VALL-E X Trans

Code-Switch Speech Synthesis

Code-Switch Text Prompts VALL-E X

Ethics Statement

Since VALL-E X could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.