VALL-E X

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

[Paper]


Ziqiang Zhang*,   Long Zhou*,   Chengyi Wang,   Sanyuan Chen,   Yu Wu,   Shujie Liu,  
Zhuo Chen,   Yanqing Liu,   Huaming Wang,   Jinyu Li,   Lei He,   Sheng Zhao,   Furu Wei

Microsoft

Abstract. We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID.

This page is for research demonstration purposes only.

Contents

Model Overview

Figure. The overall framework of VALL-E X, which can synthesize personalized speech in another language for a monolingual speaker. Taking the phoneme sequences derived from the source and target text, and the source acoustic tokens derived from an audio codec model as prompts, VALL-E X is able to produce the acoustic tokens in the target language, which can be then decompressed to the target speech waveform. Thanks to its powerful in-context learning capabilities, VALL-E X does not require cross-lingual speech data of the same speakers for training, and can perform various zero-shot cross-lingual speech generation tasks, such as cross-lingual text-to-speech synthesis and speech-to-speech translation.

Zero-Shot Cross-Lingual Text to Speech

1) English TTS with Chinese prompts (English samples are from LibriSpeech, Chinese samples are from EMIME and AISHELL-3 dataset).

English Text Chinese Speaker Prompt Baseline VALL-E X
Look a little closer while our guide lets the light of his lamp fall upon the black wall at your side.
He honours whatever he recognizes in himself, such morality equals self-glorification.
One dark night at the head of a score of his tribe, he fell upon Wabigoon's camp, his object being the abduction of the princess.
There could be little art in this last and final round of fencing.
It's the first time Hilda has been to our house and Tom introduces her around.
It was youth and poverty and proximity and everything was young and kindly.

2) Chinese TTS with English prompts (Chinese samples are from EMIME and AISHELL-3 test, English samples are from LibriSpeech).

Chinese Text English Speaker Prompt VALL-E X
坚持房地产调控政策不动摇。
值得关注的是从二零一零年到二零一四年。
两千六百四十八万二千五百四十六。
汇聚部分全球领先品牌的下一代技术创新。
商品房销售情况也传递出了更多暖意。
最低首付款比例为百分之一。

Zero-Shot Speech-to-Speech Translation

1) Chinese to English Translation on EMIME dataset.

Chinese Speech English Ground Truth Baseline VALL-E X Trans

2) English to Chinese Translation on EMIME dataset.

English Speech Chinese Ground Truth VALL-E X Trans

3) Chinese to English Translation using AISHELL-3 test.

Chinese Text Chinese Speech Baseline VALL-E X Trans
我们就坐在他的位于半山坡的办公室里。
是巴西女选手在热身泳道中违规使用脚扑影响他人。
更在于这项运动本身具有着极其丰富的精神内涵。
美国并没有统一的全国高中联赛。
本市还要打造慈善捐赠事业的阳光工程。
达茜兮会称呼我为瓦齐里先生。

4) English to Chinese Translation using LibriSpeech dev-clean.

English Text English Speech VALL-E X Trans
His instant of panic was followed by a small sharp blow high on his chest.
The last two days of the voyage Bartley found almost intolerable.
She merely brushed his cheek with her lips and put a hand lightly and joyously on either shoulder.
But in this awful moment of the danger of the church. their vow was superseded by a more sublime and indispensable duty.
We've lost the key of the cellar and there's nothing out except water and i don't think you'd care for that.
He had been late he had offered no excuse no explanation.

Foreign Accent Control

1) English to Chinese on EMIME dataset.

English Speech (Prompt) Chinese Speech (Ground Truth) VALL-E X with English LID VALL-E X with Chinese LID

2) Chinese to English on EMIME dataset.

Chinese Speech (Prompt) English Speech (Ground Truth) VALL-E X with English LID VALL-E X with Chinese LID

Voice Emotion Maintenance

VALL-E X Trans can synthesize personalized target speech while maintaining the emotion in the source speech. The source audio are sampled from the Emotional Voices Database EmoV-DB.

Emotion English Speech VALL-E X Trans
Neutral
Amused
Sleepiness
Anger
Disgust

Code-Switch Speech Synthesis

Code-Switch Text Prompts VALL-E X
播放歌曲 BEST FRIEND。
我想去travel一下,放松一下自己。
他是一个funny的人,总是讲笑话。
这个restaurant很有名,很多人都来吃。
这个pizza很好吃,你要不要try一下。

Ethics Statement

Since VALL-E X could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.