Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data
Authors
- Sungwon Kim (*Equal contribution) ksw0306@snu.ac.kr
- Heeseung Kim (*Equal contribution) gmltmd789@snu.ac.kr
- Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr
Abstract
We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data. Guided-TTS 2 combines a speaker-conditional diffusion model with a speaker-dependent phoneme classifier for adaptive text-to-speech. We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method and further fine-tune the diffusion model on the reference speech of the target speaker for adaptation, which only takes 40 seconds. We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a ten-second untranscribed data. We further show that Guided-TTS 2 outperforms adaptive TTS baselines on multi-speaker datasets even with a zero-shot adaptation setting. Guided-TTS 2 can adapt to a wide range of voices only using untranscribed speech, which enables adaptive TTS with the voice of non-human characters such as Gollum in “The Lord of the Rings”.
Real-world Data
Transcript: This audio was generated by a text-to-speech model for Steve Jobs. We use ten second untranscribed speech from Steve Jobs’ Stanford Commencement Address.
Reference | Guided-TTS 2 | Guided-TTS 2 (zero-shot) |
---|---|---|
Steve Jobs (00:55 ~ 01:05) |
Transcript: This audio was generated by a text-to-speech model for Sonny. We propose Guided Text-to-Speech 2, a diffusion based generative model for high-quality adaptive text-to-speech using untranscribed data.
Reference | Guided-TTS 2 | Guided-TTS 2 (zero-shot) |
---|---|---|
Heung-min Son (00:04 ~ 00:14) |
Transcript: This audio was generated by a text-to-speech model for Emma Watson. We found that the gap between the zero shot approach and the finetune approach is quite large in the case of real world data.
Reference | Guided-TTS 2 | Guided-TTS 2 (zero-shot) |
---|---|---|
Emma Watson (03:30 ~ 03:40) |
Transcript: This audio was generated by a text-to-speech model for Kobe Bryant. I respect his mamba mentality.
Reference | Guided-TTS 2 |
---|---|
Kobe Bryant (08:37 ~ 09:01) |
Non-human Character
Transcript: This audio was generated by a text-to-speech model for Gollum, which can adapt to non-human characters using untranscribed data.
Reference | Guided-TTS 2 | Guided-TTS 2 (zero-shot) |
---|---|---|
Gollum (00:30 ~ 00:40) |
Transcript: This audio was generated by a text-to-speech for Glados.
Reference | Guided-TTS 2 |
---|---|
Glados (00:00 ~ 00:10) |
Failure case
Transcript: This audio was generated by a text-to-speech for Moonlight Sonata.
Reference | Guided-TTS 2 |
---|---|
Moonlight Sonata (00:02 ~ 00:12) |
LJSpeech
Transcript: Nor did the methods by which they were perpetrated greatly vary from those in times past.
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Grad-TTS | Guided-TTS | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|---|---|
22,050Hz | |||||||||
16,000Hz |
Transcript: He was struck with the appearance of the corpse, which was not emaciated, as after a long disease ending in death;
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Grad-TTS | Guided-TTS | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|---|---|
22,050Hz | |||||||||
16,000Hz |
Transcript: There was unrestrained association of untried and convicted, juvenile with adult prisoners, vagrants, misdemeanants, felons.
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Grad-TTS | Guided-TTS | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|---|---|
22,050Hz | |||||||||
16,000Hz |
LibriTTS
Transcript: In this connection it should be mentioned that the Association of Edison Illuminating Companies in the same year adopted resolutions unanimously to the effect that the Edison meter was accurate, and that its use was not expensive for stations above one thousand lights; and that the best financial results were invariably secured in a station selling current by meter.
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|
22,050Hz | |||||||
16,000Hz |
Transcript: She wandered in the land of clouds thro’ valleys dark, listning Dolors and lamentations: waiting oft beside the dewy grave She stood in silence, listning to the voices of the ground, Till to her own grave plot she came, and there she sat down. And heard this voice of sorrow breathed from the hollow pit.
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|
22,050Hz | |||||||
16,000Hz |
Transcript: A transferable ticket to the Haul of Fame.
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|
22,050Hz | |||||||
16,000Hz |
VCTK
Transcript: Like last month, it is simply too early to make a call.
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|
22,050Hz | |||||||
16,000Hz |
Transcript: It is just part of modern day life.
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|
22,050Hz | |||||||
16,000Hz |
Transcript: However, within five minutes they were able to celebrate.
Sampling Rate | Reference | GT | GT Mel+HiFi-GAN | Guided-TTS 2 | Guided-TTS 2 (zero-shot) | YourTTS | Meta-StyleSpeech |
---|---|---|---|---|---|---|---|
22,050Hz | |||||||
16,000Hz |