Guided-TTS: A Diffusion Model for Text-to-Speech via Classifier Guidance (ICML 2022)
Authors
- Heeseung Kim (*Equal contribution) gmltmd789@snu.ac.kr
- Sungwon Kim (*Equal contribution) ksw0306@snu.ac.kr
- Sungroh Yoon (Corresponding author) sryoon@snu.ac.kr
Abstract
We propose Guided-TTS, a high-quality text-to-speech (TTS) model that does not require any transcript of target speaker using classifier guidance. Guided-TTS combines an unconditional diffusion probabilistic model with a separately trained phoneme classifier for classifier guidance. Our unconditional diffusion model learns to generate speech without any context from untranscribed speech data. For TTS synthesis, we guide the generative process of the diffusion model with a phoneme classifier trained on a large-scale speech recognition dataset. We present a norm-based scaling method that reduces the pronunciation errors of classifier guidance in Guided-TTS. We show that Guided-TTS achieves a performance comparable to that of the state-of-the-art TTS model, Grad-TTS, without any transcript for LJSpeech. We further demonstrate that Guided-TTS performs well on diverse datasets including a long-form untranscribed dataset.
Model Comparison (LJSpeech)
Guided-TTS does NOT use transcript of LJSpeech.
Transcript: Nor did the methods by which they were perpetrated greatly vary from those in times past.
GT | GT Mel+HiFi-GAN | Guided-TTS | Glow-TTS | Grad-TTS |
---|---|---|---|---|
Transcript: He was struck with the appearance of the corpse, which was not emaciated, as after a long disease ending in death;
GT | GT Mel+HiFi-GAN | Guided-TTS | Glow-TTS | Grad-TTS |
---|---|---|---|---|
Transcript: There was unrestrained association of untried and convicted, juvenile with adult prisoners, vagrants, misdemeanants, felons.
GT | GT Mel+HiFi-GAN | Guided-TTS | Glow-TTS | Grad-TTS |
---|---|---|---|---|
Generalization to Diverse Datasets
Guided-TTS does NOT use any transcript of untranscribed datasets.
Grad-TTS-ASR: Grad-TTS trained on paired dataset with ASR generated transcript.
1. Untranscribed speech data: LJSpeech
Transcript: Nor did the methods by which they were perpetrated greatly vary from those in times past.
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
Transcript: He was struck with the appearance of the corpse, which was not emaciated, as after a long disease ending in death;
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
Transcript: There was unrestrained association of untried and convicted, juvenile with adult prisoners, vagrants, misdemeanants, felons.
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
2. Untranscribed speech data: Hi-Fi TTS (ID: 92)
Transcript: The other, without flinching, lowered and raised his head slowly.
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
Transcript: And he repeated, as if reconsidering the suggestion conscientiously:
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
Transcript: For him the plain duty is to fasten the guilt upon as many prominent anarchists as he can on some slight indications he had picked up in the course of his investigation on the spot;
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
3. Untranscribed speech data: Hi-Fi TTS (ID: 6097)
Transcript: Any river is deep enough to drown a fool
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
Transcript: The plain was grown over with grass, but he could see no tree therein:
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
Transcript: The reduction of expense which would result from this appointment would be much more than adequate to the increased expense incurred by the appointment and remuneration of a gentleman of probity and respectability to this office.
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
4. Untranscribed speech data: Hi-Fi TTS (ID: 9017)
Transcript: Who is to be master of the world?
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
Transcript: The result of these reflections was that d’Artagnan, without asking information of any kind, alighted, commended the horses to the care of his lackey, entered a small room destined to receive those who wished to be alone, and desired the host to bring him a bottle of his best wine and as good a breakfast as possible
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
Transcript: No, but I have just met with a terrible adventure!
GT | GT Mel+HiFi-GAN | Guided-TTS | Grad-TTS-ASR |
---|---|---|---|
5. Untranscribed speech data: Blizzard 2013
Transcript: Crawford’s is no common attachment he perseveres, with the hope of creating that regard, which had not been created before.
GT | GT Mel+HiFi-GAN | Guided-TTS |
---|---|---|
Transcript: He was now the Mister Crawford who was addressing herself with ardent, disinterested love whose feelings were apparently become all that was honorable and upright, whose views of happiness were all fixed on a marriage of attachment who was pouring out his sense of her merits, describing and describing again his affection, proving as far as words could prove it, and in the language, tone, and spirit of a man of talent too, that he sought her for her gentleness, and her goodness.
GT | GT Mel+HiFi-GAN | Guided-TTS |
---|---|---|
Transcript: There may be some old woman at Thornton Lacey to be converted.
GT | GT Mel+HiFi-GAN | Guided-TTS |
---|---|---|
Analysis on the effect of norm-based guidance
Classifier guidance v.s. Norm-based guidance
Norm-based guidance method with the appropriate gradient scale (s=0.3~0.4) helps accurately generate samples given text sentences.
Transcript: Nor did the methods by which they were perpetrated greatly vary from those in times past.
Classifier guidance (s=0.5) | Classifier guidance (s=1.5) | Classifier guidance (s=3.0) | Classifier guidance (s=4.5) | Norm-based guidance (s=0.1) | Norm-based guidance (s=0.3) (Ours) | Norm-based guidance (s=0.6) | Norm-based guidance (s=1.0) |
---|---|---|---|---|---|---|---|
Transcript: There was unrestrained association of untried and convicted, juvenile with adult prisoners, vagrants, misdemeanants, felons.
Classifier guidance (s=0.5) | Classifier guidance (s=1.5) | Classifier guidance (s=3.0) | Classifier guidance (s=4.5) | Norm-based guidance (s=0.1) | Norm-based guidance (s=0.3) (Ours) | Norm-based guidance (s=0.6) | Norm-based guidance (s=1.0) |
---|---|---|---|---|---|---|---|
Unconditional Generation
Unconditional DDPM (LJSpeech) | Unconditional DDPM (Hi-Fi TTS ID: 92) | Unconditional DDPM (Hi-Fi TTS ID: 6097) |
---|---|---|
Mel-spectrogram Inpainting Results
Dataset | LJSpeech | Hi-Fi TTS ID: 92 | Hi-Fi TTS ID: 6097 |
---|---|---|---|
Mel-spectrogram | |||
Ground Truth | |||
Inpainting |