Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis

Slava Shechtman1, Raul Fernandez2 and David Haws2
1IBM Haifa Research Lab, Haifa – Israel
2IBM TJ Watson Research Lab, Yorktown Heights, NY – USA

Accepted to SLT 2021 (Full paper)

Audio Samples


Transplant condition : NO labeled word-emphasis training data available for the target synthesis voice
Base (NoEmph) Base (Sup) PC-Unsup PC-Hybrid Emphasized word
WALKED
COMPLEX
ONLY
AGAIN
MILDER
LITERALLY
FIVE
THOUGHT
ONLY
PROMPTLY
EXIT
BEGINNER'S
PATTERN
ANYONE
MUCH


Matched condition : WITH labeled word-emphasis training data available for target synthesis voice
Base (NoEmph) Base (Sup) PC-Unsup PC-Hybrid Emphasized word
WALKED
COMPLEX
ONLY
AGAIN
MILDER
LITERALLY
FIVE
THOUGHT
ONLY
PROMPTLY
EXIT
BEGINNER'S
PATTERN
ANYONE
MUCH

The sample screen of the conducted MOS Test experiment

LT screen