Supervised and Unsupervised Approaches for Controlling Narrow Lexical Focus in Sequence-to-Sequence Speech Synthesis

Slava Shechtman¹, Raul Fernandez² and David Haws²
¹IBM Haifa Research Lab, Haifa – Israel
²IBM TJ Watson Research Lab, Yorktown Heights, NY – USA

Accepted to SLT 2021 (Full paper)

Audio Samples

(1) Base (NoEmph) - No emphasis: A baseline S2S system, which uses global (sentence-level) prosodic controls.
(2) Base (Sup) - Classic supervision: A baseline S2S system (which uses global prosodic controls), exposed to emphatic data during training (for a single speaker only) and having an explicit binary feature encoding the location of emphasis.
(3) PC-Unsup - Prosodic Control (PC) of word emphasis with no supervision: A Fully Unsupervised system trained with the sentence-level and the word-level prosodic controls. Word emphasis realizied at inference time by a set of fixed additive word-level prosodic controls
(4) PC-Hybrid - Hybrid system: A system combining the classic supervision (for a single speaker only) and the multi-level prosodic controls. Word emphasis is trained from the emphatic data and boosted at inference time by a set of fixed word-level prosodic controls