(1) Base (NoEmph) - No emphasis: A baseline S2S system, which uses global (sentence-level) prosodic controls.
(2) Base (Sup) - Classic supervision: A baseline S2S system (which uses global prosodic controls), exposed to emphatic data during training (for a single speaker only) and having an explicit binary feature encoding the location of emphasis.
(3) PC-Unsup - Prosodic Control (PC) of word emphasis with no supervision: A Fully Unsupervised system trained with the sentence-level and the word-level prosodic controls. Word emphasis realizied at inference time by a set of fixed additive word-level prosodic controls
(4) PC-Hybrid - Hybrid system: A system combining the classic supervision (for a single speaker only) and the multi-level prosodic controls. Word emphasis is trained from the emphatic data and boosted at inference time by a set of fixed word-level prosodic controls
Transplant condition : NO labeled word-emphasis training data available for the target synthesis voice
Base (NoEmph)
Base (Sup)
PC-Unsup
PC-Hybrid
Emphasized word
WALKED
COMPLEX
ONLY
AGAIN
MILDER
LITERALLY
FIVE
THOUGHT
ONLY
PROMPTLY
EXIT
BEGINNER'S
PATTERN
ANYONE
MUCH
Matched condition : WITH labeled word-emphasis training data available for target synthesis voice
Base (NoEmph)
Base (Sup)
PC-Unsup
PC-Hybrid
Emphasized word
WALKED
COMPLEX
ONLY
AGAIN
MILDER
LITERALLY
FIVE
THOUGHT
ONLY
PROMPTLY
EXIT
BEGINNER'S
PATTERN
ANYONE
MUCH
The sample screen of the conducted MOS Test experiment