"Duration Prediction" existed in early TTS system as HMM-based models and even in Deep Voice 2 (2017).


Seq-to-seq model w/ attention mechanism removed the need for duration prediction.

Tacotron 2 (2018)


seq-to-seq model

auto-regressive decoder

attention mechanism (location-sensitive)

Problems with auto-regressive attention based models

early cutoff



Efforts to improve the robustness of auto-regressive attention based models

adversarial training(Guo et al. 2019)

regularization to encourage the forward and backward attention to be consistent(Zheng et al., 2019)

Gaussian mixture model attention(Graves, 2013; Skerry-Ryan et al., 2018)

forward attention(Zhang et al., 2018)

stepwise monotonic attention(He et al., 2019)