"Duration Prediction" existed in early TTS system as HMM-based models and even in Deep Voice 2 (2017).
Seq-to-seq model w/ attention mechanism removed the need for duration prediction.
attention mechanism (location-sensitive)
adversarial training(Guo et al. 2019)
regularization to encourage the forward and backward attention to be consistent(Zheng et al., 2019)
Gaussian mixture model attention(Graves, 2013; Skerry-Ryan et al., 2018)
forward attention(Zhang et al., 2018)
stepwise monotonic attention(He et al., 2019)