A Comparison Of Discrete Latent Variable Models For Speech Representation Learning

Summary

Perform a comparative study on VQVAEs and VQ-wav2vec in terms of phone recognition.
VQ-wav2vec achieves better performance.

The problems

Auto-encoding and predictive coding are two dominant frameworks for speech representation learning. A comparative evaluation is helpful for reference.

Thoughts

VQVAEs would mostly learn phonetic information while CPC-based methods are more flexible. The combination of two may be promising to further promote disentanglement.