Summary

The problems

Auto-encoding and predictive coding are two dominant frameworks for speech representation learning. A comparative evaluation is helpful for reference.

Thoughts

VQVAEs would mostly learn phonetic information while CPC-based methods are more flexible. The combination of two may be promising to further promote disentanglement.