Auto-encoding and predictive coding are two dominant frameworks for speech representation learning. A comparative evaluation is helpful for reference.
VQVAEs would mostly learn phonetic information while CPC-based methods are more flexible. The combination of two may be promising to further promote disentanglement.