Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations

Summary

Insert contrastive predictive coding (CPC) to both style and content representation to further promote the disentanglement of speech.
The speaker representation is asked to be consistent throughout the sequence, while the adversarial term is posed to the content representation to prevent speaker information.

The problems

The existing unsupervised speaker-content disentangling frameworks usually bias the learning with global average pooling for the speaker representation, and instance normalisation for the content representation. This work seeks to incorporate CPC to further improve the performance.

The solution

The CPC loss encourages the speaker representations to be consistent throughout the sequence, and discourage the content encoder from encoding speaker information by an adversarial CPC.

Thoughts

Existing works on sequential disentanglement rely on the factors of variation operating on different time scales, e.g., FHVAE and DSAE. The fact that CPC promotes the disentanglement might suggest the same inductive bias.
It seems that learning content or phonetic representation is the predominant goal in the context of CPC-based representation leaners, e.g., SIMILARITY ANALYSIS OF SELF-SUPERVISED SPEECH REPRESENTATIONS , and A Comparison Of Discrete Latent Variable Models For Speech Representation Learning. Using the identical mapping as the training objective instead, this work uses CPC to learn speaker representation.