The existing unsupervised speaker-content disentangling frameworks usually bias the learning with global average pooling for the speaker representation, and instance normalisation for the content representation. This work seeks to incorporate CPC to further improve the performance.
The CPC loss encourages the speaker representations to be consistent throughout the sequence, and discourage the content encoder from encoding speaker information by an adversarial CPC.