Summary

The problems

There are a large variety of design choices available when building a self-supervised model for speech representation, and a comparative study may shed light on how they are different from one another.

The solution

Propose to measure the similarity between speech representations derived from self-supervised models with different configurations.

Thoughts

The considered models would capture either more speaker or phonetic information, determined by how the positive and negative samples are defined, along with the objective function. Exploiting such an inductive bias could promote disentanglement as shown in Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations.