SIMILARITY ANALYSIS OF SELF-SUPERVISED SPEECH REPRESENTATIONS

Summary

Investigate the following design choices of a self-supervised learning model for speech representation:
- Training objectives: Reconstructive vs. Contrastive
- NN building blocks: RNN, CNN, and transformer
- Directionality: Uni- or bi-directional
Quantify similarity between the output representations derived from these design choices.
Training objective is found to be the dominant factor that determines the representation similarity; and that reconstructive objectives are more similar to one another than the contrastive ones.

The problems

There are a large variety of design choices available when building a self-supervised model for speech representation, and a comparative study may shed light on how they are different from one another.

The solution

Propose to measure the similarity between speech representations derived from self-supervised models with different configurations.

Thoughts

The considered models would capture either more speaker or phonetic information, determined by how the positive and negative samples are defined, along with the objective function. Exploiting such an inductive bias could promote disentanglement as shown in Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations.