AN ITERATIVE FRAMEWORK FOR SELF-SUPERVISED DEEP SPEAKER REPRESENTATION LEARNING

Summary

Propose an iterative learning framework to decrease labelling noise of contrastive self-supervised learning methods.
As suggested by previous works (e.g. deep clustering), discriminability can be improved even with the presence of noisy learning signals.
The iterative framework starts with a self-supervised speaker representation leaner, learns pseudo speaker labels using K-mean, and leverages the pseudo labels to refine the self-supervised learner.

The problems

Contrastive learning algorithms usually assume other samples from the same batch as the negative samples, which naturally introduces labelling noise (those samples are not necessarily negative).

The solution

The framework exploits K-mean to cluster the learned representation, which produces new pseudo labels. The self-supervised model then learns from these pseudo labels to refine its output representation.

Thoughts

It seems that the self-supervised model utilises the pseudo labels by imposing a cross-entropy loss on top of the contrastive loss. What'd be the issue of re-defining the positive and negative samples based on the pseudo labels, and keep training the model with the contrastive loss alone?