CONTRASTIVE SEPARATIVE CODING FOR SELF-SUPERVISED REPRESENTATION LEARNING

Summary

Unlike conventional contrastive predictive coding methods that predict either nearby, future, or missing embeddings, the work maximises mutual information between global speaker embeddings and the learned representations.
It does not need negative samples as conventional methods do; however, it requires weak supervision, i.e., the presence of target speakers from the input mixtures.
A cross-utterance attention mechanism is used to derive the global speaker embeddings.

Conventional contrastive learning methods for speech representation mostly work for clean utterances, assuming a single speaker per utterance.
How to learn robust speaker representations from interfered utterances with multiple speakers remains unsolved.

Assume a weakly supervised scenario, where the presence of target speaker is known.
Propose a novel separative contrastive learning loss rather than the usual predictive ones.
A novel cross-attention mechanism is proposed to learn cross-utterances global speaker representations.

The proposed contrastive loss is formulated probabilistically, which could provide an elegant integration with frameworks such as VAE.
Modify the contrastive loss to learn a global representation shared across utterances, which at the same time maximises between-class distance and minimises within-class distance, seems promising.