
Summary
- Unlike conventional contrastive predictive coding methods that predict either nearby, future, or missing embeddings, the work maximises mutual information between global speaker embeddings and the learned representations.
- It does not need negative samples as conventional methods do; however, it requires weak supervision, i.e., the presence of target speakers from the input mixtures.
- A cross-utterance attention mechanism is used to derive the global speaker embeddings.
The problems
- Conventional contrastive learning methods for speech representation mostly work for clean utterances, assuming a single speaker per utterance.
- How to learn robust speaker representations from interfered utterances with multiple speakers remains unsolved.
The solution
- Assume a weakly supervised scenario, where the presence of target speaker is known.
- Propose a novel separative contrastive learning loss rather than the usual predictive ones.
- A novel cross-attention mechanism is proposed to learn cross-utterances global speaker representations.
Thoughts
- The proposed contrastive loss is formulated probabilistically, which could provide an elegant integration with frameworks such as VAE.
- Modify the contrastive loss to learn a global representation shared across utterances, which at the same time maximises between-class distance and minimises within-class distance, seems promising.