
Summary
- Contrastive learning helps structure the latent space which could generalise to data with attributes unseen during training.
- Achieve pitch-timbre disentanglement through weak supervision which requires only the information of whether samples share a common pitch/instrument.
The problems
- The existing models for pitch-timbre disentanglement may not generalise to samples with unseen pitch and instrument.
The solution
- Impose a contrastive loss that pulls latent representations of samples that have same pitch/instrument closer and separates them away from each other otherwise.
- Unlike the Gaussian-mixture VAE, the method is not limited to a set of pre-defined set of known pitch and instrument, which improves generalisability to unseen attributes.
Thoughts
- The idea of incorporating contrastive loss with VAEs to learn more discriminative latent space is promising. A direction going forward could be a more principled integration based on probabilistic interpretation of the contrastive loss which is shown in CONTRASTIVE SEPARATIVE CODING FOR SELF-SUPERVISED REPRESENTATION LEARNING.
- For inferring representation, the GM-VAE is in principle not limited to the predefined dictionary and clusters, and it would be interesting to evaluate its generalisability.
- The claim of modelling perceptually related latent spaces is not trivial to verify based on the reported metrics. After all, the proposed training objective might pull away latents derived from perceptually similar samples.