Self-Supervised VQ-VAE For One-Shot Music Style Transfer

Summary

Apply the vector-quantised paradigm for speech disentanglement to music — learning discrete and continuous representations respectively for content (pitch) and style (timbre).
Also use self-supervised method — training the model with pairs of samples, sharing a common instrument but different pitch contents.

The major inductive biases imposed to the model are the strong bottleneck attributed to the discrete content representation; and the augmented pairs of input samples that share a same instrument.

Other than the reconstructive training objective, there does not seem to be a contrastive objective, which implies the discrete bottleneck and data engineering suffice to disentangle. Whether adding the contrastive loss is beneficial or not might be interesting to see.
For timbre representation, we can consider segments from the same input sequence positive, and make other segments from separate sequences negative. This way, data augmentation may not be necessary as in CONTRASTIVE LEARNING OF GENERAL-PURPOSE AUDIO REPRESENTATIONS. This could be beneficial as data augmentation might not always contribute.
For sequential data, the method seems complementary to Contrastive Predictive Coding Supported Factorized Variational Autoencoder For Unsupervised Learning Of Disentangled Speech Representations which considers global and local latent variables and utilises predictive coding objectives to promote the disentanglement.