
Summary
- Apply the vector-quantised paradigm for speech disentanglement to music — learning discrete and continuous representations respectively for content (pitch) and style (timbre).
- Also use self-supervised method — training the model with pairs of samples, sharing a common instrument but different pitch contents.
The problems
- Most prior works only function with single-note datasets.
The solution
- The major inductive biases imposed to the model are the strong bottleneck attributed to the discrete content representation; and the augmented pairs of input samples that share a same instrument.
Thoughts