Abstract

We present a method for fine-grained control over music generation through inference-time interventions on an autoregressive generative music transformer called MusicGen. Our approach enables timbre transfer, style transfer, and genre fusion by steering the residual stream using weights of linear probes trained on it, or by steering the attention layer activations in a similar manner. We observe that modelling this as a regression task provides improved performance, hypothesizing that the mean-squared-error better preserve meaningful directional information in the activation space. Combined with the global conditioning offered by text prompts in MusicGen, our method provides both global and local control over music generation.

Check out our paper

Fine-Grained control over Music Generation with Activation Steering

Samples

To demonstrate our method, we start with a sample of a trumpet playing a western classical piece.

og_audio.wav

And we were able to change the ‘style’/ genre:

To electronic:

classical_elec.wav

To rock:

elec_Rock.wav

Or to add a tabla accompaniment at the tempo of the melodic sample:

rock_indian (1).wav

And we were also able to change instruments:

To violin:

to_violin.wav

To xylophone:

to_xylophone.wav

Our method involves intervention during inference time, for example, in the following audio sample, we intervene in the activations at the 5th second and the change in audio is evident.

download (1).wav

Additional Samples

Original (rock):

rock_first10s.wav

After steering (adding pop)

rock_to_pop_5.wav

Original (pop):

pop_first10s.wav