This post presents some motivation on why we work on model diffing, some of our first results using sparse dictionary methods and our next steps. This work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Bart Bussman for their useful feedback.

Could OpenAI have avoided releasing an absurdly sycophantic model update? Could mechanistic interpretability have caught it? Maybe with model diffing!

Model diffing is the study of mechanistic changes introduced during fine-tuning - essentially, understanding what makes a fine-tuned model different from its base model internally. Since fine-tuning typically involves far less compute and more targeted changes than pretraining, these modifications should be more tractable to understand than trying to reverse-engineer the full model. At the same time, many concerning behaviors (reward hacking, sycophancy, deceptive alignment) emerge during fine-tuning1, making model diffing potentially valuable for catching problems before deployment. We investigate the efficacy of model diffing techniques by exploring the internal differences between base and chat models.

RLHF is often visualized as a "mask" applied on top of the base LLM's raw capabilities (the "shoggoth"). One application of model diffing is studying this mask specifically, rather than the entire shoggoth+mask system.

RLHF is often visualized as a "mask" applied on top of the base LLM's raw capabilities (the "shoggoth"). One application of model diffing is studying this mask specifically, rather than the entire shoggoth+mask system.

TL;DR

Background

Sparse dictionary methods like SAEs decompose neural network activations into interpretable components, finding the “concepts” a model uses. Crosscoders are a clever variant: they learn a single set of concepts shared between a base model and its fine-tuned version, but with separate representations for each model. Concretely, if a concept is detected, it’s forced to be reconstructed in both models, but the way it’s reconstructed can be different for each model.

The key insight is that if a concept only exists in the fine-tuned model, its base model representation should have zero norm (since the base model never needs to reconstruct that concept). By comparing representation norms, we can identify concepts that are unique to the base model, unique to the fine-tuned model, or shared between them. This seemed like a principled way to understand what fine-tuning adds to a model.

The Problem: Most “Model-Specific” Latents Aren’t

In our latest paper, we first trained a crosscoder on the middle-layer of Gemma-2 2B base and chat models, following Anthropic’s setup. The representation norm comparison revealed thousands of apparent “chat-only” latents. But when we looked at these latents, most weren’t interpretable - they seemed like noise rather than meaningful concepts.

This led us to develop Latent Scaling, a technique that measures how much each latent actually explains the activations in each model. For a latent $j$, with representation $d_j$, we compute

$$ ⁍ $$

which measure the importance of latent $j$ for reconstructing model $m$’s activations[^CF]. We can then measure the fine-tuning specificity of a latent with: