What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

This post presents some motivation on why we work on model diffing, some of our first results using sparse dictionary methods and our next steps. This work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Bart Bussman for their useful feedback.

Could OpenAI have avoided releasing an absurdly sycophantic model update? Could mechanistic interpretability have caught it? Maybe with model diffing!

Model diffing is the study of mechanistic changes introduced during fine-tuning - essentially, understanding what makes a fine-tuned model different from its base model internally. Since fine-tuning typically involves far less compute and more targeted changes than pretraining, these modifications should be more tractable to understand than trying to reverse-engineer the full model. At the same time, many concerning behaviors (reward hacking, sycophancy, deceptive alignment) emerge during fine-tuning1, making model diffing potentially valuable for catching problems before deployment. We investigate the efficacy of model diffing techniques by exploring the internal differences between base and chat models.

RLHF is often visualized as a "mask" applied on top of the base LLM's raw capabilities (the "shoggoth"). One application of model diffing is studying this mask specifically, rather than the entire shoggoth+mask system.

TL;DR

We demonstrate that model diffing techniques offer valuable insights into the effects of chat-tuning, highlighting this is a promising research direction deserving more attention.
We investigated Crosscoders, a sparse dictionary diffing method developed by Anthropic, which learns a shared dictionary between the base and chat model activations allowing us to identify latents specific to the chat model ("chat-only latents").
Using Latent Scaling, a new metric that quantifies how model-specific a latent is, we discovered fundamental issues in Crosscoders causing hallucinations of chat-only latents.
We show that you can resolve these issues with a simple architectural adjustment: using a BatchTopK instead of an L1 sparsity loss.
This reveals interesting patterns created by chat fine-tuning like the importance of template tokens like <end_of_turn>.
However, we show that combining Latent Scaling with an SAE trained on the chat model or chat - base activations might work even better.
We’re working on a toolkit to run model diffing methods and on an evaluation framework to properly compare the different methods in more controlled settings (where we know what the diff is).

Background

Sparse dictionary methods like SAEs decompose neural network activations into interpretable components, finding the “concepts” a model uses. Crosscoders are a clever variant: they learn a single set of concepts shared between a base model and its fine-tuned version, but with separate representations for each model. Concretely, if a concept is detected, it’s forced to be reconstructed in both models, but the way it’s reconstructed can be different for each model.

The key insight is that if a concept only exists in the fine-tuned model, its base model representation should have zero norm (since the base model never needs to reconstruct that concept). By comparing representation norms, we can identify concepts that are unique to the base model, unique to the fine-tuned model, or shared between them. This seemed like a principled way to understand what fine-tuning adds to a model.

The Problem: Most “Model-Specific” Latents Aren’t

In our latest paper, we first trained a crosscoder on the middle-layer of Gemma-2 2B base and chat models, following Anthropic’s setup. The representation norm comparison revealed thousands of apparent “chat-only” latents. But when we looked at these latents, most weren’t interpretable - they seemed like noise rather than meaningful concepts.

This led us to develop Latent Scaling, a technique that measures how much each latent actually explains the activations in each model. For a latent $j$, with representation $d_j$, we compute

$$ ⁍ $$

which measure the importance of latent $j$ for reconstructing model $m$’s activations[^CF]. We can then measure the fine-tuning specificity of a latent with: