Imagine standing before a vast, humming machine whose inner workings are hidden behind opaque panels. You can see the inputs going in and watch the outputs coming out, but what happens inside remains a mystery. This is how most of us experience modern deep learning models - astonishing, but ultimately black boxes. Mechanistic Interpretability (MI) is the research direction that seeks to pry open those panels and understand every cog and gear that drives a neural network’s decisions.

Why go beyond “What” to “How”?

Traditional interpretability tools - saliency maps, feature importance scores, LIME, SHAP, offer much value. They tell us which input features influenced a prediction. But they stop short of revealing how the network actually computes its answer. MI takes us from correlation to causation. It’s not enough to know that “the model looked at these pixels”; we want to know which neurons lit up, which circuits processed those activations, and in what order. We aim to rebuild the network’s computation in human-readable form, almost like pseudocode for a piece of software.

The building blocks of understanding

Features

At the lowest level, a feature is a pattern the network learns to recognize - edges in an image, parts of speech in text, or textures in a scene. Early vision models revealed neurons that detect horizontal lines or the colour red. In language models, some neurons spike in response to quotation marks or specific grammatical structures.

Polysemantic Neurons and Superposition

Reality quickly gets messy: neurons often become “polysemantic,” meaning they respond to multiple, seemingly unrelated features. A single neuron might fire for both cat faces and car fronts. The superposition hypothesis suggests that networks pack more features into their finite set of neurons by overlapping representations. This means we can’t always point to one neuron and say, “That’s the cat detector.”

Circuits

These are groups of neurons that collaborate to perform a function. A low-level circuit might detect edges and textures; a mid‑level one might combine those into “cat ears” or “wheel spokes”; a high‑level circuit might aggregate parts into entire object representations. By mapping these circuits, we start to see the network’s hierarchical processing pipeline.

Causal Interventions

To move from association to causation, researchers employ activation patching and causal tracing.

They remind me of induction heads in transformers, which enable in‑context learning by spotting repeated patterns in a sequence, or the “indirect object identification” circuit in language models that reliably pick out the right noun when completing a sentence.

Interpreting Vision Models and Vision–Language Models

While much of early MI work focused on language and synthetic tasks, vision has its own rich interpretability story — and it only gets more intriguing when you blend pixels with prose.

CNNs and Early Vision Models