INTERPRETABILITY

Imagine standing before a vast, humming machine whose inner workings are hidden behind opaque panels. You can see the inputs going in and watch the outputs coming out, but what happens inside remains a mystery. This is how most of us experience modern deep learning models - astonishing, but ultimately black boxes. Mechanistic Interpretability (MI) is the research direction that seeks to pry open those panels and understand every cog and gear that drives a neural network’s decisions.

Why go beyond “What” to “How”?

Traditional interpretability tools - saliency maps, feature importance scores, LIME, SHAP, offer much value. They tell us which input features influenced a prediction. But they stop short of revealing how the network actually computes its answer. MI takes us from correlation to causation. It’s not enough to know that “the model looked at these pixels”; we want to know which neurons lit up, which circuits processed those activations, and in what order. We aim to rebuild the network’s computation in human-readable form, almost like pseudocode for a piece of software.

The building blocks of understanding

Features

At the lowest level, a feature is a pattern the network learns to recognize - edges in an image, parts of speech in text, or textures in a scene. Early vision models revealed neurons that detect horizontal lines or the colour red. In language models, some neurons spike in response to quotation marks or specific grammatical structures.

Polysemantic Neurons and Superposition

Reality quickly gets messy: neurons often become “polysemantic,” meaning they respond to multiple, seemingly unrelated features. A single neuron might fire for both cat faces and car fronts. The superposition hypothesis suggests that networks pack more features into their finite set of neurons by overlapping representations. This means we can’t always point to one neuron and say, “That’s the cat detector.”

Circuits

These are groups of neurons that collaborate to perform a function. A low-level circuit might detect edges and textures; a mid‑level one might combine those into “cat ears” or “wheel spokes”; a high‑level circuit might aggregate parts into entire object representations. By mapping these circuits, we start to see the network’s hierarchical processing pipeline.

Causal Interventions

To move from association to causation, researchers employ activation patching and causal tracing.

Activation Patching - Run the model on a “clean” input and a “corrupted” version (e.g, an image with noise). Then, selectively replace activations in the corrupted run with those from the clean run. If swapping a particular layer’s activations restores the model’s performance, that layer must be causally critical for the task.
Causal Tracing - More broadly, this involves adding noise or making small interventions at various points in the network to see how the output changes. By systematically denoising or patching, we chart the flow of information and pinpoint the neurons and circuits that truly matter.

They remind me of induction heads in transformers, which enable in‑context learning by spotting repeated patterns in a sequence, or the “indirect object identification” circuit in language models that reliably pick out the right noun when completing a sentence.

Interpreting Vision Models and Vision–Language Models

While much of early MI work focused on language and synthetic tasks, vision has its own rich interpretability story — and it only gets more intriguing when you blend pixels with prose.

CNNs and Early Vision Models

Filter and Feature Visualization - By optimizing input images to maximize the activation of a single convolutional filter, researchers observed the emergence of vivid edge detectors (horizontal and vertical), color blobs, and texture patterns. These visualizations gave the first concrete peek into what early vision layers “look for.”
Network Dissection - Utilizing human-labelled segmentation datasets, this method aligns each hidden channel’s activation map with known semantic concepts — clouds, wheels, faces — quantifying how well individual neurons act as detectors for interpretable features.