LLM Internals | Notion

Science of LLMs - A research primer

Mechanistic Interpretability

Quite helpful to unlock the black box of LLMs, VLMs, MLLMs, etc and understand the internals. This is typically done by throwing a bunch of sparse autoencoders (or recent trends show increased use of transcoders/crosscoders) and monitoring layer-wise activations (and/or heatmaps).

Pioneers include teams led by Neel Nanda, companies like Anthropic, etc. A very niche and cutting-edge research topic up for undertaking.

Good starting paper on circuit tracing and attribution graphs - https://transformer-circuits.pub/2025/attribution-graphs/methods.html

How to do Interp?

Let’s start with this statement - “Superposition is a common phenomenon that leads to polysemantic neurons.“

Superposition → multiple (often unrelated) features are being superposed on a single neuron since there are a lot more features than neurons available; Polysemantic → many semantics of the dataset are captured by single neuron. This simply makes model editing harder. Further, one feature can be represented across a large set of neurons due to superposition. How do we break up this superposition to edit targeted concepts/features (not necessarily remove)? Sparse Autoencoders!

Sparse Autoencoders’ Role

They run parallel to an LLM, say $\mathit{M}$, to help convert complex connections to more understandable/interpretable terms. All information present in the polysemantic neurons when an input is passed, is now contained in discrete, monosemantic, neurons inside the SAE feature neurons (this is achievable since the SAE feature neurons lie in a higher dimensional space and the encoding is learnt such that they produce discrete activations). This is just encoding, but to ensure that the interpretation is done correctly, SAEs need to translate/decode back the outputs from the feature neurons to the outputs of the neurons of $\mathit{M}$. Conventional literature uses the term “Dictionary” only to the final translation process, but intuitively, every hidden pair of interconnected layers (in multilayer SAE) acts as a dictionary too. This process would’ve been quite simple if ReLU and positional encodings DNE. Alas, they have too much importance in current models. Also, all current works simplify SAE to have just one feature layer.

As part of the interp process, we need one SAE for every layer of $\mathit{M}$; all currently SAEs have just two layers (with Encoder Weight Matrix and Decoder Weight Matrix respectively for Encoding and Decoding). They perform the interpretation from the output activations on M to some output activations of $\mathit{S}$, followed by a translation back from $\mathit{S}$ to $\mathit{M'}$. They are trained to match $\mathit{M}$ and $\mathit{M’}$. While SAEs have spurred interest in sparse coding, several key drawbacks have been addressed in improvised techniques which involve using crosscoders/transcoders.

Manifold Learning

The intuition is that high-dimensional representations coming in as inputs reside on a characteristic low-dimensional manifold (using Diffusion kernels, etc). The low-dimensional features play important roles in alignment (in the latent space) across modalities, which is key for any model to achieve good utility. The exploitation of this manifold for various tasks is a good research area. If we couple this concept with careful latent space traversals, then the outputs of any model can be made far more useful and well-aligned when dealing with long-context inputs. Manifold estimation can learn a good characteristic space for the intrinsics of the data, and latent space traversals can enhance the output quality.

Jailbreak Evaluation Metrics

While building solid foundation models with large data sets is essential, ensuring they are robust against adversarial and backdoor attacks is just as crucial. Validating these models against such vulnerabilities is an area of great interest, and I do see some potential in contributing to this area.

Neat example - https://transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-jailbreak

Convergence with Unlearning

Unlearning is an emerging challenge that I consider both highly impactful and essential for the future of multimodal machine learning systems.

In the unlearning paradigm, it’s an established fact that the removal of knowledge across modalities without much harm to utility is almost as hard as a open-heart surgery. I feel that the learnings from interpretability (activations, manifolds, etc.) can be used to perform better knowledge removal while preserving model quality.

If we can truly and completely work in the manifold spaces of the retain and forget data, then unlearning might become a touch easier than the current, advertently expensive fine-tuning methodologies (or maybe we can strike a balance between both). Based on an initial review, this thought experiment seems to be implemented in this NeurIPS paper.