(Unrelated to previous work at X & xAI, all references are public information I’m studying to enrich my, um, taste. Any feedback is appreciated. This is n00b version, v2 is in the works after talking to a lot more people & refining my views!)
There’s a giggle in every conversation when someone brings up “taste”. Alas, the overloading of one atomic word is just another necessary evil of the English language. My own definition of taste has rapidly evolved from “good judgment of vibes” when I started on posttraining to “scalable subjective supervision”. Because in a world where people are the “limiting factor” (by raw intelligence, efficiency, decisiveness…) in the loop, figuring out how to scale human taste proportionally with AI seems pretty important for keeping us in the loop, especially when it comes to reward signal selection, self-improving systems, and the tension between exploration and human preferences.
I think subjective capabilities require more than creating pseudo-verification to plug back into RLVR approaches that worked for math/coding. It requires deciding what/when/how to introduce human supervision, creative ways to generate/insert it, and learning good representations of human data to maximally utilize it.

the four horsemen of non-verifiability 😛 (source)
In college, I worked on continual interactive learning where a small group of human experts supervise a large amount of agents (robots), and since then I’ve been interested in the interface between self-improving AI and people. And people are non-verifiable, though “non-verifiable domains” serves like a shorthand because truly non-verifiable problems are rare. Most of them are just hard to verify because the proxies are subjective or noisy, and rewards may be distributed across a bunch of rollouts, myriad things you have to define, or a ton of time. This makes it challenging to do unsupervised or self-improving post-training, though there have been workarounds (WritingZero - self-verification, RARO - no verification).
For subjective domains, outcome supervision is challenging because success is ill-defined, and any attempt at estimating outcomes is necessarily high-variance, though you can reduce variance by weakly ensembling multiple different outcomes. I’m particularly curious about how distillation can increase learning efficiency via dense supervision. Since distillation is really a special case of SFT, it inherits the tradeoff in explicitness: with verifiable rewards or reward models, you pay the cost of learning a scoring function but credit assignment is clear; with distillation, supervision is implicit and relative, which makes credit assignment harder. Perhaps recognizing that the implicit approach aligns more with softer things, distillation is widely applied now, including simulation, privileged information containing desirable behaviors, and self-distillation (using an earlier checkpoint as teacher can recover behaviors; using the same model can enhance compression, stability, and smoothing on existing behaviors).
The downside is dense signals are more hackable. Process reward models are hard to evaluate and tend to reward spurious things that don’t contribute to final quality. Self-play agents can collude and collapse to easy and unrealistic examples. And without proper outcome-based credit assignment, distillation may favor subsequences that look okay but stemmed from uncaught earlier failures.
I know that self-play can create an echo chamber of error reinforcement, which is especially bad when human judgment is an integral part of goodness. Taking the contrived example of self-play for memes. Left to its own devices, over time the judge model would likely zero in on superficial patterns like absurdity and random word coincidences. Because these patterns are easy for the generator to reproduce and score reliably, the system biases toward producing nonsensical or noisy memes. The judge that co-evolved with the generator continues to prioritize these outputs. Eventually, the system converges to self-consistent memes that are highly misaligned with people’s humor.
One full-circle idea is to think back to interactive supervision. Analogous to robots asking for help, receiving help, and updating all agents’ parameters at once, one could periodically fine-tune on a held out annotated dataset of what humans find funny. The agent can proactively initiate this when it encounters high uncertainty, which in turn may improve calibration. An ideal policy can still creatively come up with surprising, provocative memes while preserving human humor. This echoes the on-policy distillation blog’s example of alternating between forward SFT and reverse distillation to preserve earlier capabilities, and indeed a number of works have explored KL-regularized self-play for preserving alignment.
On-policy distillation is like implicitly fitting a value function over trajectories induced by the teacher. Though trajectory-level scoring is still the norm, when would we actually want a VF or critic? When we need to judge the merit of prefixes and leverage an estimate of future quality — this somehow evokes emotional reading abilities, in the sense that I am constantly changing my actions based on how I think you’d receive it, and there are old social robot papers on this. VF’s might also be useful for interactive supervision in calibrating epistemic uncertainty, helping to decide when to defer to humans or collect more data.
Similar to world models in robotics, I think a deeper understanding of the society/population AIs are conversing with enables more efficient utilization of human data. For personalization, I’ve heard that industry still relies on prompting/RAG but the subfield is shifting in a way similar to the shift from explicit CoT reasoning to latent approaches like CODI. Here instead of compressing thinking processes, you compress user preferences/memory and come up with efficient ways to access that memory — like semantic ids in recommender systems. I’ve actually worked on recsys and am excited to see how this unlocks so-called “EQ” behaviors like mirroring and scales up to social networks. Recsys people have a powerful transformer-based arsenal for turning dense user signals, including ones in natural language, into actionable predictions. Can we “reverse distill” this stuff into LLM personalization?
Another potential learning from recsys is the sweet spot between “the average user” and “this particular user”. Surely there are generalizable things you can apply to similar or in-network people? Search-based reasoning for math feels somewhat transferrable to social graphs, but with a less clear causal relationship. Another parallel from math: anticipating queries could likely extend to user understanding with shared social context, which then inspires the composability of weights to share that context. Online human feedback calls for further decisions in quality filtering, representation, and matching/adding it to memory.
The choice of whether to explicitly learn representations of human behavior can say a lot about our assumptions. Compared to joint/latent generative modeling like matrix factorization, generative recommenders and LLMs don’t do this and treat the user actions as observed, which seems good enough pragmatically. Even though my intuition says we cannot achieve true understanding without some degree of explicit modeling (does a historian treat history like a college freshman doing pattern-matching on their history breadth exam?), it is hard. One challenge, for instance, is people can express the same idea in many different ways, and naively fitting them is recipe for collapse. In that robotics work, we explored implicit modeling of diverse human experts via EBMs, but that doesn’t scale to LLM traffic. (Although the “same idea in many different ways” evokes graphs of thought. Can we do something like graphs of intent? I don’t really know inverse modeling yet but seems intuitive that LLMs can use the same reasoning tools to deduce how people think and act.)
Maybe functional user understanding sits somewhere between joint modeling and direct action prediction, just like how modern reasoning sits between scratchpad thoughts and correct answers. ****Empathy isn’t a full simulation of the other’s experience, but a deliberate and thoughtful attempt at aligning with that experience. Things that may explain real variation in behavior, like intent, seem like good candidates for latent variable sampling. Intent is like the currency between interactive intelligences — the 48 laws of power devotes so many chapters to hiding one’s intent to be socially successful (playing with fire: better intent intelligence might mean better deception). I think this inductive bias mostly benefits settings where user signals are more principled: recommenders need this less because most signals are extremely noisy and you’ll tie yourself into knots trying to overthink what they mean, but conversations are by definition designed to share information, so intent prediction may enhance chatbots’ ability to express internal states, kind of like increasing the likelihood of helpful rationales but unless there’s ground truth intent, we have to pull from the non-verifiable toolbox again for soft labels like future feedback distillation. (I’m visualizing a synergetic training course for corgi racing, where latent variables introduce progressive hurdles and teacher forcing provides shortcuts).
To write about next time: what is the effect of having many of these corgis policies updated together and supervising each other?