(Sharing my own study-notes-like curiosity, unrelated to previous work at X & xAI. All references are public information. n00b so all feedback / responses welcome š)
Thereās a giggle in every conversation when someone brings up ātasteā. Alas, the overloading of one atomic word is just another necessary evil of the English language. My own definition of taste has rapidly evolved over the past few months from āgood judgment on vibesā to āscalable subjective supervisionā. The following brain dump is focused on RL because although many capabilities can be seeded during SFT if the dataset already contains the behaviors you want, thatās seldom the case for subjectivity (hard to collect demonstrations; we want to do even better) or multi-turn interaction (requires shaping rewards over trajectories and doing on-policy updates to learn from its own mistakes). Therefore, scaling subjectivity is an integral part of scaling RL, especially when it comes to reward signal quality, long-horizon continual learning, and the tension between exploration and human preferences.
I think subjective capabilities require more than creating good verifiers and plugging them back into RLVR approaches that worked for math/coding. It requires deciding what/when/how to introduce human supvervision, creative ways to generate/insert it, and learning good representations of human data to maximally utilize it.

the four horsemen of non-verifiability šĀ (source)
In college, I worked on continual interactive learning where a small group of human experts supervise a large amount of agents (robots), and since then Iāve been interested in the interface between self-improving AI and people. And people are non-verifiable, though ānon-verifiable domainsā serves like a shorthand because truly non-verifiable problems are rare. Most of them are just hard to verify because the proxies are subjective or noisy, and rewards may be distributed across a bunch of rollouts, myriad things you have to define, or a ton of time. This makes it challenging to do unsupervised or self-improving post-training, though there have been workarounds (WritingZero - self-verification, RARO - no verification).
For subjective domains, creating outcome supervision is challenging because success is ill-defined, and any attempt at estimating outcomes is necessarily high-variance, though you can reduce variance by supervising multiple different outcomes, creating an ensemble reward. Iām particularly curious about the potential of distillation and process supervision for increasing learning efficiency via dense supervision. Since distillation is really a special case of SFT, it inherits the tradeoff in explicitness: with verifiable rewards or reward models, you pay the cost of learning a scoring function, but credit assignment is clear; with on-policy distillation, supervision is implicit and relative, which makes credit assignment harder. Distillation seems widely applied now, including simulation, privileged information containing desirable behaviors, and self-distillation (using an earlier checkpoint as teacher can recover behaviors; using the same model can enhance compression, stability, and smoothing on existing behaviors).
The downside is dense signals are more hackable. If you use process reward models, theyāre hard to evaluate and may reward reward spurious things that donāt contribute to final quality. If you do self-play, how do you prevent agents from colluding to collapse to easy/unrealistic examples? And without proper outcome-based credit assignment, distillation may favor subsequences that are ostensibly okay but stemmed from uncaught earlier failures. Iām just beginning to think about this and have so much to catch up to.
todo: to verify or not to verify: vs demonstrations
todo: value functions
I know that self-play can create an echo chamber of error reinforcement, which is especially bad when human judgment is an integral part of āgoodnessā. Taking the contrived example of self-play for memes. Left to its own devices, over time the judge model would likely zero in on superficial patterns like absurdity and random word coincidences. Because these patterns are easy for the generator to reproduce and score reliably, the system biases toward producing nonsensical or noisy memes. The judge, having co-evolved with the generator, continues to prioritize these outputs. Eventually, the system converges to self-consistent memes that are highly misaligned with peopleās humor.
One full-circle idea is to think back to interactive supervision. Analogous to robots asking for help, receiving help, and updating all agentsā parameters at once, one could periodically fine-tune on a held out annotated dataset of what humans find funny. The agent can proactively initiate this when it encounters high uncertainty, which in turn may have positive effects on improving calibration and combating overconfidence. An ideal policy can still creatively come up with surprising, provocative memes while preserving human taste. This echos the on-policy distillation blogās example of alternating between forward SFT and reverse distillation to preserve earlier capabilities, and indeed a number of works have explored KL-regularized self-play for preserving alignment.
Similar to world models in robotics (perhaps an abuse in terminology?), I think a deeper understanding of the population AIs are conversing with enables more efficient utilization of human data. Iām in very early stages of learning about user understanding and simulation, and have heard that industry still leans heavily into in-context learning but the subfield is shifting in a way similar to the shift from explicit CoT reasoning to latent approaches like CODI. Here instead of compressing thinking processes, you compress user preferences/memory and come up with efficient ways to access that memory. Iām excited to see how this unlocks so-called āEQā behaviors like mirroring, and how this scales up to social networks, given my past work on ML for social media. Generative recommenders is all the rage now. Recsys people have a powerful transformer-based arsenal for turning dense user signals ā even ones in natural language ā into actionable predictions. Can we āreverse distillā this stuff into LLM personalization?
Another potential learning from recsys is the sweet spot between āthe average userā and āthis particular userā. Surely there are generalizable things you can apply to similar or in-network people? Search-based reasoning for math feels somewhat transferrable to social graphs, but with a less clear causal relationship. Another parallel from math: anticipating queries could likely extend to user understanding with shared social context, which then inspires the composability of weights to share that context. Online human feedback calls for further decisions in quality filtering, representation, and matching/adding it to memory.
The choice of whether to explicitly learn representations of human behavior can say a lot about our assumptions. Compared to joint/latent generative modeling like matrix factorization, generative recommenders (and of course LLMs) donāt do this and treat the user actions as observed, which seems good enough pragmatically. Even though my intuition says we cannot achieve true understanding without some degree of explicit modeling (does a historian treat history like a college freshman doing pattern-matching on their history breadth exam?), it is hard. One challenge, for instance, is people can express the same idea in many different ways, and naively fitting them may cause collapse. In that robotics work, we explored implicit modeling of diverse human experts via EBMs, but that doesnāt scale to LLM traffic.
Maybe functional user understanding sits somewhere between joint modeling and direct action prediction. ****Empathy isnāt a full simulation of the otherās experience, but a deliberate and thoughtful attempt at aligning with that experience. We can introduce intermediate, latent structures that we think explain real variation in behavior, like intent. Intent is like the currency between interactive intelligences ā the 48 laws of power devotes so many chapters to hiding oneās intent to be socially successful (playing with fire: better intent intelligence might mean better deception. Need to be careful). I think this kind of inductive bias mostly benefits settings where user signals are more principled: recommenders need this less because most signals are extremely noisy and youāll tie yourself into knots trying to overthink what they mean, but conversations are by definition designed to share information, so intent prediction may enhance chatbotsā ability to express internal states, kind of like rationalization (if there are ground truth labels, reward matches; if not, it still encourages better representations).
Next question: what is the effect of having many of these policies updated together and supervising each other?