Contents

<aside> ⚠️

You’re reading our results report. The report will be officially published by OpenAI November 20th. Please only share this cautiously before the public launch.

</aside>

Executive Summary

We report on the first run of “Democratic Fine-Tuning” (DFT), funded by OpenAI. DFT is a democratic process that surfaces the “wisest” moral intuitions of a large population, compiled into a structure we call the “moral graph”, which can be used for LLM alignment.

We show bridging effects of our new democratic process. 500 participants were sampled to represent the US population. We focused on divisive topics, like how and if an LLM chatbot should respond in situations like when a user requests abortion advice. We found that Republicans and Democrats come to agreement on values it should use to respond, despite having different views about abortion itself.
We present the first moral graph, generated by this sample of Americans, capturing agreement on LLM values despite diverse backgrounds.
We present good news about their experience: 71% of participants said the process clarified their thinking, and 75% gained substantial respect for those across the political divide.
Finally, we’ll say why moral graphs are better targets for alignment than constitutions or simple rules like HHH. We’ll suggest advantages of moral graphs in safety, scalability, oversight, interpretability, moral depth, and robustness to conflict and manipulation.

In addition to this report, we're releasing a visual explorer for the moral graph, and open data about our participants, their experience, and their contributions.

Intro

We received an OpenAI grant to build a democratic process called Democratic Fine-Tuning (DFT), and create the first Moral Graph. Here, we will present our early results.

Our goal with DFT is to make one fine-tuned model that works for Republicans, for Democrats, and in general across ideological groups and across cultures; one model that people all around the world can all consider “wise”, because it's tuned by values we have broad consensus on. We hope this can help avoid a proliferation of models with different tunings and without morality, fighting to race to the bottom in marketing, politics, etc. For more on these motivations, read our introduction post.

To achieve this goal, we use two novel techniques: First, we align towards values rather than preferences, by using a chatbot to elicit what values the model should use when it responds, gathering these values from a large, diverse population. Second, we then combine these values into a “moral graph” to find which values are most broadly considered wise.

Here, we will present the first moral graph, based on convergent values identified from a representative sample of US citizens. Later work will explore gathering values globally, and fine-tuning an LLM based on these values.

We’ll start with our two novel techniques, contextualize them with a tour of the process, then share the results and what they mean for AI alignment.

Values, not Preferences

We align the model with values$^1$ from a diverse population. Note that this is not the same as aligning with preferences. People will always disagree about how exactly models should respond. But, as we’ll show below, people can agree much more if we ask a slightly more abstract question: “what should the model take into account when responding to this?”

In other words: What values should the model operate by, when responding to a particular question, or in a particular dialogue?

Participants in our process could choose to weigh in on one of these three questions.