<aside> ⚠️ You’re reading our results report. The report will be officially published by OpenAI November 20th. Please only share this cautiously before the public launch.

</aside>

Executive Summary

We report on the first run of “Democratic Fine-Tuning” (DFT), funded by OpenAI. DFT is a democratic process that surfaces the “wisest” moral intuitions of a large population, compiled into a structure we call the “moral graph”, which can be used for LLM alignment.

In addition to this report, we're releasing a visual explorer for the moral graph, and open data about our participants, their experience, and their contributions.

Intro

We received an OpenAI grant to build a democratic process called Democratic Fine-Tuning (DFT), and create the first Moral Graph. Here, we will present our early results.

Our goal with DFT is to make one fine-tuned model that works for Republicans, for Democrats, and in general across ideological groups and across cultures; one model that people all around the world can all consider “wise”, because it's tuned by values we have broad consensus on. We hope this can help avoid a proliferation of models with different tunings and without morality, fighting to race to the bottom in marketing, politics, etc. For more on these motivations, read our introduction post.

To achieve this goal, we use two novel techniques: First, we align towards values rather than preferences, by using a chatbot to elicit what values the model should use when it responds, gathering these values from a large, diverse population. Second, we then combine these values into a “moral graph” to find which values are most broadly considered wise.

Here, we will present the first moral graph, based on convergent values identified from a representative sample of US citizens. Later work will explore gathering values globally, and fine-tuning an LLM based on these values.

We’ll start with our two novel techniques, contextualize them with a tour of the process, then share the results and what they mean for AI alignment.

Values, not Preferences

We align the model with values$^1$ from a diverse population. Note that this is not the same as aligning with preferences. People will always disagree about how exactly models should respond. But, as we’ll show below, people can agree much more if we ask a slightly more abstract question: “what should the model take into account when responding to this?”

In other words: What values should the model operate by, when responding to a particular question, or in a particular dialogue?

Participants in our process could choose to weigh in on one of these three questions.

Participants in our process could choose to weigh in on one of these three questions.