Abstract
- Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning.
- We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback.
- In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters.
- Instruct GPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLPdatasets.

1 Introduction
- Large language models(LMs) can be “prompted” to perform a range of natural language processing(NLP) tasks, given some examples of the task as input.
- However, these models often express unintended behaviors such as making up facts, generating biased or toxic text, or simply not following user instructions.
- This is because the language modeling objective used for many recent large LMs—predicting the next token on a webpage from the internet—is different from the objective “follow the user’s instructions helpfully and safely” (Radford et al., 2019; Brown et al., 2020; Fedus et al., 2021; Rae et al., 2021; Thoppilan et al., 2022).
- Thus, we say that the language modeling objective is misaligned.
- Using the language of Askell et al.(2021), we want language models to be helpful (they should help the user solve their task), honest (they shouldn’t fabricate information or mislead the user), and harmless (they should not cause physical, psychological, or social harm to people or the environment).
Demonstrations of procedure
- We focus on fine-tuning approaches to aligning language models. Specifically, we use reinforcement learning from human feedback (RLHF; Christiano et al., 2017; Stiennon et al., 2020) to fine-tune GPT-3 to follow a broad class of written instructions (seeFigure2). This technique uses human preferences as a reward signal to fine-tune our models.
- We first hire a team of 40 contractors to label our data, based on their performance on a screening test (see Section3.4 and AppendixB.1 for more details).
- We then collect a dataset of human-written demonstrations of the desired output behavior on (mostly English) prompts submitted to the OpenAI API3 and some labeler-written prompts, and use this to train our supervised learning baselines.
- Next, we collect a dataset of human-labeled comparisons between outputs from our models on a larger set of API prompts.
- We then train a reward model (RM) on this dataset to predict which model output our labelers would prefer.
- Finally, we use this RM as a reward function and fine-tune our supervised learning baseline to maximize this reward using the PPO algorithm(Schulmanetal.,2017). We illustrate this process in Figure2.
- This procedure aligns the behavior of GPT-3 to the stated preferences of a specific group of people (mostly our labelers and researchers), rather than any broader notion of “human values”; we discuss this further inSection5.2. We call the resulting models InstructGPT.
- We mainly evaluate our models by having our labelers rate the quality of model outputs on our test set, consisting of prompts from held-out customers (who are not represented in the training data). We also conduct automatic evaluations on a range of public NLP datasets.

Findings