What is RL?

RL works intuitively in physical environments where the agent is dog and it is navigating in the environment like garden with a goal of picking the ball. The agent works in episodes here, picking the ball one time means one episode. The episode is split into steps and at each time stamp t the agent take an action a$_t$ like moving, as a result the environment go to a new state s $_t$ such as a new position and the environment also provide a reward r $_t$ and trajectory contains all states the agents visited and all the action it takes.

So, instead of dog, we have a LLM(or agent), instead of you, we have environment that gives feedback.

Important terms:

Agent: Its the learner. In the example, dog or in context of LLMs, the LLM itself, is the agent which we wants to train. It makes decisions and learn from the environment and its rewards
Environment: Agent surroundings, it is where it lives and interacts with. For LLM, it can be users or a specific scenario. It provides the feedback to agent.
Action: These are the choices agent can make. For example, dog can sit, stand, bark etc. For an LLM, it can be generating words in a sentence, choosing which answers to give to a question etc.
Reward: the feedback the environment gives to the agent. It can be positive or negative.
Policy: Agent’s strategy for choosing the actions. In RL, we try to learn and improve the policy.

This is better illustrated in the below figure:

You can understand the RL easily if you can relate it to coaching or a game or a training of a dog or a kid.

The Role of RL in LLMs

RL in LLMs is mainly used for the fine-tuning although there is Pre-training RL now(check this:https://github.com/tokenbender/avataRL). But it is still mainly used in fine-tuning after the supervised fine-tuning.

Now per-training and supervised fine-tuning are fine, build a strong model but to boost the LLM capabilities further we use the RL.

RL gives us a way to fine-tune these pre-trained LLMs to better achieve these desired qualities. It’s like giving our LLM dog extra training to become a well-behaved and helpful companion, not just a dog that knows how to bark fluently!