What is RL?

RL works intuitively in physical environments where the agent is dog and it is navigating in the environment like garden with a goal of picking the ball. The agent works in episodes here, picking the ball one time means one episode. The episode is split into steps and at each time stamp t the agent take an action a$_t$ like moving, as a result the environment go to a new state s $_t$ such as a new position and the environment also provide a reward r $_t$ and trajectory contains all states the agents visited and all the action it takes.

So, instead of dog, we have a LLM(or agent), instead of you, we have environment that gives feedback.

Important terms:

This is better illustrated in the below figure:

image.png

You can understand the RL easily if you can relate it to coaching or a game or a training of a dog or a kid.

The Role of RL in LLMs

RL in LLMs is mainly used for the fine-tuning although there is Pre-training RL now(check this:https://github.com/tokenbender/avataRL). But it is still mainly used in fine-tuning after the supervised fine-tuning.

Now per-training and supervised fine-tuning are fine, build a strong model but to boost the LLM capabilities further we use the RL.

RL gives us a way to fine-tune these pre-trained LLMs to better achieve these desired qualities. It’s like giving our LLM dog extra training to become a well-behaved and helpful companion, not just a dog that knows how to bark fluently!