how i turned competitive pokemon into an rl environment for llms

why i picked pokemon

Most LLM environments are either too toy-like or too clean. The model gets a neatly packaged state, the reward is obvious, and there is no real adversary trying to exploit mistakes. I wanted to build something harder: an environment with hidden information, delayed rewards, legal action constraints, and another agent actively pushing back. Competitive Pokemon ended up being a much better fit for that than I expected.

On the surface, Pokemon does not sound like a serious benchmark for language-model reasoning. But competitive Pokemon is really a game about uncertainty, tempo, resource preservation, and managing long-term consequences. Rock-paper-scissors already gives a tiny example of cyclic matchups, where the correct action depends on what the other side is likely to do. Pokemon scales that idea up dramatically. imagine a rock-paper-scissors game with 300+ type matchups, switching, partial information, setup turns, status effects, and situations where the move that looks best right now is exactly the move that loses the game later. That makes it a surprisingly rich environment for long-horizon planning and situational awareness.

turning showdown into an environment

The project I built, WolfeClick, wraps Pokemon Showdown as an OpenEnv-compatible environment. Instead of treating the battle simulator as something external that a model pokes at through brittle prompt engineering, I wanted to make it behave like a proper reinforcement learning loop. At each turn, the environment produces an observation, the model chooses one action, the simulator advances, and the environment returns the next observation and a reward.

Under the hood, Showdown runs as the actual battle engine, poke-env manages the interaction with that engine, and my environment wrapper converts the battle state into something a language model can use. The wrapper also enforces legality, keeps track of revealed opponent information, and computes the reward after each step. That makes the system feel much less like a one-off demo and much more like a reusable environment that can actually support training.

what the model sees

One of the most important design decisions was deciding what information the model should have access to. I did not want the model to see hidden state it would never know in a real battle, but I also did not want the observation to be so sparse that it became unusable. The final state representation is a structured text view of the battle that includes the active field, the model’s own full team, the opponent information revealed so far, and the exact legal actions available on that turn.

In practice, that means the model sees things like the current active Pokemon on both sides, their HP, status, item or ability if known, its own available team members and moves, and the running history of what has been revealed about the opponent. That last part matters because Pokemon is partially observable. Good play is often less about reacting to what is visible right now and more about updating your beliefs based on what the opponent has already shown.

The environment also gives the model the exact list of actions it is legally allowed to take. That makes the task much clearer: the model is not being asked to write an essay about the battle. It is being asked to choose one valid decision.

the action format

To keep the action space constrained, I made the model output exactly one JSON object. Early on, I found that this was not something to leave to prompting alone, so I first did a short SFT warmup to make the model reliably follow the schema. That helped a lot. Once the model consistently stayed in format, the RL loop could focus on choosing better actions instead of wasting rollout budget on malformed outputs.

{"action": "move" | "switch", "choice": "Exact Name of Move or Pokemon"}

That small amount of supervised tuning ended up being one of the highest-leverage parts of the pipeline. Without it, too much of the training signal gets burned on syntax and legality problems instead of actual decision-making

This turns out to be important for two reasons. First, it forces the model to commit to a concrete decision instead of hiding behind vague reasoning. Second, it makes action validation straightforward. If the model tries to output something malformed, hallucinated, or illegal, the environment can detect it immediately and penalize it.

This also made training much cleaner as this is now a verifiable reward problem. Early on, a large part of the problem is simply getting the model to reliably stay in schema and choose legal actions. Once that behavior is established, the more interesting question becomes whether the model is learning to choose better legal actions rather than just valid ones.

how the reward works

The reward design is where this environment stops being a formatting task and starts becoming an actual strategic training setup. A pure win-or-loss reward is too sparse for short experiments. If the only meaningful signal arrives at the end of a long battle, training becomes painfully inefficient, especially when the model is still learning basic action selection and legality. So instead of waiting until the very end of the game, I shaped the reward around intermediate battle events while still keeping it tied to the real objective of winning good battles.

Being obsessed with Pokemon actually helped a lot here. It made me realize that building scalable RL environments is less about abstract RL theory and more about domain understanding. If you know what good progress looks like in a battle, you can encode much better signals.

The reward is not arbitrary, and it is not a synthetic preference score invented after the fact. It is a structured attempt to reflect real battle progress in a denser form so the model can learn from shorter trajectories. That makes the environment more trainable and more verifiable while still preserving the strategic shape of the task.