Training an LLM to Play Diplomacy with RL

Introduction#

This is the story of how I spent my winter break teaching an LLM to play no-press Diplomacy from scratch. It’s a deep dive into the engineering, research decisions, and hard-won lessons from building an end-to-end RL system - from scalable rollout infrastructure to custom constrained generation to league-based training.

Fair warning: this is a long post. But if you’re interested in what it actually takes to do applied RL research - the debugging, the infrastructure, the failed experiments - hopefully it’s a useful read.

TL;DR - The Results

What I built: A GRPO-based RL system that trains Qwen3-14B (with LoRA) to play no-press Diplomacy, running on Modal’s serverless infrastructure.

The headline numbers:

80% win rate vs DumbBot ...

Introduction#

TL;DR - The Results

What I built: A GRPO-based RL system that trains Qwen3-14B (with LoRA) to play no-press Diplomacy, running on Modal’s serverless infrastructure.

The headline numbers:

80% win rate vs DumbBot (up from 72% base model) - exceeding DipNet’s 75% benchmark
+77 Elo improvement (~11 percentage point increase in expected win rate)
10x inference speedup from custom logits processor

Key takeaways:

Constrained generation via trie-based logits processing is essential - without it, valid move accuracy is ~50%
Per-token reward weighting for credit assignment was the single biggest improvement to model quality
Measuring progress in league training is genuinely hard - your Elo can drop while you’re improving

Here’s what we’ll cover - feel free to skip ahead:

Acknowledgments#

Before we get started, I want to thank Modal for providing generous compute credit for this project. This is not a sponsored post. I was going to use Modal either way, but once they found out I was planning on spending a bunch of money out of pocket, they were kind enough to give me some credit to cover the cost.

It sucks AI research is so expensive, because it’s really fun, and the tech has become so accessible, and you can learn so much by just doing it. I really hope more people get the opportunity to do projects like this. I’d love to support folks however I can. I hope I continue to get the support and opportunity to do projects like this.

Why Diplomacy?#

Diplomacy is the antithesis of Chess or Go. Where those games have alternating turns and deterministic outcomes, Diplomacy forces all seven players to submit orders simultaneously - you can’t react (or tree search) to your opponent’s move, you have to predict it. In practice, Diplomacy is a mixed strategy, rocks-paper-scissors type game. Aggression might beat passivity, but passivity might beat betrayal, and so on. Everything that makes Diplomacy hard is fundamentally a human problem: Can I trust this alliance? Will they stab me in the back? Should I betray them first?

This makes Diplomacy a fascinating testbed for LLM research. Once we get to full-press Diplomacy (with natural language negotiation), we get to ask some genuinely interesting questions:

How much will an LLM lie or betray if it serves its strategic goals?
Do these behaviors emerge naturally from optimizing to win, or do they need to be explicitly trained?
What can we learn about LLM decision-making in adversarial multi-agent settings that transfers to real-world applications?

You know what else is a simultaneous move, mixed strategy game? Business, international relations, wargaming, etc.

There’s also the engineering challenge. Meta’s Cicero "solved" Diplomacy back in 2022, achieving human-level play in full-press games. I was under no illusions this project would come anywhere close to Cicero’s performance. But Cicero relied heavily on explicit planning and search - a complex pipeline of strategic reasoning, dialogue generation, and intent prediction. Since then, LLMs have improved dramatically. I wanted to see how far we could push a much simpler approach: pure LLM + RL, no search, no explicit planning, no value function. Just throw tokens at the problem and let gradient descent figure it out.

A post hoc justification for this approach: can chain of thought reasoning in LLMs approximate a value function? What about planning and search?

The building blocks of RL#

Before we even start thinking about machine learning, we need to lay a solid foundation to build on. An RL system - in its simplest form - consists of 3 components:

We need a way to simulate lots of games (rollouts) of Diplomacy as quickly as possible
We need an agent that can make decisions in the game
If the agent is an LLM, we need an inference API to get the agent’s actions
Aside: I tend to treat using LLMs as a special case of whatever problem I’m solving. There’s nothing special about using an LLM - it’s just slower and more expensive than other options. But a good rule of thumb for your codebase is anything you use an LLM to do, you should also be able to do without it.
We need to aggregate everything that happens in the game and compute a reward for each agent

As tempting as it is to start with the modeling problem, I’d argue the most important first step of any RL project is to build a highly scalable rollout engine and test the daylights out of it.

The rollout engine#

Thanks to the power of open source, there’s a pip installable package to run the actual Diplomacy game engine. We can run this in a Modal image on a CPU for near infinite horizontal scaling. We just need to write a small wrapper around the engine to make it easy to represent the game state and compute rewards for our agents, and do a couple tricks in our rollout function to eventually support GRPO style training.

Benchmarking the rollout engine#

Again, before we even start doing ML, we need to collect some baseline throughput data. We’re interested in a couple things:

How efficiently can we scale horizontally?
How does game length affect throughput?
Where are the bottlenecks?

Feel free to skim the following graphs - they’re not particularly interesting with just the baseline bots. My point is more to emphasize the importance of building these instrumentation tools early, because once you start adding LLMs for inference and training, it becomes a game of whack-a-mole to eliminate the long pole in your pipeline and maximize throughput.

Scaling horizontally#

Thanks to modal, we can scale our throughput of the game engine pretty linearly.

Scaling horizontally

Game length#

We can also see that the game length affects throughput. Interestingly, there seems to be a sweet spot around 7 years. I suspect this is because the more late game you get in Diplomacy, the more complex order resolution becomes.

Game length

Latency waterfall#

Finally, we can compute a waterfall view of rollout latency. This isn’t so interesting for baseline bots since each step is pretty simple, but it will be useful when we add LLM powers.

Latency waterfall

Summary#

Cool, we’ve built our rollout engine and proven we can scale it on Modal. Now we can be confident building the rest of our RL pipeline, knowing that any future latency we observe is probably due to inference, training, or I/O.

The agent#

We glossed over it in the rollout engine section, but we actually need an Agent interface to power even the baseline bots. Everything becomes much more complicated when we add an LLM into the rollout loop, however. This section will cover two things:

The LLM inference logic that make learning a game as complex as Diplomacy feasible
The inference engine we need to add to our rollout infrastructure to support it

The LLM Agent#

Conceptually an LLM Diplomacy agent is remarkably simple. Consider the classic (s,a) -> r RL problem. Well, since we’re using an LLM, s is just a representation of the game state in text, maybe with some extra instructions. It could look something like this:

MOVEMENT_PREFIX = """\
### DIPLOMACY MOVEMENT PHASE ###

You are playing Diplomacy. Your task: output exactly one order per unit.

RULES:
- Copy-paste EXACT move strings from the valid moves list below
- One order per line inside <orders> tags
- Movement: "A PAR - BUR" means Army Paris moves to Burgundy
- Hold: "A PAR H" means Army Paris holds
- Support: "A PAR S A BUR - MAR" means Paris supports Burgundy to Marseilles
- Convoy: "F NTH C A LON - NWY" means Fleet convoys Army from London to Norway

"""

prompt = (
f"{MOVEMENT_PREFIX}"
f"GAME STATE:\n"
f"Power: {power_name}\n"
f"Phase: {phase}\n"
f"You have {unit_count} units.\n\n"
f"VALID MOVES:\n{moves_display}\n\n"
f"Output {unit_count} orders (one per unit):\n"
"<orders>\n"
)

For a, again - LLM, so it’s more text. Notice in the example prompt we’re seeding the LLM generation with <orders>, so we can teach the LLM to output XML and we can parse the orders from a predictable format.

Easy peasy, time to start training right? Not so fast, here’s a couple reasons the example above is a very bad idea.

Token bloat: Including all the VALID_MOVES in the prompt might seem like a good idea - hey this way we can force the model to only choose valid actions, right? True until you get to the mid game, when you can easily have 10+ units (NN) with 15-25 valid moves (MM) each. Listing these moves consumes O(N×M)O(N \times M) tokens in the prompt. Multiply that by 7 game powers over a dozen turns/game and 100 games/step and that’s a lot of extra tokens. Worse still, the action space explodes combinatorially to MNM^N possible actions per turn. With typical mid-game values, this easily exceeds most context windows, making reasoning or search infeasible, which brings us to the next point...
Logits processing: With an action space that large, learning to take good moves is hard enough. We don’t want to waste cycles first learning how to take legal actions. To make things worse, with a naive implementation, we’d risk the agent only outputting garbage moves, resulting in the default behavior of all units holding, thus generating no advantage and no gradient to learn from. To prevent this, we need to implement a custom logits processor that runs every token generation step to constrain model generation to only valid Diplomacy moves. Luckily vLLM supports custom logits processors. Let’s dive in, since this is one of the cooler parts of the project.

Custom logits processor#

The rough idea for the logits processor is this:

Before generation: We use the game engine to compute a trie over token ID sequences for all valid moves for the current game state + power
At generation start: The model can generate freely. This allows us to output reasoning traces, call tools, etc.
During generation: We "listen" for the opening <orders> tag - once we see it we’re in an active phase and we constrain outputs to only token sequences that can form valid moves.
Active phase: Each token, we advance a pointer in the trie of valid moves. When we reach a leaf node (end-of-move), we allow newlines and reset to the root of the trie to begin the next order.
Active phase, cont.: After each completed order, we extract the unit identifier (e.g., "A PAR" from "A PAR - BUR"), mark it as used, and rebuild the trie excluding all moves for that unit. This prevents duplicate orders per unit.
Completion: We listen for the closing </orders> tag - once we see it we’re done and can return the generated orders. If we’ve emitted enough orders for all units, we force-complete the closing tag. A hard-fought learning here: we can’t just match on token IDs for the closing tag. BPE tokenization is context-dependent, so </orders> might tokenize differently depending on what comes before it (e.g., > and \n can merge into a single token). Instead, we maintain a rolling window of decoded text - each token gets decoded back to characters and appended to a buffer. Tag detection happens via substring search on this text buffer, not token ID matching. This adds overhead but is the only reliable way to handle tokenization quirks.

Note

This was the first time I’d coded a trie since grinding leet codes for my first job. Turns out that stuff is useful, who knew.

Interactive Demo#

To see how this works in practice, try the interactive demo below. It walks through the key stages of constrained generation:

We can look at the impact on valid move accuracy across a variety of prompt configurations:

Prompt	Description
Verbose	The original prompt implementation with a verbose description of the game task, state, and full valid moves
Full moves (compact)	Include the entire list of valid moves in the prompt. Compact the json to minimize token usage and compact task description verbiage
Minimal	Reduce task description. Don’t include valid moves or any game state context.
Minimal with windows	Reduce task description. Don’t include valid moves. Represent relevant parts of the board as a linked list to provide strategic context
Minimal full context	Reduce task description. Don’t include valid moves, but include # of valid moves for each unit. Represent relevant parts of the board as a linked list to provide strategic context

Logits processor accuracy

We can see that even including the full moves list, the naive implementation only achieves around 50% valid move accuracy. When you remove the full list, the model is basically randomly guessing and accuracy tanks to single digit percentages.

When we include the logits processor, we see a dramatic improvement across the board - around 75%. This makes it much easier for the model to take impactful moves in the games that result in advantage and a gradient to learn strategy from.

A natural next question is whether this approach increases latency. Improving accuracy 10-20x wouldn’t do us any good if it’s 20-30x slower to generate orders.

Runtime complexity and performance impact#

One of the key advantages of this approach is that the runtime overhead is actually minimal:

Trie construction (one-time per request):

Building the trie requires encoding each valid move into token IDs
For NN units with MM moves each, we encode O(N×M)O(N \times M) moves
Each move tokenizes to roughly TT tokens on average
Trie construction: O(N×M×T)O(N \times M \times T) time, O(N×M×T)O(N \times M \times T) space
In practice with N=10N=10 units, M=20M=20 moves, and T≈5T \approx 5 tokens/move, this is ~1000 tokens to process
This happens once at the start of the request, not per generation step

Per-token overhead during generation:

Trie lookup: O(1)O(1) dictionary lookup in the current node’s children (typically < 100 options)
Tag detection: O(k)O(k) where kk is the tag length (constant, ~10 characters)
Logits masking: O(V)O(V) where VV is vocab size (~50k-100k tokens), but we only mask when active. This is also vectorized so minimal wall clock time impact.
Total per-token: O(V)O(V) worst case, but only when inside <orders> tags

Empirically, we see that the logits processor actually improves throughput ~10x, from 900ms to 80ms. My theory is the logits processor not only constrains the model to valid outputs, it also constrains the model’s total token output by forcing it to generate the closing </orders> tag once it’s generated all the orders for all units.

Logits processor throughput

Benchmark conditions: Qwen3-14B on 1x A100-40GB, vLLM v1.x, batch size 1, prefix-cache warmup, ~750 token prompts, max 256 new tokens. The 900ms baseline includes runaway generation where the model produces reasoning traces before eventually outputting orders; the 80ms constrained version generates only the structured orders.

Testing our training pipeline#

At this point we have the ability to run a bunch of rollouts and use an LLM as the agent. We’re ready to test our infrastructure with a minimal training run.

Cooperative self-play

Diplomacy is different than traditional RL environments like Chess or Go in two important ways:

The lack of transitivity: If you’re better than me at Chess, and Alice is better than you, Alice is almost certainly better than me. This is not the case in Diplomacy. Diplomacy is more like rocks paper scissors - an honorable player who never backstabs might win amongst other honorable players, but if I’m willing to betray you all to win then everything changes.
Diplomacy is not strictly zero sum: Chess or Go are zero sum games - one player wins and the other loses. If I blunder my queen and my win probability drops, yours necessarily goes up. Diplomacy is not zero sum in the short term - we can both capture a supply center or increase our win probability on the same turn. For this initial training run, we’re going to use pure self play (think AlphaZero) to train a model. Given the above, with 1 policy controlling all 7 powers, we should expect the model to learn to maximize the total score of all players. This may result in a "peace treaty" equilibrium, where each power expands into their surrounding neutral territories but never shows hostility, since it "knows" any hostility it shows will be met with equal hostility in return.

Despite the disclaimer above, this training setting will be a perfect test for our infra and core learning pipeline. Even if we can’t expect to come away with a competitive human level player, we should prove to ourselves that the model is capable of learning to maximize its reward in this environment.

The training loop#

We’re going to use GRPO to train the model.

On our own explore vs exploit loop

Note that GRPO is probably not the correct training algorithm for this problem - Diplomacy would benefit from having an explicit state value function to help with credit assignment.

However, I like to think of design decisions we make as running in our own explore-exploit trade off algorithm. In this case, I made a conscious choice to make a decision that would maximize learning (I like how elegant + simple GRPO is - how far can we push it?), not necessarily maximize the odds of training a human level diplomacy bot. This framework has served me well in the past, both at work where it’s useful when designing and shipping A/B experiments (is this a pure learning experiment, or an experiment to ship experiment?) and when approaching projects like this.

For GRPO, we need to make two major design decisions:

How do we calculate the rewards for each sample in a group?
How do we group samples together for training?

Reward calculation#

There are two obvious paradigms we can use and combine to calculate a reward for a game of diplomacy:

Outcome level rewards: reward the agent based on its final position (sparse, credit assignment is hard)
Turn level rewards: reward the agent based on the outcome of each turn (dense, credit assignment is easy, but may encourage greedy/myopic behavior)

I experimented with both and ended up heavily weighting outcome-level rewards (~90%) with a small turn-level shaping signal. Here’s a concrete example of a 4-turn trajectory for France:

Turn-level rewards (shaping signal):

Turn 1 (S1901M): A PAR - BUR, A MAR - SPA, F BRE - MAO
+0.3 (move to non-SC) + 2.0 (capture SPA) + 0.3 (move to sea)
Turn reward: +2.6

Turn 2 (F1901M): A BUR - MUN (bounced), A SPA - POR, F MAO - WES
-0.3 (bounce) + 2.0 (capture POR) + 0.3 (move)
Turn reward: +2.0

Turn 3 (S1902M): A BUR S A MUN - RUH (critical support), A POR H, F WES - TUN
+1.5 (critical support enabled attack) + 0.1 (hold) + 2.0 (capture TUN)
Turn reward: +3.6

Turn 4 (F1902M): A MUN - BER (with support), A POR - SPA, F TUN H
+2.0 (capture SC) + 0.5 (had friendly support) + 0.3 (move) + 0.1 (hold)
Turn reward: +2.9

Outcome-level reward (the real signal): At game end, France has 7 supply centers, placing 2nd out of 7 powers:

Base: 7 SCs × 2.0 + 6 units × 0.2 + 0.5 survival = 15.7
Position bonus (2nd place): +25.0
Final outcome reward: 40.7

The per-order rewards are particularly important for teaching the model to coordinate. Notice in Turn 3: the support for Munich’s attack on Ruhr gets +1.5 because it was critical - the attack would have bounced without it. Compare this to a "void" support (supporting a move that doesn’t exist), which gets -1.5. This 3-point swing helps the model learn that supports need to reference actual moves.

Reward broadcasting: Rather than giving each turn its own reward, we sum the turn rewards and add the outcome reward, then broadcast this total to every turn in the trajectory. Why? In a game as strategic as Diplomacy, a "bad" local move might be part of a brilliant long-term strategy. It’s safer to assume all moves in a winning trajectory are better than moves in a losing trajectory. The per-order weighting within each turn handles the fine-grained credit assignment.

Grouping#

Remember that a core constraint of GRPO is that all rewards in a group must be directly comparable. For a simpler environment, like math, this is easy: Let’s say num groups = 1, and samples per group = 4.

prompt: what is 1 + 1?
s1: 2, s2: 3, s3: as a large language model, i cannot do math, s4: 2
G: {s1, s2, s3, s4}
Rewards: {1, 0, 0, 1}
Mean reward: 0.5
Advantages: {+0.5, -0.5, -0.5, +0.5} (reward - mean, then normalize)

The gradient update reinforces s1 and s4 (correct answers) while discouraging s2 and s3. Simple enough. In Diplomacy, this quickly falls apart. For the first step after the fork, turns across games are directly comparable. But as the game progresses, the board states diverge and the rewards for each turn are no longer directly comparable.

For Diplomacy, our grouping strategy is:

Pick a hero power - one power per rollout is controlled by the LLM we’re training
Warmup - play some random number of turns with baseline bots to get diverse starting states
Fork - clone the game state samples_per_group times (e.g., 4 forks)
Play out - each fork plays to completion independently
Assign rewards - based on final outcome + turn level rewards

Gradient update#

With advantages computed, we can compute the gradient update for each sample in the group. Of course, since we’re broadcasting the same reward to each turn in the trajectory, we’ve lost some credit assignment information. To compensate for that, I introduced a per-order/token loss weighting scheme. I compute a set of heuristics to figure out how important each order in the output was to the turn based reward, then associate each token in the output with that order. This is particularly important for teaching the model correct support/convoy chains. Without this approach, the model would often order invalid supports of armies that were attacking a different target. For example:

France's orders:
A PAR - BUR        (Paris moves to Burgundy)
A MAR S A PAR - PIC  (Marseilles supports Paris to Picardy - WRONG!)

The support is syntactically valid (it’s a legal Diplomacy order), so the logits processor allows it. But it’s strategically nonsensical - Marseilles is trying to support a move to Picardy when Paris is actually moving to Burgundy. The game engine will mark this support as "void" and it won’t help anyone. The model wasted an order.

By assigning a harsh -1.5 reward to void supports (vs +1.5 for critical supports that enable successful attacks), we create a 3-point swing that teaches the model to coordinate its units. Since the support order itself was technically valid, we couldn’t catch this with the logits processor - we only found it was invalid after the game engine tried to resolve the order. (Note: looking at your data is so important. I only caught this by manually inspecting weave traces)

The interactive demo below shows how this works. Toggle between "Good Support" (where Marseilles correctly supports Paris → Burgundy) and "Void Support" (the broken example). Notice how each token inherits the loss weight of its order:

Self-play proof of concept#

Before optimizing for competitive play, we need to validate our training pipeline works at all. In the pure self-play setting, the same LLM policy controls all 7 powers. This is simpler to implement (no league pool, no opponent sampling) and gives us a clean signal: if the model can’t improve against itself, something is broken.

Here’s a game from the self-play training showing the resulting model controlling France vs baseline bots. It’s able to take over the world pretty easily against the weak baseline bots. Cool!

This is enough to convince me that the training pipeline works - it’s clearly being rewarded for and reinforcing the type of behavior we expect.

The self-play model learns to play reasonable Diplomacy, but as we’ll discuss in the league play section, it has fundamental limitations. For now, it validates our core loop: rollouts → rewards → gradients → better play.

Results: Training vs DumbBot#

Self-play validates our training loop, but it doesn’t tell us if we’re actually learning good Diplomacy. For that, we need to benchmark against a fixed opponent and prior work. My goal for this project became to match or exceed DipNet’s performance with a pure LLM approach. DipNet - a non LLM RL project - achieved a 75% win rate against DumbBot, which, despite the name, is a sophisticated rule-based agent that understands support chains, defensive positioning, and basic tactical patterns. It plays solid no-press Diplomacy.

Quantitative Results#

Here are the results of training the model against DumbBot:

Metric	Base Model (Qwen3-14B)	After Training	Improvement
Win Rate vs DumbBot	72%	80%	+8pp
Elo Rating	1564	1641	+77
Games Evaluated	100	100	-

The trained model shows improvement over the base model across all metrics. The Elo gain of +77 points translates to roughly an 11 percentage point increase in expected win rate (from 50% to 61% against an equal opponent). And more importantly we exceed our win rate target of 75% vs DumbBot.

Statistical note

With 100 games, the 95% confidence interval for an 80% win rate is approximately [71%, 87%] (Wilson score interval). The base model’s 72% falls just inside this range, so the improvement is suggestive but not statistically conclusive at the p=0.05 level. More games would tighten these bounds - but for a solo winter break project, 100 games per evaluation was the practical limit. The directional improvement was consistent across multiple evaluation runs.

Per-Power Performance#

Interestingly, the model shows varying performance depending on which power it plays:

Power	Win Rate	Avg SCs	Notes
Russia	100%	7.9	Easiest - corner position, lots of expansion room
Turkey	100%	7.1	Corner position advantage
Germany	86%	6.1	Central position is risky but powerful
Italy	86%	5.4	Challenging start, model learned to navigate
England	87%	6.1	Island advantage
Austria	64%	5.2	Hardest - surrounded by enemies
France	40%	3.9	Surprisingly struggled here

The France result is interesting - it suggests the model learned strategies that work well from certain positions but struggles with the specific opening dynamics France faces. I suspect this could be fixed in training by making sure each batch is stratified by country. As it was implemented for this experiment, starting powers were randomly selected - this could result in certain update steps exploiting easy starting country strategies and learning overall degenerate strategies.

Sample Game Replay#

Here’s a game from the evaluation showing the trained model (playing as Italy) securing a solo victory against the dumbbot opponents:

One thing to note is I ran the evaluation games for many more turns than the training games (a limitation of train time compute). I think this explains why the model appears to start acting randomly in the second half of this eval game.

Training Dynamics#

The most important metric during training was mean reward vs DumbBot - shown below steadily increasing for the first ~30 steps of training.

Training reward curve showing improvement from ~15 to ~34 over 46 steps

Key observations:

Reward mean increased from ~15 to ~34 over training (the smoothed line shows the trend through the noise)
Advantage clipping stayed minimal throughout, indicating stable gradients (see chart below)
Entropy bonus instead of KL penalty: I disabled the standard KL divergence penalty in favor of an entropy coefficient bonus. Diplomacy is such a domain-specific problem that drifting from the base model’s distribution isn’t inherently bad - we want the policy to specialize. The entropy bonus encourages exploration while being easier to tune than KL.

$Training stability: PPO clip fraction at zero (left), policy entropy maintaining healthy range (right)$

What happened around step 30?

You probably noticed the reward peaks around step 28-30 (~34), then becomes highly volatile with a significant drop to ~9 at step 35 before partially recovering. Even with all the stability work I did, we clearly still experienced from policy instability. A few theories:

Overfitting to DumbBot’s patterns: The model found a strategy that exploits specific DumbBot behaviors, but this strategy is brittle. When the policy drifts slightly, it loses the exploit and performance collapses. 1.

Non-stratified batches: As mentioned in the training loop section, starting powers were randomly selected. This could result in certain update steps exploiting easy starting country strategies and learning overall degenerate strategies. 1.

Reward hacking: The model may have discovered a degenerate strategy (like always holding) that scores well on our intermediate reward shaping but doesn’t actually win games against DumbBot.

This instability when training against a single fixed opponent is one motivation for league play - diverse opponents provide a more robust training signal and prevent overfitting to any single opponent’s exploitable patterns.

It’s also a good reminder about the importance of saving checkpoints frequently during training - either to resume training or to evaluate the best version of the model so far.

DumbBot is just the beginning

These results show our model can beat a heuristic baseline, but DumbBot has exploitable patterns. A model trained only against DumbBot might overfit to those patterns. The real test is league play - training against a diverse pool of opponents including previous versions of itself. That’s where things get interesting.

From self-play to league play#

While cooperative self-play and training against a single fixed opponent gave us a working training loop and demonstrated that GRPO can improve Diplomacy performance, both approaches have fundamental limitations. When all seven powers are controlled by the same policy, several problems emerge:

Policy collapse: During self-play, the model can converge to a degenerate "peace treaty" equilibrium where all powers grab neutral SCs but are discouraged from aggressive play. The resulting policy is obviously brittle to any competitive player that violates the peace treaty.

Generalization failure: A model trained only against itself develops blind spots. It learns to exploit its own specific weaknesses and defend against its own attack patterns, but real opponents play differently. Deploy a self-play model against a human or different AI, and it falls apart.

The league architecture#

Drawing inspiration from AlphaStar, I implemented a league training system with three types of opponents in the pool:

Opponent Type	Role	Examples
Recent checkpoints	Peer-level opponents for competitive learning	Last 5-10 training checkpoints
Base model	Exploitable anchor that prevents forgetting	Original pre-trained Qwen-2.5-7B
Hard-coded bots	Diverse, exploitable strategies	Dumbbot, territorial bot, defensive bot

The diversity allows the model to easily hill-climb against exploitable opponents early in training, while still being challenged by stronger peers as it improves.

AlphaStar took things even further by adding "exploiter" bots that were specifically trained and kept in the league to beat the current challenger. This was out of scope for my project.

Inference challenges: LoRA hot-swapping#

League training introduces a significant engineering challenge: we need to run inference for multiple different models simultaneously. Each game has 7 powers, potentially controlled by 7 different checkpoints from the league pool.

The naive solution - dedicating separate GPU workers to each model - doesn’t work with my limited GPU capacity.

Fortunately, vLLM supports LoRA adapter hot-swapping with batched inference.

# Different powers in the same game can use different LoRA adapters
# vLLM batches them together efficiently
requests = [
{"prompt": france_prompt, "lora_adapter": "adapter_v15"},
{"prompt": england_prompt, "lora_adapter": "adapter_v10"},
{"prompt": germany_prompt, "lora_adapter": "base_model"},
# ...
]
# Single batched inference call handles all of them

The adapters share the base model weights (95%+ of parameters), so this is memory-efficient. We can serve the entire league from a small pool of GPU workers, dynamically loading whichever adapters each game requires.

This approach introduces a hard constraint on the type of training we can Do: we’re constrained to LoRA fine-tuning rather than full model weight updates. I think this is fine for my project - both because of my compute and budget limitations, and because LoRA is generally a good fit for trying to change model behavior given existing knowledge. A full fine tune would be more essential if the base model knew nothing about Diplomacy. I do think this is an interesting example of how infrastructure/engineering constraints can guide the design of the training process.

Matchmaking: PFSP beats pure TrueSkill#

With a pool of opponents, we need to decide which opponents the hero (training) model plays against. This is the matchmaking problem.

I initially implemented a TrueSkill-style rating system that prioritizes opponents close to the hero’s current skill level. The theory being that playing against similarly-skilled opponents maximizes learning signal. In practice, pure TrueSkill led to a subtle failure mode. As the hero improved, it would mostly play against the 2-3 other checkpoints at its skill level. The pool effectively collapsed back to something resembling self-play - resulting in brittle policies that resulted in catastrophic collapses during training.

Prioritized Fictitious Self-Play (PFSP) worked better. Instead of pure skill matching, PFSP explicitly maintains exploration by:

Reserving a fraction of games for exploitable opponents (base model, bots) regardless of skill gap
Prioritizing opponents the hero has struggled against historically
Gradually increasing the challenge as the hero improves

The result is a training curriculum that maintains diversity throughout training. The hero learns to beat exploitable opponents (without forgetting how to beat them throughout training) while still being challenged by harder peers.

Why diversity matters more than difficulty

Remember I compared Diplomacy to a more complicated version of rock-paper-scissors? When you overly limit your training pool, it’s like training against a player who only plays rock. You’ll get really good at playing paper, but you’ll be caught completely off guard by a player who plays scissors.

The pure TrueSkill approach was kind of like this - the model got really good at beating former versions of itself, but then forgot how to beat the much "worse" base model.

The measurement problem#

One unexpected challenge I encountered with league play: how do you know if training is working?

In self-play against a fixed opponent (like DumbBot), the answer is simple: track win rate. But in league play, the opponents change as training progresses. Your Elo might stay flat or even decrease while you’re actually improving - because the league is getting harder at the same pace you’re getting better.

This creates a measurement paradox:

In-league metrics (Elo, win rate vs pool) don’t reliably indicate absolute improvement
Static benchmarks (win rate vs DumbBot) quickly saturate as the model surpasses them
Human evaluation is expensive and doesn’t scale

My solution was periodic evaluation against a fixed benchmark suite - the same DumbBot and baseline bots used in the self-play results above. This gives a stable reference point, even if the model quickly reaches ~90% win rate and the metric becomes less informative.

As someone who’s maybe a little too obsessed with observability, this was a big point of discomfort for me - something I’m glad I encountered and would love to learn more about best approaches. It was so hard staring at a wandb dashboard and not being particularly confident if the training run was working or not.

If I did it again, I’d invest more in measurement infrastructure:

Frozen evaluation pool: A fixed set of checkpoint snapshots that never change, evaluated periodically throughout training
Exploitability metrics: Approximate best-response testing - how easily can a targeted policy beat the current checkpoint?
Crossplay matrices: Periodic round-robin tournaments between all checkpoints to track relative progress
Behavioral metrics: Track strategic indicators beyond win rate - support coordination rate, backstab frequency, territorial expansion patterns

Scaling inference#

With league play technically working, the next bottleneck was inference throughput. Each training step requires hundreds of games, each game has ~20 turns, and each turn needs 7 LLM calls. That’s thousands of inference requests per step. Two optimizations made this tractable.

Prefix caching#

Diplomacy prompts have a common structure: game rules, current board state, valid moves. The rules section is identical across all requests. The board state is shared across all 7 powers in a turn. Only the power-specific context (which country you are, your valid moves) varies.

vLLM’s automatic prefix caching exploits this. When multiple requests share a common prefix, the KV cache for that prefix is computed once and reused. I restructured prompts to maximize prefix sharing:

[SHARED ACROSS ALL REQUESTS]
Game rules, format instructions (~500 tokens)

[SHARED ACROSS TURN]
Current board state, all unit positions (~200 tokens)

[POWER-SPECIFIC]
Your country, your units, your valid moves (~100 tokens)

This ordering means 7 requests in the same turn share ~700 tokens of cached computation. Across a batch of 100 games, the cache hit rate is substantial.

In hindsight this is a pretty obvious optimization, but when I was building the MVP and put together the initial system prompt without much thought, it was not something that occurred to me. I think the lesson here is to always do several passes revisiting all parts of your code (especially prompts, which are easy to write once and forget about).

The importance sampling correction#

This one was my constant nemesis. Training would occasionally produce NaN losses (or grad norm would explode), but only sometimes, and only after many steps. The problem ended up being the infamous inference/training logprob mismatch between vLLM and HuggingFace. This was something I was aware of and had read about before this project, and yet it still caused constant headaches and required lots of debugging.

GRPO requires computing the ratio between the current policy’s probability and the reference policy’s probability for each token. We generate completions with vLLM (fast, batched inference) but compute reference logprobs with HuggingFace (needed for gradient computation). The assumption was these would match for identical inputs.

They don’t. vLLM applies various numerical optimizations (fused softmax, different precision handling) that produce slightly different logprobs. Usually the difference is negligible, but occasionally you’d get a token where vLLM says probability 0.001 and HuggingFace says 0.0001. The importance ratio explodes, gradients explode, NaN.

The fix: compute all logprobs (both rollout and reference) using the same implementation. We now compute rollout logprobs in the trainer using HuggingFace, discarding vLLM’s logprobs entirely. Slower, but numerically stable.

Trust but verify

When mixing inference frameworks, don’t assume numerical equivalence. Log everything, compare distributions, and add sanity checks for ratio bounds.

Training stability#

Getting a training run to complete without crashes or divergence required several hard-won fixes.

EMA reference model#

Standard GRPO uses a frozen reference model - typically the initial checkpoint. But this creates a problem: as training progresses, the policy drifts further from the reference, making the KL penalty increasingly meaningless.

Instead, I use an exponential moving average (EMA) of the policy weights as the reference. Every N steps, the reference model is updated:

ref_weights = ema_decay * ref_weights + (1 - ema_decay) * policy_weights

This keeps the reference "tracking" the policy, so the KL penalty remains meaningful throughout training. It’s a form of implicit trust region that prevents catastrophic forgetting without being too restrictive.

Gradient accumulation gotchas#

HuggingFace’s trl library has a subtle issue with padding tokens. When computing loss, it masks padding tokens by setting their target to -100. But if your tokenizer uses -1 as the padding token ID (some do!), you get silent corruption: real tokens get masked, loss is computed incorrectly, gradients are wrong.

The symptom: training appears to work, metrics improve, but the model produces garbage. It took days of staring at tensor shapes to find this.

Fault tolerance on Modal#

Training crashes can happen for a variety of reasons: pre-emption, hardware failures, acts of god, etc. A 12-hour training run interrupted at hour 11 is painful. This is why we checkpoint aggressively and make everything resumable.

Every training step saves:

Model weights and optimizer state
Training metrics and step count
League pool state (all checkpoints, ratings)
RNG state for reproducibility

On restart, I detect the last valid checkpoint and resume from there.

Random learnings#

Below are a few meta-lessons from this project and random things I’m proud of.

Look at your traces#

I’m embarrassed to admit that I actually didn’t set Weave up for the first couple days of this project. Don’t be like me - actually look at what your model is doing early and often. I use Weave, which is built in to wandb. Use whatever tool you want, but you need to be looking at traces.

I discovered the void support problem by manually inspecting game traces in Weave. The model was generating syntactically valid orders that were strategically nonsensical. Aggregate metrics looked fine - valid move rate was high, rewards were increasing. But the model was wasting moves on supports that didn’t help any of their attacking units.

Without trace inspection, I never would have found this. The fix (per-order reward weighting for void vs critical supports) was the single biggest improvement to model quality.

Claude Code as research assistant#

I built this entire project with Claude Code as my pair programmer. I developed a couple custom skills that made running experiments way faster and easier

I created a custom Claude Code skill and an ablation script that allows Claude to generate experiment configs from a hypothesis and launch the experiments. The skill handles:

Designs sweeps: Given a hypothesis, generates a sweep config with appropriate hyperparameter ranges
Launches experiments: Handles the Modal deployment, WandB logging setup, and git state tracking
Analyzes results: Pulls metrics from WandB, generates comparison charts, and summarizes findings

This made running targeted, well defined experiments super frictionless. I could just tell CLaude Code "run a sweep testing our 3 different reward functions" and it setup the config, launched the experiments, and was able to analyze the results in the context of the hypotheses. The ablation framework ensures reproducibility - every experiment is tied to a specific git commit, and the skill automatically handles the bookkeeping.

It’s almost a cliche at this point, but it’s true - the most important thing you can do as a researcher is make sure you always have an experiment running. Going out to dinner? Launch an experiment. Watching a movie? Launch an experiment. Going to sleep? You better wake up to experiment results.

This framework I built meant I had no excuses - even if I only had a couple free minutes I could still launch and learn something.

An analyze experiment skill:

Updated with expected training dynamics, metric, results, and learnings as I went
Create CLIs to interact with wandb and axiom to pull data and logs
Intregrated with experiment tracker to close the loop

What’s next#

This post covered no-press Diplomacy - the version without negotiation. The model learned to play by observing board states and outcomes, never communicating with other players. For a solo project done during winter break paid for thanks to some generous compute credit from Modal (thanks guys!) and a couple hundred bucks out of pocket, I think this was a reasonable scope to tackle.

But the real game is full-press Diplomacy, where players negotiate, form alliances, and - crucially - lie and betray each other. I already discussed some of the interesting questions this full-press version opens. I would love an opportunity to pursue this setting with more compute and time.

I’m also curious to apply everything I learned from this project to other similarly strategic settings. If I had my choice, I’d love to work on RL for wargaming in general. If that sounds interesting to you, reach out!

The code for this project is available at github.com/bglick13/diplomacy-v2. If you’re interested in collaborating or have questions, reach out on Twitter @benglickenhaus.

Introduction#

Introduction#

Acknowledgments#

Why Diplomacy?#

The building blocks of RL#

The rollout engine#

Benchmarking the rollout engine#

Scaling horizontally#

Game length#

Latency waterfall#

Summary#

The agent#

The LLM Agent#

Custom logits processor#

Interactive Demo#

Runtime complexity and performance impact#

Testing our training pipeline#

The training loop#

Reward calculation#

Grouping#

Gradient update#

Self-play proof of concept#

Results: Training vs DumbBot#

Quantitative Results#

Per-Power Performance#

Sample Game Replay#

Training Dynamics#

From self-play to league play#

The league architecture#

Inference challenges: LoRA hot-swapping#

Matchmaking: PFSP beats pure TrueSkill#

The measurement problem#

Scaling inference#

Prefix caching#

The importance sampling correction#

Training stability#

EMA reference model#

Gradient accumulation gotchas#

Fault tolerance on Modal#

Random learnings#

Look at your traces#

Claude Code as research assistant#

What’s next#

Similar Posts