*In this installment of Playtesting, Alex Duffy shows why games might be the smartest approach to AI training right now. As the cofou…
In this installment of Playtesting, Alex Duffy shows why games might be the smartest approach to AI training right now. As the cofounder and CEO of Good Start Labs, he’s been exploring how game environments can improve AI capabilities across unexpected domains. His latest finding is surprising: Fine-tuning a model on the strategy game Diplomacy improved its performance on customer support and industrial operations benchmarks. Read on to learn why games generate the kind of data and behaviors that make AI better at the serious stuff, and what the Every team has learned from classics like StarCraft.—Kate Lee
Was this newsletter forwarded to you?Sign up to get it in your inbox.
It’s my job to make AI play games. One board game we’ve focused on at Good Start Labs has been Diplomacy, a World War I simulation reportedly played by John F. Kennedy and Henry Kissinger. There’s no dice and no luck. As everything shifts around you, all you can rely on are persuasion and strategy.
When we fine-tuned the Qwen3-235B model—an open-source model developed by the team at Chinese cloud computing company Alibaba Cloud—on thousands of rounds of Diplomacy, we found an over 10 percent improvement in performance on other games such as the card game Hanabi and word game Wordle. But we were encouraged to see that these improvements translated to other realms. The fine-tuned model also did better on Tau2, a benchmark that tests how well AI agents handle customer support conversations, andAssetOpsBench, IBM’s benchmark for industrial operations like equipment monitoring and maintenance.
It’s not a big leap to believe that improvement in one game could boost the model’s performance on others. But how does understanding WWI strategy make a model better at helping someone change their airline reservation or monitor equipment? Simple: Games reward specific behaviors. When you get good at those behaviors, they show up elsewhere.
When I asked my colleagues at Every what games had taught them, everyone had similar experiences. “StarCraft taught me how to cook,” Every’s head of platform Willie Williams tells me, recalling the high-speed chess-like game. “You have things that take different amounts of time, and you want them to land at the same time.” Our senior designer, Daniel Rodrigues, learned English from Pokémon before any classroom. AI editorial lead Katie Parrott became a more systematic thinker from board game mechanics and applied it to designing AI workflows.
This transfer of skills from games to other domains works for AI, too—and we can measure it. Diplomacy trains context-tracking, shifting priorities, and strategic communication. Customer support, where information is often incomplete and requests shift, needs the same capabilities.
We trained our model on Diplomacy in a reinforcement learning environment where you can clearly score whether the AI did something right. Labs are racing to build these kinds of environments because they do something that feeding the models static data can’t: They give models feedback on their decisions, teaching them to strategize, not just recall facts.
When you train a model on text from the internet, it learns to predict words. If you train it in an environment with goals and feedback, the model starts to develop skills that look remarkably like strategy. It’s a glimpse of where AI training is headed: less scraping the web, more learning by doing.
When fine-tuned in the Diplomacy learning environment, the Qwen 235B model improved significantly on certain benchmarks unrelated to gameplay. (Graph courtesy of Alex Duffy.)
Write at the speed of thought
That gap between your brain and your fingers kills momentum. Monologue lets you speak naturally and get perfect text 3x faster, and your tone, vocabulary, and style is kept intact. It auto-learns proper nouns, handles multilingual code-switching mid-sentence, and edits for accuracy. Free 1,000 words to start.
**The game is the curriculum **
“You become good at whatever the system rewards,” Every’s AI & I producer Rachel Braun tells me. Diplomacy rewards tracking context, planning responses, and navigating shifting alliances—exactly the capabilities with which labs like Anthropic, OpenAI, and DeepMind are trying to imbue their models.
It’s also why Arcee, a U.S.-based AI lab that develops open-source models, is using our Diplomacy environment to train its Trinity models. That includes its 400 billion parameter flagship Trinity Large models, one of the largest open-source model families from an American lab. Because it’s open-source, people can build on top of it, adapt it to their problems, and make it better for everyone else.
What Arcee and other labs are betting on is a second additional way to improve AI—not by making models bigger, but by training them differently after they’re built. Instead of just feeding them more text to read, they’re putting models in game-like situations where they practice tasks, get feedback on what worked, and develop skills they can apply elsewhere. The next big leap will come by combining learning by doing with ingesting more data.
AI researcher **Andrej Karpathy **put it this way: By training models in multiple games and tasks where you can score success, what are known as verifiable tasks, “the LLMs spontaneously develop strategies that look like ‘reasoning’ to humans.” The environment becomes the models’ curriculum, and whoever designs that curriculum shapes what the model becomes good at and how.
**The game is also the exam **
But games don’t just train models; they generate data no one else has. Our AI agents have played hundreds of thousands of rounds of the party game Bad Cards alongside 2 million real users. In the game, players get a prompt—something like, “What’s the secret ingredient in Grandma’s cookies?”—and compete to submit the funniest answer. Our agents pick punchlines and learn from the votes, generating data that shows people’s preferences for humor shift over time. That’s data that can’t be scraped from anywhere on the internet.
What users want from AI shifts faster than tests can measure, so static benchmarks become outdated quickly. Crowdsourced benchmarking project LM Arena just raised $150 million on this premise: The team is building an open platform for anyone to evaluate AI models by collecting feedback from human beings at scale.
Games are a natural fit for this continuous evaluation. They generate large amounts of data about real preferences, continuously refreshed. As more people interact with AI through play, they learn how these tools work, but their feedback—on what’s funny, for example—makes the next model better.
From StarCraft to the frying pan **
Willie didn’t set out to learn cooking from StarCraft—he was trying to win. But the skills he learned showed up in his kitchen anyway.
AI development is exhibiting the same pattern. If you set a clear goal, the skills to reach it will follow.
Only people can define what those goals should be: what counts as a good decision, what’s funny, and what matters. That’s subjective, inherently human work. Games are where we focus because they turn fuzzy goals into scorable outcomes—exactly what models need to learn. Diplomacy is just one game among thousands. Each one teaches something different, and we’re just beginning to discover what translates—how war strategy can help customers, or when science-fiction video game skills will show up in the kitchen.
We’re off to a good start.
Alex Duffy* is the cofounder and CEO of Good Start Labs, and a contributing writer. *
To read more essays like this, subscribe to Every, and follow us on X at @every and onLinkedIn.
Webuild AI tools for readers like you. Write brilliantly with***Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly withMonologue***.
We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.
Get paid for sharing Every with your friends. Join our referral program.
For sponsorship opportunities, reach out to sponsorships@every.to.
Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.