Make email your superpower
Not all emails are created equal—so why does our inbox treat them all the same? Cora is the most human way to email, turning your inbox into a story so you can focus on what matters and getting stuff done instead of on managing your inbox. Cora drafts responses to emails you need to respond to and briefs the rest.
“Your fleet will burn in the Black Sea tonight.”
As the message from DeepSeek’s new R1 model flashed across the screen, my eyes widened, and I watched my teammates’ do the same. An AI had just decided, unprompted, that aggression was the best course of action.
Today we are launching (and open-sourcing!) AI Diplomacy, which I built in part to evaluate how well different LLMs could negotiate, form alliances, and, yes, betray each other in an attempt to take over the world (or at least Europe in 1901). But watching R1 lean into role-play, OpenAI’s o3 scheme and manipulate other models, and Anthropic’s Claude often stubbornly opt for peace over victory revealed new layers to their personalities, and spoke volumes about the depth of their sophistication. Placed in an open-ended battle of wits, these models collaborated, bickered, threatened, and even outright lied to one another.
AI Diplomacy is more than just a game. It’s an experiment that I hope will become a new benchmark for evaluating the latest AI models. Everyone we talk to, from colleagues to Every’s clients to my barber, has the same questions on their mind: “Can I trust AI?” and “What’s my role when AI can do so much?” The answer to both is hiding in great benchmarks. They help us learn about AI and build our intuition, so we can wield this extremely powerful tool with precision.
We are what we measure
Most benchmarks are failing us. Models have progressed so rapidly that they now routinely ace more rigid and quantitative tests that were once considered gold-standard challenges. AI infrastructure company HuggingFace, for example, acknowledged this when it took down its popular LLM Leaderboard recently. “As model capabilities change, benchmarks need to follow!” an employee wrote. Researchers and builders throughout AI have taken note: When Claude 4 launched last month, one prominent researcher tweeted, “I officially no longer care about current benchmarks.”