Using OpenGameEval to Benchmark Agentic AI Assistants for Roblox Studio

The First Roblox Studio–native Evaluation Framework and Benchmark for Assessing AI Assistant Performance

The Challenge

Creators leverage Roblox Studio’s AI Assistant to accelerate Roblox experience development, but evaluating how well the AI Assistant and its underlying large language models (LLMs) perform on interactive development tasks remains a challenge. While traditional coding and agentic benchmarks focus on isolated, stateless tasks, Roblox development workflows demand purpose-built evaluation methods that measure performance on tasks such as reasoning across 3D hierarchies, managing multiplayer client-server interactions, and making changes to a stateful world.

To address this challenge, we’re introducing OpenGameEval, an open-sourc…

The First Roblox Studio–native Evaluation Framework and Benchmark for Assessing AI Assistant Performance

The Challenge

To address this challenge, we’re introducing OpenGameEval, an open-source evaluation framework and native benchmark dataset that evaluates LLM-based AI Assistant performance in a reproducible Roblox Studio environment. We hope that OpenGameEval, along with its public leaderboard, will offer a unique testing ground for the broader AI research community to evaluate core model capabilities related to tool use, agentic reasoning, and long-horizon task solving.

*OpenGameEval’s leaderboard provides a current snapshot of model effectiveness for Roblox development. *

The Solution

The OpenGameEval evaluation framework is engineered to replicate the Roblox development environment. Each evaluation is executed in an environment that simulates edit and play time behavior in Roblox Studio. This ensures that observed behavior, like physics, networking, and multiplayer interaction, is identical to what a creator or player would experience.

The framework incorporates input simulation, allowing us to programmatically mimic the complex player interactions necessary for evaluating development tasks that require user actions (e.g., button clicks, keyboard inputs, and camera manipulation).

The entire evaluation architecture is encapsulated behind a unified, simple-to-use API. This abstraction allows research partners to benchmark diverse LLM-based agentic systems performing identical benchmark tasks without modifying the underlying environment harness.

undefined

The OpenGameEval Benchmark Dataset

The OpenGameEval benchmark dataset is an open-source, manually curated suite of 47 test cases built on top of this framework through a rigorous, iterative, and fully human-verified process. We collect prompts from domain experts, build tailored Roblox experience environments to provide necessary context to AI models, manually create evaluations and authoritative solutions, and subject all scenarios to extensive human review to guarantee comprehensiveness, generalizability, and stability.

The initial release contains scenarios derived from common Roblox development tasks, including game mechanics, environment building, character animation, interface design, and sound design. The OpenGameEval benchmark utilizes executable unit tests, aligning its scoring methodology with industry-standard metrics like pass@k, cons@k, and all@k to quantify a model’s performance on the dataset. Research partners can replicate these metrics on their own after gathering evaluation results from OpenGameEval runs.

Unlike typical function-level coding challenges, OpenGameEval enables end-to-end testing of core components. A successful model must master several distinct skills, like navigating instance hierarchy, analyzing object state, and deriving the user’s intent from context within the environment.

Multistep Tasks and Contextual Variation

Roblox coding tasks often require multiple steps to navigate the existing context in an experience and investigate multiple intertwined scripts and instances to achieve a desired outcome. In the example below, OpenGameEval verifies multiple factors within a sandbox representing a real game instance environment to ensure that a model can appropriately account for multiple related scripts, client/server interaction, and the original intent of the prompt.


User prompt: Implement a health regeneration system that starts two seconds after taking damage and regenerates at 10 health per second. Placefile context: A laser tag experience with weapons, teams, and core play mechanisms already set up. Expected reasoning steps:1. Contextualize: Explore the experience with different search tools, which often requires multiple search steps adjusting the scopes: 1. Identify existing scripts on damage and player health, and understand the logic. 1. Reason the best location to add the health regeneration script (e.g., on server or client? As a section in the core game script or as a separate player script?). 1. Implementation: Write Luau code using the appropriate APIs to manipulate player health. The script needs to: 1. Capture the right timing when regeneration is needed, and how regeneration should happen. 1. Be generalizable to all damages, not limiting to a certain damage script. Verifiable evaluation: The executable test (run in the sandboxed game instance) triggers a damage event to the test player and verifies:1. Health regeneration is correctly handled on the server and made visible on the client. 1. Regeneration does not begin before the two-second delay. 1. Health regenerates at the rate of 10 health per second.

**User prompt: ** Implement a health regeneration system that starts two seconds after taking damage and regenerates at 10 health per second. Placefile context: A laser tag experience with weapons, teams, and core play mechanisms already set up. **Expected reasoning steps:**1. Contextualize: Explore the experience with different search tools, which often requires multiple search steps adjusting the scopes: 1. Identify existing scripts on damage and player health, and understand the logic.
1. Reason the best location to add the health regeneration script (e.g., on server or client? As a section in the core game script or as a separate player script?).

1. Implementation: Write Luau code using the appropriate APIs to manipulate player health. The script needs to: 1. Capture the right timing when regeneration is needed, and how regeneration should happen.
1. Be generalizable to all damages, not limiting to a certain damage script.

**Verifiable evaluation: ** The executable test (run in the sandboxed game instance) triggers a damage event to the test player and verifies:1. Health regeneration is correctly handled on the server and made visible on the client.
1. Regeneration does not begin before the two-second delay.
1. Health regenerates at the rate of 10 health per second.

undefined

To effectively test an AI model’s robustness and contextual understanding, tasks are presented under diverse environmental conditions. For instance, the “scripting a four-way traffic light” task includes three contextual variations based on the beginning state of the development environment.


User prompt: Write me a script for a simple four-way traffic light. Variation 1: An empty placefile containing only a baseplate. A traffic light model named TrafficLight is available without a script. The model needs to explore different parts within the TrafficLight model and find a way to toggle the on/off state. Variation 2: A placefile with a suburban setup. Multiple traffic light models named Traffic Signal are available without scripts. The model needs to first search the experience to correctly identify the traffic lights among other instances. The traffic light models are structured with a different logic than variant 1, and the model needs to implement a solution unique to this experience. Variation 3: A placefile with a suburban setup. Multiple traffic light and pedestrian signal models are available. While the scripts for traffic lights are removed, the scripts for pedestrian signals remain. The model needs to identify the difference between traffic lights and pedestrian signals and make changes to the correct objects. Does the existence of pedestrian signals confuse the model or help it?

**User prompt: ** Write me a script for a simple four-way traffic light. Variation 1: An empty placefile containing only a baseplate. A traffic light model named TrafficLight is available without a script. The model needs to explore different parts within the TrafficLight model and find a way to toggle the on/off state. Variation 2: A placefile with a suburban setup. Multiple traffic light models named Traffic Signal are available without scripts. The model needs to first search the experience to correctly identify the traffic lights among other instances. The traffic light models are structured with a different logic than variant 1, and the model needs to implement a solution unique to this experience. Variation 3: A placefile with a suburban setup. Multiple traffic light and pedestrian signal models are available. While the scripts for traffic lights are removed, the scripts for pedestrian signals remain. The model needs to identify the difference between traffic lights and pedestrian signals and make changes to the correct objects. Does the existence of pedestrian signals confuse the model or help it?

undefined Traffic light in a baseplate. Traffic light in an experience with assets and scripts.

We’re interested in understanding models’ behavior on seemingly similar tasks in different environments with varying levels of context and complexity.

Early Results

The OpenGameEval benchmark offers empirical data to diagnose the current state of AI assistants in interactive development. Test cases are designed to differentiate between capabilities in atomic operations and in operations that require multistep contextual reasoning.

Our initial testing revealed that models generally excel at atomic operations but struggle with contextual reasoning. They achieve the highest success rates in tasks requiring single, direct instance manipulation, like setting a particle emitter or modifying a player’s jump power. Leading models demonstrate near-perfect success, proving their proficiency in syntactic code generation and basic API knowledge.

In sharp contrast, a substantial gap persists in tasks demanding coordinated action, contextual filtering, and deep API integration. Examples like the health regeneration system and the four-way traffic light, above, continue to yield very low pass@k scores across all models.

Rapid Evolution

As models continue to evolve, we expect to see these gaps close, but we’ve already seen interesting developments. In one evaluation task that prompts a model to “change the Roblox logo like a cube to be green,” we initially saw models universally fail because the target object’s name did not explicitly contain the word logo or Roblox.

undefined

More recent evaluations show that some models are now successfully solving this case by moving beyond simple keyword matching to structural reasoning, utilizing close instance inspection (including properties, not just the name) and coordinated inference to identify the object most likely to represent the “Roblox logo.”

What’s next?

We’re committed to continually expanding and maintaining OpenGameEval to track rapid advancements in the field of AI. The current OpenGameEval framework and benchmark are just the foundation. Our strategic roadmap focuses on three core goals to ensure that the platform remains the standard for Roblox Studio Agentic AI Assistant evaluation:

Empower Creators Through Performance Transparency: We will routinely update the leaderboard and benchmark dataset while offering clear, transparent summaries that help creators compare models and understand performance across code generation, asset insertion, and tool orchestration.

Accelerate Research and Development: We will maintain and expand the API adapter to standardize evaluation, enabling research partners to run fast, frictionless, reproducible benchmarks for developing next-generation AI assistants.

Take a Community-Driven Approach: We will continue to integrate real-world creator intents and actively solicit community contributions to ensure the benchmark remains representative of cutting-edge Roblox development and advancing AI capabilities.

Together, the framework, dataset, and public leaderboard make OpenGameEval a transparent, collaborative foundation for evaluating AI-powered creation in Roblox development, helping the entire creator community measure progress, share insights, and build better assistants.

***Acknowledgments: ***The OpenGameEval project is the result of a significant collaborative effort across teams at Roblox. Special thanks to Vlad Shcherban, Sean Dunigan, and Jack Lu, who helped build the evaluation harness, and Isabella Ting and Brent Vincent, whose insights were instrumental in shaping this release. We’re deeply grateful to our partner teams and former team members, as this work reflects their collective expertise and commitment.

The Challenge

The Challenge

The Solution

The OpenGameEval Benchmark Dataset

Multistep Tasks and Contextual Variation

Early Results

Rapid Evolution

What’s next?

Empower Creators Through Performance Transparency: We will routinely update the leaderboard and benchmark dataset while offering clear, transparent summaries that help creators compare models and understand performance across code generation, asset insertion, and tool orchestration.

Accelerate Research and Development: We will maintain and expand the API adapter to standardize evaluation, enabling research partners to run fast, frictionless, reproducible benchmarks for developing next-generation AI assistants.

Similar Posts