Data science is a field dedicated to transforming raw data into meaningful, actionable insights, playing an essential role in solving real-world challenges. Businesses often depend on data-driven insights to make pivotal strategic decisions. However, the data science process is frequently complex, demanding a high level of expertise in fields like computer science and statistics. This workflow consists of many time-intensive activities, from interpreting various documents to performing complex data processing and statistical analysis.
To streamline this complex workflow, recent research has focused on using off-the-shelf large language models (LLMs) to create autonomous data science a…
Data science is a field dedicated to transforming raw data into meaningful, actionable insights, playing an essential role in solving real-world challenges. Businesses often depend on data-driven insights to make pivotal strategic decisions. However, the data science process is frequently complex, demanding a high level of expertise in fields like computer science and statistics. This workflow consists of many time-intensive activities, from interpreting various documents to performing complex data processing and statistical analysis.
To streamline this complex workflow, recent research has focused on using off-the-shelf large language models (LLMs) to create autonomous data science agents. The goal of these agents is to convert natural language questions into executable code for a desired task. But despite making significant progress, current data science agents have several limitations that hinder their practical use. A major issue is their heavy reliance on well-structured data, like CSV files in relational databases. This limited focus ignores the valuable information contained in the diverse and heterogeneous data formats, such as JSON, unstructured text, and markdown files, that are common in real-world applications. Another challenge is that many data science problems are open-ended and lack ground-truth labels, making it difficult to verify if an agent’s reasoning is correct.
To that end, we present DS-STAR, a new agent designed to solve data science problems. DS-STAR introduces three key innovations: (1) a data file analysis module that automatically extracts context from varied data formats, including unstructured ones; (2) a verification stage where an LLM-based judge assesses the plan’s sufficiency at each step; and (3) a sequential planning process that iteratively refines the initial plan based on feedback. This iterative refinement allows DS-STAR to handle complex analyses that draw verifiable insights from multiple data sources. We demonstrate that DS-STAR achieves state-of-the-art performance on challenging benchmarks like DABStep, KramaBench, and DA-Code. It especially excels with tasks involving diverse, heterogeneous data files.
DS-STAR
The DS-STAR framework operates in two main stages. First, it automatically examines all files in a directory and creates a textual summary of their structure and contents. This summary becomes a vital source of context for tackling the task at hand.
Second, DS-STAR engages in a primary loop of planning, implementing, and verifying. The Planner agent first creates a high-level plan, which the Coder agent then transforms into a code script. Subsequently, the Verifier agent evaluates the code’s effectiveness in solving the problem. The Verifier agent is an LLM-based judge prompted to determine if the current plan is adequate. If the judge finds the plan insufficient, DS-STAR refines it by altering or adding steps (determined by the Router agent) and then repeats the cycle. Importantly, DS-STAR uses a method that mimics how an expert analyst uses tools like Google colab to build a plan sequentially, reviewing intermediate results before proceeding. This iterative cycle continues until a plan is deemed satisfactory or the maximum number of rounds (10) is reached, at which point the final code is delivered as the solution.
Evaluation
To evaluate DS-STAR’s effectiveness, we compared its performance to existing state-of-the-art methods (AutoGen, DA-Agent) using a set of well-regarded data science benchmarks, DABStep, KramaBench, and DA-Code. These benchmarks evaluate performance on complex tasks like data wrangling, machine learning, and visualization that use multiple data sources and formats.
The results show that DS-STAR substantially outperforms AutoGen and DA-Agent in all test scenarios. Compared to the best alternative, DS-STAR raised the accuracy from 41.0% to 45.2% on DABStep, 39.8% to 44.7% on KramaBench, and 37.0% to 38.5% on DA-Code. Notably, DS-STAR also secured the top rank on the public leaderboard for the DABStep benchmark (as of 9/18/2025). On both easy tasks (where the answer is in a single file) and hard tasks (requiring multiple files), DS-STAR consistently surpasses competing baselines, demonstrating its superior ability to work with multiple, heterogeneous data sources.
In-depth analysis of DS-STAR
Next, we conducted ablation studies to verify the effectiveness of DS-STAR’s individual components and analyze the impact of the number of refinement rounds, specifically by measuring the iterations required to generate a sufficient plan.
Data File Analyzer: This agent is essential for high performance. Without the descriptions it generates (Variant 1), DS-STAR’s accuracy on difficult tasks within the DABStep benchmark sharply dropped to 26.98%, underscoring the importance of rich data context for effective planning and implementation.
Router: The Router agent’s ability to determine if a new step is needed or to fix an incorrect step is vital. When we removed it (Variant 2), DS-STAR only added new steps sequentially, leading to worse performance on both easy and hard tasks. This demonstrated that it is more effective to correct mistakes in a plan than to keep adding potentially flawed steps.
Generalizability Across LLMs: We also tested DS-STAR’s adaptability by using GPT-5 as the base model. This yielded promising results on the DABStep benchmark, indicating the framework’s generalizability. Interestingly, DS-STAR with GPT-5 performed better on easy tasks, while the Gemini-2.5-Pro version performed better on hard tasks.
An analysis of the refinement process: The figure below shows that difficult tasks naturally require more iterations. On the DABStep benchmark, hard tasks needed an average of 5.6 rounds to solve, whereas easy tasks required only 3.0 rounds. Furthermore, over half of the easy tasks were completed in just a single round.
Conclusion
In this work, we introduced DS-STAR, a new agent that can autonomously solve data science problems. The framework is defined by two core innovations: the automatic analysis of diverse file formats and an iterative, sequential planning process that uses a novel LLM-based verification system. DS-STAR establishes a new state-of-the-art on the DABStep, KramaBench, and DA-Code benchmarks, outperforming the best alternative. By automating complex data science tasks, DS-STAR has the potential to make data science more accessible for individuals and organizations, helping to drive innovation across many different fields.
Acknowledgements
We would like to thank Jiefeng Chen, Jinwoo Shin, Raj Sinha, Mihir Parmar, George Lee, Vishy Tirumalashetty, Tomas Pfister and Burak Gokturk for their valuable contributions to this work.