Microsoft-Tsinghua team trains 7B coding model that beats 14B rivals using only synthetic data

Researchers show that an AI model trained on synthetic programming tasks alone can beat larger competitors. A key finding: task variety matters more than the number of solutions.

The research group’s experiments show a clear link between data volume and benchmark results: with 32,000 synthetic programming tasks, the model hits a pass rate of 43.7 percent. At 64,000 tasks, that climbs to 51.3 percent, then 57.2 percent at 128,000 tasks, and finally 62.7 percent at 192,000 tasks.

Model performance scales steadily with the number of synthetic tasks. | Image: Wu et al.

Given the same compute budget, task variety matters more than the number of solutions per task. A dataset with 64,000 different tasks and one solution each outperforms one with 16,000 tasks and four solutions each, or 8,000 tasks with eight solutions each.

Task diversity beats solution quantity

Building powerful code models often stalls due to limited training data. Existing collections of competition tasks get reused over and over and aren’t enough to drive further improvements. Previous synthetic approaches simply rewrote existing tasks, limiting their diversity to the original templates.

The system generates high-quality training data in four steps. After extracting and evolving programming features (1), it creates tasks and generates solutions (2) and test cases (3) using LLMs. A two-stage validation process ensures the synthetic data is correct. | Image: Wu et al.

The new pipeline, called SynthSmith, takes a different approach by generating tasks, solutions, and test cases from scratch. The process starts by pulling features relevant to coding competitions from 10,000 existing code examples, like algorithms, data structures, and optimization techniques. Through an evolution process, the system expands the pool from 27,400 to nearly 177,000 algorithm entries. The pipeline then combines these building blocks into new programming tasks in different styles.

Quality control happens in two stages. First, the system determines correct test outputs through majority voting across multiple candidate solutions. Then it validates the best solution against a holdout test set to prevent overfitting.

Smaller model beats larger competitors

X-Coder’s 7 billion parameters hit an average pass rate of 62.9 on eight attempts on LiveCodeBench v5. On the newer v6 version, it scores 55.8, outperforming both DeepCoder-14B-Preview and AReal-boba2-14B, which have 14 billion parameters and run on a stronger base model.

Loading more...