Synth: The New Data Frontier
pleias.fr·2h·
Discuss: Hacker News
Flag this post

SYNTH: the new data frontier

Since GPT-3, language models have been mostly trained on large collections of web archives. During the past year, Frontier AI labs have reconsidered this approach as they move toward reasoning and agentic models requiring a large amount of generally unwritten *traces *for thoughts, actions or tool calls.

We release SYNTH, a fully generalist synthetic dataset, that represents a fundamental break from the common pre-training paradigm: what if we trained *for *reasoning and focused on the assimilation of knowledge and skill that matters? Classic benchmarks are already operating under this assumption: MMLU, gsm8k, MATH, are all ultimately based on collections of high school exercises.

SYNTH stems from 50,000 *vital *Wikipedia articles, expanded into a …

Similar Posts

Loading similar posts...