What GPT-OSS Leaks About OpenAI's Training Data
fi-le.net·3w·
Flag this post

What GPT-oss Leaks About

OpenAI’s Training Data

19th of September 2025

OpenAI recently released their open-weights model, and here we’ll show how that inevitably leaks some information about their model training stack. On the way, we’ll show that GPT-5 was trained on phrases from adult websites.


What data does OpenAI train their models on? That is a well-protected trade secret of course, one with vested interest for the answer, and yet OpenAI inevitably leaked some information about it with their open-weights model release GPT-oss.

While GPT-oss’s weights are openly available, the sources of training data are not clearly described in the model card. It is stated that GPT-oss was trained on a “text-only dataset with trillions of tokens, with a focus on STEM, coding, and gener…

Similar Posts

Loading similar posts...