German Commons shows that big AI datasets don’t have to live in copyright limbo
the-decoder.com·19h
Flag this post

Content

German Commons is now the largest openly licensed German text dataset, offering a foundation for building legally compliant German language models.

Most large language models train on web data with unclear copyright. German Commons takes a different approach: every text comes from institutions with clear, verifiable licensing. The project, led by the University of Kassel, the University of Leipzig, and hessian.AI, relied on the licensing info provided by these sources, without additional verification. According to their study, the team collected 154.56 billion tokens from 35.78 million documents.

The dataset pulls from 41 sources across seven categories: web content, political documents, legal texts, news, business, cultural, and s…

Similar Posts

Loading similar posts...