Content

German Commons is now the largest openly licensed German text dataset, offering a foundation for building legally compliant German language models.

Most large language models train on web data with unclear copyright. German Commons takes a different approach: every text comes from institutions with clear, verifiable licensing. The project, led by the University of Kassel, the University of Leipzig, and hessian.AI, relied on the licensing info provided by these sources, without additional verification. According to their study, the team collected 154.56 billion tokens from 35.78 million documents.

The dataset pulls from 41 sources across seven categories: web content, political documents, legal texts, news, business, cultural, and s…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help