michmech/lemmatization-lists: Machine-readable lists of lemma-token pairs in 23 languages.

Lemmatization Lists

These are large-coverage, machine-readable lemma/token pairs in several languages which I have collected (legally) from various sources, mostly as part of my work on the Global Glossary project. I use these for query expansion during fulltext searches: if a user searches for the lemma walk, the query is expanded to also search for the tokens walking, walked etc.

These are plain text files (zipped). Each line contains one lemma/token pair separated by a tab character in this sequence: lemma, tab, token. The files are encoded in UTF-8 with Windows-style line breaks.

Asturian (ast) (108,792 pairs)
Bulgarian (bg) (30,323 pairs)
Catalan (ca) (591,534 pairs)
Czech (cs) (36,400 pairs)
English (en) (41,760 pairs)
Estonian (et) (…

Lemmatization Lists

Lemmatization Lists

Similar Posts