Casho Lab December 1, 2025
To build the best multi-lingual systems, you must fundamentally understand what languages are. Simplicity in design is often good, but comprehensiveness is a must when developing systems that support everyone. So, Casho Lab built the most comprehensive language selector libraries which truly handle cases for every language. Not just every common languages, or every language we know, but over 7000 languages. And we open sourced all the data, components, and made a public api to serve them.
What is a language?
A simple question leads down many pathways. A language is a grouping of words and phrases and rules that a common set of peoples share. Some languages are spoken and some are written. Across a single country there can be multiple languages, …
Casho Lab December 1, 2025
To build the best multi-lingual systems, you must fundamentally understand what languages are. Simplicity in design is often good, but comprehensiveness is a must when developing systems that support everyone. So, Casho Lab built the most comprehensive language selector libraries which truly handle cases for every language. Not just every common languages, or every language we know, but over 7000 languages. And we open sourced all the data, components, and made a public api to serve them.
What is a language?
A simple question leads down many pathways. A language is a grouping of words and phrases and rules that a common set of peoples share. Some languages are spoken and some are written. Across a single country there can be multiple languages, multiple dialects of the same language, or multiple sub languages of a macro language family.
Edge-cases of languages
There are many edge cases for languages:
- Two people can write the same language but speak a different dialect (Chinese Cantonese, Chinese Mandarin)
- Two languages can be spoken very similar but written differently (Urdu/Hindi)
- A single language can have multiple official scripts (Serbian, Aceh)
- A language can have subregions which, use different scripts, speak differently, or use different grammar (Chinese - Taiwan, Chinese - China)
- Languages can be written in different directions from one another (English, Arabic)
- Same language can be part of different regions (English, Chinese) A region can have many official languages. (South Africa)
Requirements for a comprehensive language selector
There are many requirements to make a language selector that is both comprehensive and accessible.
- We cannot assume a person knows english -> must provide the language in its actual name (Endonym)
- Any person is able to find the selector -> We use icons to represent the concept of "languages"
- Someone can select a language even if they don’t speak english or that language -> flag usage
- We must be able to differentiate between regions and scripts.
- We must have the data for all 7000 languages
Additional Requirements
- We must support common frameworks, so people can use it.
- The styling must be modern, easily navigable, and aesthetically pleasing.
Data collection
We got to work collecting the proper data for these systems. What data? Well, ISO 639-1 data has around 200 languages (not enough), so we need to add ISO 639-2 and ISO 639-3. But then we also must interpret region codes according to BCP-47 standards. So we also gathered data on the 250 "regions" of the world, their flags, and the ISO 15924 script data.
This is what was publicly available, but it still wasn’t enough. We needed to gather Endonyms(languages in their own language), flag order for languages, flag order for scripts, language meta information(spacing, directionality), common language region pairings, common language script pairings, script names in the language itself, region names in the specific language region pairing. This data is not available from any common source so we must acquire it ourselves.
Data Augmentation with Agentic Systems
LLM data augmentation is how we acquired this data. For the 400 most common languages (those which appear in iso-639-3 terminologic codes) we wanted to collect this data. So we developed a data augmentation system which used a mix of the top current LLMs from different providers. These llms would lay out what they thought the correct answer was and give reasons why. Then judge models would decide between the different ideas who was correct. Our manual validation proved this worked extremely well on our first experiment.
Given the success of the first round, we continued with other data. Determining which flags properly represent countries is a very good uncase as this is an opinionated topic. We developed a system that analyzed what flags best represented countries and order them given a set of criteria: The total percentage of speakers in that country, whether that language was an official language of the country, and the percentage of speakers in that country that spoke that language.
Given this criteria we used llms to be non human debaters to deiced on the list and order of the different flags for language and scripts. Eventually we got back our lists with some languages having 1 country and some having over 10.
We repeated this process until the data was sufficiently complete, then cleaned and published the open source dataset and generation methods to a public github data repository.
Modern Web Design
Now that we had the appropriate data we had to turn this into something actually useful. So we started design and development. We analyzed 10 different icons and ranked them based on their ability to represent the concept "languages" given any region or culture. We then determined the different display formats that would allow users to see and select between languages, regions and scripts.
We wanted optionality in our language selector to make it the most comprehensive. We allowed users to show or not show flags, to show a single flag or up to 4 that represent a language/script, we gave them the choice between a modal and dropdown, to show english names or not. This optionality is important because people will come to different conclusions about what is best to show and there are multiple "correct" conclusions about what is right to show given different circumstances. Thus flexibility was key. Certain design choices must be opinionated as given different styling options, objects must have enough space to represent all the data present.
We also wanted to allow environments where network couldn’t be accessed like certain mobile apps and offline mode. So we made it possible to load from static files as well. There is an api endpoint and helper function which allows creating static files which contain all language data, flag svg data, and optionally display parameters.
Our designers used styling that is modern, with proper spacing and alignments to provide as little mental load as possible on the choice. Our engineers implemented search so people could navigate large language lists. Our component asynchronously loads data when the user opens it to stop unnecessary network bloat.
This component was thought out to be truly comprehensive. All languages, optionality that suits situationality and preference, and intelligent efficient design.
Component Libraries
We developed this library for 3 frameworks. React, as it is now the most commonly used frontend framework for new projects, Svelte as our development team believes this is the best framework for modern web apps, and as an html embeddable so you can use it in any other website, framework, or website builder.
Try the component
Here you can try the component in the common formats.
This site was built with svelte so below uses svelte components:
And here we use an embeddable html snippet:
Conclusion
We made an open source language selector that has a public api, an sqlite database full of all languages iso codes, names, endonyms, major regionalities, and we support all regions, as well as scripts, and variants.
Casho Lab Language Selector Package Website - ls.casholab.com
Github Repository - github.com/casholab/language-selector
Component Playground- ls.casholab.com/playground