Building on the conversations initiated under the Future of the Commons (overviews of which can be found here and here), Bahu Bhasa organised by the Open Knowledge Initiative at IIIT Hyderabad, in collaboration with the Language Technologies Research Centre brought together dozens of languages, some included in the Eighth Schedule of the Constitution of India and several others that one might be encountering for the first time. Language activists and specialists from different linguistic backgrounds, …
Building on the conversations initiated under the Future of the Commons (overviews of which can be found here and here), Bahu Bhasa organised by the Open Knowledge Initiative at IIIT Hyderabad, in collaboration with the Language Technologies Research Centre brought together dozens of languages, some included in the Eighth Schedule of the Constitution of India and several others that one might be encountering for the first time. Language activists and specialists from different linguistic backgrounds, disciplines, and geographies met between 06 and 08 November 2025, to talk about the languages and communities they work with, the projects they have been working on, their aspirations, and the support they would like to seek.
A word cloud representing Indian languages at Bahu Bhasa
An Overview of Select Projects at Bahu Bhasa
-
Uli: a project dedicated to tracking slur in Indian languages
-
Aakhor AI: Assamese typing tool that can be used for other Indian languages
-
Tamil Vani: NLP tools for Tamil language
-
Gurtur Goth: Chhatisgarhi web magazine
-
Keashur Praw: a social media page and channel for Kashmiri language
-
Eklavya: an NGO working on pedagogy for school children
-
Adivasi Janjagruti: an organisation invested in mobile journalism
-
Bhasha Verse and HimangY: tools in machine translation
-
PARI: People’s Archive of Rural India, reportage in multiple Indian languages
-
Pratham: a non-profit publisher of children’s books with a huge body of work in Indian languages
-
A word cloud representing the diverse language initiatives at Bahu Bhasa
The three day long programme covered various topics under the themes of policy, technology, and people. But these themes intersected and overlapped with each other. This postcard from the programme shares ideas that have technology in conversation with other themes.
From Gutenberg to Zuckerberg
Even among the technologically blessed languages, there is a tendency to not think beyond digitisation. Digitised texts in themselves are not enough to make a language digital. These corpuses need to be made available as searchable and machine readable data. Publishing PDFs is a relic of the Gutenberg era whereas digitality requires one to think in terms of the Zuckerberg era, the era of easy shareability.
From Culture to Knowledge
Working on Indian languages should not be about creative writing or culture alone. Indian languages need to figure into spaces of knowledge including science, technology, engineering, medicine, and mathematics. New technologies such as Generative AI will be able to work with or produce poetry but not with knowledge creation. At the same time, several conversations at Bahu Bhasa reminded us that Indian languages carry many forms of knowledge outside formal STEM domains as well; especially the wisdom held in oral traditions, proverbs, and everyday community practices. Much of this knowledge lives through context, memory, and performance, and does not travel well when reduced to text alone. This is also where current AI systems, which operate mainly at the level of linguistic form rather than meaning, struggle the most: they can imitate patterns in language, but they cannot reproduce the lived, situated understanding that communities bring to their stories. Bringing these forms of knowledge into our larger knowledge systems therefore requires approaches grounded in people, context, and care–not automated generation.
From Products to People
Technology should not be extractive. The data that go into the making of technologies for languages tend to be created and used in extractive ways. Informed consent from the communities regarding terms of usage of their data needs to be in place before the data are shared in any form.
From Form to Function/Content/Meaning
Technologies such as LLMs need to be understood more deeply when designing and developing for Indian languages. LLMs operate on a formal understanding of words. However, communication and dialogue are more than predictions of the next set of words in a sentence. Indian languages have a unique philosophy of word and meaning in which the word cannot stand on its own.
From Colonial Standardisation to Indian Digitality
English does not have a script of its own. The Roman script it continues to use hides the fact that languages should not be equated with scripts. Languages can exist without scripts and a language can have multiple scripts. The tendency to follow the trajectory of English when developing tech for Indian languages should give way to a newer imagination of technology that enables orality within digitality in creative ways. Working with one script is a relic of standardisation and definition of languages imposed by the British administrative machinery. For instance, the Perso-Arabic script was institutionalised by the British to bring some kind of standardisation to the language that was otherwise written in Devanagari, Gurmukhi, and Khudabadi as well. Then there are languages like Kashmiri which continue to remain unstandardised despite attempts to fit it into one or the other script; between Roman, Devanagari and Perso-Arabic. Hence, there is no reason why such impositions should continue in decoloniality.
From Language to Community and Impact
For several projects, development of language technology and communities is not an end in itself. These projects started out as support for an issue: preservation of books, awareness of rights, storytelling as pedagogy for learners and so on. Preservation of language and development of language technology has been an added benefit of these projects. Therefore, one does not always have to think in terms of natural language processing or lexicography in order to contribute to language technology. Projects and ideas that aspire to do something inevitably involve an element of language.
From Development to Annotation and Evaluation
Hands-on activities at Bahu Bhasa involved working with the ground realities of annotation. Annotators are the invisible labour behind the finished products people consume everyday. Their work involves going through a huge dump of data and tagging different components in exchange for a pittance which ranges from $1-2 per task, or less than $1.5/hr (after tax) in addition to the severe toll on physical and mental health (Williams 2022). The work of building a technology or an app should overlook those at the frontline. The idea of digital labour needs to recognise the menial labour it involves and give it due recognition and compensation.
Similarly, no output produced by a technology goes out into the community or the marketplace without supervision. The work of evaluators in evaluating such output, especially in translation through manual check or through designing of systems and workflows for automated checks deserves even more respect than the translation technology itself.
From Language as Pride to the Language of Shaming
Development of language apps and products begin with the premise of pride. One wants to preserve languages because they carry different worldviews and perspectives on things and phenomena without which the world is a poorer place to live in. However, languages are also means of carrying abuse, stereotyping, and vocabularies of hatred and trolling. While one cannot wish these elements of slurs away from languages, one needs to imagine better ways of identifying and monitoring their usage in the vision of creating safer online spaces, especially given that digital spaces are more often than not spaces of shaming.
The team at Open Knowledge Initiatives intends to follow up on Bahu Bhasa 2025 with formation of working groups and the constitution of a larger body of people and projects in Indian languages. The programme has stirred participants’ sense of curiosity and collaboration. Hopefully, there will be more reports from the field in the next few years as the team facilitates innovation in Indian languages at the levels of technology and ethos. The detailed report for the Bahu Bhasa 2025 can be accessed here.
References
Williams, Adrienne, “The Exploited Labor behind Artificial Intelligence,” NOEMA, October 13, 2022. https://www.noemamag.com/the-exploited-labor-behind-artificial-intelligence/.