Meta Omnilingual ASR: Advancing Automatic Speech Recognition for 1600 Languages

Takeaways:

We’re introducing Meta Omnilingual Automatic Speech Recognition (ASR), a suite of models providing automatic speech recognition capabilities for more than 1,600 languages, achieving state-of-the-art quality at an unprecedented scale.
Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.
We’re also releasing the Omnilingual ASR Corpus, an extensive collection of transcribed speech in 350 underserved languages; Omnilingual wav2vec 2.0, a scaled up massively multilingual speech representation model; and a language exploration demo people can explore langu…

Takeaways:

We’re introducing Meta Omnilingual Automatic Speech Recognition (ASR), a suite of models providing automatic speech recognition capabilities for more than 1,600 languages, achieving state-of-the-art quality at an unprecedented scale.
Omnilingual ASR was designed as a community-driven framework. People around the world can extend Omnilingual ASR to new languages by using just a few of their own samples.
We’re also releasing the Omnilingual ASR Corpus, an extensive collection of transcribed speech in 350 underserved languages; Omnilingual wav2vec 2.0, a scaled up massively multilingual speech representation model; and a language exploration demo people can explore languages covered by the model.

Automatic speech recognition (ASR) systems aim to make spoken language universally accessible by transcribing speech into text that can be searched, analyzed, and shared. Currently, most automatic speech recognition systems focus on a limited set of high-resource languages that are well represented on the internet, often relying on large amounts of labeled data and human-generated metadata to achieve good performance. This means high-quality transcriptions are often unavailable for speakers of less widely represented or low-resource languages, furthering the digital divide.

Today, Meta’s Fundamental AI Research (FAIR) team is introducing Omnilingual ASR — a groundbreaking suite of models that deliver automatic speech recognition for more than 1,600 languages, including 500 low-resource languages never before transcribed by AI. We’re also open sourcing Omnilingual wav2vec 2.0, a new self-supervised massively multilingual speech representation model scaled up to 7B parameters that can be leveraged for other downstream speech-related tasks. In addition, we’re releasing the Omnilingual ASR Corpus, a unique collection of transcribed speech in 350 underserved languages, curated in collaboration with our global partners.

This work supports our goal of building technology to help bring the world closer together. Omnilingual ASR is a significant step toward delivering a truly universal transcription system and expanding access to speech technology worldwide, ensuring that high-quality speech-to-text systems are accessible to even the most underrepresented language communities. The hope is to ultimately break down language barriers and enable communication across diverse linguistic and cultural backgrounds.

Beyond Multilinguality: Unprecedented Language Coverage and Performance

Automatic Speech Recognition has made strong progress in recent years, approaching near-perfect accuracy for many high-resource languages. However, expanding language coverage has been prohibitively resource intensive as current AI architectures are too data demanding to scale universally.

Omnilingual ASR addresses this research blocker by introducing two architectural variants. First, we scaled our previous wav2vec 2.0 speech encoder to 7B parameters for the first time, producing rich, massively multilingual semantic representations from raw, untranscribed speech data. We then built two decoder variants to map those into character tokens. The first decoder relies on a traditional connectionist temporal classification (CTC) objective, while the second leverages a traditional transformer decoder, commonly used in LLMs.

Dubbed LLM-ASR, this approach introduces a step change in ASR performance, especially for long tail languages. Our 7B-LLM-ASR system achieves state-of-the-art performance across 1,600+ languages, with character error rates (CER) below 10 for 78% of those languages.

*Lower is better

Bring Your Own Language

Beyond expanding to more than 1,600 languages, Omnilingual ASR also shifts the paradigm for how new languages can be brought into the fold. In most existing systems, languages not included at release time can only be added through expert-driven fine-tuning — a path inaccessible to most communities. Omnilingual ASR instead introduces the first large-scale ASR framework capable of extending to entirely new languages with just a few in-context examples.

This is made possible by our LLM-inspired system, which brings in-context learning capabilities over from the field of LLMs. In practice, this means that a speaker of an unsupported language can provide only a handful of paired audio-text samples and obtain usable transcription quality — without training data at scale, onerous expertise, or access to high-end compute. While zero-shot performance cannot yet match that of fully trained systems, it offers a far more scalable path to bringing new languages into digital reach.

A Suite of Models for Various Use Cases

We’re releasing a full suite of models and one dataset. Built on the foundation of FAIR’s previous research, Omnilingual ASR gives stakeholders everything they need to expand and improve speech technology for any language.

The two decoding variants are available as a versatile family of models — from lightweight 300M versions designed for low-power devices to powerful 7B models that offer top-tier accuracy for a variety of use cases. Our general-purpose speech foundation model wav2vec 2.0 is also made available at various sizes. It can be used by researchers and developers alike to enable speech-related tasks beyond ASR.

All assets are released under a permissive Apache 2.0 license while the data is provided under the CC-BY license and are based on FAIR’s open source fairseq2 framework, empowering researchers, developers, and language advocates worldwide to advance and tailor speech solutions for their own use cases using the latest tools and technologies in the PyTorch ecosystem.

Built With Global Partners

Omnilingual ASR also advances the state of multilingual ASR along more familiar dimensions. Its training corpus is one of the largest ever assembled for ASR in both volume and linguistic diversity, integrating publicly available datasets with community-sourced speech recordings collected through multiple partnerships.

To reach languages with little or no digital presence, we worked with local organizations that recruited and compensated native speakers, often in remote or under-documented regions. We’re releasing this commissioned part of our training corpus as Omnilingual ASR Corpus to further benefit the ASR research community. To date, it is the largest ultra-low-resource spontaneous ASR dataset ever made available, covering hundreds of languages never seen before by ASR systems. Explore the languages in the dataset here.

Beyond commissioned partnerships, collaborations through the Language Technology Partner Program have brought together linguists, researchers, and language communities from around the world, providing essential expertise and resources. We joined forces with organizations such as Mozilla Foundation’s Common Voice and Lanfrica/NaijaVoices to work directly with local communities.

These partnerships have been instrumental in infusing Omnilingual ASR with deep linguistic knowledge and cultural understanding, ensuring that the technology meets local needs and empowers diverse language communities globally.

Download Omnilingual ASR Try the Language Exploration Demo Try the Transcription Tool Read the Paper

Our latest updates delivered to your inbox

Subscribe to our newsletter to keep up with Meta AI news, events, research breakthroughs, and more.

Similar Posts