of the AI boom, the pace of technological iteration has reached an unprecedented level. Previous obstacles now seem to have viable solutions. This article serves as an “NMT 101” guide. While introducing our project, it also walks readers step by step through the process of fine-tuning an existing translation model to support a low-resource language that is not included in mainstream multilingual models.
Background: Dongxiang as a Low-Resource Language
Dongxiang is a minority language spoken in China’s Gansu Province and is classified as vulnerable by the UNESCO Atlas of the World’s Languages in Danger. Despite being widely spoken in local communities, Dongxiang lacks the institutional and digital support enjoyed by high-resource languages. Before diving into the training pipelin…
of the AI boom, the pace of technological iteration has reached an unprecedented level. Previous obstacles now seem to have viable solutions. This article serves as an “NMT 101” guide. While introducing our project, it also walks readers step by step through the process of fine-tuning an existing translation model to support a low-resource language that is not included in mainstream multilingual models.
Background: Dongxiang as a Low-Resource Language
Dongxiang is a minority language spoken in China’s Gansu Province and is classified as vulnerable by the UNESCO Atlas of the World’s Languages in Danger. Despite being widely spoken in local communities, Dongxiang lacks the institutional and digital support enjoyed by high-resource languages. Before diving into the training pipeline, it helps to briefly understand the language itself. Dongxiang, as its name suggests, is the mother tongue of the Dongxiang people. Descended from Central Asian groups who migrated to Gansu during the Yuan dynasty, the Dongxiang community has linguistic roots closely tied to Middle Mongol. From a writing-system perspective, Dongxiang has undergone a relatively recent standardization. Since the 1990s, with governmental promotion, the language has gradually adopted an official Latin-based orthography, using the 26 letters of the English alphabet and delimiting words by whitespace.
Dongxiang Language Textbook for Primary Schools (by Author)
Although it is still classified under the Mongolic language family, due to the prolonged coexistence with Mandarin-speaking communities through history, the language has a trove of lexical borrowing from Chinese (Mandarin). Dongxiang exhibits no overt tense inflection or grammatical gender, which may be an advantage to simplify our model training.
Based on the Dongxiang dictionary, approximately 33.8% of Dongxiang vocabulary items are of Chinese origin. (by Author)
Further background on the Dongxiang language and its speakers can be found on our website, which hosts an official English-language introduction released by the Chinese government.
Our Model: How to Use the Translation System
We build our translation system on top of NLLB-200-distilled-600M, a multilingual neural machine translation model released by Meta as part of the No Language Left Behind (NLLB) project. We were inspired by the work of David Dale. However, ongoing updates to the Transformers library have made the original approach difficult to apply. In our own trials, rolling back to earlier versions (e.g., transformers ≤ 4.33) often triggered conflicts with other dependencies. In light of these constraints, we provide a full list of libraries in our project’s GitHub requirements.txt for your reference.
Two training notebooks (by Author)
Our model was fine-tuned on 42,868 Dongxiang–Chinese bilingual sentence pairs. The training corpus combines publicly available materials with internally curated resources provided by local government partners, all of which were processed and cleaned in advance. Training was conducted using Adafactor, a memory-efficient optimizer well suited to large transformer models. With the distilled architecture, the full fine-tuning process can be completed in under 12 hours on a single NVIDIA A100 GPU. All training configurations, hyperparameters, and experimental settings are documented across two training Jupyter notebooks. Rather than relying on a single bidirectional model, we trained two direction-specific models to support Dongxiang–Chinese and Chinese–Dongxiang translation. Since NLLB is already pretrained on Chinese, joint training under data-imbalanced conditions tends to favor the easier or more dominant direction. As a result, performance gains on the low-resource side (Dongxiang) are often limited. However, NLLB does support bidirectional translation in a single model, and a straightforward approach is to alternate translation directions at the batch level.
Here are the links to our repository and website.
GitHub Repository GitHub-hosted website
The model is also publicly available on Hugging Face.
Chinese → Dongxiang Dongxiang → Chinese
Model Training: Step-by-Step Reproducible Pipeline
Before following this pipeline to build the model, we assume that the reader has a basic understanding of Python and fundamental concepts in natural language processing. For readers who may be less familiar with these topics, Andrew Ng’s courses are a highly recommended gateway. Personally, I also began my own journey to this field through his course.
Step 1: Bilingual Dataset Processing
The first stage of model training focuses on constructing a bilingual dataset. While parallel corpora for major languages can often be obtained by leveraging existing web-scraped resources, Dongxiang–Chinese data remains difficult to acquire. To support transparency and reproducibility, and with consent from the relevant data custodians, we have released both the raw corpus and a normalized version in our GitHub repository. The normalized dataset is produced through a straightforward preprocessing pipeline that removes excessive whitespace, standardizes punctuation, and ensures a clear separation between scripts. Dongxiang text is restricted to Latin characters, while Chinese text contains only Chinese characters. Below is the code used for preprocessing:
import re
import pandas as pd
def split_lines(s: str):
if "\\n" in s and "\n" not in s:
lines = s.split("\\n")
else:
lines = s.splitlines()
lines = [ln.strip().strip("'").strip() for ln in lines if ln.strip()]
return lines
def clean_dxg(s: str) -> str:
s = re.sub(r"[^A-Za-z\s,\.?]", " ", s)
s = re.sub(r"\s+", " ", s).strip()
s = re.sub(r"[,.?]+$", "", s)
return s
def clean_zh(s: str) -> str:
s = re.sub(r"[^\u4e00-\u9fff,。?]", "", s)
s = re.sub(r"[,。?]+$", "", s)
return s
def make_pairs(raw: str) -> pd.DataFrame:
lines = split_lines(raw)
pairs = []
for i in range(0, len(lines) - 1, 2):
dxg = clean_dxg(lines[i])
zh = clean_zh(lines[i+1])
if dxg or zh:
pairs.append({"Dongxiang": dxg, "Chinese": zh})
return pd.DataFrame(pairs, columns=["Dongxiang", "Chinese"])
In practice, bilingual sentence-level pairs are preferred over word-level entries, and excessively long sentences are split into shorter segments. This facilitates more reliable cross-lingual alignment and leads to more stable and efficient model training. Isolated dictionary entries should not be inserted into training inputs. Without surrounding context, the model cannot infer syntactic roles, or learn how words interact with surrounding tokens.
Bilingual dataset (by Author)
When parallel data is limited, a common alternative is to generate synthetic source sentences from monolingual target-language data and pair them with the originals to form pseudo-parallel corpora. This idea was popularized by Rico Sennrich, whose work on back-translation laid the groundwork for many NMT pipelines. LLM-generated synthetic data is another viable approach. Prior work has shown that LLM-generated synthetic data is effective in building translation systems for Purépecha, an Indigenous language spoken in Mexico.
Step 2: Tokenizer Preparation
Before text can be digested by a neural machine translation model, it must be converted into tokens. Tokens are discrete units, typically at the subword level, that serve as the basic input symbols for neural networks. Using entire words as atomic units is impractical, as it leads to excessively large vocabularies and rapid growth in model dimensionality. Moreover, word-level representations struggle to generalize to unseen or rare words, whereas subword tokenization enables models to compose representations for novel word forms.
The official NLLB documentation already provides standard examples demonstrating how tokenization is handled. Owing to NLLB’s strong multilingual capacity, most widely used writing systems can be tokenized in a reasonable and stable manner. In our case, adopting the default NLLB multilingual tokenizer (Unigram-based) was sufficient to process Dongxiang text.
Summary statistics of tokenized Dongxiang sentences (by Author)
Whether the tokenizer should be retrained is best determined by two criteria. The first is coverage: frequent occurrences of unknown tokens (<unk>) indicate insufficient vocabulary or character handling. In our sample of 300 Dongxiang sentences, the <unk> rate is zero, suggesting full coverage under the current preprocessing. The second criterion is subword fertility, defined as the average number of subword tokens generated per whitespace-delimited word. Across the 300 samples, sentences average 6.86 words and 13.48 tokens, corresponding to a fertility of approximately 1.97. This pattern remains consistent across the distribution, with no evidence of excessive fragmentation in longer sentences.
Overall, NLLB demonstrates robust behavior even on previously unseen languages. As a result, tokenizer retraining is generally unnecessary unless the target language employs a highly unconventional writing system or even lacks Unicode support. Retraining a SentencePiece tokenizer also has implications for the embedding layer. New tokens start without pretrained embeddings and must be initialized using random values or simple averaging.
Step 3: Language ID Registration
In practical machine translation systems such as Google Translate, the source and target languages must be explicitly specified. NLLB adopts the same assumption. Translation is governed by explicit language tag, referred to as src_lang and tgt_lang, determining how text is encoded and generated within the model. When a language falls outside NLLB’s predefined scope, it must first be explicitly registered, along with a corresponding expansion of the model’s embedding layer. The embedding layer maps discrete tokens into continuous vector representations, allowing the neural network to process and learn linguistic patterns in a numerical form.
In our implementation, a custom language tag is added to the tokenizer as an additional special token, which assigns it a unique token ID. The model’s token embedding matrix is then resized to accommodate the expanded vocabulary. The embedding vector associated with the new language tag is initialized from a zero-centered normal distribution with a small variance, scaled by 0.02. If the newly introduced language is closely related to an existing supported language, its embedding can often be trained on top of the existing representation space. However, linguistic similarity alone does not guarantee effective transfer learning. Differences in writing systems can affect tokenization. A well-known example is Moldovan, which is linguistically identical to Romanian but is written in the Latin script, while it is written in Cyrillic in the so-called Pridnestrovian Moldavian Republic. Despite the close linguistic relationship, the difference in script introduces distinct tokenization patterns.
The code used to register a new language is presented here.
def fix_tokenizer(tokenizer, new_lang: str):
old = list(tokenizer.additional_special_tokens)
if new_lang not in old:
tokenizer.add_special_tokens(
{"additional_special_tokens": old + [new_lang]})
return tokenizer.convert_tokens_to_ids(new_lang)
fix_tokenizer(tokenizer,"sce_Latn")
# we register Dongxiang as sce_Latn, and it should append to the last
# output 256204
print(tokenizer.convert_ids_to_tokens([256100,256204]))
print(tokenizer.convert_tokens_to_ids(['lao_Laoo','sce_Latn']))
# output
['lao_Laoo', 'sce_Latn']
[256100, 256204]
model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
model.resize_token_embeddings(len(tokenizer))
new_id = fix_tokenizer(tokenizer, "sce_Latn")
embed_dim = model.model.shared.weight.size(1)
model.model.shared.weight.data[new_id] = torch.randn(embed_dim) * 0.02
Step 4: Model Training
We fine-tuned the translation model using the Adafactor optimizer, a memory-efficient optimization algorithm designed for large-scale sequence-to-sequence models. The training schedule begins with 500 warmup steps, during which the learning rate is gradually increased up to 1e-4 to stabilize early optimization and avoid sudden gradient spikes. The model is then trained for a total of 8,000 optimization steps, with 64 sentence pairs per optimization step (batch). The maximum sequence length is set to 128 tokens, and gradient clipping is applied with a threshold of 1.0.
We initially planned to adopt early stopping. However, due to the limited size of the bilingual corpus, nearly all available bilingual data was used for training, leaving only a dozen-plus sentence pairs reserved for testing. Under these conditions, a validation set of sufficient size was not available. Therefore, although our GitHub codebase includes placeholders for early stopping, this mechanism was not actively used in practice.
Below is a snapshot of the key hyperparameters used in training.
optimizer = Adafactor(
[p for p in model.parameters() if p.requires_grad],
scale_parameter=False,
relative_step=False,
lr=1e-4,
clip_threshold=1.0,
weight_decay=1e-3,
)
batch_size = 64
max_length = 128
training_steps = 8000
warmup_steps = 500
It is also worth noting that, in the design of the loss function, we adopt a computationally efficient training strategy. The model receives tokenized source sentences as input and generates the target sequence incrementally. At each step, the predicted token is compared against the corresponding reference token in the target sentence, and the training objective is computed using token-level cross-entropy loss.
loss = model(**x, labels=y.input_ids).loss
# Pseudocode below illustrates the underlying mechanism of the loss function
for each batch:
x = tokenize(source_sentences) # input: source language tokens
y = tokenize(target_sentences) # target: reference translation tokens
predictions = model.forward(x) # predict next-token distributions
loss = cross_entropy(predictions, y) # compare with reference tokens
backpropagate(loss)
update_model_parameters()
This formulation actually carries an implicit assumption: that the reference translation represents the single correct answer and that the model’s output must align with it token by token. Under this assumption, any deviation from the reference is treated as an error. Even when a prediction conveys the same idea using different wording, synonyms, or an altered sentence structure.
The mismatch between token-level supervision and meaning-level correctness is particularly problematic in low-resource and morphologically flexible languages. At the training stage, this issue can be alleviated by relaxing strict token-level alignment and treating multiple paraphrased target sentences as equally valid references. At the inference stage, instead of selecting the highest-probability output, a set of candidate translations can be generated and re-ranked using semantically informed criteria (e.g., chrF).
Step 5: Model Evaluation
Once the model is built, the next step is to examine how well it translates. Translation quality is shaped not only by the model itself, but also by how the translation process is configured at inference time. Under the NLLB framework, the target language must be explicitly specified during generation. This is done through the forced_bos_token_id parameter, which anchors the output to the intended language. Output length is controlled through two parameters. The first is the minimum output allowance (a), which guarantees a baseline number of tokens that the model is allowed to generate. The second is a scaling factor (b), which determines how the maximum output length grows in proportion to the input length. The maximum number of generated tokens is set as a linear function of the input length, computed as a + b × input_length. In addition, max_input_length limits how many input tokens the model reads.
This function powers the Dongxiang → Chinese translation.
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
device = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_DIR3 = "/content/drive/MyDrive/my_nllb_CD_model"
tokenizer3 = AutoTokenizer.from_pretrained(MODEL_DIR3)
model3 = AutoModelForSeq2SeqLM.from_pretrained(MODEL_DIR3).to(device)
model3.eval()
def translate3(text, src_lang="zho_Hans", tgt_lang="sce_Latn",
a=16, b=1.5, max_input_length=1024, **kwargs):
tokenizer3.src_lang = src_lang
inputs = tokenizer3(text, return_tensors="pt", padding=True,
truncation=True, max_length=max_input_length).to(model3.device)
result = model3.generate(
**inputs,
forced_bos_token_id=tokenizer3.convert_tokens_to_ids(tgt_lang),
max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
**kwargs
)
outputs = tokenizer3.batch_decode(result, skip_special_tokens=True)
return outputs
Model quality is then assessed using a combination of automatic evaluation metrics and human judgment. On the quantitative side, we report standard machine translation metrics such as BLEU and ChrF++. BLEU scores were computed using standard BLEU-4, which measures word-level n-gram overlap from unigrams to four-grams and combines them using a geometric mean with brevity penalty. ChrF++ was calculated over character-level n-grams and reported as an F-score. It should be noted that the current evaluation is preliminary. Due to limited data availability at this early stage, BLEU and ChrF++ scores were computed on only a few dozen held-out sentence pairs. Our model achieved the following results:
Dongxiang → Chinese (DX→ZH) BLEU-4: 44.00 ChrF++: 34.3
Chinese → Dongxiang (ZH→DX) BLEU-4: 46.23 ChrF++: 59.80
BLEU-4 scores above 40 are generally regarded as strong in low-resource settings, indicating that the model captures sentence structure and key lexical choices with reasonable accuracy. The lower chrF++ score in the Dongxiang → Chinese direction is expected and does not necessarily indicate poor translation quality, as Chinese permits substantial surface-level variation in word choice and sentence structure, which reduces character-level overlap with a single reference translation.
In parallel, bilingual evaluators fluent in both languages reported that the model performs reliably on simple sentences, such as those following basic subject–verb–object structures. Performance degrades on longer and more complex sentences. While these results are encouraging, they also indicate that further improvement is still required.
Step 6: Deployment
At the current stage, we deploy the project through a lightweight setup by hosting the documentation and demo interface on GitHub Pages, while releasing the trained models on Hugging Face. This approach enables public access and community engagement without incurring additional infrastructure costs. Details regarding GitHub-based deployment and Hugging Face model hosting follow the official documentation provided by GitHub Pages and the Hugging Face Hub, respectively.
This script uploads a locally trained Hugging Face–compatible model.
import os
from huggingface_hub import HfApi, HfFolder
# Load the Hugging Face access token
token = os.environ.get("HF_TOKEN")
HfFolder.save_token(token)
# Path to the local directory containing the trained model artifacts
local_dir = "/path/to/your/local_model_directory"
# Target Hugging Face Hub repository ID in the format: username/repo_name
repo_id = "your_username/your_model_name"
# Upload the entire model directory to the Hugging Face Model Hub
api = HfApi()
api.upload_folder(
folder_path=local_dir,
repo_id=repo_id,
repo_type="model",
)
Following model release, a Gradio-based interface is deployed as a Hugging Face Space and embedded into the project’s GitHub Pages site. Compared to Docker-based self-deployment, using Hugging Face Spaces with Gradio avoids the cost of maintaining dedicated cloud infrastructure.
Screenshot of our translation demo (by Author)
Reflection
Throughout the project, data preparation, not model training, dominated the overall workload. The time spent cleaning, validating, and aligning Dongxiang–Chinese data far exceeded the time required to fine-tune the model itself. Without local government involvement and the support of native and bilingual speakers, completing this work would not have been possible. From a technical perspective, this imbalance highlights a broader issue of representation in multilingual NLP. Low-resource languages such as Dongxiang are underrepresented not due to inherent linguistic complexity, but because the data required to support them is expensive to obtain and relies heavily on human expertise.
At its core, this project digitizes a printed bilingual dictionary and constructs a basic translation system. For a community of fewer than one million people, these incremental steps play an outsized role in ensuring that the language is not excluded from modern language technologies. Finally, let’s take a moment to appreciate the breathtaking scenery of Dongxiang Autonomous County!
River gorge in Dongxiang Autonomous County (by Author)
Contact
This article was jointly written by Kaixuan Chen and Bo Ma, who were classmates in the Department of Statistics at the University of North Carolina — Chapel Hill. Kaixuan Chen is currently pursuing a master’s degree at Northwestern University, while Bo Ma is pursuing a master’s degree at the University of California, San Diego. Both authors are open to professional opportunities.
If you are interested in our work or would like to connect, feel free to reach out:
Project GitHub: https://github.com/dongxiangtranslationproject Kaixuan Chen: [email protected] Bo Ma: [email protected]