🔤Tokenization arxiv.orgAcademic

IHUBERT: Vector-Based Semantic Deduplication and Domain-Balanced Pretraining for Persian Resources (opens in new tab)

Persian pretrained language models (PLMs) are still limited by the scarcity of large-scale, high-quality pretraining corpora and by insufficient evaluation beyond standard classification and NER tasks. We present IHUBERT, a monolingual Persian PLM trained from scratch with the RoBERTa-base encoder (125M parameters) on a 45 GB curated subset of the Sepahr-Danesh collection (about 7-8B tokens). To improve corpus quality and reduce redundancy, we...

Read the original article