Efficient ASR for Low-Resource Languages: Leveraging Cross-Lingual Unlabeled Data
arxiv.org·2d
🎙️Whisper
Preview
Report Post

View PDF HTML (experimental)

Abstract:Automatic speech recognition for low-resource languages remains fundamentally constrained by the scarcity of labeled data and computational resources required by state-of-the-art models. We present a systematic investigation into cross-lingual continuous pretraining for low-resource languages, using Perso-Arabic languages (Persian, Arabic, and Urdu) as our primary case study. Our approach demonstrates that strategic utilization of unlabeled speech data can effectively bridge the resource gap without sacrificing recognition accuracy. We construct a 3,000-hour multilingual corpus through a scalable unlabeled data collection pipeline and employ targeted continual pretrain…

Similar Posts

Loading similar posts...