Code-Mix Sentiment Analysis on Hinglish Tweets

View PDF

Abstract:The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish–a hybrid of Hindi and English–used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A…

View PDF

Abstract:The effectiveness of brand monitoring in India is increasingly challenged by the rise of Hinglish–a hybrid of Hindi and English–used widely in user-generated content on platforms like Twitter. Traditional Natural Language Processing (NLP) models, built for monolingual data, often fail to interpret the syntactic and semantic complexity of this code-mixed language, resulting in inaccurate sentiment analysis and misleading market insights. To address this gap, we propose a high-performance sentiment classification framework specifically designed for Hinglish tweets. Our approach fine-tunes mBERT (Multilingual BERT), leveraging its multilingual capabilities to better understand the linguistic diversity of Indian social media. A key component of our methodology is the use of subword tokenization, which enables the model to effectively manage spelling variations, slang, and out-of-vocabulary terms common in Romanized Hinglish. This research delivers a production-ready AI solution for brand sentiment tracking and establishes a strong benchmark for multilingual NLP in low-resource, code-mixed environments.


Comments:	Accepted at the 9th International Conference on Natural Language Processing and Information Retrieval (NLPIR 2025), Fukuoka, Japan
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2601.05091 [cs.CL]
	(or arXiv:2601.05091v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2601.05091 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Aneshya Das [view email] [v1] Thu, 8 Jan 2026 16:39:26 UTC (4,542 KB)

Submission history

Similar Posts