Preprint / Version 1

Improving Sentiment Analysis of Tamil-English Code-Mixed Sentences

##article.authors##

  • Mohnish Sivakumar Irvington High School

DOI:

https://doi.org/10.58445/rars.3273

Keywords:

Artificial Intelligence, Natural Language Processing, Sentiment Analysis, Machine Learning

Abstract

This paper investigates sentiment analysis for Tamil-English code-mixed text, a common feature of social media communication in multilingual regions. Code-mixing in Romanized Tamil introduces challenges such as inconsistent spelling, transliteration, and noisy syntax that traditional models are not designed to handle. Using the FIRE-DravidianCodeMix 2020 dataset, we evaluated sentiment classification with lexicon-based methods, classical machine learning models, deep learning (LSTM), the multilingual transformer RemBERT, and hybrid approaches combining lexicon-based features with machine learning models. Results showed that classical models such as Logistic Regression, Naive Bayes, and SVM achieved the most stable performance, reaching around 69% accuracy with weighted F1-scores near 0.60. Deep learning and transformer models offered no clear advantage, with both LSTM and RemBERT performing slightly lower than the classical models, plateauing near 67% accuracy and weighted F1-scores around 0.54. These results emphasize that lightweight statistical models remain the most reliable in noisy and resource-constrained code-mixed environments, while deep learning and transformer architectures require greater adaptation to succeed.

References

Chakravarthi B, Muralidaran V, Priyadharshini R, and McCrae JP, Corpus creation for sentiment analysis in code-mixed Tamil-English text, arXiv preprint arXiv:2006.00206, 2020. doi:10.48550/arXiv.2006.00206.

Aguilar G, Kar S, Solorio T, and González FA, LinCE: A centralized benchmark for linguistic code-switching evaluation, Proc. 12th Language Resources and Evaluation Conf. (LREC), pp. 1803–1813, 2020. [Online]. Available: https://aclanthology.org/2020.lrec-1.223/

Sridhar SN, Code-mixing in Indian languages: Typological and sociolinguistic aspects, in Handbook of the South Asian Languages, Karduna H, Malchukov A, Subbarao P, Eds. Cham: Springer, 2020, pp. 359–387. doi:10.1007/978-3-030-46010-3_13.

Solorio T, Blair E, Maharjan L, Bethard S, Diab A, Choudhury M, Bali K, Das M, and AlGhamdi D, Overview for the first shared task on language identification in code-switched data, Proc. 1st Workshop on Computational Approaches to Code Switching, pp. 62–72, 2014. [Online]. Available: https://aclanthology.org/W14-3913/

Bhat RA, Choudhury M, Malu A, and Bali K, Universal dependency parsing for Hindi-English code-switching, Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 987–998, 2018. [Online]. Available: https://aclanthology.org/N18-1090/

Pratapa A, Bhat RA, Choudhury M, and Bali K, Language modeling for code-mixing: The role of linguistic theory based synthetic data, Proc. 56th Annu. Meeting of the Association for Computational Linguistics (ACL), pp. 1543–1553, 2018. doi:10.18653/v1/P18-1143.

Kannan A, Mohanty F, and Mamidi R, Towards building a SentiWordNet for Tamil, Proc. 13th Int. Conf. on Natural Language Processing (ICON), NLP Association of India, 2016. [Online]. Available: https://aclanthology.org/W16-6305/

Ramanathan V, Thirunavukkarasu M, and Thamarai S, Sentiment analysis: An approach for analysing Tamil movie reviews using Tamil tweets, in Research Advances in Modern Science, vol. 3, pp. 44–55. Book Publisher International, 2021. doi:10.9734/bpi/ramrcs/v3/4845F.

Padmamala R, Prema VM, Sentiment analysis of online Tamil contents using recursive neural network models approach for Tamil language. In: 2017 IEEE International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM); 2017 Aug; p. 28–31. doi:10.1109/ICSTM.2017.8089122.

Raveendirarasa V and Amalraj CRJ, Sentiment analysis of Tamil-English code-switched text on social media using sub-word level LSTM, Proc. 5th Int. Conf. Information Technology Research (ICITR), pp. 1–5, 2020. doi:10.1109/ICITR51448.2020.9310817.

Krasitskii M, Kolesnikova O, Chanona Hernandez L, Sidorov G, and Gelbukh A, Advancing sentiment analysis in Tamil-English code-mixed texts: Challenges and transformer-based solutions, Proc. 5th Int. Conf. on Natural Language Processing for Digital Humanities (NLP4DH), pp. 305–312, 2025. [Online]. Available: https://aclanthology.org/2025.nlp4dh-1.27

Chakravarthi BR, Priyadharshini R, Thavareesan S, Chinnappa D, Thenmozhi D, Sherly E, et al. Findings of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text. FIRE / Dravidian-CodeMix Shared Task Report; 2021. Available from: https://arxiv.org/abs/2111.09811 (accessed on 2025-10-12).

Downloads

Posted

2025-10-18