Improving Sentiment Analysis of Tamil-English Code-Mixed Sentences
DOI:
https://doi.org/10.58445/rars.3273Keywords:
Artificial Intelligence, Natural Language Processing, Sentiment Analysis, Machine LearningAbstract
This paper investigates sentiment analysis for Tamil-English code-mixed text, a common feature of social media communication in multilingual regions. Code-mixing in Romanized Tamil introduces challenges such as inconsistent spelling, transliteration, and noisy syntax that traditional models are not designed to handle. Using the FIRE-DravidianCodeMix 2020 dataset, we evaluated sentiment classification with lexicon-based methods, classical machine learning models, deep learning (LSTM), the multilingual transformer RemBERT, and hybrid approaches combining lexicon-based features with machine learning models. Results showed that classical models such as Logistic Regression, Naive Bayes, and SVM achieved the most stable performance, reaching around 69% accuracy with weighted F1-scores near 0.60. Deep learning and transformer models offered no clear advantage, with both LSTM and RemBERT performing slightly lower than the classical models, plateauing near 67% accuracy and weighted F1-scores around 0.54. These results emphasize that lightweight statistical models remain the most reliable in noisy and resource-constrained code-mixed environments, while deep learning and transformer architectures require greater adaptation to succeed.
References
Chakravarthi B, Muralidaran V, Priyadharshini R, and McCrae JP, Corpus creation for sentiment analysis in code-mixed Tamil-English text, arXiv preprint arXiv:2006.00206, 2020. doi:10.48550/arXiv.2006.00206.
Aguilar G, Kar S, Solorio T, and González FA, LinCE: A centralized benchmark for linguistic code-switching evaluation, Proc. 12th Language Resources and Evaluation Conf. (LREC), pp. 1803–1813, 2020. [Online]. Available: https://aclanthology.org/2020.lrec-1.223/
Sridhar SN, Code-mixing in Indian languages: Typological and sociolinguistic aspects, in Handbook of the South Asian Languages, Karduna H, Malchukov A, Subbarao P, Eds. Cham: Springer, 2020, pp. 359–387. doi:10.1007/978-3-030-46010-3_13.
Solorio T, Blair E, Maharjan L, Bethard S, Diab A, Choudhury M, Bali K, Das M, and AlGhamdi D, Overview for the first shared task on language identification in code-switched data, Proc. 1st Workshop on Computational Approaches to Code Switching, pp. 62–72, 2014. [Online]. Available: https://aclanthology.org/W14-3913/
Bhat RA, Choudhury M, Malu A, and Bali K, Universal dependency parsing for Hindi-English code-switching, Proc. Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pp. 987–998, 2018. [Online]. Available: https://aclanthology.org/N18-1090/
Pratapa A, Bhat RA, Choudhury M, and Bali K, Language modeling for code-mixing: The role of linguistic theory based synthetic data, Proc. 56th Annu. Meeting of the Association for Computational Linguistics (ACL), pp. 1543–1553, 2018. doi:10.18653/v1/P18-1143.
Kannan A, Mohanty F, and Mamidi R, Towards building a SentiWordNet for Tamil, Proc. 13th Int. Conf. on Natural Language Processing (ICON), NLP Association of India, 2016. [Online]. Available: https://aclanthology.org/W16-6305/
Ramanathan V, Thirunavukkarasu M, and Thamarai S, Sentiment analysis: An approach for analysing Tamil movie reviews using Tamil tweets, in Research Advances in Modern Science, vol. 3, pp. 44–55. Book Publisher International, 2021. doi:10.9734/bpi/ramrcs/v3/4845F.
Padmamala R, Prema VM, Sentiment analysis of online Tamil contents using recursive neural network models approach for Tamil language. In: 2017 IEEE International Conference on Smart Technologies and Management for Computing, Communication, Controls, Energy and Materials (ICSTM); 2017 Aug; p. 28–31. doi:10.1109/ICSTM.2017.8089122.
Raveendirarasa V and Amalraj CRJ, Sentiment analysis of Tamil-English code-switched text on social media using sub-word level LSTM, Proc. 5th Int. Conf. Information Technology Research (ICITR), pp. 1–5, 2020. doi:10.1109/ICITR51448.2020.9310817.
Krasitskii M, Kolesnikova O, Chanona Hernandez L, Sidorov G, and Gelbukh A, Advancing sentiment analysis in Tamil-English code-mixed texts: Challenges and transformer-based solutions, Proc. 5th Int. Conf. on Natural Language Processing for Digital Humanities (NLP4DH), pp. 305–312, 2025. [Online]. Available: https://aclanthology.org/2025.nlp4dh-1.27
Chakravarthi BR, Priyadharshini R, Thavareesan S, Chinnappa D, Thenmozhi D, Sherly E, et al. Findings of the Sentiment Analysis of Dravidian Languages in Code-Mixed Text. FIRE / Dravidian-CodeMix Shared Task Report; 2021. Available from: https://arxiv.org/abs/2111.09811 (accessed on 2025-10-12).
Downloads
Posted
Categories
License
Copyright (c) 2025 Mohnish Sivakumar

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
You are free to:
- Share — copy and redistribute the material in any medium or format for any purpose, even commercially.
- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
- The licensor cannot revoke these freedoms as long as you follow the license terms.
Under the following terms:
- Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
- No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license