COTA 2.0: AN AUTOMATIC CORRECTOR OF TUNISIAN ARABIC SOCIAL MEDIA TEXTS


(Received: 17-Jun.-2022, Revised: 7-Sep.-2022 and 18-Oct.-2022 , Accepted: 10-Nov.-2022)
In written text, orthographic noise is a common concern for NLP, especially when operating social-network comments and raw documents. This is mainly due to its orthographic conventions and morphological ambiguity. We propose to automatically normalize the social-media dialect corpora by following CODA-TA, the conventional Orthography for TA. The existing system developed for TA «COTA Orthography 1.0» is not able to handle all forms of TA. Therefore, we propose to extend its rules and lexicons to address the peculiarities of social media dialect. In certain words, the COTA Orthography 1.0 system provides the user with several correction possibilities. Therefore, in the new version, we incorporated a trigram language model to automatically select the right correction. Our results show that the system can reduce transcription errors by 95.72%.

[1] I. Zribi, M. Ellouze, L. H. Belguith and P. Blache, "Spoken Tunisian Arabic Corpus STAC: Transcription and Annotation," Research in Computing Science, vol. 90, pp. 123-135, 2015.

[2] A. Masmoudi, M. Ellouze Khmekhem, Y. Esteve, L. Hadrich Belguith and N. Habash, "A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition," Proc. of the 9th Int. Conf. on Language Resources and Evaluation, vol. 3, no. 1, pp. 306–310, 2014.

[3] R. Boujelbane, M. Ellouze, F. Béchet and L. Belguith, "De l’arabe Standard vers l’arabe Dialectal: Projection de Corpus et Ressources Linguistiques en vue du Traitement Automatique de l’oral dans les Médias Tunisiens," TAL. 2. Traitement Automatique du Langage Parlé, vol. 55, pp. 73–96, 2014. 

[4] A. Masmoudi, N. Habash, M. Ellouze, Y. Estève and L. H. Belguith, "Arabic Transliteration of Romanized Tunisian Dialect Text: A Preliminary Investigation," Proc. of the 16th Int. Conf. on Computat. Linguistics and Intelligent Text Process. (CICLing 2015), pp. 608–619, Cairo, Egypt, 2015.

[5] S. Mdhaffar, F. Bougares, Y. Eve and L. Hadrich-Belguith, "Sentiment Analysis of Tunisian Dialect: Linguistic Resources and Experiments," Proc. of the 3rd Arabic Natural Language Processing Workshop (WANLP), pp. 55–61, Valencia, Spain, 2017.

[6] J. Younes, H. Achour and E. Souissi, "Constructing Linguistic Resources for the Tunisian Dialect Using Textual User-generated Contents on the Social Web," Proc. of the 15th Int. Conf. on Current Trends in Web Engineering, ICWE 2015 Rotterdam, pp. 3–14, The Netherlands, 2015.

[7] S. El Klibi, S. El Hamzaoui, H. Ben Abda, C. Kaddes, F. El Horcheni and A. Maalla, La Constitution en Dialecte Tunisien. Tunisie: Association Tunisienne de Droit Constitutionnel, 2014.

[8] I. Zribi, R. Boujelbane, A. Masmoudi, M. Ellouze, L. H. Belguith and N. Habash, "A Conventional Orthography for Tunisian Arabic," Proc. of the 9th Int. Conf. on Language Resources and Evaluation, European Language Resources Association (ELRA), pp. 2355–2361, Reykjavik, Iceland, May 2014.

[9] R. Boujelbane, I. Zribi, S. Kharroubi and M. Ellouze, "An Automatic Process for Tunisian Arabic Orthography Normalization," Proc. of the 10th International Conference on Natural Language Processing (HrTAL2016), Dubrovnik, Croatia, 2016.

[10] N. Habash, M. T. Diab and O. Rambow, "Conventional Orthography for Dialectal Arabic," Proc. of the 8th Int. Conf. on Language Resources and Evaluation, European Language Resources Association (ELRA), pp. 711–718, Istanbul, Turkey, May 23-25, 2012.

[11] H. Saadane and N. Habash, "A Conventional Orthography for Algerian Arabic," Proc. of the 2nd Workshop on Arabic Natural Language Processing, pp. 69–79, [Online], Available: http://www.aclweb.org/anthology/W15-3208, Beijing, China, July 2015.

[12] M. Jarrar, N. Habash, F. Alrimawi, D. Akra and N. Zalmout, "Curras: An Annotated Corpus for the Palestinian Arabic Dialect," Language Resources and Evaluation, vol. 51, pp. 745–775, 2016.

[13] S. Khalifa, N. Habash, D. Abdulrahim and S. Hassan, "A Large Scale Corpus of Gulf Arabic," Proc. of the 10th Int. Conf. on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, May 2016. 

[14] F. Al-Shargi, A. Kaplan, R. Eskander, N. Habash and O. Rambow, "Morphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic," Proc. of the 10th Int. Conf. on Language Resources and Evaluation (LREC 2016), pp. 1300–1306, Portorož, Slovenia, 2016.

[15] N. Habash, F. Eryani, S. Khalifa et al., "Unified Guidelines and Resources for Arabic Dialect Orthography," Proc. of the 11th Int. Conf. on Language Resources and Evaluation (LREC-2018), Miyazaki, Japan, 2018.

[16] M. Attia, M. Al-Badrashiny and M. Diab, "Gwu-hasp-2015@ qalb-2015 Shared Task: Priming Spelling Candidates with Probability," Proc. of the 2nd Workshop on Arabic Natural Language Processing, pp. 138–143, Beijing, China, 2015.

[17] M. Attia, P. Pecina, Y. Samih, K. Shaalan and J. Van Genabith, "Arabic Spelling Error Detection and Correction," Natural Language Engineering, vol. 22, no. 5, p. 751, 2016.

[18] M. I. Alkanhal, M. A. Al-Badrashiny, M. M. Alghamdi and A. O. Al-Qabbany, "Automatic Stochastic Arabic Spelling Correction with Emphasis on Space Insertions and Deletions," IEEE Transactions on Audio, Speech and Language Processing, vol. 20, no. 7, pp. 2111–2122, 2012.

[19] A. M. Azmi, M. N. Almutery and H. A. Aboalsamh, "Real-word Errors in Arabic Texts: A Better Algorithm for Detection and Correction," IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 27, no. 8, pp. 1308–1320, 2019.

[20] V. N. Vapnik, The Nature of Statistical Learning Theory, New York, NY, USA: Springer-Verlag, New York, Inc., 1995.

[21] F. J. Damerau, "A Technique for Computer Detection and Correction of Spelling Errors," Communications of the ACM, vol. 7, no. 3, pp. 171–176, 1964.

[22] M. Alkhatib, A. A. Monem and K. Shaalan, "Deep Learning for Arabic Error Detection and Correction," ACM Transactions on Asian and Low-resource Language Information Processing (TALLIP), vol. 19, no. 5, pp. 1–13, 2020.

[23] R. Eskander, N. Habash, O. Rambow and N. Tomeh, "Processing Spontaneous Orthography," Proc. of the 2013 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 585–595, Atlanta, Georgia, June 2013.

[24] J. Wang and J.-D. Zucker, "Solving the multiple-instance problem: A Lazy Learning Approach," Proc. of the 17th Int. Conf. on Machine Learning (ser. ICML’00), pp. 1119–1126, San Francisco, USA, 2000. 

[25] N. Habash, R. Roth, O. Rambow, R. Eskander and N. Tomeh, "Morphological Analysis and Disambiguation for Dialectal Arabic," Proc. of the Human Language Technologies: Conf. of the North American Chapter of the Association of Computational Linguistics, pp. 426–432, Atlanta, USA, 2013.

[26] W. Adouane, J.-P. Bernardy and S. Dobnik, "Normalizing Non-standardized Orthography in Algerian Code-switched User-generated Data," Proc. of the 5th Workshop on Noisy User-generated Text (W- NUT 2019), pp. 131–140, Hong Kong, China, 2019. 

[27] A. Mekki, I. Zribi, M. Ellouze Khmekhem and L. Hadrich Belguith, "Critical Description of TA Linguistic Resources," Proc. of the 4th Int. Conf. on Arabic Computational Linguistics (ACLing 2018) & Procedia Computer Science, Dubai, United Arab Emirates, 2018.

[28] S. Mejri, M. Said and I. Sfar, "Pluringuisme et Diglossie en Tunisie," Synergies Tunisie, vol. 1, pp. 53– 74, 2009.

[29] M. Graja, M. Jaoua and L. H. Belguith, "Discriminative Framework for Spoken Tunisian Dialect Understanding," Proc. of the 1st Int. Conf. on Statistical Language and Speech Processing (SLSP 2013), vol. 7978, pp. 102–110, Tarragona, Spain, July 29-31, 2013.

[30] W. H. Allehaiby, "Arabizi: An Analysis of the Romanization of the Arabic Script from a Sociolinguistic Perspective," Arab World English Journal, vol. 4, no. 3, 2013.

[31] T. Buckwalter, "Arabic Transliteration," Available: http://www.qamus.org/transliteration.htm, 2002.

[32] N. Habash, M. T. Diab and O. Rambow, "Conventional Orthography for Dialectal Arabic," Proc. of the 8th Int. Conf. on Lang. Resour. and Evaluation (LREC’12), pp. 711-718, Istanbul, Turkey, May 2012..

[33] A. Mekki, I. Zribi, M. Ellouze and L. Hadrich Belguith, "Treebank Creation and Parser Generation for Tunisian Social Media Text," Proc. of the 17th ACS/IEEE Int. Conf. on Computer Systems and Applications (AICCSA), DOI: 10.1109/AICCSA50499.2020.9316462 Antalya, Turkey, 2020.

[34] A. Masmoudi and F. Bougares, "Automatic Speech Recognition System for Tunisian Dialect," Language Resources and Evaluation, vol. 52, no. 1, pp. 249–267, 2017.

[35] R. Boujelbane, Traitements Linguistiques Pour la Reconnaissance Automatique de la Parole Appliquée à la Langue Arabe: de L’arabe Standard vers L’arabe Dialectal, Thése de doctorat, Faculté des Sciences Économiques et de Gestion de Sfax, 2016. 

[36] A. Mekki, I. Zribi, M. E. Khemakhem and L. H. Belguith, "Syntactic Analysis of the Tunisian Arabic," Proc. of the Int. Workshop on Language Processing and Knowledge Management, September 2017.

[37] A. Al-Thubaity, M. Khan, M. Al-Mazrua and M. Al-Mousa, "New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool," Proc. of the Int. Conf. on Asian Language Processing, pp. 67–70, Urumqi, China, 2013.

[38] J. Guo, H. He, T. He et al., "Gluoncv and Gluonnlp: Deep Learning in Computer Vision and Natural Language Processing," J. of Machine Learning Research, vol. 21, no. 23, pp. 1–7, 2020.

[39] J. Demšar, "Statistical Comparisons of Classifiers over Multiple Datasets," J. of Machine Learning Research, vol. 7, pp. 1–30, 2006. 

[40] A. Mekki, I. Zribi, M. E. Khemakhem and L. H. Belguith, "Sentence Boundary Detection of Various Forms of Tunisian Arabic," Language Resources and Evaluation, vol. 56, pp. 357-385, 2022.

[41] R. Boujelbane, M. Mallek, M. Ellouze and L. H. Belguith, "Fine-grained POS Tagging of Spoken Tunisian Dialect Corpora," Proc. of the Int. Conf. on Applications of Natural Language to Data Bases/Information Systems (NLDB 2014), vol. 8455, pp. 59–62, 2014.

[42] I. Zribi, I. Kammoun, M. Ellouze, L. H. Belguith and P. Blache, "Sentence Boundary Detection for Transcribed Tunisian Arabic," Proc. of the 12th Workshop on Natural Language Processing (KONVENS 2016), pp. 323-331, Bochum, Germany, September 2016.