(Received: 10-Nov.-2022, Revised: 3-Jan.-2023 and 31-Jan.-2023 , Accepted: 13-Feb.-2023)
Text readability is one of the main research areas widely developed in several languages, but it is highly limited when dealing with the Arabic language. The main challenge in this area is to identify an optimal set of features that represent texts and allow us to evaluate their readability level. To address this challenge, we propose in this study various feature selection methods that can significantly retrieve the set of discriminating features representing Arabic texts. The second aim of this paper is to evaluate different sentence-embedding approaches (ArabicBert, AraBert and XLM-R) and compare their performances to those obtained using the selected linguistic features. We performed experiments with both SVM and Random Forest classifiers on two different corpora dedicated to learning Arabic as a foreign language (L2). The obtained results show that reducing the number of features improves the performance of the readability-prediction models by more than 25% and 16% for the two adopted corpora, respectively. In addition, the fine-tuned Arabic-BERT model performs better than the other sentence-embedding methods, but it provided less improvement than the feature-based models. Combining these methods with the most discriminating features produced the best performance.

[1] M. Al-Ayyoub, A. A. Khamaiseh, Y. Jararweh and M. N. Al-Kabi, "A Comprehensive Survey of Arabic Sentiment Analysis," Inf. Processing and Management, vol. 56, no. 2, pp. 320–342, 2019.

[2] S. Berrichi and A. Mazroui, "Addressing Limited Vocabulary and Long Sentences Constraints in English–Arabic Neural Machine Translation," Arabian J. for Science and Engineering, vol. 46, pp. 8245–8259, 2021.

[3] N. Nassiri, A. Lakhouaja and V. Cavalli-Sforza, "Arabic L2 Readability Assessment: Dimensionality Reduction Study," J. of King Saud Uni., Comp. and Inf. Sci., vol. 34, pp. 3789–3799, 2022.

[4] V. Cavalli-Sforza, H. Saddiki and N. Nassiri, "Arabic Readability Research: Current State and Future Directions," Procedia Computer Science, vol. 142, pp. 38– 49, 2018.

[5] T. Deutsch, M. Jasbi and S. Shieber, "Linguistic Features for Readability Assessment," Proc. of the 15th Workshop on Innovative Use of NLP for Building Educational Applications, pp. 1–17, Association for Computational Linguistics, Seattle, USA, 2020.

[6] D. D. Lewis, "Challenges in Machine Learning for Text Classification," Proc. of the 9th Annual Conference on Computational Learning Theory, pp. 1–ff, 1996.

[7] X.-D. Wang, R.-C. Chen, F. Yan, Z.-Q. Zeng and C.-Q. Hong, "Fast Adaptive K-Means Subspace Clustering for High-dimensional Data," IEEE Access, vol. 7, pp. 42639 – 42651, 2019.

[8] I. Guyon, S. Gunn, M. Nikravesh and L. A. Zadeh, Feature Extraction: Foundations and Applications, ISBN: 978-3540354871, Springer, 2008.

[9] J. N. Forsyth, Automatic Readability Prediction for Modern Standard Arabic, Ph.D. Thesis, Department of Linguistics and English Language, Brigham Young University, USA, 2014.

[10] V. Cavalli-Sforza, M. El Mezouar and H. Saddiki, "Matching an Arabic Text to a Learners’ Curriculum," Proc. of the 5th Int. Conf. on Arabic Lang. Process. (CITALA), p. 10, Morocco, 2014.

[11] H. Saddiki, K. Bouzoubaa and V. Cavalli-Sforza, "Text Readability for Arabic as a Foreign Language," Proc. of the 2015 IEEE/ACS 12th Int. Conf. of Computer Systems and Applications (AICCSA), pp. 1–8, Marrakech, Morocco, 2015.

[12] H. Saddiki, N. Habash, V. Cavalli-Sforza and M. Al-Khalil, "Feature Optimization for Predicting Readability of Arabic l1 and l2," Proc. of the 5th Workshop on Natural Language Processing Techniques for Educational Applications, pp. 20–29, DOI: 10.18653/v1/W18-3703, Melbourne, Australia, 2018.

[13] N. Nassiri, A. Lakhouaja and V. Cavalli-Sforza, "Modern Standard Arabic Readability Prediction," Proc. of the Int. Conf. on Arabic Language Processing (ICALP 2017), Part of the Communications in Computer and Information Science Book Series, vol. 782, pp. 120–133, 2018.

[14] N. Nassiri, A. Lakhouaja and V. Cavalli-Sforza, "Arabic Readability Assessment for Foreign Language Learners," Proc. of the Int. Conf. on Applications of Natural Language to Information Systems (NLDB 2018), vol. 10859, pp. 480–488, 2018.

[15] B. W. Lee, Y. S. Jang and J. H.-J. Lee, "Pushing on Text Readability Assessment: A Transformer Meets Handcrafted Linguistic Features," Proc. of the 2021 Conf. on Empirical Methods in Natural Lang. Process., pp. 10669–10686, DOI: 10.18653/v1/2021.emnlp-main.834, Dominican Rep., 2021.

[16] N. Khallaf and S. Sharoff, "Automatic Difficulty Classification of Arabic Sentences," Proc. of the 6th Arabic Natural Language Processing Workshop, Association for Computational Linguistics (WANLP 2021), pp. 105–114, DOI: 10.48550/arXiv.2103.04386, 2021.

[17] V. N. Vapnik, "Controlling the Generalization Ability of Learning Processes," Chapter 4 in Book: The Nature of Statistical Learning Theory, pp. 89–118, Springer, 2000.

[18] O. Al-Harbi, "A Comparative Study of Feature Selection Methods for Dialectal Arabic Sentiment Classification Using Support Vector Machine," Int. J. of Computer Science and Network Security, vol. 19, no. 1, pp. 167-176, January 2019.

[19] H. U˘guz, "A Two-stage Feature Selection Method for Text Categorization by Using Information Gain, Principal Component Analysis and Genetic Algorithm," Knowledge-based Systems, vol. 24, no. 7, pp. 1024-1032, 2011.

[20] M. A. Hall, Correlation-based Feature Selection for Machine Learning, Ph.D. Thesis, Department of Computer Science, University of Waikato, New Zealand, 1999.

[21] S. Bahassine, A. Madani, M. Al-Sarem and M. Kissi, "Feature Selection Using an Improved Chi- square for Arabic Text Classification," J. of King Saud Uni.-Comp. and Inf. Sci., vol. 32, no. 2, pp. 225–231, 2020.

[22] R. Elhassan and M. Ali, "The Impact of Feature Selection Methods for Classifying Arabic Texts," Proc. of the 2nd Int. Conf. on Comp. App. Inf. Secur. (ICCAIS), pp. 1–6, Riyadh, KSA, 2019.

[23] A. Elnahas, N. Elfishawy, M. Nour and M. Tolba, "Machine Learning and Feature Selection Approaches for Categorizing Arabic Text: Analysis, Comparison and Proposal," The Egyptian Journal of Language Engineering, vol. 7, no. 2, pp. 1–19, 2020.

[24] I. Guyon, J. Weston, S. Barnhill and V. Vapnik, "Gene Selection for Cancer Classification Using Support Vector Machines," Machine Learning, vol. 46, no. 1, pp. 389–422, 2002.

[25] B. E. Boser, I. M. Guyon and V. N. Vapnik, "A Training Algorithm for Optimal Margin Classifiers," Proc. of the 5th Annual Workshop on Computational Learning Theory, pp. 144–152, DOI: 10.1145/130385.130401, 1992.

[26] J. L. D. Clark and R. T. Clifford, "The FSI/ILR/ACTFL Proficiency Scales and Testing Techniques: Development, Current Status and Needed Research," Studies in Second Language Acquisition, vol. 10, no. 2, pp. 129–147, 1988.

[27] A. Pasha, M. Al-Badrashiny, M. T. Diab et al., "MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic," Proc. of the 9th Int. Conf. on Language Resources and Evaluation (LREC’14), vol. 14, pp. 1094–1101, Reykjavik, Iceland, 2014.

[28] I. A. El-Khair, "1.5 Billion Words Arabic Corpus," CoRR abs/1611.04033, arXiv: 1611.04033, DOI: 10.48550/arXiv.1611.04033, 2016.

[29] A. Conneau, K. Khandelwal, N. Goyal et al., "Unsupervised Cross-lingual Representation Learning at Scale," Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440– 8451, DOI: 10.18653/v1/2020.acl-main.747, 2020.

[30] A. Safaya, M. Abdullatif and D. Yuret, "KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media," Proc. of the 14th Workshop on Semantic Evaluation, pp. 2054–2059, DOI: 10.18653/v1/2020.semeval-1.271, Barcelona, Spain, 2020.

[31] S. Al-Aqeel, N. Abanmy, A. Aldayel, H. Al-Khalifa, M. Al-yahya and M. Diab, "Readability of Written Medicine Information Materials in Arabic Language: Expert and Consumer Evaluation," BMC Health Services Research, vol. 18, DOI: 10.1186/s12913- 018-2944-x, 2018.

[32] E. Halboub, M. S. Al-Ak’hali, H. M. Al-Mekhlafi and M. N. Alhajj, "Quality and Readability of Web-based Arabic Health Information on COVID-19: An Infodemiological Study," BMC Public Health, vol. 21, no. 1, pp. 1–7, 2021.

[33] Z. Jasem, Z. AlMeraj and D. Alhuwail, "Evaluating Breast Cancer Websites Targeting Arabic Speakers: Empirical Investigation of Popularity, Availability, Accessibility, Readability and Quality," BMC Medical Informatics and Decision Making, vol. 22, no. 1, pp. 1–15, 2022.

[34] W. Daelemans, J. Zavrel, K. van der Sloot and A. van den Bosch, "TiMBL: Tilburg Memory Based Learner," Technical Report, Version 6.3, ILK Research Group Technical Report.