PROCESSING TOOLS FOR CORPUS LINGUISTICS: A CASE STUDY ON ARABIC HISTORICAL CORPUS


(Received: 10-May-2024, Revised: 2-Jul.-2024 , Accepted: 29-Jul.-2024)
This paper explores the development, design and reconstruction of a Historical Arabic Corpus (HAC), which covers more than 1600 years of uninterrupted language use. The study emphasizes the technical aspects followed to enhance the system and provide a usable concordancer, along with simple experiments conducted on the corpus and the concordancer. Arabic has a rich literary and cultural heritage spanning thousands of years. The inclusion of digital resources and the advancement in natural language processing (NLP) technology have made Arabic historical corpora increasingly crucial for researchers and learners worldwide. By integrating HAC and its tools into Arabic language learning, learners can delve deeper into vocabulary and culture and gain valuable insights that improve their language skills and understanding of Arabic. This combination of human guidance and NLP technology makes learning an engaging and enjoyable experience, offering a dynamic and authentic way to master the Arabic language.

[1] R. Laatar, C. Aloulou and L. Hadrich Belguith, "Towards a Historical Dictionary for Arabic Language," International Journal of Speech Technology, vol. 25, no. 1, pp. 29-41, 2022.

[2] B. Hammo, S. Yagi, O. Ismail and M. Abushariah, "Exploring and Exploiting a Historical Corpus for Arabic," Language Resources & Evaluation, vol. 50, pp. 839-861, DOI:10.1007/s10579-015-9304-9, 2016.

[3] O. Ismail, S. Yagi and B. Hammo, "Corpus Linguistic Tools for Historical Semantics in Arabic," International Journal of Arabic-English Studies, vol. 15, pp. 135-152, 2014.

[4] E. Al-Thwaib, B. H. Hammo and S. Yagi, "An Academic Arabic Corpus for Plagiarism Detection: Design, Construction and Experimentation," Int. Journal of Educational Technology in Higher Education, vol. 17, no. 1, DOI:10.1186/s41239-019-0174-x, 2020.

[5] A.F. Mukhamadiarova, "Application of Corpus-based Technologies in the Formation of Lexical and Grammatical Skills in German," Perspectives of Science and Education, vol. 53, pp. 247-259, DOI:10.32744/pse.2021.5.17, 2021.

[6] S. P. Cheng, "University Students’ Perceived Benefits and Difficulties Related to Corpus-assisted Translation," Compilation and Translation Review, vol. 16, no. 1, pp. 81-132, 2023.

[7] A. Boulton, "Data-driven Learning: Taking the Computer out of the Equation," Language Learning, vol. 60, no. 3, pp. 534-572, 2010.

[8] L. Zhao, W. Kong and C. Wang, "Electricity Corpus Construction Based on Data Mining and Machine Learning Algorithm," Proc. of the 2020 IEEE 5th Information Technology and Mechatronics Engineering Conf. (ITOEC), pp. 1478-1481, Chongqing, China, 2020.

[9] A. O'keeffe, M. McCarthy and R. Carter, From Corpus to Classroom: Language Use and Language Teaching, DOI: 10.1017/CBO9780511497650, Cambridge University Press, 2007.

[10] T. Wambsganss, T. Kueng, M. Soellner and J. M. Leimeister, "ArgueTutor: An Adaptive Dialog-based Learning System for Argumentation Skills," Proc. of the 2021 CHI Conf. on Human Factors in Computing Systems, pp. 1-13, DOI: 10.1145/3411764.3445781, May 2021.

[11] Essential Corpus Tools, [Online], Available: https://corpus-analysis.com/, Last visited in June 2024.

[12] S. Khoja, "Khoja’s Stemmer," [Online], Available: http://zeus.cs.pacificu.edu/shereen/research.htm, 2015. Accessed April 2024.

[13] K. Toutanova, D. Klein, C. Manning and Y. Singer, "Feature-rich Part-of-speech Tagging with a Cyclic Dependency Network," Proc. of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL 2003), pp. 252-259, 2003.

[14] A. Abdelali, K. Darwish, N. Durrani and H. Mubarak, "Farasa: A Fast and Furious Segmenter for Arabic," Proc. of the 2016 Conf. of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 11-16, San Diego, California, 2016.

[15] O. Ossama, N. Zalmout, S. Khalifa, D. Taji, M. Oudah, B. Alhafni, G. Inoue, F. Eryani, A. Erdmann and N. Habash, "CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing," Proc. of the 12th Language Resources and Evaluation Conf., LREC, pp. 7022-7032, Marseille, France, 2020.

[16] J. Zare, S. Karimpour and K. Aqajani Delavar, "Classroom Concordancing and English Academic Lecture Comprehension: An Implication of Data-driven Learning," Computer Assisted Language Learning, vol. 36, nos. 5-6, pp. 885-905, DOI:10.1080/09588221.2021.1953081, 2023.

[17] J. Zare, S. Karimpour and K. A. Delavar, "The Impact of Concordancing on English Learners’ Foreign Language Anxiety and Enjoyment: An Application of Data-driven Learning," System, vol. 109, p. 102891, DOI:10.1016/j.system.2022.102891, 2022.

[18] V. Mohammadi and N. Mohit, "Student and Teacher Attitude toward Using Concordancing in Learning and Teaching Preposition Collocations: Issues and Options," Journal of Language Horizons, vol. 5, no. 2, pp. 139-166, 2021.

[19] I. Kazaz, "Alternative Vocabulary Assessment: Using Concordance Line Activities for Testing Lexical Knowledge," Int. Online Journal of Education and Teaching, vol. 7, no. 3, pp. 1221-1238, 2020.

[20] A. T. Shawaqfeh and M. A. Khasawneh, "Incorporating Corpus Linguistics Tools in the Training and Professional Development of Lecturers in Translation Studies," Studies in Media and Communication, vol. 11, no. 7, p. 260, DOI:10.11114/smc.v11i7.6379, 2023.

[21] M. del Mar Sánchez Ramos, "Teaching English for Medical Translation: A Corpus-based Approach," Iranian Journal of Language Teaching Research, vol. 8, no. 2, pp. 25-40, 2020.

[22] O. J. Ballance and A. Coxhead, "How Much Vocabulary is Needed to Use a Concordance?" Int. Journal of Corpus Linguistics, vol. 25, no. 1, pp. 36-61, DOI:10.1075/ijcl.17116.bal, 2020.

[23] S. Un-udom and N. Un-udom, "A Corpus-based Study on the Use of Reporting Verbs in Applied Linguistics Articles," English Language Teaching, vol. 13, no. 4, pp. 162-169, 2020.

[24] M. Bednarek and G. Carr, "Computer-assisted Digital Text Analysis for Journalism and Communications Research: Introducing Corpus Linguistic Techniques That Do not Require Programming," Media International Australia, vol. 181, no. 1, pp. 131-151, DOI: 10.1177/1329878X20947124, 2021.

[25] A. Eddakrouri, "Arabic Corpus of Library and Information Science: Design and Construction," Egyptian Journal of Language Engineering, vol. 10, no. 1, pp. 1-9, DOI:10.21608/ejle.2023.183529.1040, 2023.

[26] S. Khoja, "An RSS Feed Analysis Application and Corpus Builder," Proc. of the 2nd Int. Conf. on Arabic Language Resources and Tools, pp. 115-118, Cairo, Egypt, 2009.

[27] M. O’Donnell, "The UAM Corpus Tool: Software for Corpus Annotation and Exploration," Proc. of Bretones Callejas, Carmen M. et al. (eds.) Applied Linguistics Now: Understanding Language and Mind, Almería: Universidad de Almería, pp. 1433-1447, 2008.

[28] S. Alansary, M. Nagi and N. Adly, "Towards Analyzing the International Corpus of Arabic (ICA): Progress of Morphological Stage," Proc. of the 8th Int. Conf. on Language Engineering, Cairo, Egypt, 2008.

[29] M. Attia, P. Pecina, L. Tounsi, A. Toral and J. van Genabith, "Lexical Profiling for Arabic," Proc. of eLex 2011, pp. 23-33, 2011.

[30] K. Dukes and N. Habash, "Morphological Annotation of Quranic Arabic," Proc. of the 7th Int. Conf. on Language Resources and Evaluation (LREC'10), pp. 2530-2536, Valletta, Malta, 2010.

[31] M. Boella, F. Romani, A. Al-Raies, C. Solimando and G. Lancioni, "The SALAH Project: Segmentation and Linguistic Analysis of Ḥadīṯ Arabic Texts," Proc. of Information Retrieval Technology, Part of the Book Series Lecture Notes in Computer Science, vol. 7097, pp. 538-549, DOI: 10.1007/978-3-642-25631-8_49, Springer, Berlin, Heidelberg, 2011.

[32] A. Sharaf and E. Atwell, "QurAna: Corpus of the Quran Annotated with Pronominal Anaphora," Proc. of the 8th Int. Conf. on Language Resources and Evaluation (LREC'12), pp. 130-137, Istanbul, Turkey, 2012.

[33] S. Altammami, E. Atwell and A. Alsalka, "Constructing a Bilingual Hadith Corpus Using a Segmentation Tool," Proc. of the 12th Language Resources and Evaluation Conf., pp. 3390-3398, Marseille, France, 2020.

[34] M. Hajjar, A. Al-Hajjar, K. Zreik and P. Gallinari, "An Improved Structured and Progressive Electronic Dictionary for the Arabic Language: iSPEDAL," Proc. of the 5th Int. Conf. on Internet and Web Applications and Services (ICIW), pp. 489-495, Barcelona, Spain, 2010.

[35] B. Hammo, F. Al-Shargi, S. Yagi and N. Obeid, "Developing Tools for Arabic Corpus for Researchers," Proc. of the 2nd Workshop on Arabic Corpus Linguistics (WACL-2), Lancaster University, UK, 2013.

[36] Z. Alyafeai, M. Masoud, M. Ghaleb and M. S. Al-shaibani, "Masader: Metadata Sourcing for Arabic Text and Speech Data Resources," Proc. of the 13th Language Resources and Evaluation Conf., pp 6340–6351, European Language Resources Association, Marseille, France, 2022.

[37] The Linguistic Data Consortium, [Online], Available: https://www.ldc.upenn.edu/sites/www.ldc.upenn.ed u/files/arabic.pdf. (Last visited in June 2024).

[38] M. Alrabiah, A. Al-Salman and E. Atwell, "The Design and Construction of the 50 Million Words KSUCCA," Proc. of the 2nd Workshop on Arabic Corpus Linguistics (WACL-2), Lancaster University, UK, 2013.

[39] A. O. Al-Thubaity, "A 700 M + Arabic Corpus: KACST Arabic Corpus Design and Construction," Language Resources and Evaluation, vol. 49, pp. 721-75, DOI: 10.1007/s10579-014-9284-1, 2015.

[40] L. Anthony, "AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom," Proc. of Professional Communication Conf. (IPCC 2005), pp. 729-737, Limerick, Ireland, 2005.

[41] A. Roberts, L. Al-Sulaiti and E. Atwell, "aConCorde: Towards an Open-source, Extendable Concordancer for Arabic," Corpora, vol. 1, no. 1, pp. 39-60, 2006.

[42] R. Abbès and J. Dichy, "AraConc, an Arabic Concordance Software Based on the DIINAR.1 Language Resource," Proc. of the 6th Int. Conf. on Informatics and Systems, pp. 127-134, 2008.

[43] Y. Belinkov, A. Magidow, M. Romanov, A. Shmidman and M. Koppel, "Shamela: A Large-scale Historical Arabic Corpus," Proc. of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH), pp. 45–53, Osaka, Japan, 2016.

[44] A. Hourani, A History of the Arab Peoples: Updated Edition, ISBN: 9780571288014, London: Faber and Faber, 2013.

[45] L. Larkey and M. Connell, "Arabic Information Retrieval at UMass in TREC-10," Proc. of the 10th Text Retrieval Conference (TREC-10), pp. 562-570, Maryland, USA, 2010.

[46] R. H. Al Mahmoud, B. H. Hammo and H. Faris, "Cluster-based Ensemble Learning Model for Improving Sentiment Classification of Arabic Documents," Natural Language Engineering, pp. 1-39, DOI: 10.1017/S135132492300027X, 2023.

[47] R. H. AlMahmoud and B. H. Hammo, "SEWAR: A Corpus-based N-gram Approach for Extracting Semantically-related Words from Arabic Medical Corpus," Expert Systems with Applications, vol. 238, p. 121767, DOI: 10.1016/j.eswa.2023.121767, 2014.

[48] G. K. Zipf, "The Meaning-Frequency Relationship of Words," The Journal of General Psychology, vol. 33, no. 2, pp. 251-256, DOI:10.1080/00221309.1945.10544509, 1945.

[49] N. N. Hanifah, "The Origin of Arabic Lexicography: Its Emergence and Evolution," HuRuf Journal: Int. Journal of Arabic Applied Linguistic, vol. 1, no. 2, pp. 238-251, DOI: 10.30983/huruf.v1i1.4932, 2021.

[50] A. O. Almarwaey and U. K. Ahmad, "Semantic Change of Hijab, Halal and Islamist from Arabic to English. 3L: Language, Linguistics, Literature," The Southeast Asian Journal of English Language Studies, vol. 27, no. 2, pp. 161-176, DOI: 10.17576/3L-2021-2702-12, 2021.

[51] R. Laatar, A. Rhayem, C. Aloulou and L. H. Belguith, "Towards a Historical Ontology for Arabic Language: Investigation and Future Directions," Proc. of the Int. Conf. on Intelligent Systems Design and Applications, Part of the Book Series: Lecture Notes in Networks and Systems, vol. 418, pp. 1078-1087, Cham: Springer International Publishing, December 2021.