ENHANCING THE ACCURACY OF SONBOL’S ARABIC ROOT EXTRACTION ALGORITHM

(Received: 2018-09-11, Revised: 2018-10-05 , Accepted: 2018-10-25)
Root extraction is an important primary process in most Arabic applications, such as information retrieval systems, text mining, text classifiers, question answering systems, data compression, indexes, spelling checkers, text summarization and machine translation. Any weaknesses of root extraction will affect negatively the performance of these applications. Sonbol’s Arabic root extraction algorithm achieves high accuracy of performance and gives new classification for Arabic’s letters which minimizes the affix ambiguity. The comparison and testing of the existing Arabic root extraction algorithms on unify datasets shows that they still need some enhancements. Arabic root extraction is mainly based on using patterns, where as much as the algorithm has patterns as much as the accuracy is better. In this study, we improve Sonbol’s Arabic root extraction algorithm, by enhancing its rules and increasing its patterns. We use 4320 patterns to extract the roots, which is the largest patterns’ list extracted by Thalji’s corpus. We test the new algorithm on Thalji’s corpus that contains 720,000 word-root pairs. This corpus is mainly built to test and compare Arabic root extraction algorithms. The new algorithm is compared with Sonbol’s Arabic root extraction algorithm. The algorithm of Sonbol et al. achieves an accuracy of 68%, whereas the new algorithm achieves an accuracy of 92%.
  1. W. Abo Thuaaib, History of Sematic Languages, Lebanon: Darul Kalam for Pub. and Printing, 2016.
  2. A. Al-Taani and S. A. Al-Rub, "A Rule-based Approach for Tagging Non-vocalized Arabic Words," The International Arab Journal of Information Technology, vol. 6, no. 3, pp. 320-328, 2009.
  3. R. Sonbol, N. Ghneim and M. S. Desouki, "Arabic Morphological Analysis : A New Approach," Informa- tion and Communication Technologies: From Theory to Applications, Proc. of the IEEE 3rd International Conference, pp. 1-6, 2008.
  4. S. Khoja and R. Garside, "Stemming Arabic Text," Computing Department, Lancaster Univ., UK, 1999.
  5. E. Al-Shawakfa, A. Al-Badarneh, S. Shatnawi, K. Al-Rabab’ah and B. Bani-Ismail, "A Comparison Study of Some Arabic Root Findings," Journal of the American Society for Information Science and Technology, vol. 61, no. 5, pp. 1015-1024, 2010.
  6. N. Thalji, N. A. Hanin, Y. Yacob and S. Al-Hakeem, "Corpus for Test, Compare and Enhance Arabic Root Extraction Algorithms," International Journal of Advanced Computer Science and Applications, vol. 8, no. 5, pp. 229-236, 2017.
  7. M. Sawalha and E. Atwell, "Comparative Evaluation of Arabic Language Morphological Analyzers and Stemmers," Proc. of COLING 22nd Inter. Conference on Comptational Linguistics, pp. 107-110, 2008.
  8. R. Alshalabi, "Pattern-based Stemmer for Finding Arabic Roots," Information Technology Journal, pp. 38-43, 2005.
  9. M. N. Al-Kabi and R. Al-Mustafa, "Arabic Root-based Stemmer," Proceedings of the International Arab Conference on Information Technology, 2006.
  10. S. Ghwanmeh, S. Rabab'Ah, R. Al-Shalabi and G. Kanaan, "Enhanced Algorithm for Extracting the Root of Arabic Words," Proc. of the 6th International Conference on Computer Graphics, Imaging and Visualization, pp. 388-391, 2009.
  11. Z. Kchaou and S. Kanoun, "Arabic Stemming with Two Dictionaries," IEEE International Conference on Innovations in Information Technology, pp. 688-691, 2008.
  12. M. El-Defrawy, Y. El-Sonbaty and N. Belal, "A Rule-based Subject-correlated Arabic Stemmer," Arabian Journal for Science and Engineering, vol. 41, no. 8, pp. 2883-2891, 2016.
  13. A. Ayedh and T. Guanzheng, "Building and Benchmarking Novel Arabic Stemmer for Document Classification," Journal of Computational and Theoretical Nanoscience, vol. 13, no. 3, pp. 1527-1535, 2016.
  14. A. Pasha, M. Al-Badrashinyy, M. Diaby, A. El Kholy, R. Eskander, N. Habash, M. Pooleery, O. Rambow and R. Roth, "MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic," Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), Japan, 2014.
  15. N. Habash and O. Rambow, "Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop," Proceedings of the 43rd Annual Meeting of Association for Computational Linguistics, pp. 573-580, Association for Computational Linguistics, Michigan, 2005.
  16. N. Habash, O. Rambow and R. Roth, "MADA+ TOKAN: A Toolkit for Arabic Tokenization, Diacritization, Morphological Disambiguation, POS Tagging, Stemming and Lemmatization," Proc. of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Egypt, 2009.
  17. N. Habash, R. Roth, O. Rambow, R. Eskander and N. Tomeh, "Morphological Analysis and Disambiguation for Dialectal Arabic," Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, 2013.
  18. M. Diab, K. Hacioglu and D. Jurafsky, "Automated Methods for Processing Arabic Text: from Tokenization to Base Phrase Chunking," Arabic Computational Morphology: Knowledge-based and Empirical Methods, Kluwer/Springer, 2007.
  19. M. Boudchiche, A. Mazroui, M. Bebah, A. Lakhouaja and A. Boudlal, "Al-Khalil Morphological System 2: A Robust Arabic Morpho-syntactic Analyzer," Journal of King Saud University-Computer and Information Sciences, vol. 29, no. 2, pp. 141-146, 2017.
  20. M. Ababneh, R. Al-Shalabi, G. Kanaan and A. Al-Nobani, "Building an Effective Rule-based Light Stemmer for Arabic Language to Improve Search Effectiveness," Int. Arab Jour. of IT vol. 9, no. 4, 2012.
  21. K. Taghva, R. Elkhoury and J. Coombs, "Arabic Stemming without a Root Dictionary," Proc. of the IEEE International Conference on Information Technology: Coding and Computing, pp. 152-157, 2005.
  22. M. Sawalha and E. Atwel, "Corpus Linguistics Resources and Tools for Arabic Lexicography," Proceedings of the Workshop on Arabic Corpus Linguistics (UCREL), 2011.
  23. K. Mezher and O. Nazlia, "A Backpropagation Neural Network to Improve Arabic Stemming," Journal of Theoretical and Applied Information Technology , vol. 82, no. 3, pp. 385-394, 2015.
  24. G. Kanaan, R. Al-Shalabi and M. Sawalha, "Full Automatic Arabic Text Tagging System," Proceedings of the International Conference on Information Technology and Natural Sciences , pp. 258-267, 2003.
  25. E. Al-Shammari and J. Lin, "A Novel Arabic Lemmatization Algorithm," Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data (ACM), pp. 113-118, 2008.
  26. H. M. Al-Serhan, R. Al Shalabi and G. Kannan, "New Approach for Extracting Arabic Roots," Proceedings of the Arab Conference on Information Technology, pp. 42-59, 2003.
  27. M. N. Al-Kabi, S. A. Kazakzeh, B. M. Abu Ata, S. A. Al-Rababah and I. M. Alsmadi, "A Novel Root-based Arabic Stemmer," Journal of King Saud University-Computer and Information Sciences, pp. 94-103, 2015.
  28. A.-K. N. Al-Kabi, "Towards Improving Khoja Rule-based Arabic Stemmer," Proc. of the IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT), pp. 1-6, 2013.
  29. S. Al-Fedaghi and F. S. Al-Anzi, "A New Algorithm to Generate Arabic Root-pattern Forms," Proceedings of the 11th National Computer Conference and Exhibition, 1989.
  30. F. Abu Hawas and K. E. Emmert, "Rule-based Approach for Arabic Root Extraction: New Rules to Directly Extract Roots of Arabic Words," Journal of Computing and Information Technology, vol. 22, no. 1, pp. 57-68, 2014.
  31. K. Abainia, S. Ouamour and H. Sayoud, "A