(Received: 2018-09-11, Revised: 2018-10-05 , Accepted: 2018-10-25)
Root extraction is an important primary process in most Arabic applications, such as information retrieval systems, text mining, text classifiers, question answering systems, data compression, indexes, spelling checkers, text summarization and machine translation. Any weaknesses of root extraction will affect negatively the performance of these applications. Sonbol’s Arabic root extraction algorithm achieves high accuracy of performance and gives new classification for Arabic’s letters which minimizes the affix ambiguity. The comparison and testing of the existing Arabic root extraction algorithms on unify datasets shows that they still need some enhancements. Arabic root extraction is mainly based on using patterns, where as much as the algorithm has patterns as much as the accuracy is better. In this study, we improve Sonbol’s Arabic root extraction algorithm, by enhancing its rules and increasing its patterns. We use 4320 patterns to extract the roots, which is the largest patterns’ list extracted by Thalji’s corpus. We test the new algorithm on Thalji’s corpus that contains 720,000 word-root pairs. This corpus is mainly built to test and compare Arabic root extraction algorithms. The new algorithm is compared with Sonbol’s Arabic root extraction algorithm. The algorithm of Sonbol et al. achieves an accuracy of 68%, whereas the new algorithm achieves an accuracy of 92%.
