https://jjcit.org/paper/106
A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES
10.5455/jjcit.71-1585409230
Leen Al Qadi,Hozayfa El Rifai,Safa Obaid,Ashraf Elnagar
Arabic text classification,Single-label classification,Multi-label classification,Arabic datasets,Shallow learning classifiers
5192
1387
28-Mar.-2020
23-Jun.-2020 and 15-Jul.-2020
16-Jul.-2020
Text classification is the process of automatically tagging a textual document with the most relevant set of labels.
The aim of this work is to automatically tag an input document based on its vocabulary features. To achieve this
goal, two large datasets have been constructed from various Arabic news portals. The first dataset consists of
90k single-labeled articles from 4 domains (Business, Middle East, Technology and Sports). The second dataset
has over 290k multi-tagged articles. The datasets shall be made freely available to the research community on
Arabic computational linguistics. To examine the usefulness of both datasets, we implemented an array of ten
shallow learning classifiers. In addition, we implemented an ensemble model to combine best classifiers together
in a majority-voting classifier. The performance of the classifiers on the first dataset ranged between 87.7%
(Ada-Boost) and 97.9% (SVM). Analyzing some of the misclassified articles confirmed the need for a multi-label
opposed to single-label categorization for better classification results. We used classifiers that were compatible
with multi-labeling tasks, such as Logistic Regression and XGBoost. We tested the multi-label classifiers on the
second larger dataset. A custom accuracy metric, designed for the multi-labeling task, has been developed for
performance evaluation along with hamming loss metric. XGBoost proved to be the best multi-labeling
classifier, scoring an accuracy of 91.3%, higher than the Logistic Regression score of 87.6%.