
		<paper>
			<loc>https://jjcit.org/paper/106</loc>
			<title>A SCALABLE SHALLOW LEARNING APPROACH FOR TAGGING ARABIC NEWS ARTICLES</title>
			<doi>10.5455/jjcit.71-1585409230</doi>
			<authors>Leen Al Qadi,Hozayfa El Rifai,Safa Obaid,Ashraf Elnagar</authors>
			<keywords>Arabic text classification,Single-label classification,Multi-label classification,Arabic datasets,Shallow learning classifiers</keywords>
			<citation>11</citation>
			<views>9936</views>
			<downloads>2463</downloads>
			<received_date>28-Mar.-2020</received_date>
			<revised_date>  23-Jun.-2020 and 15-Jul.-2020</revised_date>
			<accepted_date>  16-Jul.-2020</accepted_date>
			<abstract>Text classification is the process of automatically tagging a textual document with the most relevant set of labels. 
The aim of this work is to automatically tag an input document based on its vocabulary features. To achieve this 
goal,  two  large  datasets  have  been  constructed  from various  Arabic  news portals.  The  first  dataset  consists  of 
90k single-labeled articles from 4 domains (Business, Middle East, Technology and Sports). The  second dataset 
has  over 290k  multi-tagged  articles.  The  datasets  shall  be  made  freely  available  to  the  research  community  on 
Arabic  computational  linguistics.  To  examine  the  usefulness  of  both  datasets,  we  implemented  an  array  of  ten 
shallow learning classifiers. In addition, we implemented an ensemble model to combine best classifiers together 
in  a  majority-voting  classifier.  The  performance  of  the  classifiers  on  the  first  dataset  ranged  between  87.7% 
(Ada-Boost) and 97.9% (SVM). Analyzing some of the misclassified articles confirmed the need for a multi-label 
opposed to single-label  categorization for better classification results.  We  used classifiers that were compatible 
with multi-labeling tasks, such as Logistic Regression and XGBoost.  We  tested the  multi-label  classifiers on the 
second  larger  dataset.  A  custom  accuracy  metric,  designed  for  the  multi-labeling  task,  has  been  developed  for 
performance  evaluation  along  with  hamming  loss  metric.  XGBoost  proved  to  be  the  best  multi-labeling 
classifier, scoring an accuracy of 91.3%, higher than the Logistic Regression score of 87.6%.</abstract>
		</paper>


