A PROPOSED MODEL OF SELECTING FEATURES FOR CLASSIFYING ARABIC TEXT

(Received: 2019-07-25, Revised: 2019-10-15 , Accepted: 2019-11-02)

Authors Ahmed M. D. E. Hassanein, Mohamed Nour,

Keywords #Text classification #Text clustering #Feature selection #Arabic datasets #Machine learning methods #Performance evaluation

Abstract Classification of Arabic text plays an important role for several applications. Text classification aims at assigning predefined classes to text documents. Unstructured Arabic text can be easily processed by humans, while it is harder to be interpreted and understood by machines. So, before classifying Arabic text or documents, some pre-processing operations should be done. This work presents a proposed model for selecting features from the adopted Arabic text; i.e., documents. In this work, the words ‘text’ and ‘documents’ are used interchangeably. The adopted documents are taken from Al-Khaleej-2004 corpus. The corpus contains thousands of documents which talk about news in different domains, such as economics, as well as international, local and sport news. Some preprocessing operations are carried out to extract the highly weighted terms that best describe the content of the documents. The proposed model contains many steps to define the most relevant features. After defining the initial number of features, based on the weighted words, the steps of the model begin. The first step is based on calculating the correlation between each feature and class one. Depending on a threshold value, the most highly correlated features are chosen. This reduces the number of chosen features. The number of features is again reduced by calculating the intra-correlation between the resultant features. This is done in the second step. The third step selects the best features from among those which resulted from the second step by adopting some logical operations. The logical operations, specifically logical AND or logical OR, are applied to fuse the values of features depending on their structure, nature and semantics. The obtained features are then reduced in number. The fourth step is based on adopting the idea of document clustering; i.e., the obtained features from step three are placed in one cluster. Then, iterative operations are used to group features into two clusters. Each cluster can be further partitioned into two clusters …and so on. That partitioning is repeated till the clusters' contents are not changed. The contents of each cluster are fused together using the cosine rule. This reduces the overall number of features. This work adopts four types of classifiers; namely, Naïve Bayes (NB), Decision Tree, CART and KNN. A comparative study is carried out among the behaviors of the adopted classifiers on the selected number of features. The comparative study considers some measurable criteria; namely, precision, recall, F-measure and accuracy. This work is implemented using WEKA and MatLab software packages. From the obtained results, the best performance is achieved by using CART classifier, while the worst one is obtained by using KNN classifier.

References

[1] M. Suzuki, N. Yamagishi, T. Ishida, M. Goto and S. Hirasawa, "On a New Model for Automatic Text Categorization Based on Vector Space Model," Proc. of IEEE International Conference on Systems, Man and Cybernetics, pp. 3152-3159, 2010.

[2] R. Duwairi, "Arabic Text Categorization," International Arab Journal of Information Technology, vol. 4, no. 2, pp. 125-131, April 2007.

[3] L. Khreisat, "Arabic Text Classification Using N-Gram Frequency Statistics: A Comparative Study," Proc. of Conference on Data Mining (DMIN'06), pp. 78-82, 2017.

[4] M. I. Hussien, F. Olayah, M. Al-Dwan and A. Shamsan, "Arabic Text Classification Using SMO Naïve Bayesian, J48 Algorithms," International Journal of Recent Research and Applied Studies (IJRRAS), vol. 9, no. 2, pp. 306-316, November 2011.

[5] F. Thabtah, M. A. H. Eljimini, M. Zamzeer and W. M. Hadi, "Naïve Bayesian Based on Chi-Square to Categorize Arabic Data," Communication of the IBIMA, vol. 10, pp. 158-163, 2009.

[6] R. Al-Shalabi, G. Kanaan and M. Gharaibah, "Arabic Text Categorization Using KNN Algorithm,"[Online], Available: at the University of California Irvin data collections repository, http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

[7] J. Ababneh, O. Almomani, W. Hadi, N. K. T. El-Omari and A. Al-Ibrahim, "Vector Space Models to Classify Arabic Text," International Journal of Computer Trends and Technology (IJCIT), vol. 7, no. 4, pp. 219-223, January 2014.

[8] A. Goyal and R. Mehta, "Performance Comparison of Naïve Bayes and J48 Classification Algorithms", International Journal of Applied Engineering Research, vol. 7, no. 11, pp. 1-5, 2012.

[9] A. H. Mohamed, T. Alwada and O. Al-Momani, "Arabic Text Categorization Using Support Vector Machine, Naïve Bayes and Neural Networks," GSTF Jour. of Comput., vol. 5, no. 1, pp. 108-115, 2016.

[10] M. Labani, P. Moradi, F. Ahmadizar and M. Jalili, "A Novel Multivariate Filter Method for Feature Selection in Text Classification Problems," Eng. App. of Artificial Intell., vol. 70, pp. 25-37, 2018.

[11] L. M. Abualigah, A. T. Khader and E. S. Hanandeh, "A New Feature Selection Method to Improve the Document Clustering Using Particle Swarm Optimization Algorithm," Journal of Computer Science, vol. 25, pp. 456-466, 2018.

[12] Bhumika, S. S. Sehra and A. Nayyar, "A Review Paper on Algorithms Used for Text Classification", International Journal of Application or Innovation in Engineering and Management (IJAIEM), vol. 2, no. 3, pp. 90-99, March 2013.

[13] A. Elnahas, N. El-Fishawy, M. Nour, G. Attya and M. Tolba, "Query Expansion for Arabic Information Retrieval Model: Performance Analysis and Modification," Proc. of the Conference of Language Engineering, Cairo, December 6-7, 2017.

[14] S. A. Yousif, V. W. Samawi, I. Elkaban and R. Zantout, "Enhancement of Arabic Text Classification Using Semantic Relations of Arabic Wordnet," Journal of Computer Science, vol. 11, no. 3, pp. 498-509, 2015.

[15] M. M. Hijazi, A. M. Zaki and A. R. Ismail, "Arabic Text Classification: Review Study," Journal of Engineering and Applied Sciences, vol. 11, no. 3, pp. 528-536, 2016.

[16] S. Osama and M. Nour, "Feature Selection Methods for Predicting the Popularity of Online News: Comparative Study and a Proposed Method," Journal of Theoretical and Applied Information Technology, vol. 96, no. 19, pp. 6969-6980, October 15, 2018.

[17] D. Md. Farid, Li Zhang, C. M. Rahman, M. A. Hossain and R. Strachan, "Hybrid Decision Tree and Naïve Bayes Classification for Multi-Class Classifications Tasks," Journal of Expert Systems with Applications, vol. 41, pp. 1937-1946, 2014.

[18] A. Brunello, E. Marzano, A. Montanari and G. Sciavicco, "J48SS: A Novel Decision Tree Approach for the Handling of Sequential and Time Series Data," Computers Jour., vol. 8, no. 21, pp. 1-28, 2019.

[19] E. Venkatesan and T. Velmurugan, "Performance Analysis of Decision Tree Algorithms for Breast Cancer Classification," Indian Journal of Science and Technology, vol. 8, no. 2, pp. 1-8, November 2015.

[20] Z. Elberrichi and K. Abidi, "Arabic Text Categorization: A Comparative Study of Different Representation Modes," International Arab Journal of Information Technology, vol. 9, no. 5, pp. 465-470, September 2012.

[21] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. Trippe and J. Gutierrez, "A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques, " KDD Bigdas, Halifax, Canada, pp. 1-13, July 2017.

[22] P. Kumbhar and M. Mali, "A Survey on Feature Selection Techniques and Classification Algorithms for Efficient Text Classification," Int. Jour. of Science and Research, vol. 5, no. 5, pp. 1267-1275, 2016.

[23] M. Abbas and K. Smaili, "Comparison of Topic Identification Methods for Arabic Language," Proc. of the International Conference of Recent Advances in Natural Language Processing (RANLP'05), Borovets, Bulgary, pp. 14-17, September 21-23, 2005.

[24] I. Rouby, M. Badawy, M. Nour and N. Hegazi, "Performance Evaluation of an Adopted Sentiment Analysis Model for Arabic Comments from the Facebook," Journal of Theoretical and Applied Information Technology, vol. 96, no. 21, pp. 7098-7112, November 15, 2018.

[25] N. Bhargava, G. Sharma, R. Bhargava and M. Mathuria, "Decision Tree Analysis on J48 Algorithm for Data Mining," International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 6, pp. 1114-1119, June 2013.

[26] V. P. Bresfelean, "Analysis and Predictions on Students’ Behavior Using Decision Trees in WEKA Environment," Proceedings of the 29th IEEE International Conference on Information Technology Interfaces, Croatia, June 25-28, 2007.

[27] T. R. Patil and S. S. Sherekar, "Performance Analysis of Naïve Bayes and J48 Classification Algorithms for Data Classification," International Journal of Computer Science and Applications, vol. 6, no. 2, pp. 256-261, April 2013.

[28] M. F. Zaiyadi and B. Baharudin, "A Proposed Hybrid Approach for Feature Selection in Text Document Categorization," International Journal of Computer and Information Engineering, vol. 4, no. 12, pp. 1799-1803, 2010.

[29] S. Francisca Rosario and K. Thangadurai, "RELIEF: Feature Selection Approach," International Journal of Innovative Research and Development, vol. 4, no. 11, pp. 218-224, October 2015.

[30] R. P. Durgabai, "Feature Selection Using Relief Algorithm," International Journal of Advanced Research in Computer and Communication Engineering, vol. 3, no. 10, pp. 8215-8218, October 2014.

[31] U. G. Mangai, S. Samanta, S. Das and P. R. Chowdhury, "A Survey of Decision Fusion and Feature Fusion Strategies for Pattern Classification," IETE Technical Review, vol. 27, no. 4, pp. 293-307, 2010.

[32] M. Abbas, K. Smaïli, and D. Berkani, "Multi-Category Support Vector Machines for Identifying Arabic Topics," Research in Computing Science, vol. 41, pp. 217-226, 2009.

Abstract: Classification of Arabic text plays an important role for several applications. Text classification aims at assigning predefined classes to text documents. Unstructured Arabic text can be easily processed by humans, while it is harder to be interpreted and understood by machines. So, before classifying Arabic text or documents, some pre-processing operations should be done. This work presents a proposed model for selecting features from the adopted Arabic text; i.e., documents. In this work, the words ‘text’ and ‘documents’ are used interchangeably. The adopted documents are taken from Al-Khaleej-2004 corpus. The corpus contains thousands of documents which talk about news in different domains, such as economics, as well as international, local and sport news. Some preprocessing operations are carried out to extract the highly weighted terms that best describe the content of the documents. The proposed model contains many steps to define the most relevant features. After defining the initial number of features, based on the weighted words, the steps of the model begin. The first step is based on calculating the correlation between each feature and class one. Depending on a threshold value, the most highly correlated features are chosen. This reduces the number of chosen features. The number of features is again reduced by calculating the intra-correlation between the resultant features. This is done in the second step. The third step selects the best features from among those which resulted from the second step by adopting some logical operations. The logical operations, specifically logical AND or logical OR, are applied to fuse the values of features depending on their structure, nature and semantics. The obtained features are then reduced in number. The fourth step is based on adopting the idea of document clustering; i.e., the obtained features from step three are placed in one cluster. Then, iterative operations are used to group features into two clusters. Each cluster can be further partitioned into two clusters …and so on. That partitioning is repeated till the clusters' contents are not changed. The contents of each cluster are fused together using the cosine rule. This reduces the overall number of features. This work adopts four types of classifiers; namely, Naïve Bayes (NB), Decision Tree, CART and KNN. A comparative study is carried out among the behaviors of the adopted classifiers on the selected number of features. The comparative study considers some measurable criteria; namely, precision, recall, F-measure and accuracy. This work is implemented using WEKA and MatLab software packages. From the obtained results, the best performance is achieved by using CART classifier, while the worst one is obtained by using KNN classifier.
URL: https://jjcit.org/paper/68

,abstract={Classification of Arabic text plays an important role for several applications. Text classification aims at assigning predefined classes to text documents. Unstructured Arabic text can be easily processed by humans, while it is harder to be interpreted and understood by machines. So, before classifying Arabic text or documents, some pre-processing operations should be done. This work presents a proposed model for selecting features from the adopted Arabic text; i.e., documents. In this work, the words ‘text’ and ‘documents’ are used interchangeably. The adopted documents are taken from Al-Khaleej-2004 corpus. The corpus contains thousands of documents which talk about news in different domains, such as economics, as well as international, local and sport news. Some preprocessing operations are carried out to extract the highly weighted terms that best describe the content of the documents. The proposed model contains many steps to define the most relevant features. After defining the initial number of features, based on the weighted words, the steps of the model begin. The first step is based on calculating the correlation between each feature and class one. Depending on a threshold value, the most highly correlated features are chosen. This reduces the number of chosen features. The number of features is again reduced by calculating the intra-correlation between the resultant features. This is done in the second step. The third step selects the best features from among those which resulted from the second step by adopting some logical operations. The logical operations, specifically logical AND or logical OR, are applied to fuse the values of features depending on their structure, nature and semantics. The obtained features are then reduced in number. The fourth step is based on adopting the idea of document clustering; i.e., the obtained features from step three are placed in one cluster. Then, iterative operations are used to group features into two clusters. Each cluster can be further partitioned into two clusters …and so on. That partitioning is repeated till the clusters' contents are not changed. The contents of each cluster are fused together using the cosine rule. This reduces the overall number of features. This work adopts four types of classifiers; namely, Naïve Bayes (NB), Decision Tree, CART and KNN. A comparative study is carried out among the behaviors of the adopted classifiers on the selected number of features. The comparative study considers some measurable criteria; namely, precision, recall, F-measure and accuracy. This work is implemented using WEKA and MatLab software packages. From the obtained results, the best performance is achieved by using CART classifier, while the worst one is obtained by using KNN classifier.},
keywords={Text classification,Text clustering,Feature selection,Arabic datasets,Machine learning methods,Performance evaluation},
ISSN={2413-9351},
month={December}}

AB - Classification of Arabic text plays an important role for several applications. Text classification aims at assigning predefined classes to text documents. Unstructured Arabic text can be easily processed by humans, while it is harder to be interpreted and understood by machines. So, before classifying Arabic text or documents, some pre-processing operations should be done. This work presents a proposed model for selecting features from the adopted Arabic text; i.e., documents. In this work, the words ‘text’ and ‘documents’ are used interchangeably. The adopted documents are taken from Al-Khaleej-2004 corpus. The corpus contains thousands of documents which talk about news in different domains, such as economics, as well as international, local and sport news. Some preprocessing operations are carried out to extract the highly weighted terms that best describe the content of the documents. The proposed model contains many steps to define the most relevant features. After defining the initial number of features, based on the weighted words, the steps of the model begin. The first step is based on calculating the correlation between each feature and class one. Depending on a threshold value, the most highly correlated features are chosen. This reduces the number of chosen features. The number of features is again reduced by calculating the intra-correlation between the resultant features. This is done in the second step. The third step selects the best features from among those which resulted from the second step by adopting some logical operations. The logical operations, specifically logical AND or logical OR, are applied to fuse the values of features depending on their structure, nature and semantics. The obtained features are then reduced in number. The fourth step is based on adopting the idea of document clustering; i.e., the obtained features from step three are placed in one cluster. Then, iterative operations are used to group features into two clusters. Each cluster can be further partitioned into two clusters …and so on. That partitioning is repeated till the clusters' contents are not changed. The contents of each cluster are fused together using the cosine rule. This reduces the overall number of features. This work adopts four types of classifiers; namely, Naïve Bayes (NB), Decision Tree, CART and KNN. A comparative study is carried out among the behaviors of the adopted classifiers on the selected number of features. The comparative study considers some measurable criteria; namely, precision, recall, F-measure and accuracy. This work is implemented using WEKA and MatLab software packages. From the obtained results, the best performance is achieved by using CART classifier, while the worst one is obtained by using KNN classifier.

Download Full Text