HYBRID FEATURE SELECTION FRAMEWORK FOR SENTIMENT ANALYSIS ON LARGE CORPORA

(Received: 5-Jan.-2021, Revised: 22-Feb.-2021 , Accepted: 17-Mar.-2021)

Authors *Kayode S. Adewole, Abdullateef O. Balogun, Muiz O. Raheem, Muhammed K. Jimoh, Rasheed G. Jimoh, Modinat A. Mabayoje, Fatima E. Usman-Hamza, Abimbola G. Akintola, Ayisat W. Asaju-Gbolagade,

Keywords #Sentiment analysis #Opinion mining #Hybrid feature selection #Boruta #Recursive feature elimination

Abstract Sentiment analysis has recently drawn considerable research attention in recent years owing to its applicability in determining users’ opinions, sentiments and emotions from large collections of textual data. The goal of sentiment analysis centred on improving users’ experience by deploying robust techniques that mine opinions and emotions from large corpora. There are several studies on sentiment analysis and opinion mining from textual information; however, the existence of domain-specific words, such as slang, abbreviations and grammatical mistakes further posed serious challenges to existing sentiment analysis methods. In this paper, we focus on the identification of an effective discriminative subset of features that can aid classification of users’ opinions from large corpora. This study proposes a hybrid feature-selection framework that is based on the hybridization of filter- and wrapper-based feature selection methods. Correlation feature selection (CFS) is hybridized with Boruta and Recursive Feature Elimination (RFE) to identify the most discriminative feature subsets for sentiment analysis. Four publicly available datasets for sentiment analysis: Amazon, Yelp, IMDB and Kaggle are considered to evaluate the performance of the proposed hybrid feature selection framework. This study evaluates the performance of three classification algorithms: Support Vector Machine (SVM), Naïve Bayes and Random Forest to ascertain the superiority of the proposed approach. Experimental results across different contexts as depicted by the datasets considered in this study clearly show that CFS combined with Boruta produced promising results, especially when the features selected are passed to Random Forest classifier. Indeed, the proposed hybrid framework provides an effective way of predicting users’ opinions and emotions while giving substantial consideration to predictive accuracy. The computing time of the resulting model is shorter as a result of the proposed hybrid feature selection framework.

References

[1] M. A. Hassonah, R. Al-Sayyed, A. Rodan, A.-Z. Ala’m, I. Aljarah and H. Faris, "An Efficient Hybrid Filter and Evolutionary Wrapper Approach for Sentiment Analysis of Various Topics on Twitter," Knowledge-based Systems, vol. 192, p. 105353, 2020.

[2] Y. A. Alsariera, A. V. Elijah and A. O. Balogun, "Phishing Website Detection: Forest by Penalizing Attributes Algorithm and Its Enhanced Variations," Arabian Journal for Science and Engineering, vol. 45, pp. 10459-10470, 2020.

[3] S. M. Rezaeinia, R. Rahmani, A. Ghodsi and H. Veisi, "Sentiment Analysis Based on Improved Pre-trained Word Embeddings," Expert Systems with Applications, vol. 117, pp. 139-147, 2019.

[4] R. Arulmurugan, K. Sabarmathi and H. Anandakumar, "Classification of Sentence Level Sentiment Analysis Using Cloud Machine Learning Techniques," Cluster Comp., vol. 22, pp. 1199-1209, 2019.

[5] A. Hasan, S. Moin, A. Karim and S. Shamshirband, "Machine Learning-based Sentiment Analysis for Twitter Accounts," Mathematical and Computational Applications, vol. 23, p. 11, 2018.

[6] H. H. Do, P. Prasad, A. Maag and A. Alsadoon, "Deep Learning for Aspect-based Sentiment Analysis: A Comparative Review," Expert Systems with Applications, vol. 118, pp. 272-299, 2019.

[7] Y. Wang, M. Wang and H. Fujita, "Word Sense Disambiguation: A Comprehensive Knowledge Exploitation Framework," Knowledge-based Systems, vol. 190, p. 105030, 2020.

[8] M. Taboada, J. Brooke, M. Tofiloski, K. Voll and M. Stede, "Lexicon-based methods for sentiment analysis," Computational Linguistics, vol. 37, pp. 267-307, 2011.

[9] G. Ansari, T. Ahmad and M. N. Doja, "Hybrid Filter–Wrapper Feature Selection Method for Sentiment Classification," Arabian Journal for Science and Engineering, vol. 44, pp. 9191-9208, 2019.

[10] Y. A. Alsariera, V. E. Adeyemo, A. O. Balogun and A. K. Alazzawi, "AI Meta-learners and Extra-trees Algorithm for the Detection of Phishing Websites," IEEE Access, vol. 8, pp. 142532-142542, 2020.

[11] M. S. Akhtar, D. Gupta, A. Ekbal and P. Bhattacharyya, "Feature Selection and Ensemble Construction: A Two-step Method for Aspect Based Sentiment Analysis," Knowledge-Based Systems, vol. 125, pp. 116-135, 2017.

[12] A. O. Balogun, S. Basri, S. Mahamad, S. J. Abdulkadir, M. A. Almomani, V. E. Adeyemo et al., "Impact of Feature Selection Methods on the Predictive Performance of Software Defect Prediction Models: An Extensive Empirical Study," Symmetry, vol. 12, p. 1147, 2020.

[13] B. A. Oluwagbemiga, B. Shuib, S. J. Abdulkadir and A. Sobri, "A Hybrid Multi-filter Wrapper Feature Selection Method for Software Defect Predictors," International Journal of Supply Chain Management, vol. 8, pp. 916-922, 2019.

[14] A. O. Balogun, S. Basri, S. J. Abdulkadir and A. S. Hashim, "Performance Analysis of Feature Selection Methods in Software Defect Prediction: A Search Method Approach," Applied Sciences, vol. 9, p. 2764, 2019.

[15] B. Agarwal and N. Mittal, "Machine Learning Approach for Sentiment Analysis," Proc. of Prominent Feature Extraction for Sentiment Analysis, pp. 21-45, Springer, 2016.

[16] K. S. Adewole, T. Han, W. Wu, H. Song and A. K. Sangaiah, "Twitter Spam Account Detection Based on Clustering and Classification Methods," The Jour. of Supercomputing, vol. 76, pp. 4802-4837, 2020.

[17] L. Zhang and B. Liu, "Sentiment analysis and opinion mining," Encyclopedia of Machine Learning and Data Mining, pp. 1152-1161, 2017.

[18] B. Liu, Sentiment Analysis: Mining Opinions, Sentiments and Emotions, Cambridge Uni. Press, 2020.

[19] S. Ahmed and A. Danti, "Effective Sentimental Analysis and Opinion Mining of Web Reviews Using Rule Based Classifiers," Proc. of Computational Intelligence in Data Mining, vol. 1, pp. 171-179, Springer, 2016.

[20] F. Hemmatian and M. K. Sohrabi, "A Survey on Classification Techniques for Opinion Mining and Sentiment Analysis," Artificial Intelligence Review, vol. 52, pp. 1495-1545, 2019.

[21] E. Cambria, D. Das, S. Bandyopadhyay and A. Feraco, "Affective Computing and Sentiment Analysis," Proc. of a Practical Guide to Sentiment Analysis, pp. 1-10, Springer, 2017.

[22] M. Ptaszynski, R. Rzepka, K. Araki and Y. Momouchi, "Automatically Annotating a Five-billion-word Corpus of Japanese Blogs for Sentiment and Affect Analysis," Computer Speech & Language, vol. 28, pp. 38-55, 2014.

[23] E. Cambria, Y. Li, F. Z. Xing, S. Poria and K. Kwok, "SenticNet 6: Ensemble Application of Symbolic and Subsymbolic AI for Sentiment Analysis," Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 105-114, DOI: 10.1145/3340531.3412003, 2020.

[24] E. Cambria, S. Poria, A. Gelbukh and K. Kwok, "Sentic API: A Common-sense Based API for Concept-level Sentiment Analysis," Proc. of Making Sense of Microposts (# Microposts2014), p. 2, [Online], Available: https://hdl.handle.net/10356/84835, 2014.

[25] A. Jeyapriya and C. K. Selvi, "Extracting Aspects and Mining Opinions in Product Reviews Using Supervised Learning Algorithm," Proc. of the 2nd IEEE International Conference on Electronics and Communication Systems (ICECS), pp. 548-552, Coimbatore, India, 2015.

[26] A. Tripathy, A. Agrawal and S. K. Rath, "Classification of Sentimental Reviews Using Machine Learning Techniques," Procedia-Computer Science, vol. 57, pp. 821-829, 2015.

[27] C. Alfaro, J. Cano-Montero, J. Gómez, J. M. Moguerza and F. Ortega, "A Multi-stage Method for Content Classification and Opinion Mining on Weblog Comments," Annals of Operations Research, vol. 236, pp. 197-213, 2016.

[28] A. Hussain and E. Cambria, "Semi-supervised Learning for Big Social Data Analysis," Neurocomputing, vol. 275, pp. 1662-1673, 2018.

[29] N. Claypo and S. Jaiyen, "Opinion Mining for Thai Restaurant Reviews Using K-Means Clustering and MRF Feature Selection," Proc. of the 7th IEEE International Conference on Knowledge and Smart Technology (KST), pp. 105-108, Chonburi, Thailand, 2015.

[30] I. Al-Agha and O. Abu-Dahrooj, "Multi-level Analysis of Political Sentiments Using Twitter Data: A Case Study of the Palestinian-Israeli Conflict," Jordanian Journal of Computers and Information Technology (JJCIT), vol. 5, no.3, pp. 195-215, 2019.

[31] S. Kumar, V. Koolwal and K. K. Mohbey, "Sentiment Analysis of Electronic Product Tweets Using Big Data Framework," Jordanian Journal of Computers and Information Technology (JJCIT), vol. 5, no. 1, pp. 43-59, 2019.

[32] K. M. Nahar, A. Jaradat, M. S. Atoum and F. Ibrahim, "Sentiment Analysis and Classification of Arab Jordanian Facebook Comments for Jordanian Telecom Companies Using Lexicon-based Approach and Machine Learning," Jordanian Jour. of Comp. and Inf. Tech. (JJCIT), vol. 6, no.3, pp. 247-262, 2020.

[33] F. Degenhardt, S. Seifert and S. Szymczak, "Evaluation of Variable Selection Methods for Random Forests and Omics Datasets," Briefings in Bioinformatics, vol. 20, pp. 492-503, 2019.

[34] S. S. Rathore and A. Gupta, "A Comparative Study of Feature-ranking and Feature-subset Selection Techniques for Improved Fault Prediction," Proceedings of the 7th India Software Engineering Conference, pp. 1-10, Chennai, India, 2014.

[35] Z. Xu, J. Liu, Z. Yang, G. An and X. Jia, "The Impact of Feature Selection on Defect Prediction Performance: An Empirical Comparison," Proc. of the IEEE 27th International Symposium on Software Reliability Engineering (ISSRE), pp. 309-320, Ottawa, Canada, 2016.

[36] M. B. Kursa and W. R. Rudnicki, "Feature Selection with the Boruta Package," J. Stat. Softw., vol. 36, pp. 1-13, 2010.

[37] A. O. Balogun, S. Basri, S. J. Abdulkadir, V. E. Adeyemo, A. A. Imam and A. O. Bajeh, "Software Defect Prediction: Analysis of Class Imbalance and Performance Stability," Journal of Engineering Science and Technology, vol. 14, pp. 3294-3308, 2019.

Abstract: Sentiment analysis has recently drawn considerable research attention in recent years owing to its applicability in determining users’ opinions, sentiments and emotions from large collections of textual data. The goal of sentiment analysis centred on improving users’ experience by deploying robust techniques that mine opinions and emotions from large corpora. There are several studies on sentiment analysis and opinion mining from textual information; however, the existence of domain-specific words, such as slang, abbreviations and grammatical mistakes further posed serious challenges to existing sentiment analysis methods. In this paper, we focus on the identification of an effective discriminative subset of features that can aid classification of users’ opinions from large corpora. This study proposes a hybrid feature-selection framework that is based on the hybridization of filter- and wrapper-based feature selection methods. Correlation feature selection (CFS) is hybridized with Boruta and Recursive Feature Elimination (RFE) to identify the most discriminative feature subsets for sentiment analysis. Four publicly available datasets for sentiment analysis: Amazon, Yelp, IMDB and Kaggle are considered to evaluate the performance of the proposed hybrid feature selection framework. This study evaluates the performance of three classification algorithms: Support Vector Machine (SVM), Naïve Bayes and Random Forest to ascertain the superiority of the proposed approach. Experimental results across different contexts as depicted by the datasets considered in this study clearly show that CFS combined with Boruta produced promising results, especially when the features selected are passed to Random Forest classifier. Indeed, the proposed hybrid framework provides an effective way of predicting users’ opinions and emotions while giving substantial consideration to predictive accuracy. The computing time of the resulting model is shorter as a result of the proposed hybrid feature selection framework.
URL: https://jjcit.org/paper/130

,abstract={Sentiment analysis has recently drawn considerable research attention in recent years owing to its applicability in determining users’ opinions, sentiments and emotions from large collections of textual data. The goal of sentiment analysis centred on improving users’ experience by deploying robust techniques that mine opinions and emotions from large corpora. There are several studies on sentiment analysis and opinion mining from textual information; however, the existence of domain-specific words, such as slang, abbreviations and grammatical mistakes further posed serious challenges to existing sentiment analysis methods. In this paper, we focus on the identification of an effective discriminative subset of features that can aid classification of users’ opinions from large corpora. This study proposes a hybrid feature-selection framework that is based on the hybridization of filter- and wrapper-based feature selection methods. Correlation feature selection (CFS) is hybridized with Boruta and Recursive Feature Elimination (RFE) to identify the most discriminative feature subsets for sentiment analysis. Four publicly available datasets for sentiment analysis: Amazon, Yelp, IMDB and Kaggle are considered to evaluate the performance of the proposed hybrid feature selection framework. This study evaluates the performance of three classification algorithms: Support Vector Machine (SVM), Naïve Bayes and Random Forest to ascertain the superiority of the proposed approach. Experimental results across different contexts as depicted by the datasets considered in this study clearly show that CFS combined with Boruta produced promising results, especially when the features selected are passed to Random Forest classifier. Indeed, the proposed hybrid framework provides an effective way of predicting users’ opinions and emotions while giving substantial consideration to predictive accuracy. The computing time of the resulting model is shorter as a result of the proposed hybrid feature selection framework.},
keywords={Sentiment analysis,Opinion mining,Hybrid feature selection,Boruta,Recursive feature elimination},
ISSN={2413-9351},
month={June}}

AB - Sentiment analysis has recently drawn considerable research attention in recent years owing to its applicability in determining users’ opinions, sentiments and emotions from large collections of textual data. The goal of sentiment analysis centred on improving users’ experience by deploying robust techniques that mine opinions and emotions from large corpora. There are several studies on sentiment analysis and opinion mining from textual information; however, the existence of domain-specific words, such as slang, abbreviations and grammatical mistakes further posed serious challenges to existing sentiment analysis methods. In this paper, we focus on the identification of an effective discriminative subset of features that can aid classification of users’ opinions from large corpora. This study proposes a hybrid feature-selection framework that is based on the hybridization of filter- and wrapper-based feature selection methods. Correlation feature selection (CFS) is hybridized with Boruta and Recursive Feature Elimination (RFE) to identify the most discriminative feature subsets for sentiment analysis. Four publicly available datasets for sentiment analysis: Amazon, Yelp, IMDB and Kaggle are considered to evaluate the performance of the proposed hybrid feature selection framework. This study evaluates the performance of three classification algorithms: Support Vector Machine (SVM), Naïve Bayes and Random Forest to ascertain the superiority of the proposed approach. Experimental results across different contexts as depicted by the datasets considered in this study clearly show that CFS combined with Boruta produced promising results, especially when the features selected are passed to Random Forest classifier. Indeed, the proposed hybrid framework provides an effective way of predicting users’ opinions and emotions while giving substantial consideration to predictive accuracy. The computing time of the resulting model is shorter as a result of the proposed hybrid feature selection framework.

Download Full Text