TAG RECOMMENDATION FOR SHORT ARABIC TEXT BY USING LATENT SEMANTIC ANALYSIS OF WIKIPEDIA

(Received: 8-Dec-2019, Revised: 22-Jan-2020 and 16-Feb-2020 , Accepted: 24-Feb.-2020)

Authors Iyad AlAgha, Yousef Abu-Samra,

Keywords #Tag recommendation #Arabic #Short text #Latent semantic analysis #Wikipedia #Apache Spark

Abstract Text tagging has gained a growing attention as a way of associating metadata that supports information retrieval and classification. To resolve the difficulties of manual tagging, tag recommendation has emerged as a solution to assist users in tagging by presenting a list of relevant tags. However, the majority of existing approaches for tag recommendation have focused on domain-specific tagging and tackled long-form text. Open-domain tagging can be challenging due to the lack of comprehensive knowledge and the intensive computations involved. Furthermore, tagging of short text can be problematic due to the difficulty of extracting statistical features. In terms of the language, most efforts have focused on tagging text written in English. The tagging of Arabic text has been challenged by the difficulty of processing the Arabic language and the lack of knowledge sources in Arabic. This work proposes an approach for tag recommendation for short Arabic text. It exploits the Arabic Wikipedia as a background knowledge and uses it to generate tags in response to input short text. Latent semantic analysis is exploited to analyze Wikipedia content and find articles relevant to the input text. Then, tags are selected from the titles and categories of these articles and are ranked according to relevance. The approach was evaluated based on experts' ratings of relevance of 993 tags. Results showed that the approach achieved 84.39% mean average precision and 96.53% mean reciprocal rank. A thorough discussion of results is given to highlight the limitations and the strengths of the approach.

References

[1] V. Oliveira, G. Gomes, F. Belém, W. Brandao, J. Almeida, N. Ziviani and M. Gonçalves, "Automatic Query Expansion Based on Tag Recommendation," Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 1985-1989, 2012.

[2] O. Nov, M. Naaman and C. Ye, "What Drives Content Tagging: The Case of Photos on Flickr," ACM, pp. 1097-1100, 2008.

[3] M. R. Bouadjenek, H. Hacid and M. Bouzeghoub, "Social Networks and Information Retrieval, How Are They Converging? A Survey, a Taxonomy and an Analysis of Social Information Retrieval Approaches and Platforms," Information Systems, vol. 56, pp. 1-18, 2016.

[4] G. Sriharee: "An Ontology-based Approach to Auto-tagging Articles," Vietnam Journal of Computer Science, vol. 2, no. 2, pp. 85-94, 2015.

[5] F. M. Belém, J. M. Almeida and M. A. Gonçalves, "A Survey on Tag Recommendation Methods," Journal of the Association for Information Science and Technology, vol. 68, no. 4, pp. 830-844, 2017.

[6] O. Vechtomova, "Introduction to Information Retrieval," Proc. of the 40th European Conference on IR Research, 2009.

[7] I. Al-Agha and O. Abu-Dahrooj: "Multi-level Analysis of Political Sentiment Using Twitter Data: A Case Study of the Palestinian-Israeli Conflict," Jordanian Journal of Computers and Information Technology (JJCIT), vol. 5, no. 3, 2019.

[8] "Latent Semantic Indexing," [Online], Availab.:https://en.wikipedia.org/wiki/Latent_semantic_analysis.

[9] H.-K. Hong, G.-W. Kim and D.-H. Lee: "Semantic Tag Recommendation Based on Associated Words Exploiting the Interwiki Links of Wikipedia," Journal of Information Science, vol. 44, no. 3, pp. 298-313, 2018.

[10] L. Jayaratne, "Content Based Cross-domain Recommendation Using Linked Open Data," GSTF Journal on Computing, vol. 5, no. 3, 2017.

[11] S. Vairavasundaram, V. Varadharajan, I. Vairavasundaram and L. Ravi, "Data Mining‐based Tag Recommendation System: An Overview," Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 5, no. 3, pp. 87-112, 2015.

[12] W. Guo, H. Li, H. Ji and M. T. Diab, "Linking Tweets to News: A Framework to Enrich Short Text Data in Social Media", ACL, vol. 1, pp. 239-249, 2013.

[13] S. Garcia Esparza, M. P. O'Mahony and B. Smyth, "Towards Tagging and Categorization for Micro-blogs," Paper presented at the 21st National Conference on Artificial Intelligence and Cognitive Science (AICS 2010), Galway, Ireland, 30 August-1 September, 2010.

[14] R. Dovgopol and M. Nohelty, "Twitter Hash Tag Recommendation," arXiv preprint arXiv:1502.00094, 2015.

[15] I. M. AlAgha and A. Abu-Taha, "AR2SPARQL: An Arabic Natural Language Interface for the Semantic Web," AR2SPARQL: An Arabic Natural Language Interface for the Semantic Web, vol. 124, no. 18, 2015.

[16] T. K. Landauer, D. S. McNamara, S. Dennis and W. Kintsch, Handbook of Latent Semantic Analysis, Psychology Press, 2013.

[17] T. Cvitanic, B. Lee, H. I. Song, K. Fu and D. Rosen, "LDA v. LSA: A Comparison of Two Computational Text Analysis Tools for the Functional Categorization of Patents", ICCBR Workshops, 2016.

[18] M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman and M. J. Franklin, "Apache Spark: A Unified Engine for Big Data Processing," Communications of the ACM, vol. 59, no. 11, pp. 56-65, 2016.

[19] R. Singh and A. Rani, "A Survey on the Generation of Recommender Systems," International Journal of Information Engineering and Electronic Business, vol. 9, no. 3, pp. 26-35, 2017.

[20] C. Wartena, R. Brussee and M. Wibbels, "Using Tag Co-occurrence for Recommendation," Proc. of the 9th International Conference on Intelligent Systems Design and Applications (ISDA), Pisa, Italy, pp. 273-278, 2009.

[21] R. Damaševicius, R. Valys and M. Woźniak, "Intelligent Tagging of Online Texts Using Fuzzy Logic," Proc. of IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1-8, 2016.

[22] G. V. Menezes, J. M. Almeida, F. Belém, M. A. Gonçalves, A. Lacerda, E. S. De Moura, G. L. Pappa, A. Veloso and N. Ziviani, "Demand-driven Tag Recommendation," Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II, pp. 402-417, 2010.

[23] K. Yanai, "VisualTextualRank: An Extension of Visualrank to Large-scale Video Shot Extraction Exploiting Tag Co-occurrence," IEICE Transactions on Information and Systems, vol. 98, no. 1, pp. 166-172, 2015.

[24] M. P. Lipczak and E. Milios, "Efficient Tag Recommendation for Real-life Data," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 3, no. 1, pp. 1-21, 2011.

[25] Y. Wu, Y. Yao, F. Xu, H. Tong and J. Lu, "Tag2word: Using Tags to Generate Words for Content Based Tag Recommendation," Proc. of the 25th International ACM Conference (CIKM '16), pp. 2287-2292, 2016.

[26] J. Wang, L. Hong and B. D. Davison, "Tag Recommendation Using Keywords and Association Rules," RSDC'09, pp. 1-14, 2009.

[27] F. M. Belém, E. F. Martins, J. M. Almeida and M. A. Gonçalves, "Personalized and Object-centered Tag Recommendation Methods for Web 2.0 Applications," Information Processing & Management, vol. 50, no. 4, pp. 524-553, 2014.

[28] I. Katakis, G. Tsoumakas and I. Vlahavas, "Multi-label Text Classification for Automated Tag Suggestion", ECML/PKDD, pp. 1-9, 2008.

[29] Y. Gong and Q. Zhang: "Hashtag Recommendation Using Attention-based Convolutional Neural Network," Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI-16), pp. 2782-2788, 2016.

[30] Y. Wang, J. Li, I. King, M. R. Lyu and S. Shi, "Microblog Hashtag Generation via Encoding Conversation Contexts," arXiv preprint arXiv:1905.07584, 2019.

[31] H. T. Nguyen, M. Wistuba, J. Grabocka, L. R. Drumond and L. Schmidt-Thieme, "Personalized Deep Learning for Tag Recommendation", Proc. of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp. 186-197, 2017.

[32] Y. Yang, L. Han, Z. Gou, B. Duan, J. Zhu and H. Yan, "Tagrec-CMTF: Coupled Matrix and Tensor Factorization for Tag Recommendation," IEEE Access, vol. 6, pp. 64142-64152, 2018.

[33] C. Lu, B. Shen, L. Zhang and J. Allebach, "Tag Recommendation via Robust Probabilistic Discriminative Matrix Factorization," Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1170-1174, 2016.

[34] J. Yao, Y. Wang, Y. Zhang, J. Sun and J. Zhou, "Joint Latent Dirichlet Allocation for Social Tags," IEEE Transactions on Multi-media, vol. 20, no. 1, pp. 224-237, 2017.

[35] M. A. Masood, R. A. Abbasi, O. Maqbool, M. Mushtaq, N. R. Aljohani, A. Daud, M. A. Aslam and J. S. Alowibdi, "MFS-LDA: A Multi-feature Space Tag Recommendation Model for Cold Start Problem," Program, vol. 51, no. 3, pp. 218-234, 2017.

[36] T.-A. N. Pham, X. Li, G. Cong and Z. Zhang, "A General Graph-based Model for Recommendation in Event-based Social Networks," International Conference on Data Engineering, pp. 567-578, 2015.

[37] M. Hmimida and R. Kanawati, "A Graph-coarsening Approach for Tag Recommendation," Proc. of the International World Wide Web Conferences Steering Committee, pp. 43-44, 2016.

[38] Y. Chen, H. Dong and W. Wang, "Topic-graph Based Recommendation on Social Tagging Systems: A Study on Research Gate," ACM, pp. 138-143, 2018.

[39] M. Rawashdeh, M. F. Alhamid, J. M. Alja’am, A. Alnusair and A. El Saddik, "Tag-based Personalized Recommendation in Social Media Services," Multimedia Tools and Applications, vol. 75, no. 21, pp. 13299-13315, 2016.

[40] M. A. Chatti, S. Dakova, H. Thüs and U. Schroeder, "Tag-based Collaborative Filtering Recommendation in Personal Learning Environments," IEEE Transactions on Learning Technologies, vol. 6, no. 4, pp. 337-349, 2013.

[41] S. Panigrahi, R. K. Lenka and A. Stitipragyan, "A Hybrid Distributed Collaborative Filtering Recommender Engine Using Apache Spark," Procedia Comp. Science, vol. 83, pp. 1000-1006, 2016.

[42] Y. Song, L. Zhang and C. L. Giles, "Automatic Tag Recommendation Algorithms for Social Recommender Systems," ACM Transactions on the Web (TWEB), vol. 5, no. 1, p. 4, 2011.

[43] R. Krestel and P. Fankhauser, "Personalized Topic-based Tag Recommendation," Neurocomputing, vol. 76, no. 1, pp. 61-70, 2012.

[44] P. Lops, M. De Gemmis, G. Semeraro, C. Musto and F. Narducci, "Content-based and Collaborative Techniques for Tag Recommendation: An Empirical Evaluation," Journal of Intelligent Information Systems, vol. 40, no. 1, pp. 41-61, 2013.

[45] P. Symeonidis, "ClustHOSVD: Item Recommendation by Combining Semantically Enhanced Tag Clustering with Tensor HOSVD", IEEE Transactions on Systems, Man and Cybernetics: Systems, vol. 46, no. 9, pp. 1240-1251, 2015.

[46] M. Lipczak, Y. Hu, Y. Kollet and E. Milios, "Tag Sources for Recommendation in Collaborative Tagging Systems," ECML PKDD Discovery Challenge, vol. 497, pp. 157-172, 2009.

[47] F. S. Al-Anzi and D. AbuZeina, "Toward an Enhanced Arabic Text Classification Using Cosine Similarity and Latent Semantic Indexing," Journal of King Saud University-Computer and Information Sciences, vol. 29, no. 2, pp. 189-195, 2017.

[48] H. Froud, A. Lachkar and S. A. Ouatik, "Arabic Text Summarization Based on Latent Semantic Analysis to Enhance Arabic Documents Clustering," arXiv preprint arXiv:1302.1612, 2013.

[49] K. Al-Sabahi, Z. Zhang, J. Long and K. Alwesabi, "An Enhanced Latent Semantic Analysis Approach for Arabic Document Summarization," Arabian Journal for Science and Engineering, vol. 43, no. 12, pp. 8079-8094, 2018.

[50] H. Alazzam and A. Alsmady, "A Distributed Arabic Text Classification Approach Using Latent Semantic Analysis for Big Data," Proc. of the 12th IEEE International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT), pp. 58-61, 2017.

[51] M. Naili, A. H. Chaibi and H. B. Ghézala, "Empirical Study of LDA for Arabic Topic Identification," HAL Id: hal-01444574, 2016.

[52] R. Mezher and N. Omar, "A Hybrid Method of Syntactic Feature and Latent Semantic Analysis for Automatic Arabic Essay Scoring," Journal of Applied Sciences, vol. 16, no. 5, p. 209, 2016.

[53] N. I. Al-Rajebah and H. S. Al-Khalifa, "Extracting Ontologies from Arabic Wikipedia: A Linguistic Approach," Arabian Journal for Science and Engineering, vol. 39, no. 4, pp. 2749-2771, 2014.

[54] M. M. Boudabous, L. H. Belguith and F. Sadat, "Exploiting the Arabic Wikipedia for Semi-automatic Construction of a Lexical Ontology," International Journal of Metadata, Semantics and Ontologies, vol. 8, no. 3, pp. 245-253, 2013.

[55] F. Alotaibi and M. Lee, "Mapping Arabic Wikipedia into the Named Entities Taxonomy", Proceedings of COLING 2012, pp. 43-52, 2012.

[56] M. Al-Smadi, B. Talafha, O. Qawasmeh, M. N. Alandoli, W. A. Hussien and C. Guetl, "A Hybrid Approach for Arabic Named Entity Disambiguation," Proc. of the 15th International Conference on Knowledge Technologies and Data-drive, ACM, 2015.

[57] F. Fayad and I. AlAgha, Automatic Linking of Short Arabic Texts to Wikipedia, M.Sc. Thesis, Faculty of Information Technology, The Islamic University-Gaza, Palestine, 2013.

[58] A. Yahya and A. Salhi, "Arabic Text Categorization Based on Arabic Wikipedia," ACM Transactions on Asian Language Information Processing (TALIP), vol. 13, no. 1, p. 4, 2014.

[59] A. Mahgoub, M. Rashwan, H. Raafat, M. Zahran and M. Fayek, "Semantic Query Expansion for Arabic Information Retrieval," Arabic Natural Language Processing Workshop (EMNLP), Doha, Qatar, pp. 87-92, 2014.

[60] X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde and S. Owen, "MLlib: Machine Learning in Apache Spark," JMLR, vol. 17, no. 34, pp. 1-7, 2016.

[61] W. Monroe, S. Green and C. D. Manning, "Word Segmentation of Informal Arabic with Domain Adaptation," Proceedings of the 52nd Annual Meeting of the Association for Computational Sciences, vol. 2, pp. 206-211, 2014.

[62] L. S. Larkey, L. Ballesteros and M. E. Connell, "Light Stemming for Arabic Information Retrieval," Arabic Computational Morphology, (Springer, Dordrecht,), pp. 221-243, 2007.

[63] K. Darwish and H. Mubarak, "Farasa: Fast and Accurate Arabic Word Segmenter," [Online], Available: http://alt.qcri.org/farasa/segmenter.html, Accessed: 9 Feb. 2017.

[64] E. Gabrilovich and S. Markovitch, "Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis," Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI-07), PP. 1606-1611, 2007.

[65] D. Ștefănescu, R. Banjade and V. Rus, "Latent Semantic Analysis Models on Wikipedia and Tasa," Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC'14), pp. 1417-1422, 2014.

[66] F. M. Belém, C. S. Batista, R. L. Santos, J. M. Almeida and M. A. Gonçalves, "Beyond Relevance: Explicitly Promoting Novelty and Diversity in Tag Recommendation," ACM Transactions on Intelligent Systems and Technology (TIST), vol. 7, no. 3, p. 26, 2016.

[67] J. Chakraborty and V. Verma, "Diversification in Tag Recommendation System Using Binomial Framework," Information and Communication Technology for Sustainable Development, Springer, pp. 423-430, 2018.

[68] B. Bi and J. Cho, "Automatically Generating Descriptions for Resources by Tag Modeling," Proceedings of the 22nd ACM International Conference on Information & Knowledge Management (CIKM '13), pp. 2387-2392, 2013.

[69] R. Prokofyev, A. Boyarsky, O. Ruchayskiy, K. Aberer, G. Demartini and P. Cudré-Mauroux, "Tag Recommendation for Large-scale Ontology-based Information Systems," Proc. of the International Semantic Web Conference, Springer, pp. 325-336, 2012.

[70] N. Niraula, R. Banjade, D. Ştefănescu and V. Rus, "Experiments with Semantic Similarity Measures Based on LDA and LSA," Proc. of the International Conference on Statistical Language and Speech Processing, Springer, pp. 188-199, 2013.

[71] C.-G. Chiru, T. Rebedea and S. Ciotec, "Comparison between LSA-LDA-lexical Chains," Proceedings of the 10th International Conference on Web Information Systems and Technologies (WEBIST), pp. 255-262, 2014.

[72] M. Allahyari and K. Kochut, "Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network", 2016 IEEE 10th International Conference on Semantic Computing (ICSC), pp. 63-70, 2016.

[73] T. Bogers and A. van den Bosch: "Recommending Scientific Articles Using Citeulike," Proceedings of the ACM Conference on Recommender Systems, pp. 287-290, 2008.

[74] M. Sun, Y.-N. Chen and A. I. Rudnicky: "HELPR, A Framework to Break the Barrier across Domains in Spoken Dialog Systems," Dialogues with Social Robots, Springer, pp. 257-269, 2017.

[75] M. Maamouri, A. Bies and S. Kulick, "Diacritization: A Challenge to Arabic Treebank Annotation and Parsing," Proceedings of the Conference of the Machine Translation SIG of the British Computer Society, 2006.

[76] T. K. Landauer, "LSA as a Theory of Meaning," Handbook of Latent Semantic Analysis, vol. 3, 2007.