NEWS

ACCURATE AND FAST RECURRENT NEURAL NETWORK SOLUTION FOR THE AUTOMATIC DIACRITIZATION OF ARABIC TEXT


(Received: 2-Sep-2019, Revised: 27-Oct-2019 and 21-Nov-2019 , Accepted: 16-Dec-2019)
Arabic is mostly written now without its diacritics (short vowels). Adding these diacritics decreases reading ambiguity among other benefits. This work aims to develop a fast and accurate machine learning solution to diacritize Arabic text automatically. This paper uses long short-term memory (LSTM) recurrent neural networks to diacritize Arabic text. Intensive experiments are performed to evaluate proposed alternative design and data encoding options towards a fast and accurate solution. Our experiments involve investigating and handling problems in sequence lengths, proposing and evaluating alternative encodings of the diacritized output sequences and tuning and evaluating neural network options including architecture, network size and hyper-parameters. This paper recommends a solution that can be fast trained on a large dataset and uses four bidirectional LSTM layers to predict the diacritics of the input sequence of Arabic letters. This solution achieves a diacritization error rate of 2.46% on the LDC ATB3 dataset benchmark and 1.97% on the larger new Tashkeela dataset. This latter rate is 47% improvement over the best-published previous result.

[1] N. Y. Habash, Introduction to Arabic Natural Language Processing, Synthesis Lectures on Human Language Technologies, Morgan and Claypool Publishers, 2010.

[2] G. Abandah, M. Khedher, W. Anati, A. Zghoul, S. Ababneh and M. Hattab, "The Arabic Language Status in the Jordanian Social Networking and Mobile Phone Communications," Proc. of the 7th Int’l Conference on Information Technology (ICIT 2015), pp. 449-456, 2015.

[3] G. A. Abandah and F. Khundakjie, "Issues Concerning Code System for Arabic Letters," Dirasat-Eng. Sci. J., vol. 31, no. 1, pp. 165-177, 2004.

[4] G. A. Abandah, A. Graves, B. Al-Shagoor, A. Arabiyat, F. Jamour and M. Al-Taee, "Automatic Diacritization of Arabic Text Using Recurrent Neural Networks," International Journal on Document Analysis and Recognition (IJDAR), vol. 18, no. 2, pp. 183-197, 2015.

[5] N. Habash and O. Rambow, "Arabic Diacritization through Full Morphological Tagging," Proc. of Conference on North American Chapter of the Association for Computational Linguistics, pp. 53-56, 2007.

[6] M. Maamouri, A. Bies, T. Buckwalter and W. Mekki, "The Penn Arabic Treebank: Building a Large- scale Annotated Arabic Corpus," Proc. of Conference on Arabic Language Resources and Tools (NEMLAR), pp. 102-109, 2004.

[7] A. Fadel, I. Tuffaha, B. Al-Jawarneh and M. Al-Ayyoub, "Arabic Text Diacritization Using Deep Neural Networks," arXiv: 1905.01965v1, 2019. 

[8] A. M. Azmi and R. S. Almajed, "A Survey of Automatic Arabic Diacritization Techniques," Natural Language Engineering, vol. 21, pp. 477-495, 2013.

[9] O. Hamed and T. Zesch, "A Survey and Comparative Study of Arabic Diacritization Tools," JLCL: Special Issue-NLP for Perso-Arabic Alphabets, vol. 32, no. 1, pp. 27-47, 2017.

[10] Y. Gal, "An HMM Approach to Vowel Restoration in Arabic and Hebrew," Proceedings of the ACL-02 Workshop on Computational Approach to Semitic Languages (SEMITIC ‘02), pp. 27-33, 2002.

[11] E. Elshafei, H. Al-Muhtaseb and M. Alghamdi, "Statistical Methods for Automatic Diacritization of Arabic Text," Proceedings of Saudi 18th National Computer Conference (NCC18), pp. 301-306, 2006.

[12] Y. Hifny, "Smoothing Techniques for Arabic Diacritics Restoration," Proceedings of the 12th Conference on Language Engineering (ESOLEC ‘012), pp. 6-12, 2012.

[13] T. Zerrouki, "Arabic Corpora Resources, Tashkila Collection from the Arabic Al-Shamela Library, [Online], Available: "http://aracorpus.e3rab.com/,[Accessed Aug. 27, 2019].

[14] A. S. Azim, X. Wang and K. C. Sim, "A Weighted Combination of Speech with Text-based Models for Arabic Diacritization," Proceedings of the 13th Annual Conference of International Speech Communication Association, pp. 2334-2337, 2012.

[15] D. Vergyri and K. Kirchhoff, "Automatic Diacritization of Arabic for Acoustic Modelling in Speech Recognition," Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pp. 66-73, 2004.

[16] R. Nelken and S. M. Shieber, "Arabic Diacritization Using Weighted Finite-state Transducers," Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages, pp. 79-85, 2005.

[17] I. Zitouni, J. S. Sorensen and R. Sarikaya, "Maximum Entropy-based Restoration of Arabic Diacritics," Proceedings of the 21st International Conference on Computational Linguistics, pp. 577-584, 2006.

[18] M. Rashwan, M. Al-Badrashiny, M. Attia, S. Abdou and A. Rafea, "A Stochastic Arabic Diacritizer Based on a Hybrid of Factorized and Unfactorized Textual Features IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 19, no. 1, pp. 166-175, 2011.

[19] A. Said, M. El-Sharqwi, A. Chalabi and E. Kamal, "A Hybrid Approach for Arabic Diacritization," In: E. Mtai, F. Mezaine, M. Saraee, V. Sugumaran and S. Vadera (Eds.), "Natural Language Processing and Information Systems," Lecture Notes in Computer Science, vol. 7934, pp. 53-64, Springer, 2013.

[20] S. Alquda, G. Abandah and A. Arabiyat, "Investigating Hybrid Approaches for Arabic Text Diacritization with Recurrent Neural Networks," Proceedings of the 2017 IEEE Jordan Conference on Applied Electrical Engineering and Computing Technologies (AEECT), pp. 1-6, 2017.

[21] M. Rashwan, A. Sallab, H. Raafat and A. Rafea, "Deep Learning Framework with Confused Sub-set Resolution Architecture for Automatic Arabic Diacritization," Proceedings of the IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), vol. 23, no. 3, pp. 505-516, 2015.

[22] A. Barqawi and T. Zerrouki, "Shakkala, Arabic Text Vocalization,"[Online], Available: https://github.com/Barqawiz/Shakkala, 2017.

[23] Tahadz, "Mishkal,"[Online], Available: https://tahadz.com/mishkal,[Accessed on October 16, 2019].

[24] H. Mubarak, A. Abdelali, H. Sajjad, Y. Samih and K. Darwish, "Highly Effective Arabic Diacritization Using Sequence-to-Sequence Modeling," Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 2390-2395, 2019.

[25] K. Darwish, H. Mubarak and A. Abdelali, "Arabic Diacritization: Stats, Rules and Hacks," Proceedings of the 3rd Arabic Natural Language Processing Workshop, pp. 9-17, 2017.

[26] I. Sutskever, O. Vinyals and Q. V. Le, "Sequence-to-Sequence Learning with Neural Networks," Advances in Neural Information Processing Systems (NIPS), arXiv: 1409.3215v3, 2014.

[27] A. Graves, A. R. Mohamed and G. Hinton, "Speech Recognition with Deep Recurrent Neural Networks," Proc. of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 6645-6649, 2013.

[28] A. Geron, Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools and Techniques to Build Intelligent Systems, USA: O’Reilly, 2017. 

[29] S. Hochreiter and J. Schmidhuber, "Long Short-term Memory," Neural Computation, vol. 9, no. 8, pp. 1735-1780, 1997.

[30] K. Cho, D. Bahdanau, F. Bougares, H. Schwenk and Y. Bengio, "Learning Phase Representations Using RNN Encoder-Decoder for Statistical Machine Translation," arXiv: 1406.1078v3, 2014.

[31] M. Schuster and K. K. Paliwal, "Bidirectional Recurrent Neural Networks," IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.

[32] Google, "TensorFlow,"[ine], Available: https://www.tensorflow.org/[Accessed on Aug. 27, 2019].