ARABIC SOFT SPELLING CORRECTION WITH T5

(Received: 12-Nov.-2023, Revised: 3-Jan.-2024 , Accepted: 20-Jan.-2024)

Authors Mohammed Al-Qaraghuli, Ola Arif Jaafar,

Keywords #Arabic spelling correction #Transformers #Text-to-text transfer transformer #T5 #Natural-language processing

Abstract Spelling correction is considered a challenging task for resource-scarce languages. The Arabic language is one of these resource-scarce languages, which suffers from the absence of a large spelling correction dataset, thus datasets injected with artificial errors are used to overcome this problem. In this paper, we trained the Text-to-Text Transfer Transformer (T5) model using artificial errors to correct Arabic soft spelling mistakes. Our T5 model can correct 97.8% of the artificial errors that were injected into the test set. Additionally, our T5 model achieves a character error rate (CER) of 0.77% on a set that contains real soft spelling mistakes. We achieved these results using a 4-layer T5 model trained with a 90% error injection rate, with a maximum sequence length of 300 characters.

References

[1] A. Al-Ameri, "Common Spelling Mistakes among Students of Teacher Education Institutes," The Islamic College University Journal, vol. 2015, no. 33, pp. 445-474, 2015.

[2] F. Awad, "Spelling Errors, Their Causes and Methods of Treatment," DIRASAT TARBAWIYA, vol. 5, no. 17, p. 217-250, 2012.

[3] Kaggle, "Arabic Company Reviews," [Online], Available: https://www.kaggle.com/datasets/fahdseddik /arabic-company-reviews.

[4] T. Adewumi, S. Sabry, N. Abid, F. Liwicki and M. Liwicki, "T5 for Hate Speech, Augmented Data and Ensemble," Sci, vol. 5, no.4, p. 37, DOI: 10.3390/sci5040037, 2023.

[5] H. Zhuang et al., "RankT5: Fine-tuning T5 for Text Ranking with Ranking Losses," Proc. of the 46th Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 2308–2313, DOI: 10.1145/3539618.3592047, 2023.

[6] M. Fu et al., "VulRepair: A T5-based Automated Software Vulnerability Repair," Proc. of the 30th ACM Joint European Software Engineering Conf. and Symposium on the Foundations of Software Engineering, pp. 935-947, DOI: 10.1145/3540250.3549098, 2022.

[7] S. Aloyaynaa and Y. Kotb, "Arabic Grammatical Error Detection Using Transformers-based Pre-trained Language Models," ITM Web of Conferences, vol. 56, p. 04009, 2023.

[8] G. Abandah, A. Suyyagh and M. Z. Khedher, "Correcting Arabic Soft Spelling Mistakes Using BiLSTM-based Machine Learning," Int. J. of Advanced Computer Sci. and Appl., vol. 13, no. 5, 2022.

[9] M. Al-Qaraghuli, G. Abandah and A. Suyyagh, "Correcting Arabic Soft Spelling Mistakes Using Transformers," Proc. of the 2021 IEEE Jordan Int. Joint Conf. on Electrical Engineering and Information Technology (JEEIT), pp. 146-151, Amman, Jordan, 2021.

[10] N. Madi and H. Al-Khalifa, "Error Detection for Arabic Text Using Neural Sequence Labeling," Applied Sciences, vol. 10, no. 15, p. 5279, 2020.

[11] X. Wei, J. Huang, H. Yu and Q. Liu, "PTCSpell: Pre-trained Corrector Based on Character Shape and Pinyin for Chinese Spelling Correction," Proc. of Findings of the Association for Computational Linguistics: ACL 2023, pp. 6330–6343, Toronto, Canada, 2023.

[12] L. Stankevičius, M. Lukoševičius, J. Kapočiūtė-Dzikienė, M. Briedienė and T. Krilavičius, "Correcting Diacritics and Typos with a ByT5 Transformer Model," Applied Sciences, vol. 12, no. 5, p. 2636, 2022.

[13] A. F. de S. Neto, B. L. D. Bezerra and A. H. Toselli, "Towards the Natural Language Processing As Spelling Correction for Offline Handwritten Text Recognition Systems," Applied Sciences, vol. 10, no. 21, p. 7711, 2020.

[14] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li and PJ. Liu, "Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer," The Journal of Machine Learning Research, vol. 21, pp. 1-67, 2020.

[15] A. Vaswani et al., "Attention Is All You Need," Proc. of the 31st Conf. on Neural Information Processing Systems (NIPS2017), pp. 1-11, Long Beach, USA, 2017.

[16] M. Guo, Z. Dai, D. Vrandečić and R. Al-Rfou, "Wiki-40B: Multilingual Language Model Dataset," Proc. of the 12th Language Resources and Evaluation Conf., pp. 2440–2452, Marseille, France, 2020.

[17] M. Post, "A Call for Clarity in Reporting BLEU Scores," Proc. of the 3rd Conf. on Machine Translation: Research Papers, pp. 186-191, Brussels, Belgium, 2018.

[18] M. Lewis et al., "BART: Denoising Sequence-to-sequence Pre-training for Natural Language Generation, Translation and Comprehension," arXiv preprint, arXiv: 1910.13461, 2019.

[19] J. Devlin, M. Chang, K. Lee and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," arXiv preprint, arXiv: 1810.04805, 2018.