SEMANTIC RETRIEVAL FOR INDONESIAN QURAN AUTOCOMPLETION

(Received: 7-Dec.-2022, Revised: 1-Feb.-2023 and 22-Feb.-2023 , Accepted: 5-Mar.-2023)

Authors Rian Adam Rajagede, Kholid Haryono, Rizan Qardafil,

Keywords #Semantic retrieval #Quran auto-completion

Abstract Attending lectures is a common way to learn Islamic knowledge. The speaker talks in front of the forum and participants take notes on the lecture material. Many participants listen to the lecture while taking notes either in books or on other digital devices to avoid forgetting the discussed topics. However, note-taking during the lecture can be challenging, with no complementing module from the speaker. Lecturers have different paces and varying ways of delivering. In addition, sometimes, participants cannot always focus during the lecture. Those factors can cause problems in the note-taking process: some details can be lost or even shift the meaning. For note-taking on sensitive topics, such as verses from the Quran, the note-taking process must be done carefully and avoid mistakes. In this study, we proposed an autocomplete system for the Indonesian translation of the Quran that will help the user in note-taking in Islamic lectures. The user writes down words, the parts of the Quran verse that he/she hears and the system will retrieve the most similar verses. With semantic retrieval, the user does not need to write down the exact words of the verses he/she heard. The system can also handle typographical-errors that usually occur in note-taking. We use FastText and calculate the cosine distance between the query and verses for the retrieval process. We also performed several optimization steps to create a robust system for the production stage. The system is evaluated by comparing how close the returned verse is with the ground truth. The proposed method's result in terms of accuracy reached 70.59% for the top 5 retrieved verses and 76.47% for the top 10 retrieved verses.

References

[1] B. J. Lee, "Smartphone Tapping vs. Handwriting: A Comparison of Writing Medium," The EuroCALL Review, vol. 28, no. 1, p. 15, DOI: 10.4995/eurocall.2020.12036, 2020.

[2] Z. Abu Bakar and N. Abdul Rahman, "Evaluating the Effectiveness of Thesaurus and Stemming Methods in Retrieving Malay Translated Al-Quran Documents," Lecture Notes in Computer Science, vol. 2911. pp. 653–662, 2003, DOI: 10.1007/978-3-540-24594-0_67.

[3] A. Aulia, D. Khairani and N. Hakiem, "Development of a Retrieval System for Al Hadith in Bahasa (Case Study: Hadith Bukhari)," Proc. of the 5th Int. Conf. on Cyber and IT Service Management (CITSM 2017), DOI: 10.1109/CITSM.2017.8089323, 2017.

[4] I. Humaini, T. Yusnitasari, L. Wulandari, D. Ikasari and H. Dutt, "Informatian Retrieval of Indonesian Translated Version of Al Quran and Hadith Bukhori &Muslim," Proc. of the 2018 Int. Conf. on Sustainable Energy, Electronics and Computing System (SEEMS 2018), DOI: 10.1109/SEEMS.2018.8687330, 2019.

[5] M. Adriani, J. Asian, B. Nazief, S. M. M. Tahaghoghi and H. E. Williams, "Stemming Indonesian: A Confix-stripping Approach," ACM Transactions on Asian Language Information Processing, vol. 6, no. 4, pp. 1–33, DOI: 10.1145/1316457.1316459, 2007.

[6] A. Z. Arifin, I. P. A. D. Mahendra and H. T. Ciptaningtyas, "Enhanced Confix Stripping Stemmer and Ants Algorithm for Classfying News Document in Indonesian Language," Proc. of the 5th Int. Conf. on Information & Communi. Technology and Systems, no. April 2014, pp. 149–158, 2009.

[7] D. K. Po, "Similarity Based Information Retrieval Using Levenshtein Distance Algorithm," Int. J. of Advances in Scientific Research and Engineering, vol. 06, no. 04, pp. 06–10, 2020.

[8] S. Wang and R. Koopman, "Semantic Embedding for Information Retrieval," Proc. of CEUR Workshop, vol. 1823, pp. 122–132, 2017.

[9] Y. Yuan, Improving Information Retrieval by Semantic Embedding, B.Sc. Essay, Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente, Netherlands, [Online], Available: http://essay.utwente.nl/82070/, 2020.

[10] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," Proc. of the 1st Int. Conf. on Learning Representations (ICLR 2013), arXiv: 1301.3781, DOI: 10.48550/arXiv.1301.3781, 2013.

[11] P. Bojanowski et al., "Enriching Word Vectors with Subword Information," Trans. of the Association for Computational Linguistics, vol. 5, pp. 135–146, DOI: 10.1162/tacl_a_00051, 2017.

[12] J. Devlin, M. W. Chang, K. Lee and K. Toutanova, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding," Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL HLT 2019), vol. 1, pp. 4171–4186, 2019.

[13] M. Zidny Naf’an, Y. Sari and Y. Suyanto, "Word Embeddings Evaluation on Indonesian Translation of AI-Quran and Hadiths," IOPscience, vol. 1077, no. 1, p. 012025, DOI: 10.1088/1757-899X/1077/1/012025, 2021.

[14] A. Aziz Altowayan and A. Elnagar, "Improving Arabic Sentiment Analysis with Sentiment-specific Embeddings," Proc. of the 2017 IEEE Int. Conf. on Big Data, vol. 2018-Jan., pp. 4314–4320, DOI:10.1109/BigData.2017.8258460, 2017.

[15] F. Alam, M. Afzal and K. M. Malik, "Comparative Analysis of Semantic Similarity Techniques for Medical Text," Proc. of the Int. Conf. on Information Networking, vol. 2020-Jan., pp. 106–109, DOI: 10.1109/ICOIN48656.2020.9016574, 2020.

[16] M. S. Saputri, R. Mahendra and M. Adriani, "Emotion Classification on Indonesian Twitter Dataset," Proc. of the 2018 Int. Conf. on Asian Language Processing (IALP 2018), pp. 90–95, DOI: 10.1109/IALP.2018.8629262, 2019.

[17] R. A. Rajagede and R. P. Hastuti, "Stacking Neural Network Models for Automatic Short Answer Scoring," Proc. of 5th Int. Conf. on Information Technology and Digital Applications (ICITDA 2020), vol. 1077, pp. 0–6, Yogyakarta, Indonesia, 2020.

[18] N. Reimers and I. Gurevych, "Sentence-BERT: Sentence Embeddings Using Siamese BERT-networks," Proc. of 2019 Conf. on Empirical Methods in Natural Language Processing and 9th Int. Joint Conf. on Natural Language Processing (EMNLP-IJCNLP 2019), DOI: 10.18653/v1/d19-1410, 2020.

[19] R. A. Rajagede, "Improving Automatic Essay Scoring for Indonesian Language Using Simpler Model and Richer Feature," Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics and Control, vol. 6, no. 1, pp. 11–18, DOI: 10.22219/kinetik.v6i1.1196, 2021.

[20] D. Dale, "Compress-FastText," Github Repository, [Online], Available: https://github.com/avidale/compress-fasttext, Accessed: Jun. 01, 2022.

[21] G. Nasution, "Quran - API," Github Repository, [Online], Available: https://github.com/gadingnst/quran-api, 2022.

[22] H. A. Robbani, "Sastrawi Python," Github Repository, [Online], Available: https://github.com/har07/PySastrawi, 2018.

[23] E. Grave et al., "Learning Word Vectors for 157 Languages," Proc. of the 11th Int. Conf. on Language Resources and Evaluation (LREC 2018), pp. 3483–3487, Miyazaki, Japan, 2019.

[24] G. B. Herwanto et al., "UKARA: A Fast and Simple Automatic Short Answer Scoring System for Bahasa Indonesia," Proc. of the Int. Conf. on Educat. Assessment and Policy, DOI: 10.26499/iceap.v2i1.95, 2018.

[25] T. Akiba, S. Sano, T. Yanase, T. Ohta and M. Koyama, "Optuna: A Next-generation Hyper-parameter Optimization Framework," Proc. of the ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, DOI: 10.1145/3292500.3330701, 2019.

[26] S. L. Smith, P. J. Kindermans, C. Ying and Q. V. Le, "Don’t Decay the Learning Rate, Increase the Batch Size," Proc. of the 6th Int. Conf. on Learning Representations (ICLR 2018), [Online], Available: https://openreview.net/forum?id=B1Yy1BxCZ, 2018.

[27] R. Sennrich, B. Haddow and A. Birch, "Improving Neural Machine Translation Models with Monolingual Data," Proc. of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 86-96, Berlin, Germany, 2016.

[28] D. R. Beddiar, M. S. Jahan and M. Oussalah, "Data Expansion Using Back Translation and Paraphrasing for Hate Speech Detection," Online Social Networks and Media, vol. 24, Article no. 100153, 2021.