CLUSTERING VIETNAMESE CONVERSATIONS FROM FACEBOOK PAGE TO BUILD TRAINING DATASET FOR CHATBOT

(Received: 26-Sep.-2021, Revised: 10-Dec.-2021 , Accepted: 28-Dec.-2021)

Authors Trieu Hai Nguyen, Thi-Kim-Ngoan Pham, Thi-Hong-Minh Bui, Thanh- Quynh- Chau Nguyen,

Keywords #BERT #Clustering #Language models #Feature extraction #Word embeddings

Abstract The biggest challenge of building chatbots is training data. The required data must be realistic and large enough to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining of BERT for Vietnamese (PhoBERT) to extract features of our text data. K-Means and DBSCAN clustering algorithms are used for clustering tasks based on output embeddings from PhoBERTbase. We apply V-measure score and Silhouette score to evaluate the performance of clustering algorithms. We also demonstrate the efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A GridSearch algorithm that combines both clustering evaluations is also proposed to find optimal parameters. Thanks to clustering such a number of conversations, we save a lot of time and effort to build data and storylines for training chatbot.

References

[1] A. Jung, "A Gentle Introduction to Supervised Machine Learning," Computing Research Repository (CoRR), vol. abs/1805.05052, pp. 6–7, 2018.

[2] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R. Silverman and A. Wu, "An Efficient K-means Clustering Algorithm: Analysis and Implementation," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 881–892, 2002.

[3] F. Pedregosa et al., "Scikit-learn: Machine Learning in Python," J. of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

[4] J. B. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297, University of California Press, 1967.

[5] M. Ester, H.-P. Kriegel, J. Sander and X. Xu, "A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," Proc. of 2nd International Conference on Knowledge Discovery and Data Mining (KDD'96), pp. 226–231, 1996.

[6] J. Sander, M. Ester, H. Kriegel and X. Xu, "Density-based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications," Data Mining and Knowledge Discovery, vol. 2, pp. 169– 194, 1998.

[7] M. Gaonkar and K. Sawant, "AutoEpsDBSCAN: DBSCAN with Eps Automatic for Large Dataset," IRD India, vol. 2, no. 2, pp. 11–16, 2013.

[8] G. Salton and M. J. McGill, "Introduction to Modern Information Retrieval," McGraw-Hill Computer Science Series, 1986.

[9] T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," Proc. of the 1st Int. Conf. on Learning Representations (ICLR 2013), Scottsdale, USA, 2013.

[10] J. Pennington, R. Socher and C. Manning, "GloVe: Global Vectors for Word Representation," Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, Doha, Qatar: Association for Computational Linguistics, Oct. 2014.

[11] P. Bojanowski, E. Grave, A. Joulin and T. Mikolov, "Enriching Word Vectors with Subword Information," arXiv preprint arXiv:1607.04606, 2016.

[12] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever, "Language Models are Unsupervised Multitask Learners," Proceedings{Radford2019LanguageMA}, [Online], Available: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf, 2019.

[13] J. Devlin, M.-W. Chang, K. Lee and K. Toutanova, "BERT: Pretraining of Deep Bidirectional Transformers for Language Understanding," Proc. of the 2019 Conf. of the North American Chapter of the Association for Computational Linguistics: Human Language Technol., vol. 1, pp. 4171–4186, 2019.

[14] V. Sanh, L. Debut, J. Chaumond and T. Wolf, "DistilBERT, A Distilled Version of BERT: Smaller, Faster, Cheaper and Lighter," ArXiv, [Online], Available: https://arxiv.org/abs/1910.01108, 2019.

[15] D. Q. Nguyen and A. T. Nguyen, "PhoBERT: Pre-trained Language Models for Vietnamese," Proc. of Findings of the Association for Computational Linguistics (EMNLP 2020), pp. 1037–1042, arXiv: 2003.00744, 2020.

[16] O. Gencoglu, "Deep Representation Learning for Clustering of Health Tweets," Computing Research Repository (CoRR), vol. abs/1901.00439, [Online], Available: https://arxiv.org/pdf/1901.00439, 2019.

[17] L. Pugachev and M. Burtsev, "Short Text Clustering with Transformers," arXiv preprint arXiv:2102.00541, [Online], Available: https://arxiv.org/pdf/2102.00541, 2021.

[18] S. Sia, A. Dalmia and S. J. Mielke, "Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics Too!" arXiv preprint arXiv: 2004.14914, [Online], Available: https://arxiv.org/pdf/2004.14914, 2020.

[19] A. Rosenberg and J. Hirschberg, "V-measure: A Conditional Entropy-based External Cluster Evaluation Measure," Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 410–420, Prague, Czech, Jun. 2007.

[20] P. J. Rousseeuw, "Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis," Journal of Computational and Applied Mathematics, vol. 20, pp. 53–65, 1987.

[21] A. Vaswani et al. , "Attention Is All You Need," Proc. of Advances in Neural Information Processing Systems, vol. 30, [Online], Available: https://arxiv.org/pdf/1706.03762, Curran Associates, Inc., 2017.

[22] C. Sun, X. Qiu, Y. Xu and X. Huang, "How to Fine-tune BERT for Text Classification?" Proc. of the China National Conference on Chinese Computational Linguistics (CCL 2019), vol. 11856, pp. 194–206, Cham: Springer Int. Publishing, 2019.

[23] J. Chung, C. Gulcehre, K. Cho and Y. Bengio, "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling," Proc. of the NIPS 2014 Workshop on Deep Learning, NYU, 2014.

[24] Q. Le and T. Mikolov, "Distributed Representations of Sentences and Documents," Proc. of the 31st International Conference on Machine Learning, Ser. Proceedings of Machine Learning Research, vol. 32, no. 2, pp. 1188–1196, PMLR, Bejing, China, 22–24 Jun. 2014.

[25] A. Joulin, E. Grave, P. Bojanowski and T. Mikolov, "Bag of Tricks for Efficient Text Classification," Computing Research Repository (CoRR), vol. abs/1607.01759, 2016.

[26] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee and L. Zettlemoyer, "Deep Contextualized Word Representations," arXiv Preprint arXiv: 1802.05365, [Online], Available: https://arxiv.org/pdf/1802.05365, 2018.

[27] A. Radford and K. Narasimhan, "Improving Language Understanding by Generative Pre-training," [Online], Available: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf, 2018.

[28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and Stoyanov, "Roberta: A robustly optimized BERT Pretraining Approach," arXiv: 1907.11692, 2019.

[29] R. Sennrich, B. Haddow and A. Birch, "Neural Machine Translation of Rare Words with Subword Units," Proc. of the 54th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1715–1725, DOI: 10.18653/v1/P16-1162, Aug. 2016.

[30] T. Vu, D. Q. Nguyen, M. Dras and M. Johnson, "VnCoreNLP: A Vietnamese Natural Language Processing Toolkit," Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 56–60, New Orleans, Louisiana, 2018.

[31] V.-T. Tran, "Python Vietnamese Toolkit," pyvi 0.1.1, pypi, [Online], Available: https://pypi.org/project/pyvi/, 2020.

[32] V. Anh, B. N. Anh and D. V. Dung, "Open-source Vietnamese Natural Language Process Toolkit," VnCoreNLP, Github, [Online], Available: https://github.com/undertheseanlp/word_tokenize, 2018.

[33] T. Wolf et al., " HuggingFace's Transformers: State-of-the-art Natural Language Processing," Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Oct. 2020.

[34] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier and M. Auli, "FAIRSEQ: A Fast, Extensible Toolkit for Sequence Modeling," Proceedings of NAACL-HLT 2019: Demonstrations, pp. 48- 53, Minneapolis, Minnesota, 2019.

[35] T. Neeraj, "Feature-based Approach with BERT," Trishala's Blog, Github, [Online], Available: trishalaneeraj.github.io, 2020.

[36] E. Schubert, J. Sander, M. Ester, H. Kriegel and X. Xu, "DBSCAN Revisited: Why and How You Should (Still) Use DBSCAN," ACM Trans. Database Syst., vol. 42, no. 3, pp. 19:1–19:21, DOI: 10.1145/3068335, 2017.

[37] Scikit-learn, "Clustering," Scikit-learn 1.0.2 documentation, [Online], Available: https://scikit- learn.org/stable/modules/clustering.html#k-means, 2011.

[38] N. Rahmah and I. S. Sitanggang, "Determination of Optimal Epsilon (Eps) Value on DBSCAN Algorithm to Clustering Data on Peatland Hotspots in Sumatra," Proc. of the IOP Conference Series: Earth and Environmental Science, Workshop and International Seminar on Science of Complex Natural Systems, vol. 31, p. 012012, Bogor, Indonesia, Jan. 2016.

[39] V. Satopaa, J. Albrecht, D. Irwin and B. Raghavan, "Finding a" kneedle" in a Haystack: Detecting Knee Points in System Behavior," Proc. of the 31st IEEE International Conference on Distributed Computing Systems Workshops, pp. 166–171, DOI: 10.1109/ICDCSW.2011.20, Minneapolis, MN, USA, 2011.

[40] T. H. Nguyen, "Analyze the Effects of Weighting Functions on Cost Function in the Glove Model," arXiv preprint arXiv: 2009.04732, [Online], Available: https://arxiv.org/pdf/2009.04732, 2020.