CLUSTERING VIETNAMESE CONVERSATIONS FROM FACEBOOK PAGE TO BUILD TRAINING DATASET FOR CHATBOT
Trieu Hai Nguyen,Thi-Kim-Ngoan Pham,Thi-Hong-Minh Bui,Thanh- Quynh- Chau Nguyen
BERT,Clustering,Language models,Feature extraction,Word embeddings
The biggest challenge of building chatbots is training data. The required data must be realistic and large enough
to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After
text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining
of BERT for Vietnamese (PhoBERT) to extract features of our text data. K-Means and DBSCAN clustering
algorithms are used for clustering tasks based on output embeddings from PhoBERTbase. We apply V-measure
score and Silhouette score to evaluate the performance of clustering algorithms. We also demonstrate the
efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A
GridSearch algorithm that combines both clustering evaluations is also proposed to find optimal parameters.
Thanks to clustering such a number of conversations, we save a lot of time and effort to build data and storylines
for training chatbot.