
		<paper>
			<loc>https://jjcit.org/paper/150</loc>
			<title>CLUSTERING VIETNAMESE CONVERSATIONS FROM FACEBOOK PAGE TO BUILD TRAINING DATASET FOR CHATBOT</title>
			<doi>10.5455/jjcit.71-1632557439</doi>
			<authors>Trieu Hai Nguyen,Thi-Kim-Ngoan Pham,Thi-Hong-Minh Bui,Thanh- Quynh- Chau Nguyen</authors>
			<keywords>BERT,Clustering,Language models,Feature extraction,Word embeddings</keywords>
			<citation>2</citation>
			<views>5149</views>
			<downloads>1192</downloads>
			<received_date>26-Sep.-2021</received_date>
			<revised_date>  10-Dec.-2021</revised_date>
			<accepted_date>  28-Dec.-2021</accepted_date>
			<abstract>The biggest challenge of building chatbots is training data. The required data must be realistic and large enough 
to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After 
text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining 
of  BERT  for  Vietnamese  (PhoBERT)  to  extract  features  of  our  text  data.  K-Means  and  DBSCAN  clustering 
algorithms  are  used  for  clustering  tasks  based  on  output  embeddings  from  PhoBERTbase.  We  apply  V-measure 
score  and  Silhouette  score  to  evaluate  the  performance  of  clustering  algorithms.  We  also  demonstrate  the 
efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A 
GridSearch  algorithm  that  combines  both  clustering  evaluations  is  also  proposed  to  find  optimal  parameters. 
Thanks to clustering such a number of conversations, we save a lot of time and effort to build data and storylines 
for training chatbot.</abstract>
		</paper>


