Model Klasifikasi Berbasis Multiclass Classification dengan Kombinasi Indobert Embedding dan Long Short-Term Memory untuk Tweet Berbahasa Indonesia
Abstract:
Purpose: This research aims to improve the performance of the text classification model from previous studies, by combining the IndoBERT pre-trained model with the Long Short-Term Memory (LSTM) architecture in classifying Indonesian-language tweets into several categories.
Method: The classification text based on multiclass classification was used in this research, combined with pre-trained IndoBERT namely Long Short-Term Memory (LTSM). The dataset was taken using crawling method from API Twitter. Then, it will be compared with Word2Vec-LTSM and fined-tuned IndoBERT.
Result: The IndoBERT-LSTM model with the best hyperparameter combination scenario (batch size of 16, learning rate of 2e-5, and using average pooling) managed to get an F1-score of 98.90% on the unmodified dataset (0.70% increase from the Word2Vec-LSTM model and 0.40% from the fine-tuned IndoBERT model) and 92.83% on the modified dataset (4.51% increase from the Word2Vec-LSTM model and 0.69% from the fine-tuned IndoBERT model). However, the improvement from the fine-tuned IndoBERT model is not very significant and the Word2Vec-LSTM model has a much faster total training time.
Downloads
- Alammar, J. (2018a, June 27). The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. https://jalammar.github.io/illustrated-transformer/
- Alammar, J. (2018b, December 3). The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)–Jay Alammar–Visualizing machine learning one concept at a time. http://jalammar.github.io/illustrated-bert/
- Alwehaibi, A., Bikdash, M., Albogmi, M., & Roy, K. (2021). A study of the performance of embedding methods for Arabic short-text sentiment analysis using deep learning approaches. Journal of King Saud University-Computer and Information Sciences.
- Aydo?an, M., & Karci, A. (2020). Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification. Physica A: Statistical Mechanics and Its Applications, 541, 123288. https://doi.org/10.1016/j.physa.2019.123288
- Ayo, F. E., Folorunso, O., Ibharalu, F. T., & Osinuga, I. A. (2020). Machine learning techniques for hate speech classification of twitter data: State-of-The-Art, future challenges and research directions. Computer Science Review, 38, 100311. https://doi.org/10.1016/j.cosrev.2020.100311
- Brownlee, J. (2021, January 18). How to Choose an Activation Function for Deep Learning. https://machinelearningmastery.com/choose-an-activation- function-for-deep-learning/
- Cai, R., Qin, B., Chen, Y., Zhang, L., Yang, R., Chen, S., & Wang, W. (2020). Sentiment analysis about investors and consumers in energy market based on BERT-BILSTM. IEEE Access, 8, 171408–171415. https://doi.org/10.1109/ACCESS.2020.3024750
- Chauhan, N. S. (2021, August 2). Loss Functions in Neural Networks. https://www.theaidream.com/post/loss-functions-in-neural-networks
- Chaumond, J., Delangue, C., & Wolf, T. (2016). huggingface (Hugging Face). https://huggingface.co/huggingface
- Cournapeau, D. (2007). scikit-learn: machine learning in Python—scikit-learn 1.1.1 documentation. https://scikit-learn.org/stable/#
- Devlin, J., Chang, M.-W., Lee, K., Google, K. T., & Language, A. I. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805
- Digmi, I. (2018, January 25). Memahami Epoch Batch Size Dan Iteration - JournalToday. https://imam.digmi.id/post/memahami-epoch-batch-size-dan- iteration/
- Ge?ron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. O’Reilly Media, Inc.
- Google Brain Team. (2015, November 9). TensorFlow. https://www.tensorflow.org/
- Goyal, A., Gupta, V., & Kumar, M. (2021). A deep learning-based bilingual Hindi and Punjabi named entity recognition system using enhanced word embeddings. Knowledge-Based Systems, 107601. https://doi.org/10.1016/j.knosys.2021.107601
- Gupta, V., & Lehal Professor, G. S. (2009). A Survey of Text Mining Techniques and Applications. www.alerts.yahoo.com
- Hilmiaji, N., Lhaksmana, K. M., & Purbolaksono, M. D. (2021). Identifying Emotion on Indonesian Tweets using Convolutional Neural Networks. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 5(3), 584–593. https://doi.org/10.29207/RESTI.V5I3.3137
- Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/NECO.1997.9.8.1735
- Keras Team. (2015, March 27). Dropout layer. https://keras.io/api/layers/regularization_layers/dropout/
- Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
- Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L., & Brown,
- D. (2019). Text Classification Algorithms: A Survey. Information 2019, Vol. 10, Page 150, 10(4), 150. https://doi.org/10.3390/INFO10040150
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. https://arxiv.org/abs/1301.3781v3
- Muhammad, P. F., Kusumaningrum, R., & Wibowo, A. (2021). Sentiment Analysis Using Word2vec And Long Short-Term Memory (LSTM) For Indonesian Hotel Reviews. Procedia Computer Science, 179, 728–735. https://doi.org/10.1016/J.PROCS.2021.01.061
- Nguyen, Q. T., Nguyen, T. L., Luong, N. H., & Ngo, Q. H. (2020). Fine-Tuning BERT for Sentiment Analysis of Vietnamese Reviews. Proceedings – 2020 7th NAFOSTED Conference on Information and Computer Science, NICS 2020, 302–307. https://doi.org/10.1109/NICS51282.2020.9335899
- Pahwa, B., Kasliwal, N., Scholar, R., Vidyapith, B., & Taruna, R. S. (2018). Sentiment Analysis-Strategy for Text Pre-Processing Indianization and customization for Indian consumers View project Aspect level sentiment analysis View project Sentiment Analysis-Strategy for Text Pre-Processing Bhumika Pahwa. Article in International Journal of Computer Applications, 180(34), 975–8887. https://doi.org/10.5120/ijca2018916865
- Putra, J. W. G. (2020). Pengenalan Pembelajaran Mesin dan Deep Learning.
- Rahman, D. (2019). deryrahman/word2vec-bahasa-indonesia: Word2Vec untuk bahasa Indonesia dari korpus Wikipedi https://github.com/deryrahman/word2vec-bahasa-indonesia
- Ramadhan, N. G. (2021). Indonesian Online News Topics Classification using Word2Vec and K-Nearest Neighbor. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 5(6), 1083–1089. https://doi.org/10.29207/RESTI.V5I6.3547
- Rao, A., & Spasojevic, N. (2016). Actionable and Political Text Classification using Word Embeddings and LSTM. https://arxiv.org/abs/1607.02501v2
- Robbani, H. A. (2018, September 24). GitHub - har07/PySastrawi: Indonesian stemmer. Python port of PHP Sastrawi project. PySastrawi. https://github.com/har07/PySastrawi
- Sharma, A. K., Chaurasia, S., & Srivastava, D. K. (2020). Sentimental Short Sentences Classification by Using CNN Deep Learning Model with Fine Tuned Word2Vec. Procedia Computer Science, 167, 1139–1147. https://doi.org/10.1016/J.PROCS.2020.03.416
- Sun, Z., Zemel, R., & Xu, Y. (2021). A computational framework for slang generation. Transactions of the Association for Computational Linguistics, 9, 462–478. https://doi.org/10.1162/TACL_A_00378/1921784/TACL_A_00378.PDF
- Sutanto, T. (2020). nlptm-01. Tau-Data Indonesia. https://tau-data.id/d/nlptm- 01.html
- Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104–112. https://doi.org/10.1016/J.IPM.2013.08.006
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?. ukasz, & Polosukhin, I. (2017). Attention is All you Need. In I. Guyon, U. v Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4 a845aa-Paper.pdf
- Wang, Z., Huang, Z., & Gao, J. (2020). Chinese Text Classification Method Based on BERT Word Embedding. ACM International Conference Proceeding Series, 66–71. https://doi.org/10.1145/3395260.3395273
- Wu, Y., Schuster, M., Chen, Z., Le, Q. v., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, ?., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://arxiv.org/abs/1609.08144v2