Model Klasifikasi Berbasis Multiclass Classification dengan Kombinasi Indobert Embedding dan Long Short-Term Memory untuk Tweet Berbahasa Indonesia

Thariq Iskandar Zulkarnain Maulana  Putra; Suprapto Suprapto; Arif Farhan  Bukhori

doi:10.35912/jisted.v1i1.1509

Model Klasifikasi Berbasis Multiclass Classification dengan Kombinasi Indobert Embedding dan Long Short-Term Memory untuk Tweet Berbahasa Indonesia

https://doi.org/10.35912/jisted.v1i1.1509

Published: Nov 11, 2022

Abstract:

Purpose: This research aims to improve the performance of the text classification model from previous studies, by combining the IndoBERT pre-trained model with the Long Short-Term Memory (LSTM) architecture in classifying Indonesian-language tweets into several categories.

Method: The classification text based on multiclass classification was used in this research, combined with pre-trained IndoBERT namely Long Short-Term Memory (LTSM). The dataset was taken using crawling method from API Twitter. Then, it will be compared with Word2Vec-LTSM and fined-tuned IndoBERT.

Result: The IndoBERT-LSTM model with the best hyperparameter combination scenario (batch size of 16, learning rate of 2e-5, and using average pooling) managed to get an F1-score of 98.90% on the unmodified dataset (0.70% increase from the Word2Vec-LSTM model and 0.40% from the fine-tuned IndoBERT model) and 92.83% on the modified dataset (4.51% increase from the Word2Vec-LSTM model and 0.69% from the fine-tuned IndoBERT model). However, the improvement from the fine-tuned IndoBERT model is not very significant and the Word2Vec-LSTM model has a much faster total training time.

Keywords:

1. Text Classification

2. Indonesian Tweets

3. IndoBERT

4. Long Short-Term Memory

Authors:

1	.	Thariq Iskandar Zulkarnain Maulana Putra

2	.	Suprapto Suprapto

3	.	Arif Farhan Bukhori

View PDF

How to Cite

Putra, T. I. Z. M. ., Suprapto, S., & Bukhori, A. F. . (2022). Model Klasifikasi Berbasis Multiclass Classification dengan Kombinasi Indobert Embedding dan Long Short-Term Memory untuk Tweet Berbahasa Indonesia. Jurnal Ilmu Siber Dan Teknologi Digital, 1(1), 1–28. https://doi.org/10.35912/jisted.v1i1.1509

Download Citation

Downloads

Download data is not yet available.

Issue & Section

Vol. 1 No. 1 (2022): November

Articles

References

Alammar, J. (2018a, June 27). The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. https://jalammar.github.io/illustrated-transformer/

Alammar, J. (2018b, December 3). The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)–Jay Alammar–Visualizing machine learning one concept at a time. http://jalammar.github.io/illustrated-bert/

Alwehaibi, A., Bikdash, M., Albogmi, M., & Roy, K. (2021). A study of the performance of embedding methods for Arabic short-text sentiment analysis using deep learning approaches. Journal of King Saud University-Computer and Information Sciences.

Aydo?an, M., & Karci, A. (2020). Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification. Physica A: Statistical Mechanics and Its Applications, 541, 123288. https://doi.org/10.1016/j.physa.2019.123288

Ayo, F. E., Folorunso, O., Ibharalu, F. T., & Osinuga, I. A. (2020). Machine learning techniques for hate speech classification of twitter data: State-of-The-Art, future challenges and research directions. Computer Science Review, 38, 100311. https://doi.org/10.1016/j.cosrev.2020.100311

Brownlee, J. (2021, January 18). How to Choose an Activation Function for Deep Learning. https://machinelearningmastery.com/choose-an-activation- function-for-deep-learning/

Cai, R., Qin, B., Chen, Y., Zhang, L., Yang, R., Chen, S., & Wang, W. (2020). Sentiment analysis about investors and consumers in energy market based on BERT-BILSTM. IEEE Access, 8, 171408–171415. https://doi.org/10.1109/ACCESS.2020.3024750

Chauhan, N. S. (2021, August 2). Loss Functions in Neural Networks. https://www.theaidream.com/post/loss-functions-in-neural-networks

Chaumond, J., Delangue, C., & Wolf, T. (2016). huggingface (Hugging Face). https://huggingface.co/huggingface

Cournapeau, D. (2007). scikit-learn: machine learning in Python—scikit-learn 1.1.1 documentation. https://scikit-learn.org/stable/#

Devlin, J., Chang, M.-W., Lee, K., Google, K. T., & Language, A. I. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805

Digmi, I. (2018, January 25). Memahami Epoch Batch Size Dan Iteration - JournalToday. https://imam.digmi.id/post/memahami-epoch-batch-size-dan- iteration/

Ge?ron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. O’Reilly Media, Inc.

Google Brain Team. (2015, November 9). TensorFlow. https://www.tensorflow.org/

Goyal, A., Gupta, V., & Kumar, M. (2021). A deep learning-based bilingual Hindi and Punjabi named entity recognition system using enhanced word embeddings. Knowledge-Based Systems, 107601. https://doi.org/10.1016/j.knosys.2021.107601

Gupta, V., & Lehal Professor, G. S. (2009). A Survey of Text Mining Techniques and Applications. www.alerts.yahoo.com

Hilmiaji, N., Lhaksmana, K. M., & Purbolaksono, M. D. (2021). Identifying Emotion on Indonesian Tweets using Convolutional Neural Networks. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 5(3), 584–593. https://doi.org/10.29207/RESTI.V5I3.3137

Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/NECO.1997.9.8.1735

Keras Team. (2015, March 27). Dropout layer. https://keras.io/api/layers/regularization_layers/dropout/

Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. 757–770. https://doi.org/10.18653/v1/2020.coling-main.66

Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L., & Brown,

D. (2019). Text Classification Algorithms: A Survey. Information 2019, Vol. 10, Page 150, 10(4), 150. https://doi.org/10.3390/INFO10040150

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. https://arxiv.org/abs/1301.3781v3

Muhammad, P. F., Kusumaningrum, R., & Wibowo, A. (2021). Sentiment Analysis Using Word2vec And Long Short-Term Memory (LSTM) For Indonesian Hotel Reviews. Procedia Computer Science, 179, 728–735. https://doi.org/10.1016/J.PROCS.2021.01.061

Nguyen, Q. T., Nguyen, T. L., Luong, N. H., & Ngo, Q. H. (2020). Fine-Tuning BERT for Sentiment Analysis of Vietnamese Reviews. Proceedings – 2020 7th NAFOSTED Conference on Information and Computer Science, NICS 2020, 302–307. https://doi.org/10.1109/NICS51282.2020.9335899

Pahwa, B., Kasliwal, N., Scholar, R., Vidyapith, B., & Taruna, R. S. (2018). Sentiment Analysis-Strategy for Text Pre-Processing Indianization and customization for Indian consumers View project Aspect level sentiment analysis View project Sentiment Analysis-Strategy for Text Pre-Processing Bhumika Pahwa. Article in International Journal of Computer Applications, 180(34), 975–8887. https://doi.org/10.5120/ijca2018916865

Putra, J. W. G. (2020). Pengenalan Pembelajaran Mesin dan Deep Learning.

Rahman, D. (2019). deryrahman/word2vec-bahasa-indonesia: Word2Vec untuk bahasa Indonesia dari korpus Wikipedi https://github.com/deryrahman/word2vec-bahasa-indonesia

Ramadhan, N. G. (2021). Indonesian Online News Topics Classification using Word2Vec and K-Nearest Neighbor. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 5(6), 1083–1089. https://doi.org/10.29207/RESTI.V5I6.3547

Rao, A., & Spasojevic, N. (2016). Actionable and Political Text Classification using Word Embeddings and LSTM. https://arxiv.org/abs/1607.02501v2

Robbani, H. A. (2018, September 24). GitHub - har07/PySastrawi: Indonesian stemmer. Python port of PHP Sastrawi project. PySastrawi. https://github.com/har07/PySastrawi

Sharma, A. K., Chaurasia, S., & Srivastava, D. K. (2020). Sentimental Short Sentences Classification by Using CNN Deep Learning Model with Fine Tuned Word2Vec. Procedia Computer Science, 167, 1139–1147. https://doi.org/10.1016/J.PROCS.2020.03.416

Sun, Z., Zemel, R., & Xu, Y. (2021). A computational framework for slang generation. Transactions of the Association for Computational Linguistics, 9, 462–478. https://doi.org/10.1162/TACL_A_00378/1921784/TACL_A_00378.PDF

Sutanto, T. (2020). nlptm-01. Tau-Data Indonesia. https://tau-data.id/d/nlptm- 01.html

Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104–112. https://doi.org/10.1016/J.IPM.2013.08.006

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?. ukasz, & Polosukhin, I. (2017). Attention is All you Need. In I. Guyon, U. v Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4 a845aa-Paper.pdf

Wang, Z., Huang, Z., & Gao, J. (2020). Chinese Text Classification Method Based on BERT Word Embedding. ACM International Conference Proceeding Series, 66–71. https://doi.org/10.1145/3395260.3395273

Wu, Y., Schuster, M., Chen, Z., Le, Q. v., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, ?., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://arxiv.org/abs/1609.08144v2

Alammar, J. (2018a, June 27). The Illustrated Transformer – Jay Alammar – Visualizing machine learning one concept at a time. https://jalammar.github.io/illustrated-transformer/
Alammar, J. (2018b, December 3). The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)–Jay Alammar–Visualizing machine learning one concept at a time. http://jalammar.github.io/illustrated-bert/
Alwehaibi, A., Bikdash, M., Albogmi, M., & Roy, K. (2021). A study of the performance of embedding methods for Arabic short-text sentiment analysis using deep learning approaches. Journal of King Saud University-Computer and Information Sciences.
Aydo?an, M., & Karci, A. (2020). Improving the accuracy using pre-trained word embeddings on deep neural networks for Turkish text classification. Physica A: Statistical Mechanics and Its Applications, 541, 123288. https://doi.org/10.1016/j.physa.2019.123288
Ayo, F. E., Folorunso, O., Ibharalu, F. T., & Osinuga, I. A. (2020). Machine learning techniques for hate speech classification of twitter data: State-of-The-Art, future challenges and research directions. Computer Science Review, 38, 100311. https://doi.org/10.1016/j.cosrev.2020.100311
Brownlee, J. (2021, January 18). How to Choose an Activation Function for Deep Learning. https://machinelearningmastery.com/choose-an-activation- function-for-deep-learning/
Cai, R., Qin, B., Chen, Y., Zhang, L., Yang, R., Chen, S., & Wang, W. (2020). Sentiment analysis about investors and consumers in energy market based on BERT-BILSTM. IEEE Access, 8, 171408–171415. https://doi.org/10.1109/ACCESS.2020.3024750
Chauhan, N. S. (2021, August 2). Loss Functions in Neural Networks. https://www.theaidream.com/post/loss-functions-in-neural-networks
Chaumond, J., Delangue, C., & Wolf, T. (2016). huggingface (Hugging Face). https://huggingface.co/huggingface
Cournapeau, D. (2007). scikit-learn: machine learning in Python—scikit-learn 1.1.1 documentation. https://scikit-learn.org/stable/#
Devlin, J., Chang, M.-W., Lee, K., Google, K. T., & Language, A. I. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. http://arxiv.org/abs/1810.04805
Digmi, I. (2018, January 25). Memahami Epoch Batch Size Dan Iteration - JournalToday. https://imam.digmi.id/post/memahami-epoch-batch-size-dan- iteration/
Ge?ron, A. (2017). Hands-on machine learning with Scikit-Learn and TensorFlow: concepts, tools, and techniques to build intelligent systems. O’Reilly Media, Inc.
Google Brain Team. (2015, November 9). TensorFlow. https://www.tensorflow.org/
Goyal, A., Gupta, V., & Kumar, M. (2021). A deep learning-based bilingual Hindi and Punjabi named entity recognition system using enhanced word embeddings. Knowledge-Based Systems, 107601. https://doi.org/10.1016/j.knosys.2021.107601
Gupta, V., & Lehal Professor, G. S. (2009). A Survey of Text Mining Techniques and Applications. www.alerts.yahoo.com
Hilmiaji, N., Lhaksmana, K. M., & Purbolaksono, M. D. (2021). Identifying Emotion on Indonesian Tweets using Convolutional Neural Networks. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 5(3), 584–593. https://doi.org/10.29207/RESTI.V5I3.3137
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/NECO.1997.9.8.1735
Keras Team. (2015, March 27). Dropout layer. https://keras.io/api/layers/regularization_layers/dropout/
Koto, F., Rahimi, A., Lau, J. H., & Baldwin, T. (2020). IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. 757–770. https://doi.org/10.18653/v1/2020.coling-main.66
Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L., & Brown,
D. (2019). Text Classification Algorithms: A Survey. Information 2019, Vol. 10, Page 150, 10(4), 150. https://doi.org/10.3390/INFO10040150
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings. https://arxiv.org/abs/1301.3781v3
Muhammad, P. F., Kusumaningrum, R., & Wibowo, A. (2021). Sentiment Analysis Using Word2vec And Long Short-Term Memory (LSTM) For Indonesian Hotel Reviews. Procedia Computer Science, 179, 728–735. https://doi.org/10.1016/J.PROCS.2021.01.061
Nguyen, Q. T., Nguyen, T. L., Luong, N. H., & Ngo, Q. H. (2020). Fine-Tuning BERT for Sentiment Analysis of Vietnamese Reviews. Proceedings – 2020 7th NAFOSTED Conference on Information and Computer Science, NICS 2020, 302–307. https://doi.org/10.1109/NICS51282.2020.9335899
Pahwa, B., Kasliwal, N., Scholar, R., Vidyapith, B., & Taruna, R. S. (2018). Sentiment Analysis-Strategy for Text Pre-Processing Indianization and customization for Indian consumers View project Aspect level sentiment analysis View project Sentiment Analysis-Strategy for Text Pre-Processing Bhumika Pahwa. Article in International Journal of Computer Applications, 180(34), 975–8887. https://doi.org/10.5120/ijca2018916865
Putra, J. W. G. (2020). Pengenalan Pembelajaran Mesin dan Deep Learning.
Rahman, D. (2019). deryrahman/word2vec-bahasa-indonesia: Word2Vec untuk bahasa Indonesia dari korpus Wikipedi https://github.com/deryrahman/word2vec-bahasa-indonesia
Ramadhan, N. G. (2021). Indonesian Online News Topics Classification using Word2Vec and K-Nearest Neighbor. Jurnal RESTI (Rekayasa Sistem Dan Teknologi Informasi), 5(6), 1083–1089. https://doi.org/10.29207/RESTI.V5I6.3547
Rao, A., & Spasojevic, N. (2016). Actionable and Political Text Classification using Word Embeddings and LSTM. https://arxiv.org/abs/1607.02501v2
Robbani, H. A. (2018, September 24). GitHub - har07/PySastrawi: Indonesian stemmer. Python port of PHP Sastrawi project. PySastrawi. https://github.com/har07/PySastrawi
Sharma, A. K., Chaurasia, S., & Srivastava, D. K. (2020). Sentimental Short Sentences Classification by Using CNN Deep Learning Model with Fine Tuned Word2Vec. Procedia Computer Science, 167, 1139–1147. https://doi.org/10.1016/J.PROCS.2020.03.416
Sun, Z., Zemel, R., & Xu, Y. (2021). A computational framework for slang generation. Transactions of the Association for Computational Linguistics, 9, 462–478. https://doi.org/10.1162/TACL_A_00378/1921784/TACL_A_00378.PDF
Sutanto, T. (2020). nlptm-01. Tau-Data Indonesia. https://tau-data.id/d/nlptm- 01.html
Uysal, A. K., & Gunal, S. (2014). The impact of preprocessing on text classification. Information Processing & Management, 50(1), 104–112. https://doi.org/10.1016/J.IPM.2013.08.006
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ?. ukasz, & Polosukhin, I. (2017). Attention is All you Need. In I. Guyon, U. v Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in Neural Information Processing Systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4 a845aa-Paper.pdf
Wang, Z., Huang, Z., & Gao, J. (2020). Chinese Text Classification Method Based on BERT Word Embedding. ACM International Conference Proceeding Series, 66–71. https://doi.org/10.1145/3395260.3395273
Wu, Y., Schuster, M., Chen, Z., Le, Q. v., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Kaiser, ?., Gouws, S., Kato, Y., Kudo, T., Kazawa, H., … Dean, J. (2016). Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. https://arxiv.org/abs/1609.08144v2

Jurnal Ilmu Siber dan Teknologi Digital

Model Klasifikasi Berbasis Multiclass Classification dengan Kombinasi Indobert Embedding dan Long Short-Term Memory untuk Tweet Berbahasa Indonesia

Model Klasifikasi Berbasis Multiclass Classification dengan Kombinasi Indobert Embedding dan Long Short-Term Memory untuk Tweet Berbahasa Indonesia

Abstract:

Downloads

Jurnal Ilmu Siber dan Teknologi Digital