Comparison of Random Forest Algorithm Classifier and Naïve Bayes Algorithm in Whatsapp Message Type Classification

Abdul Hadi; Mukti Qamal; Yesy Afrillia

doi:10.29103/jreece.v5i1.21227

Comparison of Random Forest Algorithm Classifier and Naïve Bayes Algorithm in Whatsapp Message Type Classification

Abdul Hadi, Mukti Qamal, Yesy Afrillia

Abstract

This study compares the effectiveness of Random Forest and Naïve Bayes algorithms in classifying WhatsApp messages into three categories: normal, promotional, and fraudulent messages. With over 2.78 billion active users worldwide and 90% of Indonesian internet users utilizing WhatsApp, the platform's end-to-end encryption creates challenges for automatic spam detection, necessitating machine learning approaches. A dataset of 300 messages, equally distributed across the three categories, underwent preprocessing including cleansing, case folding, stopword removal, normalization, and stemming before being converted to numerical form using TF-IDF vectorization. Experimental results demonstrated that Naïve Bayes outperformed Random Forest with higher accuracy (88.67% vs. 86.00%), precision (89.64% vs. 88.95%), recall (88.67% vs. 86.00%), and F1-score (88.61% vs. 85.99%). Cross-validation analysis with 10-fold validation further confirmed Naïve Bayes' superior consistency and stability across all evaluation metrics. Additionally, Naïve Bayes exhibited remarkable computational efficiency, requiring only 0.13 seconds for training compared to Random Forest's 3.65 seconds. Confusion matrix analysis revealed Naïve Bayes' particular effectiveness in distinguishing between normal and fraudulent messages, crucial for preventing users from falling victim to scams. The model successfully identified key fraud indicators such as "claim," "account," and "verification" while demonstrating precision in ambiguous cases. These findings contribute significantly to developing more effective spam detection systems for encrypted messaging platforms where traditional filtering mechanisms cannot be applied, ultimately enhancing user safety and experience through automated identification of potentially harmful content.

Keywords

Whatsapp Classification, Message Classification, Naïve Bayes, Random Forest, Text Mining

Full Text:

PDF

References

Abidin, Z., & Junaidi, A. (2024). Text Stemming and Lemmatization of Regional Languages in Indonesia: A Systematic Literature Review. Journal of Information Systems Engineering and Business Intelligence, 10(2), 217–231.

AlAfnan, & Awad, M. (2024). Social Media Personalities in Asia: Demographics, Platform Preferences, and Behavior Based Analysis. Studies in Media and Communication, 12(3), 349–363.

Devita, R. N., Herwanto, H. W., & Wibawa, A. P. (2018). Perbandingan kinerja metode naive bayes dan k-nearest neighbor untuk klasifikasi artikel berbahasa indonesia. J. Teknol. Inf. Dan Ilmu Komput, 5(4).

Dwiyansaputra, R., Nugraha, G. S., Bimantoro, F., & Aranta, A. (2021). Deteksi SMS Spam Berbahasa Indonesia menggunakan TF-IDF dan Stochastic Gradient Descent Classifier. Jurnal Teknologi Informasi, Komputer, Dan Aplikasinya (JTIKA), 3(2), 200–207.

Fhonna, R. P., Afrillia, Y., Aqmal, J., & Abadi, S. (2023). Klasifikasi Penentuan Jenis Tanah yang Sesuai Terhadap Tanaman Pangan Sebagai Solusi Ketahanan Pangan di Kabupaten Pidie Jaya menggunakan Metode Random Forest. Jurnal Informasi Dan Teknologi, 12–18.

Gaur, P., Vashistha, S., & Jha, P. (2023). Twitter sentiment analysis using naive bayes-based machine learning technique. In Sentiment Analysis and Deep Learning: Proceedings of ICSADL 2022 (pp. 367–376). Springer.

Hasanah, A. N. R., Krestianti, R. A., & Wati, S. (2023). Implementasi Algoritma Regresi Logistik untuk Binary Classification dalam Spam SMS dan WhatsApp. Prosiding SEMNAS INOTEK (Seminar Nasional Inovasi Teknologi), 7(1), 80–93.

Herwanto, Chusna, N. L., & Arif, M. S. (2021). Klasifikasi SMS Spam Berbahasa Indonesia Menggunakan Algoritma Multinomial Naïve Bayes. JURNAL MEDIA INFORMATIKA BUDIDARMA, 5(4), 1316–1325.

Jalilifard, A., Caridá, V. F., Mansano, A. F., Cristo, R. S., & da Fonseca, F. P. C. (2021). Semantic sensitive TF-IDF to determine word relevance in documents. Advances in Computing and Network Communications: Proceedings of CoCoNet 2020, Volume 2, 327–337.

Johns, A., Matamoros-Fernández, A., & Baulch, E. (2023). WhatsApp: From a one-to-one messaging app to a global communication platform. John Wiley & Sons.

Kerner, Y. H., Miller, D., & Yigal, Y. (2020). The influence of preprocessing on text classification using a bag-of-words representation. PloS One, 15(5), e0232525.

Lavanya, P. M., & Sasikala, E. (2021). Deep learning techniques on text classification using Natural language processing (NLP) in social healthcare network: A comprehensive survey. 2021 3rd International Conference on Signal Processing and Communication (ICPSC), 603–609.

Mutiara, A. B., Wibowo, E. P., & Santosa, P. I. (2020). The Crowdsourcing Method to Normalize “Bahasa Alay”, a Case of Indonesian Corpus. 2020 Fifth International Conference on Informatics and Computing (ICIC), 1–5.

Naseem, U., Razzak, I., & Eklund, P. W. (2021). A survey of pre-processing techniques to improve short-text quality: a case study on hate speech detection on twitter. Multimedia Tools and Applications, 80, 35239–35266.

Normawati, D., & Prayogi, S. A. (2021). Implementasi Naïve Bayes classifier dan confusion matrix pada analisis sentimen berbasis teks pada Twitter. J-SAKTI (Jurnal Sains Komputer Dan Informatika), 5(2), 697–711.

Putera, A. W., Suriati, S., & Lestari, Y. D. (2023). Klasifikasi Sms Spam Menggunakan Algoritma K-Nearest Neighbor. Jurnal Ilmu Komputer Dan Sistem Komputer Terapan (JIKSTRA), 5(1), 43–55.

Qadrini, L., Seppewali, A., & Aina, A. (2021). Decision tree dan adaboost pada klasifikasi penerima program bantuan sosial. Jurnal Inovasi Penelitian, 2(7), 1959–1966.

Qamal, M. (2021). Analisis Sentimen Toko Online Menggunakan Algoritma Naive Bayes Classifier. Jurnal Teknologi Terapan and Sains 4.0, 2(3), 641–650.

Quist, J., Taylor, L., Staaf, J., & Grigoriadis, A. (2021). Random forest modelling of high-dimensional mixed-type data for breast cancer classification. Cancers, 13(5), 991.

Rezaeian, N., & Novikova, G. (2020). Persian text classification using naive bayes algorithms and support vector machine algorithm. Indonesian Journal of Electrical Engineering and Informatics (IJEEI), 8(1), 178–188.

Samad, M. D., Khounviengxay, N. D., & Witherow, M. A. (2020). Effect of text processing steps on twitter sentiment classification using word embedding. ArXiv Preprint ArXiv:2007.13027.

Sapitri, I. A., Yusra, Y., & Fikry, M. (2023). Pengklasifikasian Sentimen Ulasan Aplikasi Whatsapp Pada Google Play Store Menggunakan Support Vector Machine. Jurnal Tekinkom (Teknik Informasi Dan Komputer), 6(1), 1–7.

Setiyana, T. B. (2021). ANALISIS SENTIMEN PADA REVIEW APLIKASI KESEHATAN HALODOC MENGGUNAKAN METODE MAXIMUM ENTROPY. Muhammadiyah University, Semarang.

Sihombing, J. J., Arnita, A., Al Idrus, S. I., & Niska, D. Y. (2024). Implementation of text summarization on indonesian scientific articles using textrank algorithm with TF-IDF web-based. Journal of Soft Computing Exploration, 5(3), 310–319.

Wang, D., Su, J., & Yu, H. (2020). Feature extraction and analysis of natural language processing for deep learning English language. IEEE Access, 8, 46335–46345.

Yanto, O. (2021). Hoax As A Cyber Crime In The Whirlpool Of Information Technology. International Journal of Education and Sosiotechnology (IJES), 1(3), 13–23.

DOI: https://doi.org/10.29103/jreece.v5i1.21227