CFP last date
20 January 2025
Reseach Article

Breast Cancer Classification with Principal Component Analysis and Smote using Random Forest Method and Support Vector Machine

by Rian Oktafiani, Enny Itje Sela
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 186 - Number 16
Year of Publication: 2024
Authors: Rian Oktafiani, Enny Itje Sela
10.5120/ijca2024923537

Rian Oktafiani, Enny Itje Sela . Breast Cancer Classification with Principal Component Analysis and Smote using Random Forest Method and Support Vector Machine. International Journal of Computer Applications. 186, 16 ( Apr 2024), 1-8. DOI=10.5120/ijca2024923537

@article{ 10.5120/ijca2024923537,
author = { Rian Oktafiani, Enny Itje Sela },
title = { Breast Cancer Classification with Principal Component Analysis and Smote using Random Forest Method and Support Vector Machine },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2024 },
volume = { 186 },
number = { 16 },
month = { Apr },
year = { 2024 },
issn = { 0975-8887 },
pages = { 1-8 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume186/number16/breast-cancer-classification-with-principal-component-analysis-and-smote-using-random-forest-method-and-support-vector-machine/ },
doi = { 10.5120/ijca2024923537 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2024-04-27T03:06:46.073487+05:30
%A Rian Oktafiani
%A Enny Itje Sela
%T Breast Cancer Classification with Principal Component Analysis and Smote using Random Forest Method and Support Vector Machine
%J International Journal of Computer Applications
%@ 0975-8887
%V 186
%N 16
%P 1-8
%D 2024
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Patients' lives may be at risk due to low-accuracy and inaccurate breast cancer classification results. The high dimensionality and unequal distribution of classes in breast cancer medical data presents a challenge for the application of machine learning techniques. Subsequently, studies that examine the parameters in the algorithm model are still scarce. Inappropriate parameter selection may lead to low accuracy. To classify breast cancer, this study compares the Random Forest and Support Vector Machine algorithms. The max depth parameter in Random Forest and Linear, Polynomial and RBF kernels in Support Vector Machine are the parameters analyzed in this study. Principal Component Analysis (PCA) is used for feature reduction and Synthetic Minority Oversampling Technique (SMOTE) method is used to overcome class imbalance. The results of this study are, the best accuracy obtained from the SVM method is 99.07% with precision, recall and f1 score 99% by using the RBF kernel and at n component PCA = 6, while Random Forest has the best test accuracy of 98.32%, with precision, recall and f1 score 98% by using max depth = 8 and n component PCA = 6. Therefore, it can be concluded that the method of using SMOTE and PCA can improve accuracy, and the SVM method is better than RF for breast cancer classification. Future studies can test various datasets to examine the impact of additional parameters and classification techniques.

References
  1. A. R. Vaka, B. Soni, and S. R. K., “Breast cancer detection by leveraging Machine Learning,” ICT Express, vol. 6, no. 4, pp. 320–324, Dec. 2020, doi: 10.1016/j.icte.2020.04.009.
  2. K. Yu, L. Tan, L. Lin, X. Cheng, Z. Yi, and T. Sato, “Deep-Learning-Empowered Breast Cancer Auxiliary Diagnosis for 5GB Remote E-Health,” IEEE Wirel Commun, vol. 28, no. 3, pp. 54–61, Jun. 2021, doi: 10.1109/MWC.001.2000374.
  3. M. Kirola, M. Memoria, A. Dumka, A. Tripathi, and K. Joshi, “A Comprehensive Review Study on: Optimized Data Mining, Machine Learning and Deep Learning Techniques for Breast Cancer Prediction in Big Data Context,” Biomedical and Pharmacology Journal, vol. 15, no. 1, pp. 13–25, Mar. 2022, doi: 10.13005/bpj/2339.
  4. M. M. Chanu, N. H. Singh, C. Muppala, R. T. Prabu, N. P. Singh, and K. Thongam, “Computer-aided detection and classification of brain tumor using YOLOv3 and deep learning,” Soft comput, vol. 27, no. 14, pp. 9927–9940, Jul. 2023, doi: 10.1007/s00500-023-08343-1.
  5. A. D. Krismawan and E. H. Rachmawanto, “Principal Component Analysis (PCA) dan K-Nearest Neighbor (KNN) dalam Deteksi Masker pada Wajah,” Prosiding Sains Nasional dan Teknologi, vol. 12, no. 1, p. 382, Nov. 2022, doi: 10.36499/psnst.v12i1.7066.
  6. P. F. Eduardo, C. Damián, and M. Fernando, “A comparison of deep learning models applied to Water Gas Shift catalysts for hydrogen purification,” Int J Hydrogen Energy, vol. 48, no. 64, pp. 24742–24755, Jul. 2023, doi: 10.1016/j.ijhydene.2022.09.215.
  7. H. Yilmaz and F. Kuncan, “Analysis of Different Machine Learning Techniques with PCA in the Diagnosis of Breast Cancer,” Journal of Engineering Technology and Applied Sciences, vol. 7, no. 3, pp. 195–205, Dec. 2022, doi: 10.30931/jetas.1166768.
  8. S. Wang, Y. Dai, J. Shen, and J. Xuan, “Research on expansion and classification of imbalanced data based on SMOTE algorithm,” Sci Rep, vol. 11, no. 1, p. 24039, Dec. 2021, doi: 10.1038/s41598-021-03430-5.
  9. A. S. Tarawneh, A. B. A. Hassanat, K. Almohammadi, D. Chetverikov, and C. Bellinger, “SMOTEFUNA: Synthetic Minority Over-Sampling Technique Based on Furthest Neighbour Algorithm,” IEEE Access, vol. 8, pp. 59069–59082, 2020, doi: 10.1109/ACCESS.2020.2983003.
  10. H. Yi, Q. Jiang, X. Yan, and B. Wang, “Imbalanced Classification Based on Minority Clustering Synthetic Minority Oversampling Technique With Wind Turbine Fault Detection Application,” IEEE Trans Industr Inform, vol. 17, no. 9, pp. 5867–5875, Sep. 2021, doi: 10.1109/TII.2020.3046566.
  11. N. Sharfina and N. G. Ramadhan, “Analisis SMOTE Pada Klasifikasi Hepatitis C Berbasis Random Forest dan Naïve Bayes,” JOINTECS (Journal of Information Technology and Computer Science), vol. 8, no. 1, p. 33, Jun. 2023, doi: 10.31328/jointecs.v8i1.4456.
  12. S. S. Hameed, W. H. Hassan, L. A. Latiff, and F. F. Muhammadsharif, “A comparative study of nature-inspired metaheuristic algorithms using a three-phase hybrid approach for gene selection and classification in high-dimensional cancer datasets,” Soft comput, vol. 25, no. 13, pp. 8683–8701, Jul. 2021, doi: 10.1007/s00500-021-05726-0.
  13. H. Tantyoko, D. K. Sari, and A. R. Wijaya, “Prediksi Potensial Gempa Bumi Indonesia Menggunakan Metode Random Forest Dan Feature Selection,” IDEALIS : InDonEsiA journaL Information System, vol. 6, no. 2, pp. 83–89, Jul. 2023, doi: 10.36080/idealis.v6i2.3036.
  14. N. Feroz, M. A. Ahad, and F. Doja, “Machine Learning Techniques for Improved Breast Cancer Detection and Prognosis—A Comparative Analysis,” 2021, pp. 441–455. doi: 10.1007/978-981-16-3067-5_33.
  15. A. Rasool, C. Bunterngchit, L. Tiejian, Md. R. Islam, Q. Qu, and Q. Jiang, “Improved Machine Learning-Based Predictive Models for Breast Cancer Diagnosis,” Int J Environ Res Public Health, vol. 19, no. 6, p. 3211, Mar. 2022, doi: 10.3390/ijerph19063211.
  16. M. O. Adebiyi, M. O. Arowolo, M. D. Mshelia, and O. O. Olugbara, “A Linear Discriminant Analysis and Classification Model for Breast Cancer Diagnosis,” Applied Sciences, vol. 12, no. 22, p. 11455, Nov. 2022, doi: 10.3390/app122211455.
  17. S. Hidayatulloh, M. A. Mustajab, and Y. Ramdhani, “PENGGUNAAN OTIMASI ATRIBUT DALAM PENINGKATAN AKURASI PREDIKSI DEEP LEARNING PADA BIKE SHARING DEMAND,” INFOTECH journal, vol. 9, no. 1, pp. 54–61, Feb. 2023, doi: 10.31949/infotech.v9i1.4530.
  18. M. Lestandy, “Deteksi Dini Kanker Payudara Menggunakan Metode Convolution Neural Network (CNN),” Inspiration: Jurnal Teknologi Informasi dan Komunikasi, vol. 12, no. 1, p. 65, Jun. 2022, doi: 10.35585/inspir.v12i1.2667.
  19. K. Suparna and L. M. K. K. Sari, “Kanker Payudara: Diagnostik, Faktor Risiko, Dan Stadium,” Ganesha Medicina Journal, vol. 2, no. 1, pp. 42–48, Mar. 2022, Accessed: Oct. 26, 2023. [Online]. Available: https://ejournal.undiksha.ac.id/index.php/GM/article/view/47032
  20. S. Rabbani, D. Safitri, N. Rahmadhani, A. A. F. Sani, and M. K. Anam, “Perbandingan Evaluasi Kernel SVM untuk Klasifikasi Sentimen dalam Analisis Kenaikan Harga BBM,” MALCOM: Indonesian Journal of Machine Learning and Computer Science, vol. 3, no. 2, pp. 153–160, Oct. 2023, doi: 10.57152/malcom.v3i2.897.
  21. A. M. A. Rahim, Inggrid Yanuar Risca Pratiwi, and Muhammad Ainul Fikri, “Klasifikasi Penyakit Jantung Menggunakan Metode Synthetic Minority Over-Sampling Technique Dan Random Forest Clasifier,” Indonesian Journal of Computer Science, vol. 12, no. 5, Nov. 2023, doi: 10.33022/ijcs.v12i5.3413.
  22. K. Younes et al., “Application of Unsupervised Machine Learning for the Evaluation of Aerogels’ Efficiency towards Ion Removal—A Principal Component Analysis (PCA) Approach,” Gels, vol. 9, no. 4, p. 304, Apr. 2023, doi: 10.3390/gels9040304.
  23. M. A. Almaiah et al., “Performance Investigation of Principal Component Analysis for Intrusion Detection System Using Different Support Vector Machine Kernels,” Electronics (Basel), vol. 11, no. 21, p. 3571, Nov. 2022, doi: 10.3390/electronics11213571.
  24. M. Rizky and R. Andarsyah, “Klasifikasi MIT-BIH Arrhythmia Database Metode Random Forest dan CNN dengan Model ResNet-50: A Systematic Literature Review,” Jurnal Teknologi Dan Sistem Informasi Bisnis, vol. 5, no. 3, pp. 190–196, Jul. 2023, doi: 10.47233/jteksis.v5i3.825.
  25. M. I. C. Rachmatullah, A. Wicaksono, and V. Putratama, “Perbandingan Metoda K-NN, Random Forest dan 1D CNN untuk Mengklasifikasi Data EEG Eye State,” Journal of Information System Research (JOSH), vol. 4, no. 2, pp. 669–675, Jan. 2023, doi: 10.47065/josh.v4i2.2998.
  26. S. D. Asri, D. Ramayanti, A. D. Putra, and Y. T. Utami, “DETEKSI RODA KENDARAAN DENGAN CIRCLE HOUGH TRANSFORM (CHT) DAN SUPPORT VECTOR MACHINE (SVM),” Jurnal Teknoinfo, vol. 16, no. 2, p. 427, Jul. 2022, doi: 10.33365/jti.v16i2.1952.
  27. M. A. Saddam, E. Kurniawan, and I. Indra, “Analisis Sentimen Fenomena PHK Massal Menggunakan Naive Bayes dan Support Vector Machine,” Jurnal Pengembangan IT (JPIT), vol. 8, no. 3, Sep. 2023, Accessed: Oct. 25, 2023. [Online]. Available: http://ejournal.poltektegal.ac.id/index.php/informatika/article/view/4884
  28. W. Wolberg, N. Street, and O. Mangasarian, “Breast Cancer Winscoin ,” UCL Machine Learning Repository. Accessed: Apr. 28, 2023. [Online]. Available: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29
  29. G. L. Pritalia, “Analisis Komparatif Algoritme Machine Learning dan Penanganan Imbalanced Data pada Klasifikasi Kualitas Air Layak Minum,” KONSTELASI: Konvergensi Teknologi dan Sistem Informasi, vol. 2, no. 1, Apr. 2022, doi: 10.24002/konstelasi.v2i1.5630.
  30. M. Zhou, H. Zhang, W. Zhang, and Y. Yi, “An Improved Random Forest Algorithm-Based Fatigue Recognition With Multiphysical Feature,” IEEE Sens J, vol. 23, no. 21, pp. 26195–26201, Nov. 2023, doi: 10.1109/JSEN.2023.3314316.
  31. B. P. Koya, S. Aneja, R. Gupta, and C. Valeo, “Comparative analysis of different machine learning algorithms to predict mechanical properties of concrete,” Mechanics of Advanced Materials and Structures, vol. 29, no. 25, pp. 4032–4043, Oct. 2022, doi: 10.1080/15376494.2021.1917021.
  32. A. D. Patange, S. S. Pardeshi, R. Jegadeeshwaran, A. Zarkar, and K. Verma, “Augmentation of Decision Tree Model Through Hyper-Parameters Tuning for Monitoring of Cutting Tool Faults Based on Vibration Signatures,” Journal of Vibration Engineering & Technologies, Nov. 2022, doi: 10.1007/s42417-022-00781-9.
  33. H. Zhang, L. Zhang, and Y. Jiang, “Overfitting and Underfitting Analysis for Deep Learning Based End-to-end Communication Systems,” in 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP), IEEE, Oct. 2019, pp. 1–6. doi: 10.1109/WCSP.2019.8927876.
  34. R. Oktafiani, A. Hermawan, and D. Avianto, “Max Depth Impact on Heart Disease Classification: Decision Tree and Random Forest,” Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi) , vol. 8, no. 1, pp. 160–168, Feb. 2024, doi: https://doi.org/10.29207/resti.v8i1.5574.
Index Terms

Computer Science
Information Sciences
Data Mining
Pattern Recognition
Classification
Machine Learning

Keywords

Classification Breast Cancer Principal Component Analysis SMOTE Random Forest Support Vector Machine