Grid Search CV Implementation in Random Forest Algorithm to Improve Accuracy of Breast Cancer Data

Dimas Aryo Anggoro (1), Nur Aini Afdallah (2)
(1) Informatics Department, Universitas Muhammadiyah Surakarta, Jl. Ahmad Yani, Surakarta, 57162, Indonesia
(2) Informatics Department, Universitas Muhammadiyah Surakarta, Jl. Ahmad Yani, Surakarta, 57162, Indonesia
Fulltext View | Download
How to cite (IJASEIT) :
Anggoro, Dimas Aryo, and Nur Aini Afdallah. “Grid Search CV Implementation in Random Forest Algorithm to Improve Accuracy of Breast Cancer Data”. International Journal on Advanced Science, Engineering and Information Technology, vol. 12, no. 2, Apr. 2022, pp. 515-20, doi:10.18517/ijaseit.12.2.15487.
Breast cancer is the most common cancer in women and is the second leading cause of global death. Disease diagnosis plays an important role in determining treatment strategies related to patient safety. Therefore, we need machine learning to predict disease. This paper aims to determine the best parameter values in breast cancer data using the Grid Search CV method and classify breast cancer data using the random forest algorithm. In addition, the paper aims to compare the accuracy values generated using the Grid Search CV and without the Grid Search CV. The method used to analyze breast cancer data in researchers is the Random Forest (RF) classification algorithm. In addition to using the Random Forest algorithm, this study also uses the Grid Search CV method. Grid Search CV is a method used to determine the optimal model parameters so that the classifier can predict the test data reliably. This study indicates that the highest accuracy value is obtained in the random forest algorithm using the grid search method of 0.9545. In contrast, the accuracy of the random forest algorithm without using the grid search method is 0.9480. For further research, it is suggested to develop a breast cancer dataset using the grid search cv method with other algorithms, such as Logistic Regression, Xgboost, and SVM. We can also use the same algorithm with different datasets to prove that the grid search cv method can increase accuracy.

M. F. Ullah, Breast Cancer: Current Perspectives on the Disease Status. 2019.

I. L. Maria, A. A. Sainal, and M. Nyorong, “Risiko Gaya Hidup Terhadap Kejadian Kanker Payudara Pada Wanita,” Media Kesehat. Masy. Indones., vol. 13, no. 2, p. 157, 2017, doi: 10.30597/mkmi.v13i2.1988.

C. Mattiuzzi and G. Lippi, "Current Cancer Epidemiology glossary," J. Epidemiol. Glob. Health, vol. 9, no. 4, pp. 217-222, 2019, doi: DOI: https://doi.org/10.2991/jegh.k.191008.001.

H. Wang, B. Zheng, S. W. Yoon, and H. S. Ko, "A support vector machine-based ensemble algorithm for breast cancer diagnosis," Eur. J. Oper. Res., vol. 267, no. 2, pp. 687-699, 2018, doi: 10.1016/j.ejor.2017.12.001.

C. E. DeSantis et al., "Breast cancer statistics, 2019," CA. Cancer J. Clin., vol. 69, no. 6, pp. 438-451, 2019, doi: 10.3322/caac.21583.

Y. Ao, H. Li, L. Zhu, S. Ali, and Z. Yang, "The linear random forest algorithm and its advantages in machine learning assisted logging regression modeling," J. Pet. Sci. Eng., vol. 174, pp. 776-789, 2019, doi: 10.1016/j.petrol.2018.11.067.

J. Dou et al., "Assessment of advanced random forest and decision tree algorithms for modeling rainfall-induced landslide susceptibility in the Izu-Oshima Volcanic Island, Japan," Sci. Total Environ., vol. 662, no. January, pp. 332-346, 2019, doi: 10.1016/j.scitotenv.2019.01.221.

A. R. Chowdhury, T. Chatterjee, and S. Banerjee, "A Random Forest classifier-based approach in the detection of abnormalities in the retina," Med. Biol. Eng. Comput., vol. 57, no. 1, pp. 193-203, 2019, doi: 10.1007/s11517-018-1878-0.

W. Dong, Y. Huang, B. Lehane, and G. Ma, "XGBoost algorithm-based prediction of concrete electrical resistivity for structural health monitoring," Autom. Constr., vol. 114, no. March, p. 103155, 2020, doi: 10.1016/j.autcon.2020.103155.

Y. Shuai, Y. Zheng, and H. Huang, "Hybrid Software Obsolescence Evaluation Model Based on PCA-SVM-GridSearchCV," Proc. IEEE Int. Conf. Softw. Eng. Serv. Sci. ICSESS, vol. 2018, pp. 449-453, 2019, doi: 10.1109/ICSESS.2018.8663753.

M. M. Ramadhan, I. S. Sitanggang, F. R. Nasution, and A. Ghifari, "Parameter Tuning in Random Forest Based on Grid Search Method for Gender Classification Based on Voice Frequency," DEStech Trans. Comput. Sci. Eng., pp. 625-629, 2017, doi: 10.12783/dtcse/cece2017/14611.

W. H. Wolberg and O. L. Mangasarian, "Multisurface method of pattern separation for medical diagnosis applied to breast cytology," Proc. Natl. Acad. Sci. U. S. A., vol. 87, no. 23, pp. 9193-9196, 1990, doi: 10.1073/pnas.87.23.9193.

H. He, Y. Bai, E. A. Garcia, and S. Li, "ADASYN: Adaptive synthetic sampling approach for imbalanced learning," Proc. Int. Jt. Conf. Neural Networks, no. 3, pp. 1322-1328, 2008, doi: 10.1109/IJCNN.2008.4633969.

B. Tan, J. Yang, Y. Tang, S. Jiang, P. Xie, and W. Yuan, "A Deep Imbalanced Learning Framework for Transient Stability Assessment of Power System," IEEE Access, vol. 7, pp. 81759-81769, 2019, doi: 10.1109/ACCESS.2019.2923799.

H. Zhao, J. Zheng, J. Xu, and W. Deng, "Fault diagnosis method based on principal component analysis and broad learning system," IEEE Access, vol. 7, pp. 99263-99272, 2019, doi: 10.1109/ACCESS.2019.2929094.

Adiwijaya, U. N. Wisesty, E. Lisnawati, A. Aditsania, and D. S. Kusumo, "Dimensionality reduction using Principal Component Analysis for cancer detection based on microarray data classification," J. Comput. Sci., vol. 14, no. 11, pp. 1521-1530, 2018, doi: 10.3844/jcssp.2018.1521.1530.

A. N. Zuda Pradana Putra, “Pebandingan Performa Naí¯ve Bayes dan KNN pada Klasifikasi Teks Sentimen Jasa Ekspedisi,” vol. 3, no. 1, pp. 145-152, 2022.

S. Benbelkacem and B. Atmani, "Random forests for diabetes diagnosis," 2019 Int. Conf. Comput. Inf. Sci. ICCIS 2019, pp. 1-4, 2019, doi: 10.1109/ICCISci.2019.8716405.

I. Syarif, A. Prugel-Bennett, and G. Wills, "SVM parameter optimization using grid search and genetic algorithm to improve classification performance," Telkomnika (Telecommunication Comput. Electron. Control., vol. 14, no. 4, pp. 1502-1509, 2016, doi: 10.12928/TELKOMNIKA.v14i4.3956.

H. Zhang, H. Zhang, S. Pirbhulal, W. Wu, and V. H. C. D. Albuquerque, "Active balancing mechanism for imbalanced medical data in deep learning-based classification models," ACM Trans. Multimed. Comput. Commun. Appl., vol. 16, pp. 1-15, 2020, doi: 10.1145/3357253.

R. Maglietta et al., "Convolutional Neural Networks for Risso's Dolphins Identification," IEEE Access, vol. 8, pp. 80195-80206, 2020, doi: 10.1109/ACCESS.2020.2990427.

D. A. Anggoro and P. I. Rahmatullah, "The implementation of subspace outlier detection in K-nearest neighbors to improve accuracy in bank marketing data," Int. J. Emerg. Trends Eng. Res., vol. 8, no. 2, pp. 545-550, 2020, doi: 10.30534/ijeter/2020/44822020.

T. H. Kerbaa, A. Mezache, and H. Oudira, "Model Selection of Sea Clutter Using Cross Validation Method," Procedia Comput. Sci., vol. 158, pp. 394-400, 2019, doi: 10.1016/j.procs.2019.09.067.

G. A. Buntoro, "Analisi Sentimen Hatespeech Pada Twitter dengan Metode Naive Bayes Classifier dan Support Vector Machine," Resma, vol. 3, no. 2, pp. 13-22, 2016.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).