A Review of Feature Selection Methods on Diabetes Mellitus Classification

Nur Farahaina Idris (1), Mohd Arfian Ismail (2), Shahreen Kasim (3), Rohayanti Hassan (4), Deshinta Arrova Dewi (5), Abdullah Munzir Mohd Fauzi (6), Rahmat Hidayat (7)
(1) Faculty of Computing, Universiti Malaysia Pahang Al-Sultan Abdullah, Pekan, Pahang, Malaysia
(2) Faculty of Computing, Universiti Malaysia Pahang Al-Sultan Abdullah, Pekan, Pahang, Malaysia
(3) Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Parit Raja, Johor Malaysia
(4) Faculty of Computing, Universiti Teknologi Malaysia, Johor Malaysia
(5) INTI International University, Nilai, Negeri Sembilan, Malaysia
(6) MZR Global Sdn Bhd, Shah Alam, Selangor, Malaysia
(7) Department of Information Technology, Politeknik Negeri Padang, Sumatera Barat, Indonesia
Fulltext View | Download
How to cite (IJASEIT) :
[1]
N. F. Idris, “A Review of Feature Selection Methods on Diabetes Mellitus Classification”, Int. J. Adv. Sci. Eng. Inf. Technol., vol. 15, no. 3, pp. 686–692, Jun. 2025.
Diabetes is a leading cause of death in the United States and leads to serious health complications. In recent decades, artificial intelligence technology and its subfield, machine learning, have been increasingly utilized to aid in disease diagnosis. Machine learning methods must be robust enough to handle the variability in diabetes datasets, which often encompass diverse patient demographics, clinical characteristics, and environmental factors. This motivates researchers to develop suitable feature selection methods that complement machine learning methods, thereby reducing time and complexity. However, feature selection may negatively impact classification accuracy by inadvertently removing essential features, or it may increase the time required due to repetitive processes during evaluation. Hence, thorough reviews of feature selection methods for diabetes classification are being conducted to evaluate their effectiveness. There are three primary categories of feature selection methods: embedded, wrapper, and filter methods. All the methods had distinct mechanisms and effects during the classification process. This study reviewed feature selection methods in each category, such as Random Forest from the embedded method, Chi-Square test from the filter method, and Recursive Feature Elimination from the wrapper method. The Chi-Square test is efficient only with categorical features, Random Forest is effective but causes high complexity and increased time due to its ensemble nature, and Recursive Feature Elimination produces the best performance but is not very suitable for data with high dimensionality. The findings indicate that Recursive Feature Elimination is more suitable for diabetes classification, as it is fast and yields good performance.

W. Animaw and Y. Seyoum, “Increasing prevalence of diabetes mellitus in a developing country and its related factors,” PLoS One, vol. 12, no. 11, pp. 1–11, 2017, doi:10.1371/journal.pone.0187670.

A. B. Olokoba, O. A. Obateru, and L. B. Olokoba, “Type 2 diabetes: A review of current trends,” J. Clin. Med., vol. 7, no. 18, pp. 61–66, 2015, doi: 10.5001/omj.2012.68.

A. P. Lovic, A. Piperidou, I. Zografou, and H. Grassos, “The growing epidemic of diabetes mellitus,” Curr. Vasc. Pharmacol., vol. 18, no. 2, 2020, doi: 10.2174/1570161117666190405165911.

The Lancet Diabetes & Endocrinology, “Undiagnosed type 2 diabetes: An invisible risk factor,” Lancet Diabetes Endocrinol., vol. 12, no. 4, p. 215, 2024, doi: 10.1016/S2213-8587(24)00072-X.

J. A. da Silva et al., “Diagnosis of diabetes mellitus and living with a chronic condition: Participatory study,” BMC Public Health, vol. 18, no. 699, pp. 1–8, 2018, doi: 10.1186/s12889-018-5637-9.

D. Tomic, J. E. Shaw, and D. J. Magliano, “The burden and risks of emerging complications of diabetes mellitus,” Nat. Rev. Endocrinol., vol. 18, no. 9, pp. 525–539, 2022, doi: 10.1038/s41574-022-00690-7.

N. F. Idris et al., “Stacking with recursive feature elimination-isolation forest for classification of diabetes mellitus,” PLoS One, vol. 19, no. 5, pp. 1–18, 2024, doi: 10.1371/journal.pone.0302595.

K. Devasena and J. Shana, “Building machine learning model for predicting breast cancer using different regression techniques,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 1166, no. 1, Art. no. 012029, Jul. 2021, doi: 10.1088/1757-899X/1166/1/012029.

S. Jebapriya, S. David, J. W. Kathrine, and N. Sundar, “Support vector machine for classification of autism spectrum disorder based on abnormal structure of corpus callosum,” Int. J. Adv. Comput. Sci. Appl., vol. 10, no. 9, pp. 489–493, 2019, doi:10.14569/ijacsa.2019.0100965.

D. Lavanya and K. U. Rani, “Performance evaluation of decision tree classifiers on medical datasets,” Int. J. Comput. Appl., vol. 26, no. 4, pp. 1–4, 2011, doi: 10.5120/3095-4247.

V. O. Khilwani et al., “Diabetes prediction using stacking classifier,” in Proc. 2021 1st IEEE Int. Conf. Artif. Intell. Mach. Vis. (AIMV), 2021, pp. 1–6, doi: 10.1109/AIMV53313.2021.9670920.

X. Li, M. Curiger, R. Dornberger, and T. Hanne, “Optimized computational diabetes prediction with feature selection algorithms,” ACM Int. Conf. Proc. Ser., no. ML, pp. 36–43, 2023, doi:10.1145/3596947.3596948.

Md. Maniruzzaman et al., “Comparative approaches for classification of diabetes mellitus data: Machine learning paradigm,” Comput. Methods Programs Biomed., vol. 152, pp. 23–34, Dec. 2017, doi:10.1016/j.cmpb.2017.09.004.

B. F. Darst, K. C. Malecki, and C. D. Engelman, “Using recursive feature elimination in random forest to account for correlated variables in high dimensional data,” BMC Genet., vol. 19, no. Suppl. 1, pp. 1–6, 2018, doi: 10.1186/s12863-018-0633-8.

L. J. Cai, S. Lv, and K. B. Shi, “Application of an improved CHI feature selection algorithm,” Discret. Dyn. Nat. Soc., vol. 2021, 2021, doi: 10.1155/2021/9963382.

M. A. M. Hasan, M. Nasser, S. Ahmad, and K. I. Molla, “Feature selection for intrusion detection using random forest,” J. Inf. Secur., vol. 7, no. 3, pp. 129–140, 2016, doi: 10.4236/jis.2016.73009.

H. M. Farghaly and T. Abd El-Hafeez, “A high-quality feature selection method based on frequent and correlated items for text classification,” Soft Comput., vol. 27, no. 16, pp. 11259–11274, 2023, doi: 10.1007/s00500-023-08587-x.

M. E. Cintra and H. A. Camargo, “Feature subset selection for fuzzy classification methods,” in Inf. Process. Manag. Uncertain. Knowl.-Based Syst., vol. 80, pt. 1, pp. 318–327, 2010, doi: 10.1007/978-3-642-14055-6_33.

M. R. Mahmood, “Two feature selection methods comparison Chi-square and Relief-F for facial expression recognition,” in J. Phys.: Conf. Ser., vol. 1804, no. 1, Art. no. 012056, 2021, doi: 10.1088/1742-6596/1804/1/012056.

H. Habehh and S. Gohel, “Machine learning in healthcare,” Curr. Genomics, vol. 22, no. 4, pp. 291–300, 2021, doi:10.2174/1389202922666210705124359.

M. Phongying and S. Hiriote, “Diabetes classification using machine learning techniques,” Computation, vol. 11, no. 5, 2023, doi:10.3390/computation11050096.

L. Yu and H. Liu, “Efficient feature selection via analysis of relevance and redundancy,” J. Mach. Learn. Res., vol. 5, pp. 1205–1224, 2004, doi: 10.5555/1005332.1044700.

N. Pudjihartono, T. Fadason, A. W. Kempa-Liehr, and J. M. O’Sullivan, “A review of feature selection methods for machine learning-based disease risk prediction,” Front. Bioinform., vol. 2, no. June, pp. 1–17, 2022, doi: 10.3389/fbinf.2022.927312.

N. M. Abdelwahed, G. S. El-Tawel, and M. A. Makhlouf, “Effective hybrid feature selection using different bootstrap enhances cancers classification performance,” BioData Min., vol. 15, no. 1, pp. 1–54, 2022, doi: 10.1186/s13040-022-00304-y.

Y. Chen and Y. Zhong, “Improved filter method for feature selection,” IOP Conf. Ser.: Mater. Sci. Eng., vol. 569, no. 5, Art. no. 052008, Aug. 2019, doi: 10.1088/1757-899X/569/5/052008.

S. E. Awan, M. Bennamoun, F. Sohel, F. M. Sanfilippo, B. J. Chow, and G. Dwivedi, “Feature selection and transformation by machine learning reduce variable numbers and improve prediction for heart failure readmission or death,” PLoS One, vol. 14, no. 6, pp. 1–13, 2018, doi: 10.1371/journal.pone.0218760.

S. Xia and Y. Yang, “A model-free feature selection technique of feature screening and random forest-based recursive feature elimination,” Int. J. Intell. Syst., vol. 2023, 2023, doi:10.1155/2023/2400194.

E. Sreehari and L. D. D. Babu, “Critical factor analysis for prediction of diabetes mellitus using an inclusive feature selection strategy,” Appl. Artif. Intell., vol. 38, no. 1, 2024, doi:10.1080/08839514.2024.2331919.

M. Y. Shams, Z. Tarek, and A. M. Elshewey, “A novel RFE-GRU model for diabetes classification using PIMA Indian dataset,” Sci. Rep., vol. 15, no. 1, pp. 1–22, 2025, doi: 10.1038/s41598-024-82420-9.

R. K. Sachdeva, P. Bathla, P. Rani, V. Kukreja, and R. Ahuja, “A systematic method for breast cancer classification using RFE feature selection,” in Proc. 2022 2nd Int. Conf. Adv. Comput. Innov. Technol. Eng. (ICACITE), 2022, pp. 1673–1676, doi:10.1109/ICACITE53722.2022.9823464.

S. Raghavendra and S. K. J, “Performance evaluation of random forest with feature selection methods in prediction of diabetes,” Int. J. Electr. Comput. Eng., vol. 10, no. 1, pp. 353–359, 2020, doi:10.11591/ijece.v10i1.pp353-359.

A. A. Alhussan et al., “Classification of diabetes using feature selection and hybrid Al-Biruni Earth Radius and Dipper Throated optimization,” Diagnostics, vol. 13, no. 12, pp. 1–40, 2023, doi:10.3390/diagnostics13122038.

R. Natras, B. Soja, and M. Schmidt, “Ensemble machine learning of random forest, AdaBoost and XGBoost for vertical total electron content forecasting,” Remote Sens., vol. 14, no. 15, pp. 1–34, Aug. 2022, doi: 10.3390/rs14153547.

S. Ramya, T. Vijayaraghavan, and D. Kalaivani, “Diabetic prediction using feature selection-based random forest and fine-tuned K-nearest neighbor classifier algorithm—A design thinking approach,” in Proc. 2023 4th Int. Conf. Electron. Sustain. Commun. Syst. (ICESC), 2023, pp. 1303–1309, doi: 10.1109/ICESC57686.2023.10193333.

S. Lin, W. Ji, and J. Pei, “A method for selecting diabetes features based on random forest,” J. Phys.: Conf. Ser., vol. 1237, no. 2, Art. no. 022123, 2019, doi: 10.1088/1742-6596/1237/2/022123.

S. Gündoğdu, “Efficient prediction of early-stage diabetes using XGBoost classifier with random forest feature selection technique,” Multimed. Tools Appl., vol. 82, no. 22, pp. 34163–34181, 2023, doi:10.1007/s11042-023-15165-8.

P. Rajendra and S. Latifi, “Prediction of diabetes using logistic regression and ensemble techniques,” Comput. Methods Programs Biomed. Updat., vol. 1, p. 100032, 2021, doi:10.1016/j.cmpbup.2021.100032.

I. S. Thaseen and C. A. Kumar, “Intrusion detection model using fusion of chi-square feature selection and multi-class SVM,” J. King Saud Univ. - Comput. Inf. Sci., vol. 29, no. 4, pp. 462–472, 2017, doi:10.1016/j.jksuci.2015.12.004.

V. Rupapara, F. Rustam, A. Ishaq, E. Lee, and I. Ashraf, “Chi-square and PCA-based feature selection for diabetes detection with ensemble classifier,” Intell. Autom. Soft Comput., vol. 36, no. 2, pp. 1931–1949, 2023, doi: 10.32604/iasc.2023.028257.

A. S. Jaddoa and Z. T. M. Al-Ta’i, “Diagnosis of diabetes mellitus using (chi square-information gain) selectors and (SVM and KNN) classifiers,” in Proc. 1st Int. & 4th Local Conf. Pure Sci. (ICPS), 2023, doi: 10.1063/5.0102761.

L. A. S. Cardona, H. D. Vargas-Cardona, P. N. González, D. A. C. Peña, and Á. Á. O. Gutiérrez, “Classification of categorical data based on the chi-square dissimilarity and t-SNE,” Computation, vol. 8, no. 4, pp. 1–15, 2020, doi: 10.3390/computation8040104.

A. B. Pillay, D. Pathmanathan, A. Abu, and H. Omar, “RFE-based feature selection to improve classification accuracy for morphometric analysis of craniodental characters of house rats,” Sains Malaysiana, vol. 52, no. 7, pp. 1901–1914, 2023, doi: 10.17576/jsm-2023-5207-01.

S. Srivatsan and T. Santhanam, “Early onset detection of diabetes using feature selection and boosting techniques,” ICTACT J. Soft Comput., vol. 12, no. 1, pp. 2474–2485, 2021, doi:10.21917/ijsc.2021.0344.

Alifah, T. Siswantining, D. Sarwinda, and A. Bustamam, “RFE and chi-square based feature selection approach for detection of diabetic retinopathy,” in Proc. Int. Joint Conf. Sci. Eng. (IJCSE 2020), 2020, no. Feb. 2021, doi: 10.2991/aer.k.201124.069.

G. Audemard, S. Bellart, L. Bounia, F. Koriche, J. M. Lagniez, and P. Marquis, “Trading complexity for sparsity in random forest explanations,” in Proc. 36th AAAI Conf. Artif. Intell. (AAAI), vol. 36, 2022, pp. 5461–5469, doi: 10.1609/aaai.v36i5.20484.

S. Bahassine, A. Madani, M. Al-Sarem, and M. Kissi, “Feature selection using an improved Chi-square for Arabic text classification,” J. King Saud Univ. - Comput. Inf. Sci., vol. 32, no. 2, pp. 225–231, Feb. 2020, doi: 10.1016/j.jksuci.2018.05.010.

W. H. Nugroho, S. Handoyo, Y. J. Akri, and A. D. Sulistyono, “Building multiclass classification model of logistic regression and decision tree using the Chi-square test for variable selection method,” J. Hunan Univ. Nat. Sci., vol. 49, no. 4, pp. 172–181, 2022, doi:10.55463/issn.1674-2974.49.4.17.

Vikas and P. Kaur, “Lung cancer detection using Chi-square feature selection and support vector machine algorithm,” Int. J. Adv. Trends Comput. Sci. Eng., vol. 10, no. 3, pp. 2050–2060, Jun. 2021, doi:10.30534/ijatcse/2021/801032021.

M. L. Mchugh, “The Chi-square test of independence,” Biochem. Medica, vol. 23, no. 2, pp. 143–149, 2013, doi:10.11613/BM.2013.018.

W. Li et al., “Predictive model and risk analysis for diabetic retinopathy using machine learning: A retrospective cohort study in China,” BMJ Open, vol. 11, no. 11, pp. 1–11, 2021, doi:10.1136/bmjopen-2021-050989.

S. Matharaarachchi, M. Domaratzki, and S. Muthukumarana, “Assessing feature selection method performance with class imbalance data,” Mach. Learn. Appl., vol. 6, p. 100170, 2021, doi:10.1016/j.mlwa.2021.100170.

M. H. Rizky, M. R. Faisal, I. Budiman, D. Kartini, and F. Abadi, “Effect of hyperparameter tuning using random search on tree-based classification algorithm for software defect prediction,” IJCCS (Indones. J. Comput. Cybern. Syst.), vol. 18, no. 1, pp. 95–104, 2024, doi: 10.22146/ijccs.90437.

B. H. Shekar and G. Dagnew, “Grid search-based hyperparameter tuning and classification of microarray cancer data,” in Proc. 2nd Int. Conf. Adv. Comput. Commun. Paradigms (ICACCP), 2019, doi:10.1109/icaccp.2019.8882943.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors who publish with this journal agree to the following terms:

    1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
    2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
    3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).