From: Machine learning and deep learning predictive models for type 2 diabetes: a systematic review
Cite | References | Machine learning model | Validation parameter | Data sampling | Complementary techniques | Description of the population |
---|---|---|---|---|---|---|
Type of data: electronic health records | ||||||
[29] | Arellano-Campos et al. (2019) | Cox proportional hazard regression | Accuracy: 0.75 hazard ratios | Cross-validation (k = 10) and bootstrapping | Beta-coefficients model | Base L: 7636 follow: 6144 diabetes: 331 age: 32–54 |
[30] | You et al. (2019) | Super learning: ensemble learner by choosing a weighted combination of algorithms | Average treatment effect | Cross-validation | Targeted learning query language logistic and tree regression | Total: 78,894 control: 41,127 diabetes: 37,767 age: > 40 |
[27] | Maxwell et al. (2017) | Sigmoid function-Deep Neural Network with cross entropy as loss function | Accuracy: 0.921 F1-score: 0.823 precision: 0.915 sensitivity: 0.867 | Training set (90%) test set (10%) tenfold cross-validation | RAkEL-LibSVM RAkEL-MLP RAkEL-SMO RAkEL-J48 RAkEL-RF MLkNN | Total: 110,300 imbalanced 6 disease categories |
[28] | Nguyen et al. (2019) | Deep Neural Network with three embedding and two hidden layers | Specificity 0.96 accuracy: 0.84 sensitivity: 0.31 AUC (ROC): 0.84 | Training set (70%): cross-validation 9:1 test set (30%) | Generalized linear model large-scale regression | Total: 76,214 78 diseases age: 25–78 |
[31] | Pham et al. (2017) | Recurrent Neural Network Convolutional-Long Short-Term Memory (C-LSTM) | F1-score: 0.79 precision: 0.66 | Training set (66%) tuning set (17%) test set (17%) | Support vector machine and random forests | Diabetes: 12,000 age: 18–100 mean age: 73 |
[32] | Spänig et al. (2019) | Deep Neural Networks with tangens hyperbolicus | AUC (ROC) = 0.71 AUC (ROC) =  0.68 | Training set (80%) test set (20%) | Sub-sampling approach support vector machine with RBF kernel | Total: 4814 diabetes: 646 diagnosis: 397 not diag: 257 age: 45–75 imbalance |
[33] | Wang et al. (2020) | Convolutional neural network and bidirectional long short-term memory | Precision: 92.3 recall: 90.5 F score: 91.3 accuracy: 92.8 | Training set (70%) validation set (10%) test set (20%) | SVM-TFIDF CNN BiLSTM | Total: 18,625 diabetes: 5645 10 disease categories |
[34] | Kim et al. (2020) | Class activation map and CNN (SSANet) | R2 = 0.75 MAE = 3.55 AUC (ROC) =  0.77 | Training set (89%) validation set (1%) test set (10%) | Linear regression | Total: 412,026 norm: 243,668 diabetes: 14,189 age: 19–90 |
[35] | Bernardini et al. (2020) | Sparse balanced support vector machine (SB-SVM) | Recall = 0.7464 AUC (ROC) = 0.8143 | Tenfold cross-validation | Sparse 1-norm SVM | Total: 2433 diabetes: 225 control: 2208 age: 60–80 imbalanced |
[36] | Mei et al. (2017) | Hierarchical recurrent neural network | AUC (ROC) = 0.9268 Accuracy = 0.6745 | Training set (80%) validation set (10%) test set (10%) | Linear regression | Total: 620,633 |
[25] | Prabhu et al. (2019) | Deep belief neural network | Recall: 1.0 precision: 0.68 F1 score: 0.80 | Training set validation set test set | Principal component analysis | Pima Indian Women Diabetes Dataset |
[13] | Bernardini et al. (2020) | Multiple instance learning boosting | Accuracy: 0.83 F1-score: 0.81 precision: 0.82 recall: 0.83 AUC (ROC): 0.89 | Tenfold cross-validation | None | Total: 252 diabetes: 252 age: 54–72 |
[37] | Solares et al. (2019) | Hazard ratios using Cox regression | AUC (ROC): 0.75, concordance (C-statistic) | Derivation set (80%) validation (20%) | None | Total: 80,964 diabetes: 2267 age: 50 |
[38] | Kumar et al. (2017) | Support vector machine, Naive Bayes, K-nearest neighbor C4.5 decision tree | Precision: 0.65, 0.68, 0.7, 0.72 recall: 0.69, 0.68, 0.7, 0.74 accuracy: 0.69, 0.67, 0.7, 0.74 F-score: 0.65, 0.68, 0.7, 0.72 | N-fold (N = 10) cross validation | None | Diabetes: 200 age: 1–100 |
[39] | Olivera et al. (2017) | Logistic regression artificial neural network K-nearest neighbor Naïve Bayes | AUC (ROC): 75.44, 75.48, 74.94, 74.47 balanced accuracy: 69.3, 69.47, 68.74, 68.95 | Training set (70%) test set (30%) tenfold cross-validation | Forward selection | Diabetes: 12,447 unknown: 1359 age: 35–74 |
[10] | Alghamdi et al. (2017) | Naïve Bayes tree, random forest, and logistic model tree, j48 decision tree | Kappa: 1.34, 3.63 1.37, 0.70, 1.14 recall (%) 99.2, 99.2, 90.8, 99.9, 99.4 Specificity (%) 1.6, 3.1, 21.2 0.50, 1.3 accuracy (%) 83.9, 84.1, 79.9, 84.3, 84.1 | N-fold cross validation | Multiple linear regression gain ranking method synthetic minority oversampling technique | Total: 32,555 diabetes: 5099 imbalanced |
[14] | Xie et al. (2017) | K2 structure-learning algorithm | Accuracy = 82.48 | Training set (75%) test set (25%) | None | Total: 21,285 diabetes: 1124 age: 35–65 |
[40] | Peddinti et al. (2017) | Regularised least-squares regression for binary risk classification | Odds ratio accuracy: 0.77 | Tenfold cross-validation | Logistic regression | Total: 543 diabetes: 146 age: 48–50 |
[8] | Maniruzzaman et al. (2017) | Linear discriminant analysis, quadratic discriminant analysis, Naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, random forest | Accuracy: 0.92 sensitivity: 0.96 specificity: 0.80 PPV: 0.91 NPV: 0.91 AUC (ROC): 0.93 | Cross-validation K2, K4, K5, K10, and JK | Random forest, logistic regression, mutual information, principal component analysis, analysis of variance Fisher discriminant ratio | Pima Indian diabetic dataset |
[41] | Dutta et al. (2018) | Logistic regression support vector machine random forest | Sensitivity: 0.80, 0.75, 0.84 F1-score: 0.80, 0.79, 0.84 | Training set (67%) test set (33%) | None | Diabetes: 130 control: 262 imbalanced age: 21–81 |
[42] | Alhassan et al. (2018) | Long short-term memory deep learning gated-recurrent unit deep learning | Accuracy: 0.97 F1-score: 0.96 | Training set (90%) test set (10%) tenfolds cross-validation | Logistic regression support vector machine, multi-layer perceptron | Total: 41,000,000 imbalanced diabetes: 62% |
[15] | Hertroijs et al. (2018) | Latent growth mixture modelling | Specificity: 81.2% sensitivity: 78.4% accuracy: 92.3% | Training set (90%) test set (10%) fivefold cross-validation | K-nearest neighbour | Total: 105814 age: > 18 |
[43] | Kuo et al. (2020) | Random forest C5.0 support vector machine | Accuracy: 1 F1-score: 1 AUC (ROC): 1 sensitivity: 1 | Tenfold cross-validation | Information gain (features) gain ratio | Total: 149 diabetes: 149 age: 21–91 |
[44] | Pimentel et al. (2018) | Naïve Bayes, alternating decision tree, random forest, random tree, k-nearest neighbor, support vector machine | Specificity: 0.76, 0.88, 0.87, 0.97, 0.82, 0.85 sensitivity: 0.62, 0.50, 0.33, 0.42, 0.40, 0.59 AUC (ROC): 0.73, 0.81, 0.87, 0.74, 0.62, 0.63 | Training set (70%) test set (30%) tenfold cross-validation | SMOTE | Total: 9947 imbalanced diabetes: 13% age: 21–93 |
[45] | Talaei-Khoeni et al. (2018) | Artificial neural network, support vector machine, logistic regression, decision tree | AUC (ROC): 0.614, 0.831, 0.738, 0.793 sensitivity: 0.608, 0.683, 0.677, 0.687 specificity: 0.783, 0.950, 0.712, 0.651 MCC: 0.797. 0.922, 0.581, 0.120 MCE: 0.844, 0.989, 0.771, 0.507 | Oversampling technique, random under sampling | Syntactic minority LASSO, AIC and BIC | Total: 10,911 imbalance diabetes: 51.9% |
[46] | Perveen et al. (2019) | J48 decision tree, Naïve Bayes | TPR: 0.85, 0.782, 0.852, 0.774 FPR: 0.218, 0.15 0.226, 0.148 precision: 0.814, 0.782, 0.807 recall: 0.85, 0.802, 0.852, 0.824 F-measure: 0.831, 0.634, 0.829, 0.774 MCC: 0.634, 0.823, 0.628, 0.798 AUC (ROC): 0.883, 0.873, 0.836, 0.826 | K-medoids under sampling | Logistic regression | Total: 667, 907 age: 22–74 diabetes: 8.13% imbalance |
[47] | Yuvaraj et al. (2019) | Decision tree Naïve Bayes random forest | Precision: 87, 91, 94 recall: 77, 82, 88 F-measure: 82, 86, 91 accuracy: 88, 91, 94 | Training set (70%) test set (30%) | Information gain RHadoop | Total: 75,664 |
[48] | Deo et al. (2019) | Bagged trees, linear support vector machine | Accuracy: 91% AUC (ROC): 0.908 | Training set (70%) test set (30%) fivefold cross-validation, holdout validation | Synthetic minority oversampling technique, Gower’s distance | Total: 140 diabetes: 14 imbalanced age: 12–90 |
[49] | Jakka et al. (2019) | K nearest neighbor, decision tree, Naive Bayes, support vector machine, logistic regression, random forest | Accuracy: 0.73, 0.70, 075, 0.66, 0.78, 0.74 recall: 0.69, 0.72, 0.74, 0.64 0.76, 0.69 F1-score: 0.69, 0.72, 0.74, 0.40, 0.75, 0.69 misclassification rate: 0.31, 0.29, 0.26, 0.36, 0.24, 0.29 AUC (ROC): 0.70, 0.69, 0.70, 0.61, 0.74, 0.70 | None | None | Pima Indians Diabetes dataset |
[50] | Radja et al. (2019) | Naive Bayes, support vector machine, decision table, J48 decision tree | Precision: 0.80, 0.79, 0.76, 0.79 precision: 0.68, 0.74, 0.60, 0.63 recall: 0.84, 0.90, 0.81, 0.81 recall: 0.61, 0.54, 0.53, 0.60 F1-score: 0.76, 0.76, 0.71, 0.74 | Tenfold cross-validation | None | Total: 768 diabetes: 500 control: 268 |
[51] | Choi et al. (2019) | Logistic regression, linear discriminant analysis, quadratic discriminant analysis, K-nearest neighbor | AUC (ROC): 0.78, 0.77 0.76, 0.77 | Tenfold cross-validation | Information gain | Total: 8454 diabetes: 404 age: 40–72 |
[52] | Akula et al. (2019) | K nearest neighbor, support vector machine, decision tree, random forest, gradient boosting, neural network, Naive Bayes | Overall accuracy: 0.86 precision: 0.24 negative prediction: 0.99 sensitivity: 0.88 specificity: 0.85 F1-score: 0.38 | Training set: 800 test set: 10,000 | None | Pima Indians Diabetes Dataset Practice Fusion Dataset total: 10,000 age: 18–80 |
[53] | Xie et al. (2019) | Support vector machine, decision tree, logistic regression, random forest, neural network, Naive Bayes | Accuracy: 0.81, 0.74, 0.81, 0.79, 0.82, 0.78 sensitivity: 0.43, 0.52, 0.46, 0.50, 0.37, 0.48 specificity: 0.87, 0.78, 0.87, 0.84 0.90, 0.82 AUC (ROC): 0.78, 0.72, 0.79, 0.76, 0.80, 0.76 | Training set (67%) test set (33%) | Odds ratio synthetic minority over-sampling technique | Total: 138,146 diabetes: 20,467 age: 30–80 |
[54] | Lai et al. (2019) | Gradient boosting machine, logistic regression, random forest, Rpart | AUC (ROC): 84.7%, 84.0% 83.4%, 78.2% | Training set (80%) test set (20%) tenfold cross-validation | Misclassification costs | Total: 13,309 diabetes: 20.9% age: 18–90 imbalanced |
[17] | Brisimi et al. (2018) | Alternating clustering and classification | AUC (ROC): 0.8814, 0.8861, 0.8829, 0.8812 | Training set (40%) test set (60%) | Sparse (l1-regularized), support vector machines, random forests, gradient tree boosting | Diabetes: 47,452 control: 116,934 age mean: 66 |
[55] | Abbas et al. (2019) | Support vector machine with Gaussian radial basis | Accuracy: 96.80% sensitivity: 80.09% | Tenfold cross-validation | Minimum redundancy maximum relevance algorithm | Total: 1438 diabetes: 161 age: 25–64 |
[56] | Sarker et al. (2020) | K-nearest neighbors | Precision: 0.75 recall: 0.76 F-score: 0.75 AUC (ROC): 0.72 | Tenfold cross validation | Adaptive boosting, logistic regression, Naive Bayes, support vector machine decision tree | Total: 500 age: 10–80 |
[57] | Cahn et al. (2020) | Gradient boosting trees model | AUC (ROC): 0.87 sensitivity: 0.61 specificity: 0.91 PPV: 0.16 | Training set: THIN dataset validation set: AppleTree dataset MHS dataset | Logistic-regression | Age: 40–80 THIN: total = 3,068,319 pre-DM: 40% DM: 2.9% Apple Tree: P-DM: 381,872 DM: 2.3% MHS: pre-DM: 12,951 DM: 2.7% |
[58] | Garcia-Carretero et al. (2020) | K-nearest neighbors | Accuracy: 0.977 sensitivity 0.998 specificity 0.838 PPV: 0.976 NPV: 0.984 AUC (ROC): 0.89 | Tenfold cross-validation | Random forest | Age: 44–72 pre-DM = 1647 diabetes: 13% |
[59] | Zhang et al. (2020) | Logistic regression, classification and regression tree, gradient boosting machine, artificial neural networks, random forest, support vector machine | AUC (ROC): 0.84, 0.81, 0.87, 0.85, 0.87, 0.84 accuracy: 0.75, 0.80, 0.81, 0.74, 0.86, 0.76 sensitivity: 0.79, 0.67, 0.76, 0.81, 0.80, 0.75 specificity: 0.75, 0.81, 0.82, 0.73, 0.78, 0.77 PPV: 0.23, 0.26, 0.29, 0.26, 0.26, 0.24 NPV: 0.97, 0.96, 0.97, 0.98, 0.98, 0.97 | Tenfold cross-validation | Synthetic minority over-sampling technique | Total: 36,652 age: 18–79 |
[26] | Albahli et al. (2020) | Logistic regression | Accuracy: 0.97 | Tenfold cross-validation | Random Forest eXtreme Gradient Boosting | Total: diabetes age: 21–81 Pima Indians Diabetes dataset |
[60] | Haq et al. (2020) | Decision tree (iterative Dichotomiser 3) | Accuracy: 0.99 sensitivity 1 specificity 0.98 MCC: 0.99 F1-score: 1 AUC (ROC): 0.998 | Training set (70%) test set (30%) hold out training set (90%) test set (10%) tenfold cross-validation | Ada Boost, random forest | Total = 2000 diabetes: 684 age: 21–81 |
[61] | Yang et al. (2020) | Linear discriminant analysis, support vector machine random forest | AUC: 0.85, 0.84, 0.83 sensitivity: 0.80, 0.79, 0.78 specificity: 0.74, 0.75, 0.73 accuracy: 0.75 0.74,0.74 PPV: 0.36, 0.36, 0.35 | Training set: (80%, 2011–2014), test set: (20%, 2011–2014) and validation set: (2015–2016) fivefold cross-validation | Binary logistic regression | Total = 8057 age: 20–89 imbalanced |
[62] | Ahn et al. (2020) | Random forest, support vector machine | AUC (ROC): 1.00, 0.95 | Tenfold cross-validation | ELISA | Age: 43–68 |
[63] | Sarwar et al. (2018) | K nearest neighbors, Naive Bayes, support vector machine, decision tree, logistic regression, random forest | Accuracy: 0.77, 0.74, 0.77, 0.71, 0.74, 0.71 | Training set (70%) test set (30%) tenfold cross-validation | None | Pima Indians Diabetes Dataset |
[64] | Zou et al. (2018) | Random forest J48 decision tree Deep Neural Network | Accuracy: 0.81, 0.79, 0.78 sensitivity: 0.85 0.82, 0.82 specificity: 0.77, 0.76, 0.75 MCC: 0.62, 0.57, 0.57 | Fivefold cross-validation | Principal component analysis, minimum redundancy maximum relevance | Pima Indian diabetic Luzhou |
[65] | Farran et al. (2019) | Logistic regression k-nearest neighbours support vector machine | AUC (ROC): 3-year: 0.74, 0.83, 0.73 5-year: 0.72, 0.82, 0.68 7-year: 0.70, 0.79, 0.71 | Fivefold cross-validation | None | Diabetes: 40,773 control: 107,821 age: 13–65 |
[66] | Xiong et al. (2019) | Multilayer perceptron, AdaBoost, random forest, support vector machine, gradient boosting | Accuracy: 0.87, 0.86, 0.86, 0.86, 0.86 | Training set (60%) test set (20%) tenfolds cross-validation set (20%) | Missing values feature mean | Total: 11845 diabetes: 845 age: 20–100 |
[67] | Dinh et al. (2019) | Support vector machine, random forest, gradient boosting, logistic regression | AUC (ROC): 0.890.94, 0.96, 0.72 sensitivity: 0.81, 0.86, 0.89, 0.67 precision: 0.81, 0.86, 0.89, 0.67 F1-score: 0.81, 0.86, 0.89, 0.67 | Training set (80%) test set (20%) tenfold cross-validation | None | Case 1: 21,131 diabetes: 5532 case 2: 16,426 prediabetes: 6482 |
[68] | Liu et al. (2019) | LASSO, SCAD, MCP, stepwise regression | AUC (ROC): 0.710.70, 0.70, 0.71 sensitivity: 0.64, 0.64, 0.64, 0.63 specificity: 0.68, 0.68, 0.68, 0.68, precision: 0.35, 0.35, 0.35, 0.35 NPV: 0.87, 0.87, 0.87, 0.87 | Training set (70%) test set (30%) tenfold cross-validation | None | Total: 5481 age: > 40 |
[9] | Muhammad et al. (2020) | Logistic regression support vector machine K-nearest neighbor random forest Naive Bayes gradient boosting | Accuracy: 0.81, 0.85, 0.82, 0.89, 0.77, 0.86 AUC (ROC): 0.80, 0.85, 0.82, 0.86 0.77, 0.86 | None | Correlation coefficient analysis | Total: 383 age: 1–150 diabetes: 51.9% |
[69] | Tang et al. (2020) | EMR-image multimodal network (CNN) | Accuracy: 0.86 F1-score: 0.76 AUC (ROC): 0.89 Sensitivity: 0.68 Precision: 0.88 | Fivefold cross-validation | None | Total: 997 diabetes: 401 |
[70] | Maniruzzaman et al. (2021) | Naive Bayes decision tree Adaboost random forest | Accuracy: 0.87, 0.90, 0.91, 0.93 AUC (ROC): 0.82, 0.78, 0.90, 0.95 | Tenfold cross-validation | Logistic regression | Total: 6561 diabetes: 657 age: 30–64 imbalanced |
[71] | Boutilier et al. (2021) | Random forest logistic regression Adaboost K-nearest neighbors decision trees | AUC (ROC): 0.91, 0.91, 0.90, 0.86, 0.78 | Tenfold cross-validation | 2-Sided Wilcoxon signed rank test | Total: 2278 diabetes: 833 age: 35–63 |
[72] | Li et al. (2021) | Extreme gradient boosting (GBT) | AUC (ROC): 0.91 precision: 0.82 sensitivity: 0.80 F1-score: 0.77 | Training set (60%) validation (20%) test set (20%) | Genetic algorithm | Diabetics: 570 control: 570 prediabetics: 570 age: 33–68 |
[73] | Lam et al. (2021) | Random forest logistic regression extreme gradient boosting GBT | AUC (ROC): 0.86 F1-score: 0.82 | Tenfold cross-validation | None | Control: 19,852 diabetes: 3103 age: 40–69 |
[74] | Deberneh et al. (2021) | Random forest support vector machine XGBoost | Accuracy: 0.73, 0.73, 0.72 precision: 0.74, 0.74, 0.74 F1-score: 0.74, 0.74, 0.73 sensitivity: 0.73, 0.74, 0.72 Kappa: 0.60, 0.60, 0.58 MCC: 0.60, 0.60, 0.58 | Tenfold cross-validation | ANOVA, Chi-squared, SMOTE feature Importance | Total: 535,169 diabetes: 4.3% prediabetes: 36% age: 18–108 |
[75] | He et al. (2021) | Cox regression | C-statics: 0.762 | Hold out | None | Total: 68,299 diabetes: 1281 age: 40–69 |
[76] | GarcÃa-Ordás et al. (2021) | Convolutional neural network (DNN) | Accuracy: 0.92 | Training set (90%) test set (10%) | Variational and sparse autoencoders | Pima Indians |
[77] | Kanimozhi et al. (2021) | Hybrid particle swarm optimization-artificial fish swarm optimization | Accuracy: 1, 0.99 specificity: 0.86, 0.83 sensitivity: 1, 0.99 MCC: 0.91, 0.92 Kappa: 0.96, 0.98 | Training set (90%) test set (10%) fivefold cross-validation | Min–max scaling, kernel extreme learning machine | Pima Indians Diabetics, Diabetic Research Center |
[78] | Ravaut et al. (2021) | Extreme gradient boosting tree | AUC (ROC): 0.84 | Training set (86%) validation (7%) test set (7%) | Mean absolute Shapley values | Total: 15,862,818 diabetes: 19,137 age: 40–69 |
[79] | De Silva et al. (2021) | Logistic regression | AUC (ROC): 0.75 accuracy: 0.62 specificity: 0.62 sensitivity: 0.77 PPV: 0.09 NPV: 0.98 | Training set (30%) validation (30%) test set (40%) | SMOTE ROSE | Total: 16,429 diabetes: 5.6% age: >20 |
[80] | Kim et al. (2021) | Deep neural network, logistic regression, decision tree | Accuracy: 0.80, 0.80, 0.71 | Fivefold cross-validation | Wald test | Total: 3889 diabetes: 746 age: 40–69 |
[81] | Vangeepuram et al. (2021) | Naive Bayes | AUC (ROC): 0.75 accuracy: 0.62 specificity: 0.62 sensitivity: 0.77 PPV: 0.09 NPV: 0.98 | Fivefold cross-validation | Friedman-Nemenyi | Total: 2858 diabetes: 828 age: 12–19 |
[82] | Recenti et al. (2021) | Random forest Ada-boost gradient boosting | Accuracy: 0.90, 0.79, 0.86 precision: 0.88, 0.78, 0.84 F1-score: 0.90, 0.81, 0.87 sensitivity: 0.93, 0.84, 0.90 specificity: 0.87, 0.76, 0.82 AUC (ROC): 0.97, 0.90, 0.95 | Tenfold cross-validation | SMOTE | Total: 2943 age: 66–98 imbalance |
[83] | Ramesh et al. (2021) | Support vector machine | Accuracy: 0.83 specificity: 0.79 sensitivity: 0.87 | Tenfold cross-validation | MICE LASSO | Pima Indians |
[84] | Lama et al. (2021) | Random forest | AUC (ROC): 0.78 | Fivefold cross-validation | SHAP TreeExplainer | Total: 3342 diabetes: 556 age: 35–54 |
[85] | Shashikant et al. (2021) | Gaussian process-based kernel | Accuracy: 0.93 precision: 0.94 F1-score: 0.95 sensitivity: 0.96 specificity: 0.82 AUC (ROC): 0.89 | Tenfold cross-validation | Non-linear HRV | Total: 135 diabetes: 100 age: 20–70 |
[86] | Kalagotla et al. (2021) | Stacking multi-layer perceptron, support vector machine logistic regression | Accuracy: 0.78 precision: 0.72 sensitivity: 0.51 F1-score: 0.60 | Hold out k-fold cross-validation | Matrix correlation | Pima Indians |
[87] | Moon et al. (2021) | Logistic regression | AUC (ROC): 0.94 | Training set (47%) validation (30%) test set (23%) | Cox regression | Total: 14,977 diabetes: 636 age: 48–69 |
[88] | Ihnaini et al. (2021) | Ensemble deep learning model | Accuracy: 0.99 precision: 1 sensitivity: 0.99 F1-score: 0.99 RMSE: 0 MAE: 0.6 | Hold out | None | Pima Indians merged Hospital Frenkfurt Germany |
[89] | Rufo et al. (2021) | LightGBM | Accuracy: 0.98 specificity: 0.96 AUC (ROC): 0.98 Sensitivity: 0.99 | Tenfold cross-validation | Min–max scale | Diabetes: 1030 Control: 1079 age: 12–90 |
[90] | Haneef et al. (2021) | Linear discriminant analysis | Accuracy: 0.67 specificity: 0.67 sensitivity: 0.62 | Training set (80%) test set (20%) | Z-score transformation random down sampling | Total 44,659 age 18–69 imbalanced |
[91] | Wei et al. (2022) | Random forest | AUC (ROC): 0.70 R2: 0.40 | Training set (70%) test set (30%) tenfold cross-validation | LASSO PCA | Total: 8501 age: 15–50 diabetes: 8.92% imbalanced |
[92] | Leerojanaprapa et al. (2019) | Bayesian network | AUC (ROC): 0.78 | Training set (70%) test set (30%) | None | Total: 11,240 diabetes: 5.53% age: 15–19 |
[93] | Subbaiah et al. (2020) | Random forest | Accuracy: 1 specificity: 1 sensitivity: 1 Kappa: 1 | Training set (70%) test set (30%) | None | Pima Indians |
[94] | Thenappan et al. (2020) | Support vector machine | Accuracy: 0.97 specificity: 0.96 sensitivity: 0.94 precision: 0.96 | Training set (70%) test set (30%) | Principal component analysis | Pima Indians |
[95] | Sneha et al. (2019) | Support vector machine, random forest, Naive Bayes, decision tree, k-nearest neighbors | Accuracy: 0.78, 0.75, 0.74, 0.73, 0.63 | Training set (70%) test set (30%) | None | Total: 2500 age: 29–70 |
[96] | Jain et al. (2020) | Support vector machine, random forest, k-nearest neighbors | Accuracy: 0.74 0.74, 0.76 precision: 0.67, 0.72, 0.70 sensitivity: 0.52, 0.44, 0.54 F1-score: 0.58, 0.55, 0.61 AUC (ROC): 0.74, 0.83, 0.83 | Training set (70%) test set (30%) | None | Control: 500 diabetes: 268 age: 21–81 |
[97] | Syed et al. (2020) | Decision forest | F1-Score: 0.87 precision: 0.81 AUC (ROC): 0.90 Sensitivity: 0.91 | Training set (80%) test set (20%) | Pearson Chi-squared | Total: 4896 diabetes: 990 age: 40–60 |
[98] | Nuankaew et al. (2020) | Average weighted objective distance | Precision: 0.99 accuracy: 0.90 specificity: 0.97 | Training set (70%) test set (30%) | None | Mendeley data for diabetes |
[99] | Samreen et al. (2021) | Stack NB, LR, KNN, DT, SVM, RF, Ada-boost, GBT | Accuracy: 0.98, 0.99 (SVD) | Training set (70%) test set (30%) tenfold cross-validation | One hot encoding, singular value decomposition | Age: 20–90 |
[100] | Fazakis et al. (2021) | Weighted voting LR-RF | AUC (ROC): 0.88 | Hold-out | Forward/backward stepwise selection | English longitudinal study of ageing |
[101] | Omana et al. (2021) | Newton’s divide difference method | Accuracy: 0.97 S-error: 0.06 | Hold-out | Non-linear autoaggressive regression | Total: 812,007 diabetes: 23.49% |
[102] | Ravaut et al. (2021) | Extreme gradient boosting tree | AUC (ROC): 0.80 | Training set (87%) validation (7%) test set (6%) | Mean absolute Shapley values | Total: 14,786,763 diabetes: 27,820 age: 10–100 imbalance |
[103] | Lang et al. (2021) | Deep belief network | AUC (ROC): 0.82 sensitivity: 0.80 specificity: 0.73 | Hold-out | Stratified sampling | Total: 1778 diabetes: 279 |
[104] | Gupta et al. (2021) | Deep Neural Network | Precision: 0.90 accuracy: 0.95 sensitivity: 0.95 F1-score: 0.93 specificity: 0.95 | Hold-out | None | Pima Indians |
[105] | Roy et al. (2021) | Gradient boosting tree | Accuracy: 0.92 precision: 0.86 sensitivity: 0.87 specificity: 0.79 AUC (ROC): 0.84 | Tenfold cross-validation | Correlation matrix SMOTE | Total: 500 diabetes: 289 age: 20–80 Imbalanced |
[106] | Zhang et al. (2021) | Bagging boosting GBT, RF, GBM | Accuracy: 0.82 sensitivity: 0.85 specificity: 0.82 AUC (ROC): 0.89 | Training set (80%) test set (20%) tenfold cross-validation | SMOTE | Total: 37,730 diabetes: 9.4% age: 50–70 Imbalanced |
[107] | Turnea et al. (2018) | Decision tree | Accuracy: 0.74 sensitivity: 0.60 specificity: 0.82 RMSE: 26.1 | Training set (75%) test set (25%) | None | Pima Indians |
[108] | Vettoretti et al. (2021) | RFE-Borda | RMSE: 0.98 | None | Correlation matrix | English longitudinal study of ageing |