Skip to main content

Table 2 Detailed classification of methods that predict the main factors for diagnosing the onset of diabetes

From: Machine learning and deep learning predictive models for type 2 diabetes: a systematic review

Cite

References

Machine learning model

Validation parameter

Data sampling

Complementary techniques

Description of the population

Type of data: electronic health records

[29]

Arellano-Campos et al. (2019)

Cox proportional hazard regression

Accuracy: 0.75 hazard ratios

Cross-validation (k = 10) and bootstrapping

Beta-coefficients model

Base L: 7636 follow: 6144 diabetes: 331 age: 32–54

[30]

You et al. (2019)

Super learning: ensemble learner by choosing a weighted combination of algorithms

Average treatment effect

Cross-validation

Targeted learning query language logistic and tree regression

Total: 78,894 control: 41,127 diabetes: 37,767 age: > 40

[27]

Maxwell et al. (2017)

Sigmoid function-Deep Neural Network with cross entropy as loss function

Accuracy: 0.921 F1-score: 0.823 precision: 0.915 sensitivity: 0.867

Training set (90%) test set (10%) tenfold cross-validation

RAkEL-LibSVM RAkEL-MLP RAkEL-SMO RAkEL-J48 RAkEL-RF MLkNN

Total: 110,300 imbalanced 6 disease categories

[28]

Nguyen et al. (2019)

Deep Neural Network with three embedding and two hidden layers

Specificity 0.96 accuracy: 0.84 sensitivity: 0.31 AUC (ROC): 0.84

Training set (70%): cross-validation 9:1 test set (30%)

Generalized linear model large-scale regression

Total: 76,214 78 diseases age: 25–78

[31]

Pham et al. (2017)

Recurrent Neural Network Convolutional-Long Short-Term Memory (C-LSTM)

F1-score: 0.79 precision: 0.66

Training set (66%) tuning set (17%) test set (17%)

Support vector machine and random forests

Diabetes: 12,000 age: 18–100 mean age: 73

[32]

Spänig et al. (2019)

Deep Neural Networks with tangens hyperbolicus

AUC (ROC) = 0.71 AUC (ROC) =  0.68

Training set (80%) test set (20%)

Sub-sampling approach support vector machine with RBF kernel

Total: 4814 diabetes: 646 diagnosis: 397 not diag: 257 age: 45–75 imbalance

[33]

Wang et al. (2020)

Convolutional neural network and bidirectional long short-term memory

Precision: 92.3 recall: 90.5 F score: 91.3 accuracy: 92.8

Training set (70%) validation set (10%) test set (20%)

SVM-TFIDF CNN BiLSTM

Total: 18,625 diabetes: 5645 10 disease categories

[34]

Kim et al. (2020)

Class activation map and CNN (SSANet)

R2 = 0.75 MAE = 3.55 AUC (ROC) =  0.77

Training set (89%) validation set (1%) test set (10%)

Linear regression

Total: 412,026 norm: 243,668 diabetes: 14,189 age: 19–90

[35]

Bernardini et al. (2020)

Sparse balanced support vector machine (SB-SVM)

Recall = 0.7464 AUC (ROC) = 0.8143

Tenfold cross-validation

Sparse 1-norm SVM

Total: 2433 diabetes: 225 control: 2208 age: 60–80 imbalanced

[36]

Mei et al. (2017)

Hierarchical recurrent neural network

AUC (ROC) = 0.9268 Accuracy = 0.6745

Training set (80%) validation set (10%) test set (10%)

Linear regression

Total: 620,633

[25]

Prabhu et al. (2019)

Deep belief neural network

Recall: 1.0 precision: 0.68 F1 score: 0.80

Training set validation set test set

Principal component analysis

Pima Indian Women Diabetes Dataset

[13]

Bernardini et al. (2020)

Multiple instance learning boosting

Accuracy: 0.83 F1-score: 0.81 precision: 0.82 recall: 0.83 AUC (ROC): 0.89

Tenfold cross-validation

None

Total: 252 diabetes: 252 age: 54–72

[37]

Solares et al. (2019)

Hazard ratios using Cox regression

AUC (ROC): 0.75, concordance (C-statistic)

Derivation set (80%) validation (20%)

None

Total: 80,964 diabetes: 2267 age: 50

[38]

Kumar et al. (2017)

Support vector machine, Naive Bayes, K-nearest neighbor C4.5 decision tree

Precision: 0.65, 0.68, 0.7, 0.72 recall: 0.69, 0.68, 0.7, 0.74 accuracy: 0.69, 0.67, 0.7, 0.74 F-score: 0.65, 0.68, 0.7, 0.72

N-fold (N = 10) cross validation

None

Diabetes: 200 age: 1–100

[39]

Olivera et al. (2017)

Logistic regression artificial neural network K-nearest neighbor Naïve Bayes

AUC (ROC): 75.44, 75.48, 74.94, 74.47 balanced accuracy: 69.3, 69.47, 68.74, 68.95

Training set (70%) test set (30%) tenfold cross-validation

Forward selection

Diabetes: 12,447 unknown: 1359 age: 35–74

[10]

Alghamdi et al. (2017)

Naïve Bayes tree, random forest, and logistic model tree, j48 decision tree

Kappa: 1.34, 3.63 1.37, 0.70, 1.14 recall (%) 99.2, 99.2, 90.8, 99.9, 99.4 Specificity (%) 1.6, 3.1, 21.2 0.50, 1.3 accuracy (%) 83.9, 84.1, 79.9, 84.3, 84.1

N-fold cross validation

Multiple linear regression gain ranking method synthetic minority oversampling technique

Total: 32,555 diabetes: 5099 imbalanced

[14]

Xie et al. (2017)

K2 structure-learning algorithm

Accuracy = 82.48

Training set (75%) test set (25%)

None

Total: 21,285 diabetes: 1124 age: 35–65

[40]

Peddinti et al. (2017)

Regularised least-squares regression for binary risk classification

Odds ratio accuracy: 0.77

Tenfold cross-validation

Logistic regression

Total: 543 diabetes: 146 age: 48–50

[8]

Maniruzzaman et al. (2017)

Linear discriminant analysis, quadratic discriminant analysis, Naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, random forest

Accuracy: 0.92 sensitivity: 0.96 specificity: 0.80 PPV: 0.91 NPV: 0.91 AUC (ROC): 0.93

Cross-validation K2, K4, K5, K10, and JK

Random forest, logistic regression, mutual information, principal component analysis, analysis of variance Fisher discriminant ratio

Pima Indian diabetic dataset

[41]

Dutta et al. (2018)

Logistic regression support vector machine random forest

Sensitivity: 0.80, 0.75, 0.84 F1-score: 0.80, 0.79, 0.84

Training set (67%) test set (33%)

None

Diabetes: 130 control: 262 imbalanced age: 21–81

[42]

Alhassan et al. (2018)

Long short-term memory deep learning gated-recurrent unit deep learning

Accuracy: 0.97 F1-score: 0.96

Training set (90%) test set (10%) tenfolds cross-validation

Logistic regression support vector machine, multi-layer perceptron

Total: 41,000,000 imbalanced diabetes: 62%

[15]

Hertroijs et al. (2018)

Latent growth mixture modelling

Specificity: 81.2% sensitivity: 78.4% accuracy: 92.3%

Training set (90%) test set (10%) fivefold cross-validation

K-nearest neighbour

Total: 105814 age: > 18

[43]

Kuo et al. (2020)

Random forest C5.0 support vector machine

Accuracy: 1 F1-score: 1 AUC (ROC): 1 sensitivity: 1

Tenfold cross-validation

Information gain (features) gain ratio

Total: 149 diabetes: 149 age: 21–91

[44]

Pimentel et al. (2018)

Naïve Bayes, alternating decision tree, random forest, random tree, k-nearest neighbor, support vector machine

Specificity: 0.76, 0.88, 0.87, 0.97, 0.82, 0.85 sensitivity: 0.62, 0.50, 0.33, 0.42, 0.40, 0.59 AUC (ROC): 0.73, 0.81, 0.87, 0.74, 0.62, 0.63

Training set (70%) test set (30%) tenfold cross-validation

SMOTE

Total: 9947 imbalanced diabetes: 13% age: 21–93

[45]

Talaei-Khoeni et al. (2018)

Artificial neural network, support vector machine, logistic regression, decision tree

AUC (ROC): 0.614, 0.831, 0.738, 0.793 sensitivity: 0.608, 0.683, 0.677, 0.687 specificity: 0.783, 0.950, 0.712, 0.651 MCC: 0.797. 0.922, 0.581, 0.120 MCE: 0.844, 0.989, 0.771, 0.507

Oversampling technique, random under sampling

Syntactic minority LASSO, AIC and BIC

Total: 10,911 imbalance diabetes: 51.9%

[46]

Perveen et al. (2019)

J48 decision tree, Naïve Bayes

TPR: 0.85, 0.782, 0.852, 0.774 FPR: 0.218, 0.15 0.226, 0.148 precision: 0.814, 0.782, 0.807 recall: 0.85, 0.802, 0.852, 0.824 F-measure: 0.831, 0.634, 0.829, 0.774 MCC: 0.634, 0.823, 0.628, 0.798 AUC (ROC): 0.883, 0.873, 0.836, 0.826

K-medoids under sampling

Logistic regression

Total: 667, 907 age: 22–74 diabetes: 8.13% imbalance

[47]

Yuvaraj et al. (2019)

Decision tree Naïve Bayes random forest

Precision: 87, 91, 94 recall: 77, 82, 88 F-measure: 82, 86, 91 accuracy: 88, 91, 94

Training set (70%) test set (30%)

Information gain RHadoop

Total: 75,664

[48]

Deo et al. (2019)

Bagged trees, linear support vector machine

Accuracy: 91% AUC (ROC): 0.908

Training set (70%) test set (30%) fivefold cross-validation, holdout validation

Synthetic minority oversampling technique, Gower’s distance

Total: 140 diabetes: 14 imbalanced age: 12–90

[49]

Jakka et al. (2019)

K nearest neighbor, decision tree, Naive Bayes, support vector machine, logistic regression, random forest

Accuracy: 0.73, 0.70, 075, 0.66, 0.78, 0.74 recall: 0.69, 0.72, 0.74, 0.64 0.76, 0.69 F1-score: 0.69, 0.72, 0.74, 0.40, 0.75, 0.69 misclassification rate: 0.31, 0.29, 0.26, 0.36, 0.24, 0.29 AUC (ROC): 0.70, 0.69, 0.70, 0.61, 0.74, 0.70

None

None

Pima Indians Diabetes dataset

[50]

Radja et al. (2019)

Naive Bayes, support vector machine, decision table, J48 decision tree

Precision: 0.80, 0.79, 0.76, 0.79 precision: 0.68, 0.74, 0.60, 0.63 recall: 0.84, 0.90, 0.81, 0.81 recall: 0.61, 0.54, 0.53, 0.60 F1-score: 0.76, 0.76, 0.71, 0.74

Tenfold cross-validation

None

Total: 768 diabetes: 500 control: 268

[51]

Choi et al. (2019)

Logistic regression, linear discriminant analysis, quadratic discriminant analysis, K-nearest neighbor

AUC (ROC): 0.78, 0.77 0.76, 0.77

Tenfold cross-validation

Information gain

Total: 8454 diabetes: 404 age: 40–72

[52]

Akula et al. (2019)

K nearest neighbor, support vector machine, decision tree, random forest, gradient boosting, neural network, Naive Bayes

Overall accuracy: 0.86 precision: 0.24 negative prediction: 0.99 sensitivity: 0.88 specificity: 0.85 F1-score: 0.38

Training set: 800 test set: 10,000

None

Pima Indians Diabetes Dataset Practice Fusion Dataset total: 10,000 age: 18–80

[53]

Xie et al. (2019)

Support vector machine, decision tree, logistic regression, random forest, neural network, Naive Bayes

Accuracy: 0.81, 0.74, 0.81, 0.79, 0.82, 0.78 sensitivity: 0.43, 0.52, 0.46, 0.50, 0.37, 0.48 specificity: 0.87, 0.78, 0.87, 0.84 0.90, 0.82 AUC (ROC): 0.78, 0.72, 0.79, 0.76, 0.80, 0.76

Training set (67%) test set (33%)

Odds ratio synthetic minority over-sampling technique

Total: 138,146 diabetes: 20,467 age: 30–80

[54]

Lai et al. (2019)

Gradient boosting machine, logistic regression, random forest, Rpart

AUC (ROC): 84.7%, 84.0% 83.4%, 78.2%

Training set (80%) test set (20%) tenfold cross-validation

Misclassification costs

Total: 13,309 diabetes: 20.9% age: 18–90 imbalanced

[17]

Brisimi et al. (2018)

Alternating clustering and classification

AUC (ROC): 0.8814, 0.8861, 0.8829, 0.8812

Training set (40%) test set (60%)

Sparse (l1-regularized), support vector machines, random forests, gradient tree boosting

Diabetes: 47,452 control: 116,934 age mean: 66

[55]

Abbas et al. (2019)

Support vector machine with Gaussian radial basis

Accuracy: 96.80% sensitivity: 80.09%

Tenfold cross-validation

Minimum redundancy maximum relevance algorithm

Total: 1438 diabetes: 161 age: 25–64

[56]

Sarker et al. (2020)

K-nearest neighbors

Precision: 0.75 recall: 0.76 F-score: 0.75 AUC (ROC): 0.72

Tenfold cross validation

Adaptive boosting, logistic regression, Naive Bayes, support vector machine decision tree

Total: 500 age: 10–80

[57]

Cahn et al. (2020)

Gradient boosting trees model

AUC (ROC): 0.87 sensitivity: 0.61 specificity: 0.91 PPV: 0.16

Training set: THIN dataset validation set: AppleTree dataset MHS dataset

Logistic-regression

Age: 40–80 THIN: total = 3,068,319 pre-DM: 40% DM: 2.9% Apple Tree: P-DM: 381,872 DM: 2.3% MHS: pre-DM: 12,951 DM: 2.7%

[58]

Garcia-Carretero et al. (2020)

K-nearest neighbors

Accuracy: 0.977 sensitivity 0.998 specificity 0.838 PPV: 0.976 NPV: 0.984 AUC (ROC): 0.89

Tenfold cross-validation

Random forest

Age: 44–72 pre-DM = 1647 diabetes: 13%

[59]

Zhang et al. (2020)

Logistic regression, classification and regression tree, gradient boosting machine, artificial neural networks, random forest, support vector machine

AUC (ROC): 0.84, 0.81, 0.87, 0.85, 0.87, 0.84 accuracy: 0.75, 0.80, 0.81, 0.74, 0.86, 0.76 sensitivity: 0.79, 0.67, 0.76, 0.81, 0.80, 0.75 specificity: 0.75, 0.81, 0.82, 0.73, 0.78, 0.77 PPV: 0.23, 0.26, 0.29, 0.26, 0.26, 0.24 NPV: 0.97, 0.96, 0.97, 0.98, 0.98, 0.97

Tenfold cross-validation

Synthetic minority over-sampling technique

Total: 36,652 age: 18–79

[26]

Albahli et al. (2020)

Logistic regression

Accuracy: 0.97

Tenfold cross-validation

Random Forest eXtreme Gradient Boosting

Total: diabetes age: 21–81 Pima Indians Diabetes dataset

[60]

Haq et al. (2020)

Decision tree (iterative Dichotomiser 3)

Accuracy: 0.99 sensitivity 1 specificity 0.98 MCC: 0.99 F1-score: 1 AUC (ROC): 0.998

Training set (70%) test set (30%) hold out training set (90%) test set (10%) tenfold cross-validation

Ada Boost, random forest

Total = 2000 diabetes: 684 age: 21–81

[61]

Yang et al. (2020)

Linear discriminant analysis, support vector machine random forest

AUC: 0.85, 0.84, 0.83 sensitivity: 0.80, 0.79, 0.78 specificity: 0.74, 0.75, 0.73 accuracy: 0.75 0.74,0.74 PPV: 0.36, 0.36, 0.35

Training set: (80%, 2011–2014), test set: (20%, 2011–2014) and validation set: (2015–2016) fivefold cross-validation

Binary logistic regression

Total = 8057 age: 20–89 imbalanced

[62]

Ahn et al. (2020)

Random forest, support vector machine

AUC (ROC): 1.00, 0.95

Tenfold cross-validation

ELISA

Age: 43–68

[63]

Sarwar et al. (2018)

K nearest neighbors, Naive Bayes, support vector machine, decision tree, logistic regression, random forest

Accuracy: 0.77, 0.74, 0.77, 0.71, 0.74, 0.71

Training set (70%) test set (30%) tenfold cross-validation

None

Pima Indians Diabetes Dataset

[64]

Zou et al. (2018)

Random forest J48 decision tree Deep Neural Network

Accuracy: 0.81, 0.79, 0.78 sensitivity: 0.85 0.82, 0.82 specificity: 0.77, 0.76, 0.75 MCC: 0.62, 0.57, 0.57

Fivefold cross-validation

Principal component analysis, minimum redundancy maximum relevance

Pima Indian diabetic Luzhou

[65]

Farran et al. (2019)

Logistic regression k-nearest neighbours support vector machine

AUC (ROC): 3-year: 0.74, 0.83, 0.73 5-year: 0.72, 0.82, 0.68 7-year: 0.70, 0.79, 0.71

Fivefold cross-validation

None

Diabetes: 40,773 control: 107,821 age: 13–65

[66]

Xiong et al. (2019)

Multilayer perceptron, AdaBoost, random forest, support vector machine, gradient boosting

Accuracy: 0.87, 0.86, 0.86, 0.86, 0.86

Training set (60%) test set (20%) tenfolds cross-validation set (20%)

Missing values feature mean

Total: 11845 diabetes: 845 age: 20–100

[67]

Dinh et al. (2019)

Support vector machine, random forest, gradient boosting, logistic regression

AUC (ROC): 0.890.94, 0.96, 0.72 sensitivity: 0.81, 0.86, 0.89, 0.67 precision: 0.81, 0.86, 0.89, 0.67 F1-score: 0.81, 0.86, 0.89, 0.67

Training set (80%) test set (20%) tenfold cross-validation

None

Case 1: 21,131 diabetes: 5532 case 2: 16,426 prediabetes: 6482

[68]

Liu et al. (2019)

LASSO, SCAD, MCP, stepwise regression

AUC (ROC): 0.710.70, 0.70, 0.71 sensitivity: 0.64, 0.64, 0.64, 0.63 specificity: 0.68, 0.68, 0.68, 0.68, precision: 0.35, 0.35, 0.35, 0.35 NPV: 0.87, 0.87, 0.87, 0.87

Training set (70%) test set (30%) tenfold cross-validation

None

Total: 5481 age: > 40

[9]

Muhammad et al. (2020)

Logistic regression support vector machine K-nearest neighbor random forest Naive Bayes gradient boosting

Accuracy: 0.81, 0.85, 0.82, 0.89, 0.77, 0.86 AUC (ROC): 0.80, 0.85, 0.82, 0.86 0.77, 0.86

None

Correlation coefficient analysis

Total: 383 age: 1–150 diabetes: 51.9%

[69]

Tang et al. (2020)

EMR-image multimodal network (CNN)

Accuracy: 0.86 F1-score: 0.76 AUC (ROC): 0.89 Sensitivity: 0.68 Precision: 0.88

Fivefold cross-validation

None

Total: 997 diabetes: 401

[70]

Maniruzzaman et al. (2021)

Naive Bayes decision tree Adaboost random forest

Accuracy: 0.87, 0.90, 0.91, 0.93 AUC (ROC): 0.82, 0.78, 0.90, 0.95

Tenfold cross-validation

Logistic regression

Total: 6561 diabetes: 657 age: 30–64 imbalanced

[71]

Boutilier et al. (2021)

Random forest logistic regression Adaboost K-nearest neighbors decision trees

AUC (ROC): 0.91, 0.91, 0.90, 0.86, 0.78

Tenfold cross-validation

2-Sided Wilcoxon signed rank test

Total: 2278 diabetes: 833 age: 35–63

[72]

Li et al. (2021)

Extreme gradient boosting (GBT)

AUC (ROC): 0.91 precision: 0.82 sensitivity: 0.80 F1-score: 0.77

Training set (60%) validation (20%) test set (20%)

Genetic algorithm

Diabetics: 570 control: 570 prediabetics: 570 age: 33–68

[73]

Lam et al. (2021)

Random forest logistic regression extreme gradient boosting GBT

AUC (ROC): 0.86 F1-score: 0.82

Tenfold cross-validation

None

Control: 19,852 diabetes: 3103 age: 40–69

[74]

Deberneh et al. (2021)

Random forest support vector machine XGBoost

Accuracy: 0.73, 0.73, 0.72 precision: 0.74, 0.74, 0.74 F1-score: 0.74, 0.74, 0.73 sensitivity: 0.73, 0.74, 0.72 Kappa: 0.60, 0.60, 0.58 MCC: 0.60, 0.60, 0.58

Tenfold cross-validation

ANOVA, Chi-squared, SMOTE feature Importance

Total: 535,169 diabetes: 4.3% prediabetes: 36% age: 18–108

[75]

He et al. (2021)

Cox regression

C-statics: 0.762

Hold out

None

Total: 68,299 diabetes: 1281 age: 40–69

[76]

García-Ordás et al. (2021)

Convolutional neural network (DNN)

Accuracy: 0.92

Training set (90%) test set (10%)

Variational and sparse autoencoders

Pima Indians

[77]

Kanimozhi et al. (2021)

Hybrid particle swarm optimization-artificial fish swarm optimization

Accuracy: 1, 0.99 specificity: 0.86, 0.83 sensitivity: 1, 0.99 MCC: 0.91, 0.92 Kappa: 0.96, 0.98

Training set (90%) test set (10%) fivefold cross-validation

Min–max scaling, kernel extreme learning machine

Pima Indians Diabetics, Diabetic Research Center

[78]

Ravaut et al. (2021)

Extreme gradient boosting tree

AUC (ROC): 0.84

Training set (86%) validation (7%) test set (7%)

Mean absolute Shapley values

Total: 15,862,818 diabetes: 19,137 age: 40–69

[79]

De Silva et al. (2021)

Logistic regression

AUC (ROC): 0.75 accuracy: 0.62 specificity: 0.62 sensitivity: 0.77 PPV: 0.09 NPV: 0.98

Training set (30%) validation (30%) test set (40%)

SMOTE ROSE

Total: 16,429 diabetes: 5.6% age: >20

[80]

Kim et al. (2021)

Deep neural network, logistic regression, decision tree

Accuracy: 0.80, 0.80, 0.71

Fivefold cross-validation

Wald test

Total: 3889 diabetes: 746 age: 40–69

[81]

Vangeepuram et al. (2021)

Naive Bayes

AUC (ROC): 0.75 accuracy: 0.62 specificity: 0.62 sensitivity: 0.77 PPV: 0.09 NPV: 0.98

Fivefold cross-validation

Friedman-Nemenyi

Total: 2858 diabetes: 828 age: 12–19

[82]

Recenti et al. (2021)

Random forest Ada-boost gradient boosting

Accuracy: 0.90, 0.79, 0.86 precision: 0.88, 0.78, 0.84 F1-score: 0.90, 0.81, 0.87 sensitivity: 0.93, 0.84, 0.90 specificity: 0.87, 0.76, 0.82 AUC (ROC): 0.97, 0.90, 0.95

Tenfold cross-validation

SMOTE

Total: 2943 age: 66–98 imbalance

[83]

Ramesh et al. (2021)

Support vector machine

Accuracy: 0.83 specificity: 0.79 sensitivity: 0.87

Tenfold cross-validation

MICE LASSO

Pima Indians

[84]

Lama et al. (2021)

Random forest

AUC (ROC): 0.78

Fivefold cross-validation

SHAP TreeExplainer

Total: 3342 diabetes: 556 age: 35–54

[85]

Shashikant et al. (2021)

Gaussian process-based kernel

Accuracy: 0.93 precision: 0.94 F1-score: 0.95 sensitivity: 0.96 specificity: 0.82 AUC (ROC): 0.89

Tenfold cross-validation

Non-linear HRV

Total: 135 diabetes: 100 age: 20–70

[86]

Kalagotla et al. (2021)

Stacking multi-layer perceptron, support vector machine logistic regression

Accuracy: 0.78 precision: 0.72 sensitivity: 0.51 F1-score: 0.60

Hold out k-fold cross-validation

Matrix correlation

Pima Indians

[87]

Moon et al. (2021)

Logistic regression

AUC (ROC): 0.94

Training set (47%) validation (30%) test set (23%)

Cox regression

Total: 14,977 diabetes: 636 age: 48–69

[88]

Ihnaini et al. (2021)

Ensemble deep learning model

Accuracy: 0.99 precision: 1 sensitivity: 0.99 F1-score: 0.99 RMSE: 0 MAE: 0.6

Hold out

None

Pima Indians merged Hospital Frenkfurt Germany

[89]

Rufo et al. (2021)

LightGBM

Accuracy: 0.98 specificity: 0.96 AUC (ROC): 0.98 Sensitivity: 0.99

Tenfold cross-validation

Min–max scale

Diabetes: 1030 Control: 1079 age: 12–90

[90]

Haneef et al. (2021)

Linear discriminant analysis

Accuracy: 0.67 specificity: 0.67 sensitivity: 0.62

Training set (80%) test set (20%)

Z-score transformation random down sampling

Total 44,659 age 18–69 imbalanced

[91]

Wei et al. (2022)

Random forest

AUC (ROC): 0.70 R2: 0.40

Training set (70%) test set (30%) tenfold cross-validation

LASSO PCA

Total: 8501 age: 15–50 diabetes: 8.92% imbalanced

[92]

Leerojanaprapa et al. (2019)

Bayesian network

AUC (ROC): 0.78

Training set (70%) test set (30%)

None

Total: 11,240 diabetes: 5.53% age: 15–19

[93]

Subbaiah et al. (2020)

Random forest

Accuracy: 1 specificity: 1 sensitivity: 1 Kappa: 1

Training set (70%) test set (30%)

None

Pima Indians

[94]

Thenappan et al. (2020)

Support vector machine

Accuracy: 0.97 specificity: 0.96 sensitivity: 0.94 precision: 0.96

Training set (70%) test set (30%)

Principal component analysis

Pima Indians

[95]

Sneha et al. (2019)

Support vector machine, random forest, Naive Bayes, decision tree, k-nearest neighbors

Accuracy: 0.78, 0.75, 0.74, 0.73, 0.63

Training set (70%) test set (30%)

None

Total: 2500 age: 29–70

[96]

Jain et al. (2020)

Support vector machine, random forest, k-nearest neighbors

Accuracy: 0.74 0.74, 0.76 precision: 0.67, 0.72, 0.70 sensitivity: 0.52, 0.44, 0.54 F1-score: 0.58, 0.55, 0.61 AUC (ROC): 0.74, 0.83, 0.83

Training set (70%) test set (30%)

None

Control: 500 diabetes: 268 age: 21–81

[97]

Syed et al. (2020)

Decision forest

F1-Score: 0.87 precision: 0.81 AUC (ROC): 0.90 Sensitivity: 0.91

Training set (80%) test set (20%)

Pearson Chi-squared

Total: 4896 diabetes: 990 age: 40–60

[98]

Nuankaew et al. (2020)

Average weighted objective distance

Precision: 0.99 accuracy: 0.90 specificity: 0.97

Training set (70%) test set (30%)

None

Mendeley data for diabetes

[99]

Samreen et al. (2021)

Stack NB, LR, KNN, DT, SVM, RF, Ada-boost, GBT

Accuracy: 0.98, 0.99 (SVD)

Training set (70%) test set (30%) tenfold cross-validation

One hot encoding, singular value decomposition

Age: 20–90

[100]

Fazakis et al. (2021)

Weighted voting LR-RF

AUC (ROC): 0.88

Hold-out

Forward/backward stepwise selection

English longitudinal study of ageing

[101]

Omana et al. (2021)

Newton’s divide difference method

Accuracy: 0.97 S-error: 0.06

Hold-out

Non-linear autoaggressive regression

Total: 812,007 diabetes: 23.49%

[102]

Ravaut et al. (2021)

Extreme gradient boosting tree

AUC (ROC): 0.80

Training set (87%) validation (7%) test set (6%)

Mean absolute Shapley values

Total: 14,786,763 diabetes: 27,820 age: 10–100 imbalance

[103]

Lang et al. (2021)

Deep belief network

AUC (ROC): 0.82 sensitivity: 0.80 specificity: 0.73

Hold-out

Stratified sampling

Total: 1778 diabetes: 279

[104]

Gupta et al. (2021)

Deep Neural Network

Precision: 0.90 accuracy: 0.95 sensitivity: 0.95 F1-score: 0.93 specificity: 0.95

Hold-out

None

Pima Indians

[105]

Roy et al. (2021)

Gradient boosting tree

Accuracy: 0.92 precision: 0.86 sensitivity: 0.87 specificity: 0.79 AUC (ROC): 0.84

Tenfold cross-validation

Correlation matrix SMOTE

Total: 500 diabetes: 289 age: 20–80 Imbalanced

[106]

Zhang et al. (2021)

Bagging boosting GBT, RF, GBM

Accuracy: 0.82 sensitivity: 0.85 specificity: 0.82 AUC (ROC): 0.89

Training set (80%) test set (20%) tenfold cross-validation

SMOTE

Total: 37,730 diabetes: 9.4% age: 50–70 Imbalanced

[107]

Turnea et al. (2018)

Decision tree

Accuracy: 0.74 sensitivity: 0.60 specificity: 0.82 RMSE: 26.1

Training set (75%) test set (25%)

None

Pima Indians

[108]

Vettoretti et al. (2021)

RFE-Borda

RMSE: 0.98

None

Correlation matrix

English longitudinal study of ageing