Machine learning and deep learning predictive models for type 2 diabetes: a systematic review

Fregoso-Aparicio, Luis; Noguez, Julieta; Montesinos, Luis; García-García, José A.

doi:10.1186/s13098-021-00767-9

Diabetology & Metabolic Syndrome

Table 2 Detailed classification of methods that predict the main factors for diagnosing the onset of diabetes

From: Machine learning and deep learning predictive models for type 2 diabetes: a systematic review

Cite	References	Machine learning model	Validation parameter	Data sampling	Complementary techniques	Description of the population
Type of data: electronic health records
[29]	Arellano-Campos et al. (2019)	Cox proportional hazard regression	Accuracy: 0.75 hazard ratios	Cross-validation (k = 10) and bootstrapping	Beta-coefficients model	Base L: 7636 follow: 6144 diabetes: 331 age: 32–54
[30]	You et al. (2019)	Super learning: ensemble learner by choosing a weighted combination of algorithms	Average treatment effect	Cross-validation	Targeted learning query language logistic and tree regression	Total: 78,894 control: 41,127 diabetes: 37,767 age: > 40
[27]	Maxwell et al. (2017)	Sigmoid function-Deep Neural Network with cross entropy as loss function	Accuracy: 0.921 F1-score: 0.823 precision: 0.915 sensitivity: 0.867	Training set (90%) test set (10%) tenfold cross-validation	RAkEL-LibSVM RAkEL-MLP RAkEL-SMO RAkEL-J48 RAkEL-RF MLkNN	Total: 110,300 imbalanced 6 disease categories
[28]	Nguyen et al. (2019)	Deep Neural Network with three embedding and two hidden layers	Specificity 0.96 accuracy: 0.84 sensitivity: 0.31 AUC (ROC): 0.84	Training set (70%): cross-validation 9:1 test set (30%)	Generalized linear model large-scale regression	Total: 76,214 78 diseases age: 25–78
[31]	Pham et al. (2017)	Recurrent Neural Network Convolutional-Long Short-Term Memory (C-LSTM)	F1-score: 0.79 precision: 0.66	Training set (66%) tuning set (17%) test set (17%)	Support vector machine and random forests	Diabetes: 12,000 age: 18–100 mean age: 73
[32]	Spänig et al. (2019)	Deep Neural Networks with tangens hyperbolicus	AUC (ROC) = 0.71 AUC (ROC) = 0.68	Training set (80%) test set (20%)	Sub-sampling approach support vector machine with RBF kernel	Total: 4814 diabetes: 646 diagnosis: 397 not diag: 257 age: 45–75 imbalance
[33]	Wang et al. (2020)	Convolutional neural network and bidirectional long short-term memory	Precision: 92.3 recall: 90.5 F score: 91.3 accuracy: 92.8	Training set (70%) validation set (10%) test set (20%)	SVM-TFIDF CNN BiLSTM	Total: 18,625 diabetes: 5645 10 disease categories
[34]	Kim et al. (2020)	Class activation map and CNN (SSANet)	R2 = 0.75 MAE = 3.55 AUC (ROC) = 0.77	Training set (89%) validation set (1%) test set (10%)	Linear regression	Total: 412,026 norm: 243,668 diabetes: 14,189 age: 19–90
[35]	Bernardini et al. (2020)	Sparse balanced support vector machine (SB-SVM)	Recall = 0.7464 AUC (ROC) = 0.8143	Tenfold cross-validation	Sparse 1-norm SVM	Total: 2433 diabetes: 225 control: 2208 age: 60–80 imbalanced
[36]	Mei et al. (2017)	Hierarchical recurrent neural network	AUC (ROC) = 0.9268 Accuracy = 0.6745	Training set (80%) validation set (10%) test set (10%)	Linear regression	Total: 620,633
[25]	Prabhu et al. (2019)	Deep belief neural network	Recall: 1.0 precision: 0.68 F1 score: 0.80	Training set validation set test set	Principal component analysis	Pima Indian Women Diabetes Dataset
[13]	Bernardini et al. (2020)	Multiple instance learning boosting	Accuracy: 0.83 F1-score: 0.81 precision: 0.82 recall: 0.83 AUC (ROC): 0.89	Tenfold cross-validation	None	Total: 252 diabetes: 252 age: 54–72
[37]	Solares et al. (2019)	Hazard ratios using Cox regression	AUC (ROC): 0.75, concordance (C-statistic)	Derivation set (80%) validation (20%)	None	Total: 80,964 diabetes: 2267 age: 50
[38]	Kumar et al. (2017)	Support vector machine, Naive Bayes, K-nearest neighbor C4.5 decision tree	Precision: 0.65, 0.68, 0.7, 0.72 recall: 0.69, 0.68, 0.7, 0.74 accuracy: 0.69, 0.67, 0.7, 0.74 F-score: 0.65, 0.68, 0.7, 0.72	N-fold (N = 10) cross validation	None	Diabetes: 200 age: 1–100
[39]	Olivera et al. (2017)	Logistic regression artificial neural network K-nearest neighbor Naïve Bayes	AUC (ROC): 75.44, 75.48, 74.94, 74.47 balanced accuracy: 69.3, 69.47, 68.74, 68.95	Training set (70%) test set (30%) tenfold cross-validation	Forward selection	Diabetes: 12,447 unknown: 1359 age: 35–74
[10]	Alghamdi et al. (2017)	Naïve Bayes tree, random forest, and logistic model tree, j48 decision tree	Kappa: 1.34, 3.63 1.37, 0.70, 1.14 recall (%) 99.2, 99.2, 90.8, 99.9, 99.4 Specificity (%) 1.6, 3.1, 21.2 0.50, 1.3 accuracy (%) 83.9, 84.1, 79.9, 84.3, 84.1	N-fold cross validation	Multiple linear regression gain ranking method synthetic minority oversampling technique	Total: 32,555 diabetes: 5099 imbalanced
[14]	Xie et al. (2017)	K2 structure-learning algorithm	Accuracy = 82.48	Training set (75%) test set (25%)	None	Total: 21,285 diabetes: 1124 age: 35–65
[40]	Peddinti et al. (2017)	Regularised least-squares regression for binary risk classification	Odds ratio accuracy: 0.77	Tenfold cross-validation	Logistic regression	Total: 543 diabetes: 146 age: 48–50
[8]	Maniruzzaman et al. (2017)	Linear discriminant analysis, quadratic discriminant analysis, Naïve Bayes, Gaussian process classification, support vector machine, artificial neural network, Adaboost, logistic regression, decision tree, random forest	Accuracy: 0.92 sensitivity: 0.96 specificity: 0.80 PPV: 0.91 NPV: 0.91 AUC (ROC): 0.93	Cross-validation K2, K4, K5, K10, and JK	Random forest, logistic regression, mutual information, principal component analysis, analysis of variance Fisher discriminant ratio	Pima Indian diabetic dataset
[41]	Dutta et al. (2018)	Logistic regression support vector machine random forest	Sensitivity: 0.80, 0.75, 0.84 F1-score: 0.80, 0.79, 0.84	Training set (67%) test set (33%)	None	Diabetes: 130 control: 262 imbalanced age: 21–81
[42]	Alhassan et al. (2018)	Long short-term memory deep learning gated-recurrent unit deep learning	Accuracy: 0.97 F1-score: 0.96	Training set (90%) test set (10%) tenfolds cross-validation	Logistic regression support vector machine, multi-layer perceptron	Total: 41,000,000 imbalanced diabetes: 62%
[15]	Hertroijs et al. (2018)	Latent growth mixture modelling	Specificity: 81.2% sensitivity: 78.4% accuracy: 92.3%	Training set (90%) test set (10%) fivefold cross-validation	K-nearest neighbour	Total: 105814 age: > 18
[43]	Kuo et al. (2020)	Random forest C5.0 support vector machine	Accuracy: 1 F1-score: 1 AUC (ROC): 1 sensitivity: 1	Tenfold cross-validation	Information gain (features) gain ratio	Total: 149 diabetes: 149 age: 21–91
[44]	Pimentel et al. (2018)	Naïve Bayes, alternating decision tree, random forest, random tree, k-nearest neighbor, support vector machine	Specificity: 0.76, 0.88, 0.87, 0.97, 0.82, 0.85 sensitivity: 0.62, 0.50, 0.33, 0.42, 0.40, 0.59 AUC (ROC): 0.73, 0.81, 0.87, 0.74, 0.62, 0.63	Training set (70%) test set (30%) tenfold cross-validation	SMOTE	Total: 9947 imbalanced diabetes: 13% age: 21–93
[45]	Talaei-Khoeni et al. (2018)	Artificial neural network, support vector machine, logistic regression, decision tree	AUC (ROC): 0.614, 0.831, 0.738, 0.793 sensitivity: 0.608, 0.683, 0.677, 0.687 specificity: 0.783, 0.950, 0.712, 0.651 MCC: 0.797. 0.922, 0.581, 0.120 MCE: 0.844, 0.989, 0.771, 0.507	Oversampling technique, random under sampling	Syntactic minority LASSO, AIC and BIC	Total: 10,911 imbalance diabetes: 51.9%
[46]	Perveen et al. (2019)	J48 decision tree, Naïve Bayes	TPR: 0.85, 0.782, 0.852, 0.774 FPR: 0.218, 0.15 0.226, 0.148 precision: 0.814, 0.782, 0.807 recall: 0.85, 0.802, 0.852, 0.824 F-measure: 0.831, 0.634, 0.829, 0.774 MCC: 0.634, 0.823, 0.628, 0.798 AUC (ROC): 0.883, 0.873, 0.836, 0.826	K-medoids under sampling	Logistic regression	Total: 667, 907 age: 22–74 diabetes: 8.13% imbalance
[47]	Yuvaraj et al. (2019)	Decision tree Naïve Bayes random forest	Precision: 87, 91, 94 recall: 77, 82, 88 F-measure: 82, 86, 91 accuracy: 88, 91, 94	Training set (70%) test set (30%)	Information gain RHadoop	Total: 75,664
[48]	Deo et al. (2019)	Bagged trees, linear support vector machine	Accuracy: 91% AUC (ROC): 0.908	Training set (70%) test set (30%) fivefold cross-validation, holdout validation	Synthetic minority oversampling technique, Gower’s distance	Total: 140 diabetes: 14 imbalanced age: 12–90
[49]	Jakka et al. (2019)	K nearest neighbor, decision tree, Naive Bayes, support vector machine, logistic regression, random forest	Accuracy: 0.73, 0.70, 075, 0.66, 0.78, 0.74 recall: 0.69, 0.72, 0.74, 0.64 0.76, 0.69 F1-score: 0.69, 0.72, 0.74, 0.40, 0.75, 0.69 misclassification rate: 0.31, 0.29, 0.26, 0.36, 0.24, 0.29 AUC (ROC): 0.70, 0.69, 0.70, 0.61, 0.74, 0.70	None	None	Pima Indians Diabetes dataset
[50]	Radja et al. (2019)	Naive Bayes, support vector machine, decision table, J48 decision tree	Precision: 0.80, 0.79, 0.76, 0.79 precision: 0.68, 0.74, 0.60, 0.63 recall: 0.84, 0.90, 0.81, 0.81 recall: 0.61, 0.54, 0.53, 0.60 F1-score: 0.76, 0.76, 0.71, 0.74	Tenfold cross-validation	None	Total: 768 diabetes: 500 control: 268
[51]	Choi et al. (2019)	Logistic regression, linear discriminant analysis, quadratic discriminant analysis, K-nearest neighbor	AUC (ROC): 0.78, 0.77 0.76, 0.77	Tenfold cross-validation	Information gain	Total: 8454 diabetes: 404 age: 40–72
[52]	Akula et al. (2019)	K nearest neighbor, support vector machine, decision tree, random forest, gradient boosting, neural network, Naive Bayes	Overall accuracy: 0.86 precision: 0.24 negative prediction: 0.99 sensitivity: 0.88 specificity: 0.85 F1-score: 0.38	Training set: 800 test set: 10,000	None	Pima Indians Diabetes Dataset Practice Fusion Dataset total: 10,000 age: 18–80
[53]	Xie et al. (2019)	Support vector machine, decision tree, logistic regression, random forest, neural network, Naive Bayes	Accuracy: 0.81, 0.74, 0.81, 0.79, 0.82, 0.78 sensitivity: 0.43, 0.52, 0.46, 0.50, 0.37, 0.48 specificity: 0.87, 0.78, 0.87, 0.84 0.90, 0.82 AUC (ROC): 0.78, 0.72, 0.79, 0.76, 0.80, 0.76	Training set (67%) test set (33%)	Odds ratio synthetic minority over-sampling technique	Total: 138,146 diabetes: 20,467 age: 30–80
[54]	Lai et al. (2019)	Gradient boosting machine, logistic regression, random forest, Rpart	AUC (ROC): 84.7%, 84.0% 83.4%, 78.2%	Training set (80%) test set (20%) tenfold cross-validation	Misclassification costs	Total: 13,309 diabetes: 20.9% age: 18–90 imbalanced
[17]	Brisimi et al. (2018)	Alternating clustering and classification	AUC (ROC): 0.8814, 0.8861, 0.8829, 0.8812	Training set (40%) test set (60%)	Sparse (l1-regularized), support vector machines, random forests, gradient tree boosting	Diabetes: 47,452 control: 116,934 age mean: 66
[55]	Abbas et al. (2019)	Support vector machine with Gaussian radial basis	Accuracy: 96.80% sensitivity: 80.09%	Tenfold cross-validation	Minimum redundancy maximum relevance algorithm	Total: 1438 diabetes: 161 age: 25–64
[56]	Sarker et al. (2020)	K-nearest neighbors	Precision: 0.75 recall: 0.76 F-score: 0.75 AUC (ROC): 0.72	Tenfold cross validation	Adaptive boosting, logistic regression, Naive Bayes, support vector machine decision tree	Total: 500 age: 10–80
[57]	Cahn et al. (2020)	Gradient boosting trees model	AUC (ROC): 0.87 sensitivity: 0.61 specificity: 0.91 PPV: 0.16	Training set: THIN dataset validation set: AppleTree dataset MHS dataset	Logistic-regression	Age: 40–80 THIN: total = 3,068,319 pre-DM: 40% DM: 2.9% Apple Tree: P-DM: 381,872 DM: 2.3% MHS: pre-DM: 12,951 DM: 2.7%
[58]	Garcia-Carretero et al. (2020)	K-nearest neighbors	Accuracy: 0.977 sensitivity 0.998 specificity 0.838 PPV: 0.976 NPV: 0.984 AUC (ROC): 0.89	Tenfold cross-validation	Random forest	Age: 44–72 pre-DM = 1647 diabetes: 13%
[59]	Zhang et al. (2020)	Logistic regression, classification and regression tree, gradient boosting machine, artificial neural networks, random forest, support vector machine	AUC (ROC): 0.84, 0.81, 0.87, 0.85, 0.87, 0.84 accuracy: 0.75, 0.80, 0.81, 0.74, 0.86, 0.76 sensitivity: 0.79, 0.67, 0.76, 0.81, 0.80, 0.75 specificity: 0.75, 0.81, 0.82, 0.73, 0.78, 0.77 PPV: 0.23, 0.26, 0.29, 0.26, 0.26, 0.24 NPV: 0.97, 0.96, 0.97, 0.98, 0.98, 0.97	Tenfold cross-validation	Synthetic minority over-sampling technique	Total: 36,652 age: 18–79
[26]	Albahli et al. (2020)	Logistic regression	Accuracy: 0.97	Tenfold cross-validation	Random Forest eXtreme Gradient Boosting	Total: diabetes age: 21–81 Pima Indians Diabetes dataset
[60]	Haq et al. (2020)	Decision tree (iterative Dichotomiser 3)	Accuracy: 0.99 sensitivity 1 specificity 0.98 MCC: 0.99 F1-score: 1 AUC (ROC): 0.998	Training set (70%) test set (30%) hold out training set (90%) test set (10%) tenfold cross-validation	Ada Boost, random forest	Total = 2000 diabetes: 684 age: 21–81
[61]	Yang et al. (2020)	Linear discriminant analysis, support vector machine random forest	AUC: 0.85, 0.84, 0.83 sensitivity: 0.80, 0.79, 0.78 specificity: 0.74, 0.75, 0.73 accuracy: 0.75 0.74,0.74 PPV: 0.36, 0.36, 0.35	Training set: (80%, 2011–2014), test set: (20%, 2011–2014) and validation set: (2015–2016) fivefold cross-validation	Binary logistic regression	Total = 8057 age: 20–89 imbalanced
[62]	Ahn et al. (2020)	Random forest, support vector machine	AUC (ROC): 1.00, 0.95	Tenfold cross-validation	ELISA	Age: 43–68
[63]	Sarwar et al. (2018)	K nearest neighbors, Naive Bayes, support vector machine, decision tree, logistic regression, random forest	Accuracy: 0.77, 0.74, 0.77, 0.71, 0.74, 0.71	Training set (70%) test set (30%) tenfold cross-validation	None	Pima Indians Diabetes Dataset
[64]	Zou et al. (2018)	Random forest J48 decision tree Deep Neural Network	Accuracy: 0.81, 0.79, 0.78 sensitivity: 0.85 0.82, 0.82 specificity: 0.77, 0.76, 0.75 MCC: 0.62, 0.57, 0.57	Fivefold cross-validation	Principal component analysis, minimum redundancy maximum relevance	Pima Indian diabetic Luzhou
[65]	Farran et al. (2019)	Logistic regression k-nearest neighbours support vector machine	AUC (ROC): 3-year: 0.74, 0.83, 0.73 5-year: 0.72, 0.82, 0.68 7-year: 0.70, 0.79, 0.71	Fivefold cross-validation	None	Diabetes: 40,773 control: 107,821 age: 13–65
[66]	Xiong et al. (2019)	Multilayer perceptron, AdaBoost, random forest, support vector machine, gradient boosting	Accuracy: 0.87, 0.86, 0.86, 0.86, 0.86	Training set (60%) test set (20%) tenfolds cross-validation set (20%)	Missing values feature mean	Total: 11845 diabetes: 845 age: 20–100
[67]	Dinh et al. (2019)	Support vector machine, random forest, gradient boosting, logistic regression	AUC (ROC): 0.890.94, 0.96, 0.72 sensitivity: 0.81, 0.86, 0.89, 0.67 precision: 0.81, 0.86, 0.89, 0.67 F1-score: 0.81, 0.86, 0.89, 0.67	Training set (80%) test set (20%) tenfold cross-validation	None	Case 1: 21,131 diabetes: 5532 case 2: 16,426 prediabetes: 6482
[68]	Liu et al. (2019)	LASSO, SCAD, MCP, stepwise regression	AUC (ROC): 0.710.70, 0.70, 0.71 sensitivity: 0.64, 0.64, 0.64, 0.63 specificity: 0.68, 0.68, 0.68, 0.68, precision: 0.35, 0.35, 0.35, 0.35 NPV: 0.87, 0.87, 0.87, 0.87	Training set (70%) test set (30%) tenfold cross-validation	None	Total: 5481 age: > 40
[9]	Muhammad et al. (2020)	Logistic regression support vector machine K-nearest neighbor random forest Naive Bayes gradient boosting	Accuracy: 0.81, 0.85, 0.82, 0.89, 0.77, 0.86 AUC (ROC): 0.80, 0.85, 0.82, 0.86 0.77, 0.86	None	Correlation coefficient analysis	Total: 383 age: 1–150 diabetes: 51.9%
[69]	Tang et al. (2020)	EMR-image multimodal network (CNN)	Accuracy: 0.86 F1-score: 0.76 AUC (ROC): 0.89 Sensitivity: 0.68 Precision: 0.88	Fivefold cross-validation	None	Total: 997 diabetes: 401
[70]	Maniruzzaman et al. (2021)	Naive Bayes decision tree Adaboost random forest	Accuracy: 0.87, 0.90, 0.91, 0.93 AUC (ROC): 0.82, 0.78, 0.90, 0.95	Tenfold cross-validation	Logistic regression	Total: 6561 diabetes: 657 age: 30–64 imbalanced
[71]	Boutilier et al. (2021)	Random forest logistic regression Adaboost K-nearest neighbors decision trees	AUC (ROC): 0.91, 0.91, 0.90, 0.86, 0.78	Tenfold cross-validation	2-Sided Wilcoxon signed rank test	Total: 2278 diabetes: 833 age: 35–63
[72]	Li et al. (2021)	Extreme gradient boosting (GBT)	AUC (ROC): 0.91 precision: 0.82 sensitivity: 0.80 F1-score: 0.77	Training set (60%) validation (20%) test set (20%)	Genetic algorithm	Diabetics: 570 control: 570 prediabetics: 570 age: 33–68
[73]	Lam et al. (2021)	Random forest logistic regression extreme gradient boosting GBT	AUC (ROC): 0.86 F1-score: 0.82	Tenfold cross-validation	None	Control: 19,852 diabetes: 3103 age: 40–69
[74]	Deberneh et al. (2021)	Random forest support vector machine XGBoost	Accuracy: 0.73, 0.73, 0.72 precision: 0.74, 0.74, 0.74 F1-score: 0.74, 0.74, 0.73 sensitivity: 0.73, 0.74, 0.72 Kappa: 0.60, 0.60, 0.58 MCC: 0.60, 0.60, 0.58	Tenfold cross-validation	ANOVA, Chi-squared, SMOTE feature Importance	Total: 535,169 diabetes: 4.3% prediabetes: 36% age: 18–108
[75]	He et al. (2021)	Cox regression	C-statics: 0.762	Hold out	None	Total: 68,299 diabetes: 1281 age: 40–69
[76]	García-Ordás et al. (2021)	Convolutional neural network (DNN)	Accuracy: 0.92	Training set (90%) test set (10%)	Variational and sparse autoencoders	Pima Indians
[77]	Kanimozhi et al. (2021)	Hybrid particle swarm optimization-artificial fish swarm optimization	Accuracy: 1, 0.99 specificity: 0.86, 0.83 sensitivity: 1, 0.99 MCC: 0.91, 0.92 Kappa: 0.96, 0.98	Training set (90%) test set (10%) fivefold cross-validation	Min–max scaling, kernel extreme learning machine	Pima Indians Diabetics, Diabetic Research Center
[78]	Ravaut et al. (2021)	Extreme gradient boosting tree	AUC (ROC): 0.84	Training set (86%) validation (7%) test set (7%)	Mean absolute Shapley values	Total: 15,862,818 diabetes: 19,137 age: 40–69
[79]	De Silva et al. (2021)	Logistic regression	AUC (ROC): 0.75 accuracy: 0.62 specificity: 0.62 sensitivity: 0.77 PPV: 0.09 NPV: 0.98	Training set (30%) validation (30%) test set (40%)	SMOTE ROSE	Total: 16,429 diabetes: 5.6% age: >20
[80]	Kim et al. (2021)	Deep neural network, logistic regression, decision tree	Accuracy: 0.80, 0.80, 0.71	Fivefold cross-validation	Wald test	Total: 3889 diabetes: 746 age: 40–69
[81]	Vangeepuram et al. (2021)	Naive Bayes	AUC (ROC): 0.75 accuracy: 0.62 specificity: 0.62 sensitivity: 0.77 PPV: 0.09 NPV: 0.98	Fivefold cross-validation	Friedman-Nemenyi	Total: 2858 diabetes: 828 age: 12–19
[82]	Recenti et al. (2021)	Random forest Ada-boost gradient boosting	Accuracy: 0.90, 0.79, 0.86 precision: 0.88, 0.78, 0.84 F1-score: 0.90, 0.81, 0.87 sensitivity: 0.93, 0.84, 0.90 specificity: 0.87, 0.76, 0.82 AUC (ROC): 0.97, 0.90, 0.95	Tenfold cross-validation	SMOTE	Total: 2943 age: 66–98 imbalance
[83]	Ramesh et al. (2021)	Support vector machine	Accuracy: 0.83 specificity: 0.79 sensitivity: 0.87	Tenfold cross-validation	MICE LASSO	Pima Indians
[84]	Lama et al. (2021)	Random forest	AUC (ROC): 0.78	Fivefold cross-validation	SHAP TreeExplainer	Total: 3342 diabetes: 556 age: 35–54
[85]	Shashikant et al. (2021)	Gaussian process-based kernel	Accuracy: 0.93 precision: 0.94 F1-score: 0.95 sensitivity: 0.96 specificity: 0.82 AUC (ROC): 0.89	Tenfold cross-validation	Non-linear HRV	Total: 135 diabetes: 100 age: 20–70
[86]	Kalagotla et al. (2021)	Stacking multi-layer perceptron, support vector machine logistic regression	Accuracy: 0.78 precision: 0.72 sensitivity: 0.51 F1-score: 0.60	Hold out k-fold cross-validation	Matrix correlation	Pima Indians
[87]	Moon et al. (2021)	Logistic regression	AUC (ROC): 0.94	Training set (47%) validation (30%) test set (23%)	Cox regression	Total: 14,977 diabetes: 636 age: 48–69
[88]	Ihnaini et al. (2021)	Ensemble deep learning model	Accuracy: 0.99 precision: 1 sensitivity: 0.99 F1-score: 0.99 RMSE: 0 MAE: 0.6	Hold out	None	Pima Indians merged Hospital Frenkfurt Germany
[89]	Rufo et al. (2021)	LightGBM	Accuracy: 0.98 specificity: 0.96 AUC (ROC): 0.98 Sensitivity: 0.99	Tenfold cross-validation	Min–max scale	Diabetes: 1030 Control: 1079 age: 12–90
[90]	Haneef et al. (2021)	Linear discriminant analysis	Accuracy: 0.67 specificity: 0.67 sensitivity: 0.62	Training set (80%) test set (20%)	Z-score transformation random down sampling	Total 44,659 age 18–69 imbalanced
[91]	Wei et al. (2022)	Random forest	AUC (ROC): 0.70 R2: 0.40	Training set (70%) test set (30%) tenfold cross-validation	LASSO PCA	Total: 8501 age: 15–50 diabetes: 8.92% imbalanced
[92]	Leerojanaprapa et al. (2019)	Bayesian network	AUC (ROC): 0.78	Training set (70%) test set (30%)	None	Total: 11,240 diabetes: 5.53% age: 15–19
[93]	Subbaiah et al. (2020)	Random forest	Accuracy: 1 specificity: 1 sensitivity: 1 Kappa: 1	Training set (70%) test set (30%)	None	Pima Indians
[94]	Thenappan et al. (2020)	Support vector machine	Accuracy: 0.97 specificity: 0.96 sensitivity: 0.94 precision: 0.96	Training set (70%) test set (30%)	Principal component analysis	Pima Indians
[95]	Sneha et al. (2019)	Support vector machine, random forest, Naive Bayes, decision tree, k-nearest neighbors	Accuracy: 0.78, 0.75, 0.74, 0.73, 0.63	Training set (70%) test set (30%)	None	Total: 2500 age: 29–70
[96]	Jain et al. (2020)	Support vector machine, random forest, k-nearest neighbors	Accuracy: 0.74 0.74, 0.76 precision: 0.67, 0.72, 0.70 sensitivity: 0.52, 0.44, 0.54 F1-score: 0.58, 0.55, 0.61 AUC (ROC): 0.74, 0.83, 0.83	Training set (70%) test set (30%)	None	Control: 500 diabetes: 268 age: 21–81
[97]	Syed et al. (2020)	Decision forest	F1-Score: 0.87 precision: 0.81 AUC (ROC): 0.90 Sensitivity: 0.91	Training set (80%) test set (20%)	Pearson Chi-squared	Total: 4896 diabetes: 990 age: 40–60
[98]	Nuankaew et al. (2020)	Average weighted objective distance	Precision: 0.99 accuracy: 0.90 specificity: 0.97	Training set (70%) test set (30%)	None	Mendeley data for diabetes
[99]	Samreen et al. (2021)	Stack NB, LR, KNN, DT, SVM, RF, Ada-boost, GBT	Accuracy: 0.98, 0.99 (SVD)	Training set (70%) test set (30%) tenfold cross-validation	One hot encoding, singular value decomposition	Age: 20–90
[100]	Fazakis et al. (2021)	Weighted voting LR-RF	AUC (ROC): 0.88	Hold-out	Forward/backward stepwise selection	English longitudinal study of ageing
[101]	Omana et al. (2021)	Newton’s divide difference method	Accuracy: 0.97 S-error: 0.06	Hold-out	Non-linear autoaggressive regression	Total: 812,007 diabetes: 23.49%
[102]	Ravaut et al. (2021)	Extreme gradient boosting tree	AUC (ROC): 0.80	Training set (87%) validation (7%) test set (6%)	Mean absolute Shapley values	Total: 14,786,763 diabetes: 27,820 age: 10–100 imbalance
[103]	Lang et al. (2021)	Deep belief network	AUC (ROC): 0.82 sensitivity: 0.80 specificity: 0.73	Hold-out	Stratified sampling	Total: 1778 diabetes: 279
[104]	Gupta et al. (2021)	Deep Neural Network	Precision: 0.90 accuracy: 0.95 sensitivity: 0.95 F1-score: 0.93 specificity: 0.95	Hold-out	None	Pima Indians
[105]	Roy et al. (2021)	Gradient boosting tree	Accuracy: 0.92 precision: 0.86 sensitivity: 0.87 specificity: 0.79 AUC (ROC): 0.84	Tenfold cross-validation	Correlation matrix SMOTE	Total: 500 diabetes: 289 age: 20–80 Imbalanced
[106]	Zhang et al. (2021)	Bagging boosting GBT, RF, GBM	Accuracy: 0.82 sensitivity: 0.85 specificity: 0.82 AUC (ROC): 0.89	Training set (80%) test set (20%) tenfold cross-validation	SMOTE	Total: 37,730 diabetes: 9.4% age: 50–70 Imbalanced
[107]	Turnea et al. (2018)	Decision tree	Accuracy: 0.74 sensitivity: 0.60 specificity: 0.82 RMSE: 26.1	Training set (75%) test set (25%)	None	Pima Indians
[108]	Vettoretti et al. (2021)	RFE-Borda	RMSE: 0.98	None	Correlation matrix	English longitudinal study of ageing

Back to article page

ISSN: 1758-5996

Contact us

Submission enquiries: journalsubmissions@springernature.com