Skip to main content

Machine learning for predicting diabetes risk in western China adults



Diabetes mellitus is a global epidemic disease. Long-time exposure of patients to hyperglycemia can lead to various type of chronic tissue damage. Early diagnosis of and screening for diabetes are crucial to population health.


We collected the national physical examination data in Xinjiang, China, in 2020 (a total of more than 4 million people). Three types of physical examination indices were analyzed: questionnaire, routine physical examination and laboratory values. Integrated learning, deep learning and logistic regression methods were used to establish a risk model for type-2 diabetes mellitus. In addition, to improve the convenience and flexibility of the model, a diabetes risk score card was established based on logistic regression to assess the risk of the population.


An XGBoost-based risk prediction model outperformed the other five risk assessment algorithms. The AUC of the model was 0.9122. Based on the feature importance ranking map, we found that hypertension, fasting blood glucose, age, coronary heart disease, ethnicity, parental diabetes mellitus, triglycerides, waist circumference, total cholesterol, and body mass index were the most important features of the risk prediction model for type-2 diabetes.


This study established a diabetes risk assessment model based on multiple ethnicities, a large sample and many indices, and classified the diabetes risk of the population, thus providing a new forecast tool for the screening of patients and providing information on diabetes prevention for healthy populations.


Diabetes mellitus (DM) is a metabolic disease characterized by hyperglycemia. Hyperglycemia can cause chronic damage to tissues over time [1]. Diabetes has become a major health problem worldwide with a significant increase in DM patients. According to the International Diabetes Federation (IDF), approximately 537 million adults worldwide had diabetes in 2021 (with a prevalence of 10.5%), and it is estimated that by 2045, approximately 783 million people worldwide are likely to have diabetes (with a prevalence of approximately 12.2%) [2, 3]. In China, the number of adults with diabetes ranked first in the world in 2021 (approximately 140.9 million patients, with a prevalence rate of approximately 13.0%) [3, 4]. According to a survey, because individuals with type-2 diabetes mellitus (T2DM) usually lack the relevant knowledge, or they are asymptomatic, some individuals with T2DM patients can not be detected in time (approximately 50% of individuals with T2DM are undiagnosed) [3, 5]. It is necessary to identify individuals with diabetes in the population in an efficient and accurate manner. Therefore that early preventive measures and treatment can be taken to avoid further escalation of T2DM.

Currently, the scientific community has shifted its focus to the use of powerful computational methods for early and accurate prediction of diabetes [6,7,8,9,10,11]. Machine learning (ML) can iteratively learn nonlinear interactions from large amounts of data [12,13,14]. At present, based on electronic medical records and hospitalization data, ML methods have been used in the diagnosis and prediction of diabetes, prediabetes, complications and disease progression [7, 8, 15,16,17], as well as real-time blood glucose monitoring [18, 19], with some success. However, most of these models are created for the care of T2DM patients, and the sample size of training data is too small to reliably capture asymptomatic cases of early abnormal blood glucose, which are not suitable for mass screening of the population or public health planning [20, 21]. One study [22] reported that most models for diabetes prediction and risk assessment were rarely used because they relied on specific data. As physical examination data grows and ML rapidly develops, the use of physical examination data for disease risk assessment can provide better clinical guidance and facilitate large-scale screenings at an earlier stage [23]. However, at present, fewer scholars conduct diabetes screening based on health examination data [8, 24]. ML methods have not been applied to T2DM screening models and risk assessment in western China based on large-scale physical examination data.

We aimed to develop an ML model suitable for large-scale screening of T2DM among adults in western China. In this study, we established the model based on logistic regression (LR) and ML algorithms, including classification and regression tree (CART), light gradient boosting machine (LightGBM), random forest (RF), extreme gradient boosting (XGBoost), multilayer perceptron (MLP), and TabNet model, and combined them with western China large-scale health examination data, which are characterized by wide coverage, large volume and strong representation. In addition, in order to improve the convenience and flexibility of the model, a diabetes risk score card was established based on logistic regression to assess the risk of the population. This study is the first T2DM screening model that systematically compares various algorithms on a multiethnic and large sample basis.

Materials and methods

The dataset

We used the health examination data obtained from the national physical examination (NPE) project in 2020, which was previously described in detail [25]. The NPE health examination consisted of three parts: questionnaire, routine physical examination and laboratory tests.

A total of 9,333,091 people were enrolled in this study by signing an informed consent form. Participants were excluded from the study if they were (i) younger than 18 years old; or (ii) more than 20% of their baseline and laboratory test data were missing. Second, we removed variables unrelated to the study, such as participants’ names, contact phone numbers, and home addresses. After that, missing value processing (random forest interpolation) and extreme value processing (deletion) were performed for the remaining variables. The detailed analysis process is shown in Fig. 1. Finally, 4,075,431 samples were left, including 3,774,084 healthy individuals and 3,013,47 T2DM patients.

Fig. 1
figure 1

Flow Chart. CART classification and regression tree, LightGBM light gradient boosting machine, RF random forest, XGBoost extreme gradient boosting, LR logistic regression

Feature fusion

In our computational model, we combined three types of physical examination data: questionnaire data (9 features), routine tests (2 features), and laboratory values (9 features). A total of 20 features were sufficient to identify diabetes risk. Through the questionnaire, we collected demographic characteristics, diet, smoking, hypertension, coronary heart disease, and parental history of T2DM in the population. Body mass index (BMI) and waist circumference (WC) were collected through routine tests. Through laboratory testing, nine laboratory values were collected.

T2DM was defined if any of the following criteria were met: 2 h postprandial blood glucose (2hPG) ≥ 11.1 mmol/L, fasting blood glucose (FBG) ≥ 7.0 mmol/L, or a complaint of diabetes and the use of antidiabetic drugs.

Feature selection

To adjust the parameters and measure the model’s performance, the data were segmented using the 70–30 holdout method. The training set contained 2852801 samples (healthy population: 2641683, T2DM patients 211118). The possible risk factors for DM were preliminarily screened by reviewing the relevant literature. Univariate and multivariate logistic regression analyses were performed to analyze these characteristics, and correlation analysis was used to determine the correlation between each characteristic.

Classification algorithms

In this study, integrated learning (CART, LightGBM, RF, XGBoost), deep learning (TabNet and MLP) and LR models were used to construct a diabetes risk assessment model.

The CART algorithm is a tree arrangement algorithm. CART has the advantages of fast operation speed, high accuracy, high-dimensional data and no parameter assumptions. There are some problems with it, including high variance and overfitting, which limit its applicability as an independent prediction model.

The RF algorithm is a combination of bagging ensemble learning theory and the random subspace method [26, 27]. The core idea of the RF algorithm is to construct multiple independent classifiers, and then apply the average or majority voting principle to their predictions to determine the results of ensemble classifiers.

The XGBoost technique is a nonlinear machine learning technique based on trees [8]. XGBoost is based on combining weak estimators to predict hard-to-evaluate samples repeatedly [28], so as to constitute a strong estimator. The XGBoost can evaluate the importance of each input feature more easily than other black box techniques such as support vector machine (SVM) and artificial neural network (ANN) techniques.

The LightGBM algorithm is a decision tree-based ensemble algorithm that provides an effective implementation of gradient lifting [29]. Compared to traditional training algorithms, LightGBM has a faster training speed, a lower memory requirement, and a higher accuracy, which can lead to more efficient models.

MLP is a feed-forward, supervised artificial neural network structure that can contain multiple hidden layers through multilayer perceptrons to achieve classification modeling of nonlinear data.

TabNet is a neural network for tabular data that uses sequential attention mechanism to select the features to be reasoned about at each decision step, thus learning to obtain the most salient features for interpretability and more efficient learning.

In order to facilitate clinical and real-life applications, we designed a diabetes risk scorecard based on LR. In the process of establishing the score card, we used the chi-square method for continuous variables, and the discrete variables were directly divided into categories. We determined the final number of boxes according to the information value (IV) value curve. Then, the IV value of each feature was calculated and variables whose IV value was greater than 0.1 were selected into the scorecard model. Finally, the weight of evidence (WOE) value of each box was calculated, and the WOE was mapped back to the original dataset, and then LR was used to establish the model. The detailed process can be found in a previous study [8].

Model evaluation

To obtain the optimal parameters, we used grid search to perform hyperparameter debugging on four models to obtain the optimal parameters. Based on the confusion matrix, we calculated the accuracy, recall, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and receiver operating characteristic (ROC) curve of each model. Furthermore, we utilized the Kolmogorov–Smirnov (KS) value to appraise the efficiency of the scorecard model. A higher value of KS is indicative of an improved model. The greater the KS value, the more successful the model is. The KS value, which varies from 0 to 1, and when KS surpasses 0.3, the prediction performance of the model is deemed satisfactory.

Statistical analysis

The baseline characteristics of the study population are represented as the mean ± standard deviation when they are continuous variables, and as frequency (percentage) when they are categorical variables.

The differences in variables between diabetic patients with diabetes and healthy people were analyzed. The t test or Mann–Whitney test was used for continuous variables. The chi-square test or Fisher’s exact test were used for categorical variables. Statistical significance was inferred at a two-sided P-value < 0.05.

This study utilized Python Software Version 3.8.3. The libraries “Pandas” “NumPy” and “Matplotlib” were used for determining nulls and outliers as well as for interpolation. Meanwhile, the “Sklearn” library was used for construction and validation of the ML model. We use “PyTouch” to build a deep learning framework.


Basic characteristics

A total of 9,333,091 participants were included in this study. After data preprocessing, 4,075,431 participants were left, including 1,919,248 (47.09%) males and 2,156,183 (52.91%) females. A total of 3,774,084 healthy people and 301,347 T2DM patients were included. The prevalence of T2DM was calculated at 7.39% among the study population. The general characteristics of the study population are presented in Table 1. Patients with diabetes had older age, higher BMI, WC, HGB, WBC, FB, TC, TG, LDLC, lower PLT and HDLC than healthy people. Compared with the healthy population, the proportion of patients with hypertension and CAD was higher in diabetic patients with diabetes. The prevalence of T2DM was significantly different among people with different dietary habits and smoking statuses. For further details, see Table 1.

Table 1 Characteristics of participants in this study

We compared the prevalence of diabetes in different age groups (Fig. 2). It was found that the age of diabetes patients was concentrated in the range of 50–80 years old, accounting for approximately 60% of diabetes patients. Diabetes patients younger than the age of 40 accounted for 2.5% of the total number of diabetes patients.

Fig. 2
figure 2

Distribution of diabetes patients and healthy people by age. Healthy people (yellow) and T2DM patients (blue)

Feature selection

The possible risk factors for T2DM were preliminarily screened by reviewing relevant literature (Table 2). The Pearson’s Correlation Coefficient was utilized to reveal the interrelationship between the various features. The correlation between the factors was then depicted using heat maps (Additional file 1: Figure A1). In Additional file 1: Figure A1, BMI had a positive correlation with WC, while HTN and CDA showed a positive correlation.

Table 2 Multivariate logistic regression analysis in the development group

Univariate logistic regression analysis (Additional file 1: Table A1) and a multivariate logistic regression analysis (Table 2) were performed for these features. We found that age, unbalanced diet, smoking, hypertension, CAD, PDM, WC, BMI, WBC, FGB, TC, and TG were positively associated with the risk of T2DM. HDL was negatively associated with T2DM. Multivariate logistic regression showed that HGB, PLT and LDLC were negatively correlated with the risk of T2DM, which may be related to the data itself and affected by missing values. Considering that some previous studies found a relationship between TC and T2DM, combined with correlation analysis and logistic regression analysis, finally, sex, age, ethnicity, EH, SS, HTN, CAD, PDM, WC, BMI, WBC, PLT, FBG, ECG, TC, TG, LDLC, and HDLC were chosen to construct the diabetes risk prediction model.

Tuning of the parameters

To obtain the optimal parameters, we used grid search and cross-validation to conduct hyperparameter debugging for seven models, as shown in Additional file 1: Table A2.

Comparison of model performance

In this study, we constructed various tree-based machine learning models, such as CART, LightGBM, RF, XGBoost, MLP and TabNet, as well as the LR model. Table 3 and Additional file 1: Figure A2 show the performance of each prediction model on the validation group. The results showed that XGBoost had a good model performance, with an AUC of 0.9122. XGBoost also showed superiority in accuracy (0.8314), precision (0.2800), PPV (0.9829) and NPV (0.9122). Table 3 demonstrates the efficacy of each prediction model on the validation group.

Table 3 Performance metrics of the machine learning models

Figure 3 shows the ROC curves and AUC of different prediction models in the development group and validation group. It is found that XGBoost performed better than the other prediction models. The AUC of the development group was 0.9209, and the AUC of the validation group was 0.9122. The results showed that the XGBoost algorithm showed excellent advantages in predicting the risk of diabetes in this study.

Fig. 3
figure 3

ROC curves of different learning machine learning algorithms on the training and validation sets. A ROC curve in the development group. B ROC curve in the validation group

We used SHapley Additive Explanations (SHAP) to explain the characteristic contributions of the XGBoost model. Figure 4 showes the feature importance of the XGBoost algorithms. We found that HTN, FGB, age, PDM, CAD, ethnicity, TG, WC, BMI and TC were identified as the top ten of the most important factors.

Fig. 4
figure 4

Feature importance of the XGBoost model. HTN hypertension, FBG fasting blood glucose, PDM parental diabetes mellitus, CAD coronary heart disease, WC waist circumference, BMI body mass index, WBC white blood cell, HGB hemoglobin, PLT platelet, TC total cholesterol, TG triglyceride, LDLC low density lipoprotein cholesterol, HDLC high density lipoprotein cholesterol

Diabete risk score card

A diabetes risk score card with a scale of 100 was designed for this study. The diabetes risk score card was used to evaluates an individual's risk of diabetes by aiding in the calculation of their risk score. Based on the IV value, the risk score card model was established using age, FBG, HTN, WC, BMI, TG, CAD and ethnicity as variables. The ROC and KS curves of the validation group are displayed in Fig. 5.

Fig. 5
figure 5

ROC and KS curves of the diabetes risk score card. A ROC curve; B KS curve

We used the score card scaling algorithm to convert the model into score cards (Table 4). The score card comprises the baseline score as well as the associated score for each box within each feature. When using a scorecard, the total score is the sum of the base score and the feature score, which represents the diabetes risk value. In this study, the base score was 46.3.

Table 4 Diabetes risk score card

The Kolmogorov–Smirnov curve (Fig. 5) was utilized to illustrate the totality of the score and to determine the risk interval. The higher the KS value is, the greater the segmentation ability of the model’s corresponding threshold value will be. As illustrated in Fig. 5, the apex of the inflection point is achieved when the score is equal to 45. Therefore, to easily calculate the risk interval, we set 50 as the intermediate threshold. The higher the score generated from testing, the lower the risk of diabetes; conversely, the lower the score, the greater the likelihood of developing diabetes. To supply users with a more direct evaluation, four risk categories have been established in accordance with the KS chart (Table 5).

Table 5 Risk interval division and threshold of diabetes risk score card

Comparison with existing models

To further validate the efficacy of our model, a comparison of the proposed model against other leading methods was conducted, the results of which are presented in Table 6.

Table 6 Comparison with existing models


The increasing burden of diabetes has become a global challenge [3, 32]. Through mass screening, early identification and intervention of patients with diabetes can be achieved to delay or prevent the development of the disease [33, 34]. The most efficacious method for widespread screening of diabetes has yet to be identified. In this study, the T2DM risk prediction models were developed and validated on data from more than 4 million people. The data were obtained from the cross-sectional data of NPE, including more than 9 million people in 14 prefectures of Xinjiang, China, which can be considered representative of the overall population of Xinjiang. Following the evaluation of the model's performance, it was determined that the XGBoost model was the optimal model for predicting the risk of T2DM, with an AUC was is 0.9122.

In this study, we used questionnaires to obtain indicators of hypertension and cardiovascular diseases, genetic history and smoking and diet in the population, which not only captured the medical history of each patient, but also included demographic factors and laboratory test indicators. Univariate and multivariate logistic regression analyses showed that sex, age, ethnicity, EH, SS, HTN, CAD, PDM, WC, BMI, WBC, PLT, FBG, ECG, TC, TG, LDLC, and HDLC were important factors for diabetes. HTN, FGB, age, PDM, CAD, ethnicity, TG, WC, BMI, and TC were the most important predictors of diabetes. Except that the FGB was viewed as a recognized risk factor and predictor of T2DM, hypertension and CAD were the most important features of T2DM risk models, which presented with high predictive ability. Some studies have confirmed that hypertension, cardiovascular disease and diabetes are mutually promote and influence each other [35, 36]. Many pathophysiological mechanisms underlie the association between diabetes and cardiovascular disease. Among these mechanisms, several have been identified as potential contributors [36]. Including insulin resistance in the nitric-oxide pathway, the stimulatory effect of hyperinsulinemia on sympathetic drive, smooth muscle growth, and sodium-fluid retention, as well as the excitatory effect of hyperglycemia on the renin–angiotensin–aldosterone system, provide plausible explanations for the association between diabetes and cardiovascular disease. On the other hand, the functional changes occurring in the context of T2DM and hypertension significantly alter the hemodynamic stress on the heart and other organs. Some studies have also demonstrated the important role of ECG in the prediction of diabetes [37], and our study confirmed the association between abnormal ECG results and T2DM. Understanding these underlying mechanisms is crucial for developing targeted interventions to prevent and manage cardiovascular complications in individuals with diabetes.

Our study showed that age was also an important feature of diabetes prediction models. The FDRSMA is a classic and widely used diabetes risk scoring model [38]. The objective of FDRSM is to utilize six risk factors (including BMI, FBG, PDM, HDLC, blood pressure and TG) to evaluate the risk of T2DM among middle-aged individuals. T2DM is generally observed in adults and appears to be more prevalent among the elderly individuals. As people age, the glucose sensitivity of pancreatic cells decline and insulin secretion is impaired, leading to hyperglycemia and T2DM [39]. Several studies reported differences in the incidence of diabetes between ethnic groups [40,41,42] and confirmed that ethnicity could be a predictors of diabetes [40, 43,44,45,46]. In our study, we used Uyghur as a reference, with Han and Hui ethnic groups exhibiting a heightened susceptibility to diabetes. The kazakh, Mongolian and Tajik ethnic groups had a lower risk. Genetic and environmental differences (i.e., economic level, diet, lifestyle, climate) were taken into account. Family history of diabetes was also identified as an important risk factor for T2DM in our model, which is consistent with previous studies [47]. There is a significant genetic predisposition to T2DM, with a 2 to 30 fold increased risk for T2DM in those with a family history compared with those without a family history [48].

Many studies have demonstrated a connection between obesity and diabetes. Furthermore, our study discovered that augmented BMI and WC were correlated with a higher probability of having diabetes. The development of obesity gain can result in insulin resistance and diminished β-cell functionality in humans. According to the World Health Organization, the global increase in the prevalence of diabetes is believed to be related to chronic stress, being overweight [49], lacking of physical activity [50, 51], excessive consumption of alcohol [52, 53] and an unhealthy diet [54]. Our model also demonstrated that EH and SS were predictors of T2DM. In addition, we also found that people who smoked and those who had quit smoking had a higher risk of T2DM than those who did not smoke, and those who ate a vegetarian or meat-based diet had a higher probability of T2DM than those who ate a balanced meat-vegetarian diet.

We incorporated laboratory variables, including TC, TG and HDLC into the diabetes prediction model. Our findings indicated that TG was an independent risk factor for T2DM, while TC was not an independent risk factor for T2DM in our study. Consistent with other studies [55]. The feature importance ranking showed that TC, TG, LDLC and HDLC were all important features of the T2DM risk prediction model. Multiple studies have revealed that dyslipidemia and T2DM often coexist in individuals and share common pathological mechanisms, such as insulin resistance, metabolic disturbances, inflammation, and alterations in the gut microbiota [55, 56].

Currently, ML algorithms are increasingly used to predict diabetes and related diseases [11, 12, 18, 19, 30, 57,58,59]. In this study, a diabetes screening model based on CART, LightGBM, RF, XGBoost TabNet and MLP models was constructed. The AUC (0.9122), PPV (0.2800), NPV (0.9829) and accuracy (0.8314) of the XGBoost prediction model showed good performance in the validation group. It appears that our model outperforms the majority of existing models, which may be because the model is built on the basis of multiple features and big data. Other studies also found that XGBoost was effective in predicting the risk of diabetes [8, 16].

The development of the diabetes risk assessment score card assists clinicians and individuals alike in conducting self-examinations, with the aim of increasing the rate of diabetes cascade screening and enhancing individual lifestyle management. Hence, utilizing large-scale physical examination information to achieve prompt risk notification and identification of diabetes is the most practicable course of action.

This study has several advantages. First, based on the NPE project, it not only has a wide coverage and a large amount of data, but also includes a number of major ethnic groups in China, which can enable better assessment of the characteristics of the population in Xinjiang, China; in addition the risk prediction model has a good generalization ability in Xinjiang, China. Second, the risk factors affecting diabetes were fully considered in this study. Laboratory examination, questionnaire survey and routine examination data were fully taken into account to obtain indicators such as hypertension and cardiovascular diseases, genetic history and exercise and diet in the population, and the influencing factors of diabetes were comprehensively analyzed. Third, the results of our model all showed satisfactory predictive effects (XGBoost: AUC = 0.9122). This study also has several limitations. First, it is not possible to establish causality using cross-sectional data derived from national health examinations, therefore, these results should be subject to further investigated in subsequent research. Second, the health examination data used in our study were highly heterogeneous and had a high rate of missing data, which affected the power of the model.


T2DM imposes an inexorable and significant burden on society, including intangible costs of lost productivity, premature death, and poor quality of life. Our model is based on large-scale health examination data in Xinjiang, China, which was used to construct a large-scale early diabetes risk screening model. Our model can be applied directly to the physical examination database, providing a highly efficient means for the identification of high-risk diabetes records over at a large range. This allows for the understanding of potential diabetes risk ratios at the public health level and the implementation of more effective diabetes prevention and control strategies. It is of great significance for the early control of diabetes to identify early risk warning sings and perform screening based on large-scale physical examination data.

Availability of data and materials

The data used to support the findings of this study are available from the corresponding author upon request.


  1. World Health Organization. Global report on diabetes. Geneva: World Health Organization; 2016.

    Google Scholar 

  2. Cho NH, Shaw JE, Karuranga S, Huang Y, Da RFJ, Ohlrogge AW, Malanda B. IDF diabetes atlas: global estimates of diabetes prevalence for 2017 and projections for 2045. Diabetes Res Clin Pr. 2018;138:271–81.

    Article  CAS  Google Scholar 

  3. International Diabetes Federation. IDF diabetes atlas. Brussels: International Diabetes Federation; 2021.

    Google Scholar 

  4. Wang L, Peng W, Zhao Z, Zhang M, Shi Z, Song Z, Zhang X, Li C, Huang Z, Sun X, Wang L, Zhou M, Wu J, Wang Y. Prevalence and treatment of diabetes in China, 2013–2018. Jama-J Am Med Assoc. 2021;326:2498–506.

    Article  Google Scholar 

  5. Saeedi P, Petersohn I, Salpea P, Malanda B, Karuranga S, Unwin N, Colagiuri S, Guariguata L, Motala A, Ogurtsova K. Global and regional diabetes prevalence estimates for 2019 and projections for 2030 and 2045: results from the international diabetes federation diabetes atlas. Diabetes Res Clin Pr. 2019.

    Article  Google Scholar 

  6. Xiong XL, Zhang RX, Bi Y, Zhou WH, Zhu DL. Machine learning models in type 2 diabetes risk prediction: results from a cross-sectional retrospective study in chinese adults. Curr Med Sci. 2019;39:582–8.

    Article  PubMed  Google Scholar 

  7. Cahn A, Shoshan A, Sagiv T, Yesharim R, Goshen R, Shalev V, Raz I. Prediction of progression from pre-diabetes to diabetes: development and validation of a machine learning model. Diabetes Metab Res Rev. 2019.

    Article  PubMed  Google Scholar 

  8. Yang H, Luo YM, Ren XL, Wu M, He XL, Peng BW, Deng KJ, Yan D, Tang H, Lin H. Risk Prediction of diabetes: big data mining with fusion of multifarious physical examination indicators. Inform Fusion. 2021;75:140–9.

    Article  Google Scholar 

  9. Boutilier JJ, Chan TCY, Ranjan M, Deo S. Risk stratification for early detection of diabetes and hypertension in resource-limited settings: machine learning analysis. J Med Internet Res. 2021.

    Article  PubMed  PubMed Central  Google Scholar 

  10. Goel M, Sharma A, Chilwal AS, Kumari S, Kumar A, Bagler G. Machine learning models to predict sweetness of molecules. Comput Biol Med. 2023.

    Article  PubMed  Google Scholar 

  11. Zou Q, Qu K, Luo Y, Yin D, Ju Y, Tang H. Predicting diabetes mellitus with machine learning techniques. Front Genet. 2018;9:515.

    Article  PubMed  PubMed Central  Google Scholar 

  12. Kds A, Wkl A, Af B, Rtdc D, Cb E, Je A. Use and performance of machine learning models for type 2 diabetes prediction in community settings: a systematic review and meta-analysis—sciencedirect. Int J Med Inform. 2020;143:104268.

    Article  Google Scholar 

  13. Wu Y, Hu H, Cai J, Chen R, Zuo X, Cheng H, Yan D. Machine learning for predicting the 3-year risk of incident diabetes in chinese adults. Front Public Health. 2021.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Shehab M, Abualigah L, Shambour Q, Abu-Hashem MA, Shambour MKY, Alsalibi AI, Gandomi AH. Machine learning in medical applications: a review of state-of-the-art methods. Comput Biol Med. 2022.

    Article  PubMed  Google Scholar 

  15. Dagliati A, Marini S, Sacchi L, Cogni G, Bellazzi R. Machine learning methods to predict diabetes complications. J Diabetes Sci Technol. 2017;12:193229681770637.

    Google Scholar 

  16. Ravaut M, Harish V, Sadeghi H, Leung KK, Rosella LC. Development and validation of a machine learning model using administrative health data to predict onset of type 2 diabetes. JAMA Netw Open. 2021;4:e2111315.

    Article  PubMed  PubMed Central  Google Scholar 

  17. Rabhi S, Blanchard F, Diallo AM, Zeghlache D, Lukas C, Berot A, Delemer B, Barraud S. Temporal deep learning framework for retinopathy prediction in patients with type 1 diabetes. Artif Intell Med. 2022.

    Article  PubMed  Google Scholar 

  18. Woldaregay AZ, Årsand E, Walderhaug S, Albers D, Mamykina L, Botsis T, Hartvigsen G. Data-driven modeling and prediction of blood glucose dynamics: machine learning applications in type 1 diabetes. Artif Intell Med. 2019;98:109–34.

    Article  PubMed  Google Scholar 

  19. Zhu T, Li K, Herrero P, Georgiou P. Deep learning for diabetes: a systematic review. IEEE J Biomed Health. 2021;25:2744–57.

    Article  Google Scholar 

  20. Choi SB, Kim WJ, Yoo TK, Park JS, Chung JW, Lee YH, Kang ES, Kim DW. Screening for prediabetes using machine learning models. Comput Math Method M. 2014.

    Article  Google Scholar 

  21. Choi SH, Kim TH, Lim S, Park KS, Jang HC, Cho NH. Hemoglobin A1c as a diagnostic tool for diabetes screening and new-onset diabetes prediction: a 6-year community-based prospective study. Diabetes Care. 2011;34:944–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Noble D, Mathur R, Dent T, Meads C, Greenhalgh T. Risk models and scores for type 2 diabetes: systematic review. BMJ-Brit Med J. 2011.

    Article  Google Scholar 

  23. Kavakiotis I, Tsave O, Salifoglou A, Maglaveras N, Vlahavas I, Chouvarda I. Machine learning and data mining methods in diabetes research. Comput Struct Biotec. 2017;15:104–16.

    Article  Google Scholar 

  24. Deberneh HM, Kim I. Prediction of type 2 diabetes based on machine learning algorithm. Int J Env Res Pub He. 2021;18:3317.

    Article  Google Scholar 

  25. Ji W, Zhang Y, Cheng Y, Wang Y, Zhou Y. Development and validation of prediction models for hypertension risks: a cross-sectional study based on 4,287,407 participants. Front Cardiovasc Med. 2022.

    Article  PubMed  PubMed Central  Google Scholar 

  26. Schonlau M, Zou RY. The random forest algorithm for statistical learning. Stata J. 2020;20:3–29.

    Article  Google Scholar 

  27. Huang Y, Ren Y, Yang H, Ding Y, Liu Y, Yang Y, Mao A, Yang T, Wang Y, Xiao F, He Q, Zhang Y. Using a machine learning-based risk prediction model to analyze the coronary artery calcification score and predict coronary heart disease and risk assessment. Comput Biol Med. 2022.

    Article  PubMed  PubMed Central  Google Scholar 

  28. Lu Y, Fu X, Chen F, Wong K. Prediction of fetal weight at varying gestational age in the absence of ultrasound examination using ensemble learning. Artif Intell Med. 2020.

    Article  PubMed  Google Scholar 

  29. G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, T. Liu, LightGBM: A Highly Efficient Gradient Boosting Decision Tree, 31st Conference on Neural Information Processing Systems (NIPS 2017). CA. 2017

  30. Zhou X, Qiao Q, Ji L, Ning F, Yang W, Weng J, Shan Z, Tian H, Ji Q, Lin L, Li Q, Xiao J, Gao W, Pang Z, Sun J. Nonlaboratory-based risk assessment algorithm for undiagnosed type 2 diabetes developed on a nation-wide diabetes survey. Diabetes Care. 2013;36:3944–52.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Gao WG, Dong YH, Pang ZC, Nan HR, Wang SJ, Ren J, Zhang L, Tuomilehto J, Qiao Q. A simple Chinese risk score for undiagnosed diabetes. Diabetic Med. 2010;27:274–81.

    Article  CAS  PubMed  Google Scholar 

  32. Bommer C, Heesemann E, Sagalova V, Manne-Goehler J, Atun R, Bärnighausen T, Vollmer S. The global economic burden of diabetes in adults aged 20–79 years: a cost-of-illness study. Lancet Diabetes Endo. 2017;5:423–30.

    Article  Google Scholar 

  33. Li G, Zhang P, Wang J, Gregg EW, Yang W, Gong Q, Li H, Li H, Jiang Y, An Y, Shuai Y, Zhang B, Zhang J, Thompson TJ, Gerzoff RB, Roglic G, Hu Y, Bennett PH. The long-term effect of lifestyle interventions to prevent diabetes in the China Da Qing diabetes prevention study: a 20-year follow-up study. Lancet. 2008;371:1783–9.

    Article  PubMed  Google Scholar 

  34. Gillies CL, Abrams KR, Lambert PC, Cooper NJ, Sutton AJ, Hsu RT, Khunti K. Pharmacological and lifestyle interventions to prevent or delay type 2 diabetes in people with impaired glucose tolerance: systematic review and meta-analysis. BMJ-Brit Med J. 2007;334:299-302B.

    Article  Google Scholar 

  35. Strain WD, Paldánius PM. Diabetes, cardiovascular disease and the microcirculation. Cardiovasc Diabetol. 2018;17:57–57.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Ferrannini E, Cushman WC. Diabetes and hypertension: the bad companions. Lancet. 2012;380:601–10.

    Article  PubMed  Google Scholar 

  37. Zanelli S, Ammi M, Hallab M, El YM. Diabetes detection and management through photoplethysmographic and electrocardiographic signals analysis: a systematic review. Sensors-Basel. 2022;22:7890.

    Article  CAS  Google Scholar 

  38. Wilson PW, Meigs JB, Sullivan L, Fox CS, Nathan DM, D’Agostino RS. Prediction of incident diabetes mellitus in middle-aged adults: the framingham offspring study. Arch Intern Med. 2007;167:1068–74.

    Article  PubMed  Google Scholar 

  39. Chang AM, Halter JB. Aging and insulin secretion. Am J Physiol-Endoc M. 2003;284:E7.

    CAS  Google Scholar 

  40. Wang L, Gao P, Zhang M, Huang Z, Zhang D, Deng Q, Li Y, Zhao Z, Qin X, Jin D, Zhou M, Tang X, Hu Y, Wang L. Prevalence and ethnic pattern of diabetes and prediabetes in China in 2013. JAMA-J Am Med Assoc. 2017;317:2515.

    Article  Google Scholar 

  41. Cheng YJ. Prevalence of diabetes by race and ethnicity in the United States, 2011–2016. J Am Med Assoc. 2019;322:2389–98.

    Article  Google Scholar 

  42. Wang MC, Shah NS, Carnethon MR, O’Brien MJ, Khan SS. Age at diagnosis of diabetes by race and ethnicity in the United States from 2011 to 2018. JAMA Intern Med. 2021.

    Article  PubMed  PubMed Central  Google Scholar 

  43. Golden SH, Yajnik C, Phatak S, Hanson RL, Knowler WC. Racial/ethnic differences in the burden of type 2 diabetes over the life course: a focus on the USA and India. Diabetologia. 2019;62:1751–60.

    Article  PubMed  PubMed Central  Google Scholar 

  44. Gong H, Pa L, Wang K, Mu H, Dong F, Ya S, Xu G, Tao N, Pan L, Wang B, Shan G. Prevalence of diabetes and associated factors in the Uyghur and Han population in Xinjiang China. Int J Env Res Pub He. 2015;12:12792–802.

    Article  CAS  Google Scholar 

  45. Schwartz N, Nachum Z, Green MS. The prevalence of gestational diabetes mellitus recurrence–effect of ethnicity and parity: a metaanalysis. Am J Obstet Gynecol. 2015;213:310–7.

    Article  PubMed  Google Scholar 

  46. Remsing SC, Abner SC, Reeves K, Coles B, Lawson C, Gillies C, Razieh C, Yates T, Davies MJ, Lilford R, Khunti K, Zaccardi F. Ethnicity and prognosis following a cardiovascular event in people with and without type 2 diabetes: observational analysis in over 5 million subjects in England. Diabetes Res Clin Pr. 2022.

    Article  Google Scholar 

  47. Sakurai M, Nakamura K, Miura K, Takamura T, Yoshita K, Sasaki S, Nagasawa SY, Morikawa Y, Ishizaki M, Kido T. Family history of diabetes, lifestyle factors, and the 7-year incident risk of type 2 diabetes mellitus in middle-aged Japanese men and women. J Diabetes Invest. 2013;4:261–8.

    Article  Google Scholar 

  48. Cornelis MC, Zaitlen N, Hu FB, Kraft P, Price AL. Genetic and environmental components of family history in type 2 diabetes. Hum Genet. 2015;134:259–67.

    Article  CAS  PubMed  Google Scholar 

  49. Carbone S, Buono M, Ozemek C, Lavie CJ. Obesity, risk of diabetes and role of physical activity, exercise training and cardiorespiratory fitness. Prog Cardiovasc Dis. 2019;62:327–33.

    Article  PubMed  Google Scholar 

  50. Yang Z, Scott CA, Mao C, Tang J, Farmer AJ. Resistance exercise versus aerobic exercise for type 2 diabetes: a systematic review and meta-analysis. Sports Med. 2014;44:487.

    Article  PubMed  Google Scholar 

  51. Pan B, Long G, Xun YQ, Chen YJ, Gao CY, Han X, Zuo LQ, Shan HQ, Yang KH, Ding GW. Exercise training modalities in patients with type 2 diabetes mellitus: a systematic review and network meta-analysis. Int J Behav Nutr Phy. 2018.

    Article  Google Scholar 

  52. Polsky S, Akturk HK. Alcohol consumption diabetes risk, and cardiovascular disease within diabetes. Curr Diabetes Rep. 2017;17:136–212.

    Article  Google Scholar 

  53. Knott C, Bell S, Britton A. Alcohol consumption and the risk of type 2 diabetes: a systematic review and dose-response meta-analysis of more than 1.9 million individuals from 38 observational studies. Diabetes Care. 2015.

    Article  PubMed  Google Scholar 

  54. Zhao Z, Li M, Li C, Wang T, Xu Y, Zhan Z, Dong W, Shen Z, Xu M, Lu J. Dietary preferences and diabetic risk in China: a large-scale nationwide internet data-based study. J Diabetes. 2020;12:270–8.

    Article  PubMed  Google Scholar 

  55. Wang K, Gong M, Xie S, Zhang M, Zheng H, Zhao X, Liu C. Nomogram prediction for the 3-year risk of type 2 diabetes in healthy mainland China residents. EPMA J. 2019;10:227–37.

    Article  PubMed  PubMed Central  Google Scholar 

  56. Verges B. Pathophysiology of diabetic dyslipidaemia: where are we? Diabetologia. 2015;58:886–99.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Makroum MA, Adda M, Bouzouane A, Ibrahim H. Machine learning and smart devices for diabetes management: systematic review. Sensors. 2022;22:1843.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Contreras I, Vehi J. Artificial intelligence for diabetes management and decision support: literature review. J Med Internet Res. 2018.

    Article  PubMed  PubMed Central  Google Scholar 

  59. Haq AU, Li JP, Khan J, Memon MH, Nazir S, Ahmad S, Khan GA, Ali A. Intelligent machine learning approach for effective recognition of diabetes in e-healthcare using clinical data. Sensors. 2020;20:2649.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


The Xinjiang Uygur Autonomous Region Health Commission and the Health Management Institute of Xinjiang Medical University are hereby thanked for their support in providing data. In addition, thanks are expressed to all the participants for their assistance.


This work was supported by the Key Research and Development Program of China under Grant 2022YFC3601600 and 2021YFC2009400, in part by the National Natural Science Foundation of China (NSFC) under Grant 61876194, in part by the Province Natural Science Foundation of Guangdong under Grant 2021A1515011897, in part by the Key Research and Development Program of Guangzhou under Grant 202206010028, in part by the Fundamental Research Funds for the Central Universities, Sun Yat-sen University 23ptpy119, and in part by the Province Natural Science Foundation of Xinjiang, China, under Grant 2016D01C330.

Author information

Authors and Affiliations



LL drafted the composition and conducted analyses and interpretations of the data. YC and LL conceived and designed the research. YC performed the statistical analyses. WJ collected the data and critically revised important intellectual content. YC, ZH and ML conducted data preprocessing. YZ, YY and YW critically reviewed and edited the manuscript.

Corresponding authors

Correspondence to Yining Yang, Yushan Wang or Yi Zhou.

Ethics declarations

Ethics approval and consent to participate

This study was conducted in accordance with the principles outlined in the "Helsinki Declaration" and was approved by the Ethics Committee of the First Affiliated Hospital of Xinjiang Medical University. All methods were performed in accordance with the relevant guidelines and regulations (No. K202101-20).

Competing interests

The authors declare that this research was conducted in the absence of any commercial or financial relationships that could be construed as a potential competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1

: Table A1. Univariate logistic regression analysis. Table A2. Hyperparameters of the model. Figure A1. The heat map to show the Pearson correlation of features. Figure A2. Confusion matrix of each classification model. (A) CART, (B) LightGBM, (C) RF, (D) XGBoost, (E) MLP, (F) TabNet and (G) LR.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, L., Cheng, Y., Ji, W. et al. Machine learning for predicting diabetes risk in western China adults. Diabetol Metab Syndr 15, 165 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: