ML/AI for Diabetes-2 Risk Management, Lifestyle/Daily-Life Support

In the last decade, the impact of Type-2 Diabetes (T2D) has increased to a great extent especially in developing countries. T2D is a common condition that causes the level of sugar (glucose) in the blood to become too high. T2D is responsible for very considerable morbidity, mortality.

The objective of this project is to summarize recent efforts to use ML/AI techniques to assist in the management of T2D, along with the associated challenges.

Our results indicate that ML methods are being progressively established as suitable for use in clinical daily practice, as well as for the self-management of diabetes. Consequently, these methods provide powerful tools for improving patients’ quality of life.

Source: canva

State-of-the-Art

The potential of AI to enable T2D solutions has been investigated in the context of multiple critical management issues. In this case study, we use the following proposed T2D management categories to summarize the latest 3 contributions described in the reviewed articles:

  • Blood glucose control/prediction
  • Risk and patient personalization
  • Lifestyle and daily-life support in T2D management

Over the last years, various ML/AI techniques (DNN, SVM, KNN, DT, GBT, GBM, RF, LR, etc.) have been used to predict T2D and its complications. However, researchers and developers still face two main challenges when building T2D predictive models:

  • There is considerable heterogeneity in previous studies regarding algorithms used, making it challenging to identify the optimal one.
  • There is a lack of transparency about the features used in the optimized models, which reduces their interpretability.
  • This systematic analysis aimed at providing answers to the above challenges.
  • The study followed the earlier review primarily, enriched with the most recent case studies (cf. References and Explore More).

Public-Domain Data Analysis

Conventionally, T2D ML/AI studies use the Kaggle PIMA Indian Diabetes (PID) dataset taken from the National Institute of Diabetes and Kidney Diseases center, see UC Irvine Machine Learning Repository. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Dataset 1: 768 Females of Pima Indian Heritage

The PIMA dataset consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on:

  • Pregnancies: Number of times pregnant
  • Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  • BloodPressure: Diastolic blood pressure (mm Hg)
  • SkinThickness: Triceps skin fold thickness (mm)
  • Insulin: 2-Hour serum insulin (mu U/ml)
  • BMI: Body mass index (weight in kg/(height in m)^2)
  • DiabetesPedigreeFunction: Diabetes pedigree function
  • Age: Age (years)
  • Outcome: Class variable (0 or 1)

An interesting attribute is the Diabetes Pedigree Function (pedi). It provided some data on T2D history in relatives and the genetic relationship of those relatives to the patient (the hereditary risk).

RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Let’s count the number of zero values

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0

to be replaced by mean/median values per column.

The ML objective (binary classification problem) is to predict if the patient is diabetic or not (Outcome=0, 1).

In principle, we can rename the column names using the description from the original dataset website

[“NumTimesPrg”, “PlGlcConc”, “BloodP”, “SkinThick”, “TwoHourSerIns”, “BMI”, “DiPedFunc”, “Age”, “HasDiabetes”]

PIMA data statistics: mean, std, min, max, Q1, Q2, and Q3.
PIMA data statistics: mean, std, min, max, Q1, Q2, and Q3.

Histograms of input data columns:

Histograms of input data columns
Histograms of input data columns

The BMI density plot:

BMI density plot

The BMI vs Glucose scatter plot:

The BMI vs Glucose scatter plot

The composite BMI-Glucose-Age scatter plot:

The composite BMI-Glucose-Age scatter plot

It is clear that the PIMA population is generally young, less than 50 years old.

Recall that the boxplot is drawn from the first quartile to the third quartile. The line inside the box represents the median. The whiskers extend from both ends of the box to the minimum and maximum values. So a boxplot and whisker plot is a type of distribution graph — it shows how the data is dispersed around the median, if the data is skewed, and whether or not it is symmetrical. 

Age density plot vs boxplot
Age density plot vs boxplot
BMI density plot vs boxplot
BMI density plot vs boxplot
Blood Pressure density plot vs boxplot
Blood Pressure density plot vs boxplot
Glucose density plot vs boxplot
Glucose density plot vs boxplot
Insulin density plot vs boxplot
Insulin density plot vs boxplot
Diabetes Pedigree Function density plot vs boxplot
Diabetes Pedigree Function density plot vs boxplot
Pregnancies density plot vs boxplot
Skin Thickness density plot vs boxplot
Skin Thickness density plot vs boxplot
Boxplots of input data columns
Boxplots of input data columns

Let’s check skewness of Insulin, Skin Thickness, and Age. It indicates that the data may not be normally distributed. Outliers, which are data values that are far away from other data values, can strongly affect ML results. Outliers are easiest to identify on the boxplots above (cf. Insulin and DPF).

We can also group our data based on Outcome

Outcome
0    500
1    268
Pie-chart of Outcome
Pie-chart of Outcome
Bar-chart of Outcome
Bar-chart of Outcome

Upon examining the distribution of class values (see above 2 plots), we noticed that there are 500 negative instances (65.1%) and 258 positive instances (34.9%).

Glucose vs Outcome histograms
Glucose vs Outcome histograms
BMI vs Outcome histograms
BMI vs Outcome histograms
Glucose vs Outcome density plots
Glucose vs Outcome density plots
Pregnancies vs Outcome density plots
Pregnancies vs Outcome density plots
Pregnancies vs Outcome violin plots
Pregnancies vs Outcome violin plots
Scatter pair-plot and density plots of input data columns grouped by Outcome
Scatter pair-plot and density plots of input data columns grouped by Outcome

Reviewing histograms of all attributes in the dataset shows us the following:

  • Some of the attributes look normally distributed (plas, pres, skin, and mass).
  • Some of the attributes look like they may have an exponential distribution (preg, insu, pedi, age).
  • Age should probably have a normal distribution, the constraints on the data collection may have skewed the distribution.
  • Testing for normality (normality plot) may be of interest. We could look at fitting the data to a normal distribution.

Reviewing scatter plots of all attributes in the dataset shows that:

  • There is no obvious relationship between age and onset of diabetes.
  • There is no obvious relationship between pedi function and onset of diabetes.
  • This may suggest that diabetes is not hereditary, or that the Diabetes Pedigree Function needs work.
  • Larger values of plas combined with larger values for age, pedi, mass, insu, skin, pres, and preg tends to show greater likelihood of testing positive for diabetes.

All X-plots, histograms and density plots above suggest that the difference between two means grouped by Overcome is not statistically significant. Since those intervals overlap, we conclude that the difference between 2 groups is not statistically significant. If there is no overlap or overlap < 25%, the difference is significant.

While this visual method of assessing the overlap is easy to perform, regrettably it comes at the cost of reducing our ability to detect differences. This is where ML comes in.

It is important to consider all possible limitations of the data, which may include the following:

  • Results may be limited to Pima Indians, but give us a good start on how to begin diagnosing other populations with diabetes.
  • Results may be limited to the time the data was collected (between 1960s and 1980s). Today’s medical procedures for diagnosing diabetes include a urine test and the hemoglobin A1c test, which shows the average level of blood sugar over the previous 3 months.
  • Dataset is rather small, which may limit performance of some ML algorithms.

Dataset 2: the diabetes dataset which is available from within scikit-learn.

from sklearn.datasets import load_diabetes

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

ML/AI T2D Use-Cases

Feature Selection is the process used to select the input variables that are most important to your Machine Learning task.

Dataset 1

RFC classification report:

[[137  25]
 [ 31  61]]
              precision    recall  f1-score   support

           0       0.82      0.85      0.83       162
           1       0.71      0.66      0.69        92

    accuracy                           0.78       254
   macro avg       0.76      0.75      0.76       254
weighted avg       0.78      0.78      0.78       254

The feature correlation matrix C=C(*,*) below is simply a table which displays the correlation coefficients for different variables. The matrix depicts the correlation between all the possible pairs of values in a table. Each cell in the table below contains the Pearson correlation coefficient.

RFC correlation matrix heatmap

The feature importance plots below provide a score that indicates how useful or valuable each feature was in the training of the specific ML model. This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.

RFC Feature ranking below: 1 – Glucose, 2 – BMI, 3- Age, 4 – DPF, 5 – BP.

RFC Feature dominance
RFC Feature dominance

GBC Feature ranking below: 1 – Glucose, 2 – BMI, 3- Age, 4 – DPF, 5 – Pregnancies.

GBC Feature dominance
GBC Feature dominance

DTC Feature ranking below: 1 – Glucose, 2 – BMI, 3 – Pregnancies.

DTC Feature dominance
DTC Feature dominance

Improved RFC Feature ranking below: 1 – Glucose, 2 – BMI, 3- DPF, 4 – Age, 5 – BP.

Improved RFC Feature dominance
Improved RFC Feature dominance
Accuracy of ML algorithms
Accuracy of ML algorithms

We can see that RFC yields the most accurate prediction of our test data. It offers the following feature ranking: 1 – Glucose, 2 – BMI, 3- DPF/Age. This is consistent with ranking based on the correlation coefficient (see the matrix C above):

C(Outcome, Glucose)=0.47 > C(Outcome, BMI)=0.29 > C(Outcome, Age)=0.24 ~ C(Outcome, Pregnancies).

In addition, we can drop Age or Pregnancies because C(Age, Pregnancies)=0.54. We can also drop Skin Thickness (ST) because C(ST, Insulin)=0.44 and

C(ST, BMI)=0.39. In principle, we can also drop Insulin because C(Insulin, Glucose)=0.33.

Dataset 2

sklearn data feature dominance

Model features:

      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level 

from sklearn.feature_selection import SelectFromModel

Features selected by SelectFromModel: ['s1' 's5']
Done in 0.001s

Our visual inspection of the above feature dominance plot suggests the following ranking: 1 – s1 (TC) ~ s5 (LTG), 2 – BMI, 3 -s2 (LDL).

Lifestyle & Daily-Life Support

Lifestyle & Daily-Life Support is a fundamental aspect of T2D risk management. ML/AI promotes a research-based, structured lifestyle change program that is proven to help prevent and delay the development of T2D.

You cannot change certain risk factors like your age but you can change some lifestyle risks including
  • Being overweight with a large BMI
  • Lack of physical activity
  • An unhealthy diet
  • Smoking
  • High blood cholesterol
  • High blood pressure
  • Stress

Lifestyle changes are often advised for people at higher risk of diabetes and those who are newly diagnosed with type 2, to help manage their diabetes.

The recommended lifestyle interventions include:

  • Taking two and a half hours each week of moderate intensity physical activity or one hour and 15 minutes of high intensity exercise.
  • Losing weight gradually to achieve a healthy body mass index
  • Replacing refined carbohydrates with wholegrain foods and increase intake of vegetables and other foods high in dietary fibre
  • Reducing the amount of saturated fat in the diet.

Key Takeaways

  • We have explored the Pima India Dataset 1 and the sklearn diabetes dataset 2 with many visualizations, feature engineering and numerical co-rendering.
  • Results support the tight glycemic control strategy that attempts to rigidly control glucose levels (typically an A1C level of 6.5% to 7.0% or lower).
  • This study shows that the more weight you lose, the greater the health benefits. Patients need to follow diabetes diet plans to lose weight.
  • Our DPF feature analysis indicates the following: if you have a family health history of diabetes, you are more likely to have prediabetes and develop diabetes. 
  • Even though the PIMA population is generally young, there is evidence that the likelihood of developing the condition increases drastically after age 45.
  • The second case study using dataset 2 supports the  strong quasi-linear relationship between “bad” cholesterol and T2D, known as diabetic dyslipidemia.
  • The studies offer support for the management of T2D patients using advanced features, such as computerized alerts.
  • Recommendations have been proposed for improving daily-life support for diabetes therapies.
  • Our findings show the increasing overall importance of ML/AI for T2D risk management.
  • Future work: examine T2D medication adherence thresholds and risk of hospitalization to be implemented with the help of proposed ML/AI algorithms.

Explore More

References

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: