ML/AI for Diabetes-2 Risk Management, Lifestyle/Daily-Life Support

In the last decade, the impact of Type-2 Diabetes (T2D) has increased to a great extent especially in developing countries. T2D is a common condition that causes the level of sugar (glucose) in the blood to become too high. T2D is responsible for very considerable morbidity, mortality.

The objective of this project is to summarize recent efforts to use ML/AI techniques to assist in the management of T2D, along with the associated challenges.

Our results indicate that ML methods are being progressively established as suitable for use in clinical daily practice, as well as for the self-management of diabetes. Consequently, these methods provide powerful tools for improving patients’ quality of life.

Contents:

State-of-the-Art

The potential of AI to enable T2D solutions has been investigated in the context of multiple critical management issues. In this case study, we use the following proposed T2D management categories to summarize the latest 3 contributions described in the reviewed articles:

Blood glucose control/prediction
Risk and patient personalization
Lifestyle and daily-life support in T2D management

Over the last years, various ML/AI techniques (DNN, SVM, KNN, DT, GBT, GBM, RF, LR, etc.) have been used to predict T2D and its complications. However, researchers and developers still face two main challenges when building T2D predictive models:

There is considerable heterogeneity in previous studies regarding algorithms used, making it challenging to identify the optimal one.
There is a lack of transparency about the features used in the optimized models, which reduces their interpretability.
This systematic analysis aimed at providing answers to the above challenges.
The study followed the earlier review primarily, enriched with the most recent case studies (cf. References and Explore More).

Public-Domain Data Analysis

Conventionally, T2D ML/AI studies use the Kaggle PIMA Indian Diabetes (PID) dataset taken from the National Institute of Diabetes and Kidney Diseases center, see UC Irvine Machine Learning Repository. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

Dataset 1: 768 Females of Pima Indian Heritage

The PIMA dataset consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on:

Pregnancies: Number of times pregnant
Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
BloodPressure: Diastolic blood pressure (mm Hg)
SkinThickness: Triceps skin fold thickness (mm)
Insulin: 2-Hour serum insulin (mu U/ml)
BMI: Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction: Diabetes pedigree function
Age: Age (years)
Outcome: Class variable (0 or 1)

An interesting attribute is the Diabetes Pedigree Function (pedi). It provided some data on T2D history in relatives and the genetic relationship of those relatives to the patient (the hereditary risk).

RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

Let’s count the number of zero values

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0

to be replaced by mean/median values per column.

The ML objective (binary classification problem) is to predict if the patient is diabetic or not (Outcome=0, 1).

In principle, we can rename the column names using the description from the original dataset website

[“NumTimesPrg”, “PlGlcConc”, “BloodP”, “SkinThick”, “TwoHourSerIns”, “BMI”, “DiPedFunc”, “Age”, “HasDiabetes”]

PIMA data statistics: mean, std, min, max, Q1, Q2, and Q3.

Histograms of input data columns:

The BMI density plot:

The BMI vs Glucose scatter plot:

The composite BMI-Glucose-Age scatter plot:

It is clear that the PIMA population is generally young, less than 50 years old.

Recall that the boxplot is drawn from the first quartile to the third quartile. The line inside the box represents the median. The whiskers extend from both ends of the box to the minimum and maximum values. So a boxplot and whisker plot is a type of distribution graph — it shows how the data is dispersed around the median, if the data is skewed, and whether or not it is symmetrical.

Diabetes Pedigree Function density plot vs boxplot

Let’s check skewness of Insulin, Skin Thickness, and Age. It indicates that the data may not be normally distributed. Outliers, which are data values that are far away from other data values, can strongly affect ML results. Outliers are easiest to identify on the boxplots above (cf. Insulin and DPF).

We can also group our data based on Outcome

Outcome
0    500
1    268

Upon examining the distribution of class values (see above 2 plots), we noticed that there are 500 negative instances (65.1%) and 258 positive instances (34.9%).

Scatter pair-plot and density plots of input data columns grouped by Outcome

Reviewing histograms of all attributes in the dataset shows us the following:

Some of the attributes look normally distributed (plas, pres, skin, and mass).
Some of the attributes look like they may have an exponential distribution (preg, insu, pedi, age).
Age should probably have a normal distribution, the constraints on the data collection may have skewed the distribution.
Testing for normality (normality plot) may be of interest. We could look at fitting the data to a normal distribution.

Reviewing scatter plots of all attributes in the dataset shows that:

There is no obvious relationship between age and onset of diabetes.
There is no obvious relationship between pedi function and onset of diabetes.
This may suggest that diabetes is not hereditary, or that the Diabetes Pedigree Function needs work.
Larger values of plas combined with larger values for age, pedi, mass, insu, skin, pres, and preg tends to show greater likelihood of testing positive for diabetes.

All X-plots, histograms and density plots above suggest that the difference between two means grouped by Overcome is not statistically significant. Since those intervals overlap, we conclude that the difference between 2 groups is not statistically significant. If there is no overlap or overlap < 25%, the difference is significant.

While this visual method of assessing the overlap is easy to perform, regrettably it comes at the cost of reducing our ability to detect differences. This is where ML comes in.

It is important to consider all possible limitations of the data, which may include the following:

Results may be limited to Pima Indians, but give us a good start on how to begin diagnosing other populations with diabetes.
Results may be limited to the time the data was collected (between 1960s and 1980s). Today’s medical procedures for diagnosing diabetes include a urine test and the hemoglobin A1c test, which shows the average level of blood sugar over the previous 3 months.
Dataset is rather small, which may limit performance of some ML algorithms.

Dataset 2: the diabetes dataset which is available from within scikit-learn.

from sklearn.datasets import load_diabetes

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times `n_samples` (i.e. the sum of squares of each column totals 1).

Source URL:
https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html

For more information see:
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) "Least Angle Regression," Annals of Statistics (with discussion), 407-499.
(https://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf)

ML/AI T2D Use-Cases

Feature Selection is the process used to select the input variables that are most important to your Machine Learning task.

Dataset 1

RFC classification report:

[[137  25]
 [ 31  61]]
              precision    recall  f1-score   support

           0       0.82      0.85      0.83       162
           1       0.71      0.66      0.69        92

    accuracy                           0.78       254
   macro avg       0.76      0.75      0.76       254
weighted avg       0.78      0.78      0.78       254

The feature correlation matrix C=C(*,*) below is simply a table which displays the correlation coefficients for different variables. The matrix depicts the correlation between all the possible pairs of values in a table. Each cell in the table below contains the Pearson correlation coefficient.

The feature importance plots below provide a score that indicates how useful or valuable each feature was in the training of the specific ML model. This importance is calculated explicitly for each attribute in the dataset, allowing attributes to be ranked and compared to each other.

RFC Feature ranking below: 1 – Glucose, 2 – BMI, 3- Age, 4 – DPF, 5 – BP.

GBC Feature ranking below: 1 – Glucose, 2 – BMI, 3- Age, 4 – DPF, 5 – Pregnancies.

DTC Feature ranking below: 1 – Glucose, 2 – BMI, 3 – Pregnancies.

Improved RFC Feature ranking below: 1 – Glucose, 2 – BMI, 3- DPF, 4 – Age, 5 – BP.

We can see that RFC yields the most accurate prediction of our test data. It offers the following feature ranking: 1 – Glucose, 2 – BMI, 3- DPF/Age. This is consistent with ranking based on the correlation coefficient (see the matrix C above):

C(Outcome, Glucose)=0.47 > C(Outcome, BMI)=0.29 > C(Outcome, Age)=0.24 ~ C(Outcome, Pregnancies).

In addition, we can drop Age or Pregnancies because C(Age, Pregnancies)=0.54. We can also drop Skin Thickness (ST) because C(ST, Insulin)=0.44 and

C(ST, BMI)=0.39. In principle, we can also drop Insulin because C(Insulin, Glucose)=0.33.

Dataset 2

Model features:

      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

from sklearn.feature_selection import SelectFromModel

Features selected by SelectFromModel: ['s1' 's5']
Done in 0.001s

Our visual inspection of the above feature dominance plot suggests the following ranking: 1 – s1 (TC) ~ s5 (LTG), 2 – BMI, 3 -s2 (LDL).

Lifestyle & Daily-Life Support

Lifestyle & Daily-Life Support is a fundamental aspect of T2D risk management. ML/AI promotes a research-based, structured lifestyle change program that is proven to help prevent and delay the development of T2D.

You cannot change certain risk factors like your age but you can change some lifestyle risks including

Being overweight with a large BMI
Lack of physical activity
An unhealthy diet
Smoking
High blood cholesterol
High blood pressure
Stress

Lifestyle changes are often advised for people at higher risk of diabetes and those who are newly diagnosed with type 2, to help manage their diabetes.

The recommended lifestyle interventions include:

Taking two and a half hours each week of moderate intensity physical activity or one hour and 15 minutes of high intensity exercise.
Losing weight gradually to achieve a healthy body mass index
Replacing refined carbohydrates with wholegrain foods and increase intake of vegetables and other foods high in dietary fibre
Reducing the amount of saturated fat in the diet.

Key Takeaways

We have explored the Pima India Dataset 1 and the sklearn diabetes dataset 2 with many visualizations, feature engineering and numerical co-rendering.
Results support the tight glycemic control strategy that attempts to rigidly control glucose levels (typically an A1C level of 6.5% to 7.0% or lower).
This study shows that the more weight you lose, the greater the health benefits. Patients need to follow diabetes diet plans to lose weight.
Our DPF feature analysis indicates the following: if you have a family health history of diabetes, you are more likely to have prediabetes and develop diabetes.
Even though the PIMA population is generally young, there is evidence that the likelihood of developing the condition increases drastically after age 45.
The second case study using dataset 2 supports the strong quasi-linear relationship between “bad” cholesterol and T2D, known as diabetic dyslipidemia.
The studies offer support for the management of T2D patients using advanced features, such as computerized alerts.
Recommendations have been proposed for improving daily-life support for diabetes therapies.
Our findings show the increasing overall importance of ML/AI for T2D risk management.
Future work: examine T2D medication adherence thresholds and risk of hospitalization to be implemented with the help of proposed ML/AI algorithms.

Explore More

About

HealthTech ML/AI Q3 ’22 Round-Up

The Application of ML/AI in Diabetes

Diabetes Prediction using ML/AI in Python

HealthTech ML/AI Use-Cases

References

← Back

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

Or enter a custom amount

€

Your contribution is appreciated.

Donate

Donate monthly

Donate yearly

ML/AI for Diabetes-2 Risk Management, Lifestyle/Daily-Life Support

State-of-the-Art

Public-Domain Data Analysis

ML/AI T2D Use-Cases

Lifestyle & Daily-Life Support

You cannot change certain risk factors like your age but you can change some lifestyle risks including

Key Takeaways

Explore More

References

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Share this:

Discover more from Our Blogs

Leave a comment Cancel reply

Discover more from Our Blogs