# HPO-Tuned ML Diabetes-2 Prediction

This blog was inspired by the recent tests with Machine Learning (ML) hyper-parameter optimization (HPO) and cross-validation scores for enhanced prediction of Type-2 diabetes (T2D).

The basic workflow includes RandomizedSearchCV HPO and accuracy metrics:

• Import libraries and data
• Data Pre-Processing, Cleaning, Scaling, Resampling and Transformations
• Exploratory Data Analysis (EDA)
• Feature Engineering (FE)
• Model Building – GaussianNB
• RandomForestClassifier with RandomizedSearchCV
• DecisionTreeClassifier
• Compute confusion matrix, accuracy score

The advanced workflow includes EDA, FE, HPO, and complete X-validation:

• Import libraries and data
• Feature Engineering and Exploratory Data Analysis
• Replace missing values by median values
• Creating metrics for evaluations
• Model building with GBC and HPO with GridSearchCV
• X-validation and full classification report
• Export the best model with pickle

Both workflows compare several ML algorithms, SMOTE oversampling, scaling and various performance metrics.

## Basic Workflow

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

Importing essential libraries
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

and importing the input dataset

We rename DiabetesPedigreeFunction as DPF
df = df.rename(columns={‘DiabetesPedigreeFunction’:’DPF’})

while replacing the 0 values from [‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’] by NaN
df_copy = df.copy(deep=True)
df_copy[[‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’]] = df_copy[[‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’]].replace(0,np.nan)

Print No. of rows and columns
print(f’Total Rows {df_copy.shape[0]}’)
print(f’Total Rows {df_copy.shape[1]}’)

```Total Rows 768
Total Rows 9```

We collect all columns having missing values and at what percentage
columns_with_na=[features for features in df_copy.columns if df_copy[features].isnull().sum()>0]

for feature in columns_with_na:
print(feature, np.round(df_copy[feature].isnull().mean(), 4), ‘ % missing values’)

```Glucose 0.0065  % missing values
BloodPressure 0.0456  % missing values
SkinThickness 0.2956  % missing values
Insulin 0.487  % missing values
BMI 0.0143  % missing values```

Let’s plot the counts of missing values per Outcome

for feature in columns_with_na:
data = df_copy.copy()
data[feature] = np.where(data[feature].isnull(), 1, 0)
data.groupby(feature)[‘Outcome’].value_counts().plot.bar()
plt.title(feature)
plt.show()

It is clear that Insulin and Skin Thickness have the largest number of missing values for both values of Outcome.

Get all the columns having numerical values
num_data = [features for features in df_copy.columns if df_copy[features].dtypes != ‘O’]
print(f’Number of numerical columns is {len(num_data)}’)

`Number of numerical columns is 9`

Check if any numerical columns are discrete
discrete_columns = [feature for feature in df_copy.columns if len(df_copy[feature].unique()) < 20 and feature not in [‘Outcome’] ]
print(f’Number of discrete Columns is {len(discrete_columns)}’)
for i in discrete_columns:
print(f'{i} has {len(df_copy[i].unique())} discrete values’)

```Number of discrete Columns is 1
Pregnancies has 17 discrete values```

Check distribution of the discrete data
for feature in discrete_columns:
dt=df_copy.copy()
dt.groupby(feature)[‘Outcome’].value_counts().plot.bar()
plt.xlabel(feature)
plt.ylabel(‘Outcome’)
plt.title(feature)
plt.show()

Check if any numerical columns are continuous
continuous_columns = [feature for feature in num_data if feature not in discrete_columns and feature not in [‘Outcome’] ]
print(f’Number of Continous Columns is {len(continuous_columns)}’)

`Number of Continuous Columns is 7`

Check distribution of the Continuous data to see their outliers
for feature in continuous_columns:
dt = df_copy.copy()
dt[feature].hist(bins=25)
plt.xlabel(feature)
plt.ylabel(“Count”)
plt.title(feature)
plt.show()

Collect all columns having missing values and at what percentage
columns_with_na=[features for features in df_copy.columns if df_copy[features].isnull().sum()>0]

for feature in columns_with_na:
print(feature, np.round(df_copy[feature].isnull().mean(), 4), ‘ % missing values’)

```Glucose 0.0065  % missing values
BloodPressure 0.0456  % missing values
SkinThickness 0.2956  % missing values
Insulin 0.487  % missing values
BMI 0.0143  % missing values```

Column editing: replace NaN values with median of that column and make new column as df_copy[feature_nan]. Give it value 1 if NaN is present else 0
for feature in columns_with_na:
median_value = df_copy[feature].mean()
df_copy[feature+’nan’] = np.where(df_copy[feature].isnull(), 1, 0)
df_copy[feature].fillna(median_value, inplace = True)

df_copy[columns_with_na].isnull().sum()

```Glucose          0
BloodPressure    0
SkinThickness    0
Insulin          0
BMI              0
dtype: int64```

Remove any duplicate entry
final_df =df_copy.loc[:,~df_copy.columns.duplicated()]

final_df.shape

`(768, 14)`

Let’s import StandardScaler and SMOTE

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from imblearn import over_sampling
from imblearn.over_sampling import SMOTE
oversample = SMOTE()

Model Building:
from sklearn.model_selection import train_test_split
X0 = final_df.drop(columns=’Outcome’)

let’s transform our data
X = X0
y = final_df[‘Outcome’]
X, y = oversample.fit_resample(X, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Naive Bayes:
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB(var_smoothing=1e-06)

classifier.fit(X_train, y_train)

```GaussianNB(var_smoothing=1e-06)
```

y_pred = classifier.predict(X_test)
y_pred

Let’s call confusion_matrix and accuracy_score

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))

```[[77 28]
[18 77]]
0.77```

Random Forest Model + HPO:

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

clf = RandomForestClassifier()

n_estimators = [10, 20, 30, 50, 100]
max_depth = [2, 3, 5, 7, 10]

##### Define the grid of hyperparameters to search

hyperparameter_grid = {
‘n_estimators’: n_estimators,
‘max_depth’:max_depth,
}
random_cv = RandomizedSearchCV(estimator=clf,param_distributions=hyperparameter_grid,
cv=5, n_iter=5,scoring = ‘neg_mean_absolute_error’,n_jobs = 4,verbose = 5,
return_train_score = True,random_state=42)
random_cv.fit(X_train,y_train)
random_cv.best_estimator_

```Fitting 5 folds for each of 5 candidates, totalling 25 fits
```

Out[93]:

`RandomForestClassifier(max_depth=10, n_estimators=50)`

clf = RandomForestClassifier(max_depth=5, n_estimators=50)
clf.fit(X_train, y_train)

`RandomForestClassifier(max_depth=5, n_estimators=50)`

y_pred = clf.predict(X_test)
y_pred

Let’s check confusion_matrix and accuracy_score

from sklearn.metrics import confusion_matrix, accuracy_score
cm = confusion_matrix(y_test, y_pred)
print(cm)
print(accuracy_score(y_test, y_pred))

```[[80 25]
[ 8 87]]
0.835```

We can see that RandomForestClassifier+RandomizedSearchCV is more accurate than GaussianNB.

Decision Tree + HPO:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
classifier = DecisionTreeClassifier()

from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

parameters = {“max_depth”:[2,3, None],
“max_features”:randint(1,9),
“min_samples_leaf”:randint(1,8),
“criterion”: [“gini”, “entropy”]
}

tree_cv = RandomizedSearchCV(classifier, parameters, cv = 5)
tree_cv.fit(X_train, y_train)
tree_cv.best_estimator_

```DecisionTreeClassifier(max_features=7, min_samples_leaf=3)
```

classifier = DecisionTreeClassifier(max_depth=15, max_features=8, min_samples_leaf=4)
classifier.fit(X_train, y_train)

`DecisionTreeClassifier(max_depth=15, max_features=8, min_samples_leaf=4)`

y_pred = classifier.predict(X_test)
y_pred

Let’s check confusion_matrix and accuracy_score

```[[85 20]
[20 75]]
0.8```

Creating a pickle file for the clf model
import pickle
filename = ‘diabetes-model.pkl’
pickle.dump(clf, open(filename, ‘wb’))

## In-Depth EDA+RF Analysis

Importing libraries and data

import numpy as np
import pandas as pd
import pickle
import seaborn as sns

and checking the content

Data Visualization + EDA:

Histograms

df.plot.hist(subplots=True, layout=(4,4), figsize=(20, 20), bins=20)

fig=sns.pairplot(
df,
x_vars=[‘Pregnancies’,’Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’,’DiabetesPedigreeFunction’],
y_vars=[‘Age’],
)
fig.savefig(“hcpairplot.png”)

df.boxplot(column=[‘Pregnancies’,’Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’,’DiabetesPedigreeFunction’,’Age’], figsize=(30,20), fontsize=24)
plt.savefig(‘hcboxplot.png’)

We can see that Insulin has the largest number of outliers.

Renaming DiabetesPedigreeFunction as DPF
df = df.rename(columns={‘DiabetesPedigreeFunction’:’DPF’})

Replacing the 0 values from [‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’] by NaN
df_copy = df.copy(deep=True)
df_copy[[‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’]] = df_copy[[‘Glucose’,’BloodPressure’,’SkinThickness’,’Insulin’,’BMI’]].replace(0,np.NaN)

Replacing NaN value by mean, median depending upon distribution
df_copy[‘Glucose’].fillna(df_copy[‘Glucose’].mean(), inplace=True)
df_copy[‘BloodPressure’].fillna(df_copy[‘BloodPressure’].mean(), inplace=True)
df_copy[‘SkinThickness’].fillna(df_copy[‘SkinThickness’].median(), inplace=True)
df_copy[‘Insulin’].fillna(df_copy[‘Insulin’].median(), inplace=True)
df_copy[‘BMI’].fillna(df_copy[‘BMI’].median(), inplace=True)

Random Forest Classifier Model:
from sklearn.model_selection import train_test_split
X = df.drop(columns=’Outcome’)
y = df[‘Outcome’]
X1=scaler.fit_transform(X)
X, y = oversample.fit_resample(X1, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 50,random_state=0)
classifier.fit(X_train, y_train)

import pickle
filename = ‘RFmodel.pkl’
with open(filename, ‘wb’) as file:
pickle.dump(classifier, file)

Predicting the Test set results
y_pred = classifier.predict(X_test)

df2 = pd.DataFrame({‘Actual’:y_test,’Predicted’:y_pred})
df2

df2.plot.hist(alpha=0.5)

Let’s call confusion_matrix and accuracy_score

from sklearn.metrics import confusion_matrix, accuracy_score
print(“Confusion Matrix:”, confusion_matrix(y_test, y_pred))
print(“Accuracy:”,accuracy_score(y_test,y_pred))

```Confusion Matrix: [[86 19]
[ 9 86]]
Accuracy: 0.86```

Let’s check other metrics

from sklearn import metrics
print(‘Root Mean Squared Error (RMSE):’, np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print(‘Mean Absolute Error (MAE):’, metrics.mean_absolute_error(y_test, y_pred))
print(‘Mean Squared Error (MSE):’, metrics.mean_squared_error(y_test, y_pred))

```Root Mean Squared Error (RMSE): 0.37416573867739417
Mean Absolute Error (MAE): 0.14
Mean Squared Error (MSE): 0.14```

Importing libraries and data as 768 rows × 9 columns

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
%matplotlib inline

from sklearn.model_selection import train_test_split,cross_validate,GridSearchCV

data.info()

```<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
#   Column                    Non-Null Count  Dtype
---  ------                    --------------  -----
0   Pregnancies               768 non-null    int64
1   Glucose                   768 non-null    int64
2   BloodPressure             768 non-null    int64
3   SkinThickness             768 non-null    int64
4   Insulin                   768 non-null    int64
5   BMI                       768 non-null    float64
6   DiabetesPedigreeFunction  768 non-null    float64
7   Age                       768 non-null    int64
8   Outcome                   768 non-null    int64
dtypes: float64(2), int64(7)
memory usage: 54.1 KB```

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from imblearn import over_sampling
from imblearn.over_sampling import SMOTE
oversample = SMOTE()

Let’s look at the pairplot

fig=sns.pairplot(data, hue=’Outcome’,vars = data.columns[1:-1])
fig.savefig(“hcpxcorrpairplot.png”)

Let’s look at the correlation matrix C(*,*)

We can see that C(Glucose, Outcome)=0.47 (highest correlation), while C(BP, Outcome)=0.065 (lowest correlation). At the same time, C(Age, Pregnancies)=0.54, so we can drop one of these features such as Pregnancies.

We can check Outcome

sns.countplot(data[‘Outcome’])

and correlation values consistent with the above matrix C plot

Check missing values

print(“total number of rows : {0}”.format(len(data)))
print(“number of rows missing Glucose: {0}”.format(len(data.loc[data[‘Glucose’] == 0])))
print(“number of rows missing diastolic_bp: {0}”.format(len(data.loc[data[‘BloodPressure’] == 0])))
print(“number of rows missing insulin: {0}”.format(len(data.loc[data[‘Insulin’] == 0])))
print(“number of rows missing bmi: {0}”.format(len(data.loc[data[‘BMI’] == 0])))
print(“number of rows missing diab_pred: {0}”.format(len(data.loc[data[‘DiabetesPedigreeFunction’] == 0])))
print(“number of rows missing age: {0}”.format(len(data.loc[data[‘Age’] == 0])))
print(“number of rows missing skin: {0}”.format(len(data.loc[data[‘SkinThickness’] == 0])))

```total number of rows : 768
number of rows missing Glucose: 5
number of rows missing diastolic_bp: 35
number of rows missing insulin: 374
number of rows missing bmi: 11
number of rows missing diab_pred: 0
number of rows missing age: 0
number of rows missing skin: 227```

Let’s replace missing values

df = data.copy()

for column in [‘Glucose’,’SkinThickness’,’Insulin’]:
median_0 = data[column][data[‘Outcome’]==0].median()
median_1 = data[column][data[‘Outcome’]==1].median()

```df[column][df['Outcome']==0] = data[column][df['Outcome']==0].fillna(median_0)
df[column][df['Outcome']==1] = data[column][df['Outcome']==1].fillna(median_1)
```

df.BloodPressure.fillna(df.BloodPressure.median(),inplace=True)
df.BMI.fillna(df.BMI.median(),inplace=True)

Let’s create X (features) and y (target) objects

X = df.drop(‘Outcome’,axis=1)
X

y = df.Outcome
y

Let’s apply our data transformations and splitting

from sklearn.model_selection import train_test_split with test_size=0.20

X1=scaler.fit_transform(X)
X, y = oversample.fit_resample(X1, y)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

Creating metrics for evaluations
from sklearn import metrics
f1 = metrics.make_scorer(metrics.f1_score)
accuracy = metrics.make_scorer(metrics.accuracy_score)
precision = metrics.make_scorer(metrics.precision_score)
recall = metrics.make_scorer(metrics.recall_score)
auc = metrics.make_scorer(metrics.roc_auc_score)
scoring = {
“accuracy”:accuracy,
“precision”:precision,
“recall”: recall,
“f1”:f1,
}
def printResults(cv):
print(“Accuracy {:.3f} ({:.3f})”.format(cv[“test_accuracy”].mean(), cv[“test_accuracy”].std()))
print(“Precision {:.3f} ({:.3f})”.format(cv[“test_precision”].mean(), cv[“test_precision”].std()))
print(“Recall {:.3f} ({:.3f})”.format(cv[“test_recall”].mean(), cv[“test_recall”].std()))
print(“F1 {:.3f} ({:.3f})”.format(cv[“test_f1”].mean(), cv[“test_f1”].std()))

Creating our ML Model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

gbc.fit(X_train,y_train)

`GradientBoostingClassifier()`

Making test predictions

y_pred = gbc.predict(X_test)

Let’s check results

print(metrics.classification_report(y_test, y_pred))

```precision    recall  f1-score   support

0       0.76      0.71      0.74        96
1       0.75      0.80      0.77       104

accuracy                           0.76       200
macro avg       0.76      0.75      0.75       200
weighted avg       0.76      0.76      0.75       200```

cm = metrics.confusion_matrix(y_test, y_pred,normalize=’all’)
sns.heatmap(cm, annot=True, cmap=”Blues”);

cv_gbc = cross_validate(gbc, X, y, scoring=scoring, cv=5)

printResults(cv_gbc)

```Accuracy  0.787 (0.069)
Precision 0.779 (0.063)
Recall    0.800 (0.089)
F1        0.789 (0.072)```

Tuning our ML Model
params = {
‘loss’: [‘deviance’,’exponential’],
‘learning_rate’: [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0],
‘n_estimators’: [100,200,300,400,500,600,700,800,800,1000],
}
gs = GridSearchCV(estimator = gbc,param_grid=params,cv=5)
gs.fit(X,y)

```GridSearchCV(cv=5, estimator=GradientBoostingClassifier(),
param_grid={'learning_rate': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7,
0.8, 0.9, 1.0],
'loss': ['deviance', 'exponential'],
'n_estimators': [100, 200, 300, 400, 500, 600, 700,
800, 800, 1000]})```

gs.best_score_

`0.806`

gs.best_params_

`{'learning_rate': 0.7, 'loss': 'exponential', 'n_estimators': 100}`

gbc_best.fit(X_train,y_train)

`GradientBoostingClassifier(learning_rate=0.2, n_estimators=200)`

Let’s make predictions

y_pred = gbc_best.predict(X_test)

print(metrics.classification_report(y_test, y_pred))

```precision    recall  f1-score   support

0       0.77      0.77      0.77        96
1       0.79      0.79      0.79       104

accuracy                           0.78       200
macro avg       0.78      0.78      0.78       200
weighted avg       0.78      0.78      0.78       200
```

cm = metrics.confusion_matrix(y_test, y_pred,normalize=’all’)
sns.heatmap(cm, annot=True, cmap=”Blues”);

cv_gbc_best = cross_validate(gbc_best, X, y, cv=5, scoring=scoring)

printResults(cv_gbc_best)

```Accuracy  0.788 (0.056)
Precision 0.783 (0.030)
Recall    0.794 (0.121)
F1        0.785 (0.069)```

Let’s invoke pickle

import pickle

filename = ‘diabetes-model.pkl’
pickle.dump(gbc_best,open(filename,’wb’))

`GradientBoostingClassifier(n_estimators=400)`

print(metrics.classification_report(y_test, y_pred))

``` precision    recall  f1-score   support

0       0.75      0.72      0.73        96
1       0.75      0.78      0.76       104

accuracy                           0.75       200
macro avg       0.75      0.75      0.75       200
weighted avg       0.75      0.75      0.75       200```

cm = metrics.confusion_matrix(y_test, y_pred,normalize=’all’)
sns.heatmap(cm, annot=True, cmap=”Blues”);

## Summary

• The objective of the project is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the input dataset from Kaggle.
• Data editing: we replaced missing values with mean/median feature values and removed outliers.
• We applied StandardScaler and SMOTE data transformations before splitting.
• RandomForestClassifier+RandomizedSearchCV (accuracy ~86%) is more accurate than GaussianNB and DecisionTreeClassifier

Accuracy 0.788 (0.056)

Precision 0.783 (0.030)

Recall 0.794 (0.121)

F1 0.785 (0.069)

• The model created as a result of HPO became the model with the best Cross Validation Score value, as shown above.

## Explore More

HealthTech ML/AI Q3 ’22 Round-Up

The Application of ML/AI in Diabetes

Diabetes Prediction using ML/AI in Python

HealthTech ML/AI Use-Cases

One-Time
Monthly
Yearly

#### Make a yearly donation

Choose an amount

\$5.00
\$15.00
\$100.00
\$5.00
\$15.00
\$100.00
\$5.00
\$15.00
\$100.00

Or enter a custom amount

\$