Using AI/ANN AUC>90% for Early Diagnosis of Cardiovascular Disease (CVD)

Featured Photo of Karolina Grabowska on Pexels.

Cardiovascular disease (CVD) is the principal cause of mortality and morbidity globally. With the pressures for improved care and translation of the latest medical advances and knowledge to an actionable plan, clinical decision-making for cardiologists is challenging.

This scope of this project is within the AI-driven Cardiovascular Medicine. Specifically, it will focus on early diagnosis of heart disease using Artificial Neural Networks (ANN).

The ANN method aims to improve the detection of heart issues, which could lead to better outcomes since early diagnosis is critical.
This project will utilize a dataset of 303 patients and distributed by the UCI Machine Learning Repository.

Data Preparation

Let’s set the working directory HEART23

import os
os.chdir(‘HEART23’)
os. getcwd()

and import the libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from scipy.stats import skew

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score, precision_score, recall_score

import scikitplot as skplt

import tensorflow as tf

Let’s read the input data

data = pd.read_csv(‘heartcvd.csv’)
data.head()

where output of 0 means that a person has a low chance of CVD while 1 suggests that the person has a high chance of CVD.

Let’s check the data structure:

print(f’Number of rows:’, data.shape[0])
print(f’Number of columns:’, data.shape[1])

Number of rows: 303
Number of columns: 14

We have 5 numerical features and 8 categorical features in our dataset

data.nunique()

age          41
sex           2
cp            4
trtbps       49
chol        152
fbs           2
restecg       3
thalachh     91
exng          2
oldpeak      40
slp           3
caa           5
thall         4
output        2
dtype: int64

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

data.describe().T

Descriptive statistics of input CVD data table

All of our categorical columns have an integer datatype. I will convert them into an object datatype

cat_columns = [‘sex’, ‘cp’, ‘fbs’, ‘restecg’, ‘exng’, ‘slp’, ‘caa’, ‘thall’, ‘output’]
num_columns = [‘age’, ‘trtbps’, ‘oldpeak’, ‘chol’, ‘thalachh’]
data[cat_columns] = data[cat_columns].astype(str)

Let’s see if we don’t have any missing values in our dataset

data.isnull().sum()

age         0
sex         0
cp          0
trtbps      0
chol        0
fbs         0
restecg     0
thalachh    0
exng        0
oldpeak     0
slp         0
caa         0
thall       0
output      0
dtype: int64

Exploratory Data Analysis (EDA)

Let's compare distributions of various features based on target variable

sns.set_context(‘notebook’, font_scale= 1.2)
fig, ax = plt.subplots(2, 2, figsize = (20, 13))

plt.suptitle(‘Distribution of various features based on target variable’, fontsize = 20)

ax1 = sns.histplot(x =’age’, data= data, hue= ‘output’, kde= True, ax= ax[0, 0], palette=’winter’)
ax1.set(xlabel = ‘Age’, title= ‘Distribution of age based on target variable’)

ax2 = sns.histplot(x =’trtbps’, data= data, hue= ‘output’, kde= True, ax= ax[0, 1], palette=’plasma’)
ax2.set(xlabel = ‘Resting blood pressure (in mm Hg)’, title= ‘Distribution of BP based on target variable’)

ax3 = sns.histplot(x =’chol’, data= data, hue= ‘output’, kde= True, ax= ax[1, 0], palette=’winter’)
ax3.set(xlabel = ‘Cholesterol in mg/dl’, title= ‘Distribution of Cholesterol based on target variable’)

ax4 = sns.histplot(x =’thalachh’, data= data, hue= ‘output’, kde= True, ax= ax[1, 1], palette=’plasma’)
ax4.set(xlabel = ‘Max Heart Rate Achieved’, title= ‘Distribution of maximum heart rate achieved based on target variable’)

plt.show()

Distribution of various features based on target variable

We can see a pattern in the distribution of maximum heart rate achieved. Those who have reached a higher maximum heart rate are more likely to have CVD.

Let’s compare boxplots of various features based on target variable

sns.set_context(‘notebook’, font_scale= 1.2)
fig, ax = plt.subplots(2, 2, figsize = (20, 10))

plt.suptitle(‘Boxplot of various features based on target variable’, fontsize = 20)

ax1 = sns.boxplot(x =’age’, data= data, ax= ax[0, 0], color = ‘#40bf80’)
ax1.set(xlabel = ‘Age’)

ax2 = sns.boxplot(x =’trtbps’, data= data, ax= ax[0, 1], color=’#40bf80′)
ax2.set(xlabel = ‘Resting blood pressure (in mm Hg)’)

ax3 = sns.boxplot(x =’chol’, data= data, hue= ‘output’, ax= ax[1, 0], color= ‘#40bf80’)
ax3.set(xlabel = ‘Cholesterol in mg/dl’)

ax4 = sns.boxplot(x =’thalachh’, data= data, ax= ax[1, 1], color = ‘#40bf80’)
ax4.set(xlabel = ‘Max Heart Rate Achieved’)

plt.savefig(‘cvd1boxplotstarget.png’)

Boxplot of various features based on target variable

There are some outliers in the Blood Pressure and Cholesterol columns.

Let’s look at the feature correlations using the sns heatmap

plt.figure(figsize= (16, 8))
sns.heatmap(data.corr(), annot = True, cmap= ‘YlGnBu’, fmt= ‘.2f’);

I don’t think there is any correlation between our numerical features.

Let’s compare count plots for various categorical features

sns.set_context(‘notebook’, font_scale= 1)
fig, ax = plt.subplots(3, 3, figsize = (20, 15))
ax[2, 2].axis(‘off’)

plt.suptitle(‘Count plot for various categorical features’, fontsize = 20)

ax1 = sns.countplot(x =’sex’, data= data, ax= ax[0, 0], hue = ‘output’ ,palette= ‘viridis’)
ax1.set(xlabel = ‘Gender of the patient’)

ax2 = sns.countplot(x =’exng’, data= data, hue= ‘output’, ax= ax[0, 1], palette= ‘viridis’)
ax2.set(xlabel = ‘Exercise induced angina’)

ax3 = sns.countplot(x =’caa’, data= data, ax= ax[0, 2], hue = ‘output’, palette= ‘viridis’)
ax3.set(xlabel = ‘Number of major vessels’)

ax4 = sns.countplot(x =’cp’, data= data, hue = ‘output’, ax= ax[1, 0], palette= ‘viridis’)
ax4.set(xlabel = ‘Chest pain type’)

ax5 = sns.countplot(x =’fbs’, data= data, hue = ‘output’, ax= ax[1, 1], palette= ‘viridis’)
ax5.set(xlabel = ‘Fasting blood sugar’)

ax6 = sns.countplot(x =’restecg’, data= data, ax= ax[1, 2], hue = ‘output’, palette= ‘viridis’)
ax6.set(xlabel = ‘Resting electrocardiographic results’)

ax7 = sns.countplot(x =’slp’, data= data, ax= ax[2, 0], hue = ‘output’, palette= ‘viridis’)
ax7.set(xlabel = ‘Slope’)

ax8 = sns.countplot(x =’thall’, data= data, ax= ax[2, 1], hue = ‘output’, palette= ‘viridis’)
ax8.set(xlabel = ‘Thalium Stress Test result’)

plt.show()

Count plots for various categorical features

data[‘output’].value_counts()

1    165
0    138
Name: output, dtype: int64

plt.figure(figsize= (6, 8))
data_pie = [165 , 138]
labels = [“High Chances”, “Low Chances”]
explode = [0.1, 0]
plt.pie(data_pie ,labels= labels , explode = explode , autopct=”%1.2f%%”, shadow= True, colors= [‘#256D85’, ‘#3BACB6’])
plt.show()

This is a well balanced dataset.

Checking for skewness:

def skewness(data):
skew_df = pd.DataFrame(data.select_dtypes(np.number).columns, columns=[‘Feature’])
skew_df[‘Skew’] = skew_df[‘Feature’].apply(lambda feature: skew(data[feature]))
skew_df[‘Absolute Skew’] = skew_df[‘Skew’].apply(abs)
return skew_df

skewness(data=data[num_columns])

Since oldpeak and chol columns are skewed, we will apply the log transformation
data[‘oldpeak’] = np.log1p(data[‘oldpeak’])
data[‘chol’] = np.log1p(data[‘chol’])

data = pd.get_dummies(data, drop_first=True)

data.head()

Input data table after log transformation

Train/Test Data

Let’s prepare our data for DL using StandardScaler and train_test_split with test_size= 0.20

X = data.drop(‘output_1’, axis= 1)
y = data.output_1

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.20,random_state= 42)

print(X_train.shape)

(242, 22)

Training Sequential ANN

model = tf.keras.Sequential([
tf.keras.Input(22),
tf.keras.layers.Dense(100, activation = ‘relu’),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(1, activation = ‘sigmoid’)
])

model.compile(
loss = tf.keras.losses.BinaryCrossentropy(),
optimizer = tf.keras.optimizers.Adam(),
metrics=[tf.keras.metrics.AUC(name=’auc’)]
)

history = model.fit(X_train, y_train, epochs = 50, validation_split = 0.20)

Epoch 1/50
7/7 [==============================] - 1s 33ms/step - loss: 0.7810 - auc: 0.3940 - val_loss: 0.6786 - val_auc: 0.5731
Epoch 2/50
7/7 [==============================] - 0s 5ms/step - loss: 0.6791 - auc: 0.5831 - val_loss: 0.6047 - val_auc: 0.7491
Epoch 3/50
7/7 [==============================] - 0s 5ms/step - loss: 0.6081 - auc: 0.7379 - val_loss: 0.5497 - val_auc: 0.8367
Epoch 4/50
7/7 [==============================] - 0s 5ms/step - loss: 0.5510 - auc: 0.8235 - val_loss: 0.5123 - val_auc: 0.8631
Epoch 5/50
7/7 [==============================] - 0s 5ms/step - loss: 0.5103 - auc: 0.8646 - val_loss: 0.4851 - val_auc: 0.8793
Epoch 6/50
7/7 [==============================] - 0s 5ms/step - loss: 0.4750 - auc: 0.8887 - val_loss: 0.4638 - val_auc: 0.8861
Epoch 7/50
7/7 [==============================] - 0s 5ms/step - loss: 0.4635 - auc: 0.8852 - val_loss: 0.4475 - val_auc: 0.8946
Epoch 8/50
7/7 [==============================] - 0s 5ms/step - loss: 0.4450 - auc: 0.8978 - val_loss: 0.4343 - val_auc: 0.8963
Epoch 9/50
7/7 [==============================] - 0s 5ms/step - loss: 0.4295 - auc: 0.8987 - val_loss: 0.4238 - val_auc: 0.9014
Epoch 10/50
7/7 [==============================] - 0s 5ms/step - loss: 0.4021 - auc: 0.9198 - val_loss: 0.4142 - val_auc: 0.8997
Epoch 11/50
7/7 [==============================] - 0s 5ms/step - loss: 0.4088 - auc: 0.9079 - val_loss: 0.4077 - val_auc: 0.8997
Epoch 12/50
7/7 [==============================] - 0s 5ms/step - loss: 0.4070 - auc: 0.9056 - val_loss: 0.4026 - val_auc: 0.9014
Epoch 13/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3846 - auc: 0.9197 - val_loss: 0.3975 - val_auc: 0.9005
Epoch 14/50
7/7 [==============================] - 0s 4ms/step - loss: 0.3885 - auc: 0.9146 - val_loss: 0.3945 - val_auc: 0.9014
Epoch 15/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3738 - auc: 0.9198 - val_loss: 0.3922 - val_auc: 0.8988
Epoch 16/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3623 - auc: 0.9259 - val_loss: 0.3904 - val_auc: 0.8971
Epoch 17/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3564 - auc: 0.9293 - val_loss: 0.3884 - val_auc: 0.8980
Epoch 18/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3596 - auc: 0.9253 - val_loss: 0.3858 - val_auc: 0.8980
Epoch 19/50
7/7 [==============================] - 0s 4ms/step - loss: 0.3447 - auc: 0.9334 - val_loss: 0.3833 - val_auc: 0.9031
Epoch 20/50
7/7 [==============================] - 0s 4ms/step - loss: 0.3203 - auc: 0.9454 - val_loss: 0.3815 - val_auc: 0.9048
Epoch 21/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3215 - auc: 0.9439 - val_loss: 0.3791 - val_auc: 0.9082
Epoch 22/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3239 - auc: 0.9418 - val_loss: 0.3775 - val_auc: 0.9099
Epoch 23/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3288 - auc: 0.9365 - val_loss: 0.3731 - val_auc: 0.9107
Epoch 24/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3172 - auc: 0.9462 - val_loss: 0.3709 - val_auc: 0.9158
Epoch 25/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3274 - auc: 0.9392 - val_loss: 0.3727 - val_auc: 0.9201
Epoch 26/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3297 - auc: 0.9387 - val_loss: 0.3759 - val_auc: 0.9167
Epoch 27/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3168 - auc: 0.9475 - val_loss: 0.3783 - val_auc: 0.9150
Epoch 28/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3048 - auc: 0.9531 - val_loss: 0.3792 - val_auc: 0.9150
Epoch 29/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2904 - auc: 0.9583 - val_loss: 0.3790 - val_auc: 0.9150
Epoch 30/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2991 - auc: 0.9541 - val_loss: 0.3784 - val_auc: 0.9141
Epoch 31/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2904 - auc: 0.9593 - val_loss: 0.3779 - val_auc: 0.9133
Epoch 32/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2788 - auc: 0.9616 - val_loss: 0.3793 - val_auc: 0.9133
Epoch 33/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2972 - auc: 0.9516 - val_loss: 0.3795 - val_auc: 0.9124
Epoch 34/50
7/7 [==============================] - 0s 5ms/step - loss: 0.3090 - auc: 0.9460 - val_loss: 0.3795 - val_auc: 0.9116
Epoch 35/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2864 - auc: 0.9555 - val_loss: 0.3797 - val_auc: 0.9116
Epoch 36/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2956 - auc: 0.9499 - val_loss: 0.3792 - val_auc: 0.9116
Epoch 37/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2963 - auc: 0.9502 - val_loss: 0.3785 - val_auc: 0.9124
Epoch 38/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2769 - auc: 0.9585 - val_loss: 0.3778 - val_auc: 0.9133
Epoch 39/50
7/7 [==============================] - 0s 4ms/step - loss: 0.2752 - auc: 0.9599 - val_loss: 0.3781 - val_auc: 0.9133
Epoch 40/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2696 - auc: 0.9620 - val_loss: 0.3773 - val_auc: 0.9133
Epoch 41/50
7/7 [==============================] - 0s 4ms/step - loss: 0.2691 - auc: 0.9598 - val_loss: 0.3765 - val_auc: 0.9167
Epoch 42/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2570 - auc: 0.9668 - val_loss: 0.3727 - val_auc: 0.9167
Epoch 43/50
7/7 [==============================] - 0s 4ms/step - loss: 0.2675 - auc: 0.9604 - val_loss: 0.3717 - val_auc: 0.9150
Epoch 44/50
7/7 [==============================] - 0s 4ms/step - loss: 0.2620 - auc: 0.9627 - val_loss: 0.3734 - val_auc: 0.9141
Epoch 45/50
7/7 [==============================] - 0s 4ms/step - loss: 0.2612 - auc: 0.9641 - val_loss: 0.3743 - val_auc: 0.9158
Epoch 46/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2620 - auc: 0.9624 - val_loss: 0.3738 - val_auc: 0.9175
Epoch 47/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2561 - auc: 0.9662 - val_loss: 0.3742 - val_auc: 0.9158
Epoch 48/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2499 - auc: 0.9684 - val_loss: 0.3759 - val_auc: 0.9141
Epoch 49/50
7/7 [==============================] - 0s 4ms/step - loss: 0.2459 - auc: 0.9695 - val_loss: 0.3773 - val_auc: 0.9150
Epoch 50/50
7/7 [==============================] - 0s 5ms/step - loss: 0.2595 - auc: 0.9608 - val_loss: 0.3787 - val_auc: 0.9167

We can see the ANN output error values after 50 iterations: loss: 0.2595 – auc: 0.9608 – val_loss: 0.3787 – val_auc: 0.9167.

Let’s print our model

print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 100)               2300      
                                                                 
 dropout (Dropout)           (None, 100)               0         
                                                                 
 dense_1 (Dense)             (None, 1)                 101       
                                                                 
=================================================================
Total params: 2,401
Trainable params: 2,401
Non-trainable params: 0
_________________________________________________________________
None

Let’s check the train/test scores and accuracy values

scoret = model.evaluate(X_train, y_train, verbose=0)

print(‘Train score:’, scoret[0])
print(‘Train accuracy:’, scoret[1])

Train score: 0.26860886812210083
Train accuracy: 0.9574049711227417

score = model.evaluate(X_test, y_test, verbose=0)

print(‘Test score:’, score[0])
print(‘Test accuracy:’, score[1])

Test score: 0.3528120219707489
Test accuracy: 0.9224138259887695

ANN Model Validation

Let’s plot the content of history.keys()

print(history.history.keys())

dict_keys(['loss', 'auc', 'val_loss', 'val_auc'])

plt.plot(history.history[‘auc’])
plt.plot(history.history[‘val_auc’])
plt.legend([‘train’, ‘test’], loc=’lower right’)
plt.ylabel(‘auc’)
plt.xlabel(‘epoch’)

plt.plot(history.history[‘loss’])
plt.plot(history.history[‘val_loss’])
plt.legend([‘train’, ‘test’], loc=’upper right’)
plt.ylabel(‘loss’)
plt.xlabel(‘epoch’)

Let’s check our test predictions

pred = model.predict(X_test)
pred = tf.cast(tf.round(pred), dtype=tf.int32).numpy().reshape(61)

2/2 [==============================] - 0s 2ms/step

print(f’Accuracy of our model is {round(accuracy_score(y_test, pred) * 100, 2)}%’)

Accuracy of our model is 86.89%

print(f’Precision: {round(precision_score(y_test, pred), 2)}’)

Precision: 0.88

print(f’Recall: {round(recall_score(y_test, pred), 2)}’)

Recall: 0.88

Let’s plot the confusion matrix

skplt.metrics.plot_confusion_matrix(y_test,pred, figsize=(6,6), cmap= ‘YlGnBu’)

Let’s plot the ROC Curve

fpr, tpr, thresholds = roc_curve(y_test, pred)

plt.figure(figsize=(6,4))
plt.plot(fpr, tpr, linewidth=2, color= ‘teal’)
plt.plot([0,1], [0,1], ‘r–‘ )
plt.title(‘ROC Curve’)
plt.xlabel(‘False Positive Rate’)
plt.ylabel(‘True Positive Rate’)

plt.show()

The ROC score is

roc_auc = roc_auc_score(y_test, pred)
print(‘ROC AUC Score: {:.2f}’.format(roc_auc))

ROC AUC Score: 0.87

The F1-score is

import sklearn
f1 = sklearn.metrics.f1_score(y_test, pred)
print(‘F1 score: %f’ % f1)

F1 score: 0.870968

The Cohen’s kappa score is

kappa = sklearn.metrics.cohen_kappa_score(y_test, pred)
print(‘Cohens kappa: %f’ % kappa)

Cohens kappa: 0.737916

The F-beta score is

from sklearn.metrics import fbeta_score
fbeta_score(y_test, pred, average=None, beta=0.5)

array([0.8496732 , 0.88815789])

The average F-beta score is

fbeta_score(y_test, pred, average=’macro’, beta=0.5)

0.8689155486756106

The Jaccard similarity coefficient score is

from sklearn.metrics import jaccard_score
jaccard_score(y_test, pred, average=None)

array([0.76470588, 0.77142857])

The average Jaccard score is

jaccard_score(y_test, pred, average=’macro’)

0.7680672268907562

The final classification report is

from sklearn.metrics import classification_report
target_names = [‘Healthy’, ‘Unhealthy’]
print(classification_report(y_test, pred, target_names=target_names))

  precision    recall  f1-score   support

     Healthy       0.84      0.90      0.87        29
   Unhealthy       0.90      0.84      0.87        32

    accuracy                           0.87        61
   macro avg       0.87      0.87      0.87        61
weighted avg       0.87      0.87      0.87        61

The Hamming loss is

from sklearn.metrics import hamming_loss
hamming_loss(y_test, pred)

0.13114754098360656

Summary

CVD is one of the key contributors to human death. Each year, several people die due to this disease. According to the WHO, 17.9 million people die each year due to CVD.

With AI-driven technologies developed for early detection of CVD, the use of ANN/DL binary classification has been shown to improve the early diagnosis of CVD.
Results confirm the excellent performance of our ANN classifier:
The F1 score = 0.87 and AUC>90% are consistent with the earlier DL study
The validation tests indicate the effectiveness of the proposed approach in a real-world healthcare environment.

Explore More

Infographic

Classification ML/AI Python for CVD prediction

← Back

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

Or enter a custom amount

€

Your contribution is appreciated.

Donate

Donate monthly

Donate yearly

Using AI/ANN AUC>90% for Early Diagnosis of Cardiovascular Disease (CVD)

Data Preparation

Exploratory Data Analysis (EDA)

Train/Test Data

Training Sequential ANN

ANN Model Validation

Summary

Explore More

Infographic

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Share this:

Discover more from Our Blogs

Leave a comment Cancel reply

Discover more from Our Blogs