Early Heart Attack Prediction using ECG Autoencoder and 19 ML/AI Models with Test Performance QC Comparisons

  • Machine learning (ML) can be used to predict the risk of a heart attack. The algorithms used for this task would be supervised ML algorithms, such as Random Forest, Logistic Regression, Support Vector Machines, etc. These algorithms can be trained on data from past heart attack patients and then used to predict the likelihood of a heart attack in a new patient. The features used for this prediction could include age, gender, lifestyle factors, medical history, and other relevant factors.
  • In addition, technologies such as electrocardiograms (ECG) are also important for diagnosing CVD. However, popular classification models based on supervised ML often fail to detect abnormal ECG effectively. Following recent research studies, we propose an ECG anomaly detection framework based on deep learning autoencoders.
  • Autoencoders are a type of Artificial Neural Network (ANN) that is used to learn complex data representations and compress data effectively. Autoencoders are trained to learn a representation of the input data that has a lower dimensionality than the original input. This is done by learning to reconstruct the input data from a compressed representation. Autoencoders can be used for a variety of tasks, such as dimensionality reduction, denoising, and feature extraction.
  • This post consists of the following two parts: (1) ECG Anomaly Detection using Autoencoders and (2) early heart attack detection using available supervised ML algorithms.
  • About Input Kaggle Datasets:
  • The first dataset contains the ECG readings of patients
  • The second dataset is for heart attack classification using the following features
  • Age : Age of the patient
  • Sex : Sex of the patient
  • exang: exercise induced angina (1 = yes; 0 = no)
  • ca: number of major vessels (0-3)
  • cp : Chest Pain type chest pain type
    • Value 1: typical angina
    • Value 2: atypical angina
    • Value 3: non-anginal pain
    • Value 4: asymptomatic
  • trtbps : resting blood pressure (in mm Hg)
  • chol : cholestoral in mg/dl fetched via BMI sensor
  • fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • rest_ecg : resting electrocardiographic results
    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes’ criteria
  • thalach : maximum heart rate achieved
  • target : 0= less chance of heart attack 1= more chance of heart attack

Table of Contents

  1. ECG Autoencoder
  2. Input Dataset
  3. Exploratory Data Analysis (EDA)
  4. Feature Correlations
  5. Multiple ML/AI Models
  6. Feature Importance
  7. Model Performance
  8. Summary
  9. Explore More

Embed Socials:

ECG Autoencoder

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import the following libraries

import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from tensorflow.keras import layers, losses
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model

Let’s read the input dataset

df = pd.read_csv(‘ecg.csv’, header=None)

Let’s perform the following data manipulations:

  1. Split data and labels

data = df.iloc[:,:-1].values
labels = df.iloc[:,-1].values

2. Split train and test data with test_size = 0.2

train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size = 0.2, random_state = 21)

3. Normalization

min = tf.reduce_min(train_data)
max = tf.reduce_max(train_data)

train_data = (train_data – min)/(max – min)
test_data = (test_data – min)/(max – min)

4. Data format conversion

train_data = tf.cast(train_data, dtype=tf.float32)
test_data = tf.cast(test_data, dtype=tf.float32)

5. Label format conversion

train_labels = train_labels.astype(bool)
test_labels = test_labels.astype(bool)

6. Separate normal and abnormal ECG data

  • Normal ECG data
    n_train_data = train_data[train_labels]
    n_test_data = test_data[test_labels]
  • Abnormal ECG data
  • an_train_data = train_data[~train_labels]
  • an_test_data = test_data[~test_labels]

Lets plot a normal ECG
plt.plot(np.arange(140), n_train_data[0])
plt.grid()
plt.title(‘Normal ECG’)
plt.show()

Normal ECG

Lets plot an abnormal ECG
plt.plot(np.arange(140), an_train_data[0])
plt.grid()
plt.title(‘Abnormal ECG’)
plt.show()

Abnormal ECG

Let’s introduce the detector = encoder + decoder class

class detector(Model):
def init(self):
super(detector, self).init()
self.encoder = tf.keras.Sequential([
layers.Dense(32, activation=’relu’),
layers.Dense(16, activation=’relu’),
layers.Dense(8, activation=’relu’)
])
self.decoder = tf.keras.Sequential([
layers.Dense(16, activation=’relu’),
layers.Dense(32, activation=’relu’),
layers.Dense(140, activation=’sigmoid’)
])

def call(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded

Let’s compile and train the model
autoencoder = detector()
autoencoder.compile(optimizer=’adam’, loss=’mae’)
autoencoder.fit(n_train_data, n_train_data, epochs = 20, batch_size=512, validation_data=(n_test_data, n_test_data))

Epoch 1/20
5/5 [==============================] - 1s 42ms/step - loss: 0.0595 - val_loss: 0.0573
Epoch 2/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0568 - val_loss: 0.0557
Epoch 3/20
5/5 [==============================] - 0s 8ms/step - loss: 0.0551 - val_loss: 0.0537
Epoch 4/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0528 - val_loss: 0.0508
Epoch 5/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0495 - val_loss: 0.0471
Epoch 6/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0460 - val_loss: 0.0438
Epoch 7/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0427 - val_loss: 0.0405
Epoch 8/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0395 - val_loss: 0.0374
Epoch 9/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0367 - val_loss: 0.0348
Epoch 10/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0342 - val_loss: 0.0326
Epoch 11/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0321 - val_loss: 0.0307
Epoch 12/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0303 - val_loss: 0.0290
Epoch 13/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0286 - val_loss: 0.0274
Epoch 14/20
5/5 [==============================] - 0s 8ms/step - loss: 0.0272 - val_loss: 0.0261
Epoch 15/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0261 - val_loss: 0.0253
Epoch 16/20
5/5 [==============================] - 0s 8ms/step - loss: 0.0253 - val_loss: 0.0247
Epoch 17/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0247 - val_loss: 0.0241
Epoch 18/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0241 - val_loss: 0.0234
Epoch 19/20
5/5 [==============================] - 0s 7ms/step - loss: 0.0235 - val_loss: 0.0228
Epoch 20/20
5/5 [==============================] - 0s 8ms/step - loss: 0.0228 - val_loss: 0.0222

Let’s define a function in order to plot the original ECG and reconstructed ones and also show the error
def plot(data, n):
enc_img = autoencoder.encoder(data)
dec_img = autoencoder.decoder(enc_img)
plt.plot(data[n], ‘b’)
plt.plot(dec_img[n], ‘r’)
plt.fill_between(np.arange(140), data[n], dec_img[n], color = ‘lightcoral’)
plt.legend(labels=[‘Input’, ‘Reconstruction’, ‘Error’])
plt.show()

Let’s plot the normal ECG

plot(n_test_data, 0)

plot the original normal ECG and reconstructed one and also show the error

Let’s plot the abnormal ECG

plot(an_test_data, 0)

plot the original abnormal ECG and reconstructed one and also show the error

Let’s apply autoencoder to n_train_data

reconstructed = autoencoder(n_train_data)
train_loss = losses.mae(reconstructed, n_train_data)
t = np.mean(train_loss) + np.std(train_loss)

def prediction(model, data, threshold):
rec = model(data)
loss = losses.mae(rec, data)
return tf.math.less(loss, threshold)
print(t)

0.034331195

Let’s apply autoencoder prediction to n_test_data with t=0.034331195

pred = prediction(autoencoder, n_test_data, t)

Let’s look at a few selected ECG curves

plot(n_test_data, 0)
plot(n_test_data, 1)
plot(n_test_data, 3)

ECG curves test data 0, 1
ECG curve test data 3

Input Heart Dataset

Let’s discuss Part 2 with the focus on predicting heart attack occurrence.

We don’t need to set the working directory again.

Let’s import the key libraries

import numpy as np # linear algebra
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings(“ignore”)
import numpy as np
%matplotlib inline
from sklearn.preprocessing import OrdinalEncoder

Let’s read the input CVD dataset

df = pd.read_csv(‘heart.csv’)

df.head()

Input data table
Input data table

and check the descriptive statistics summary

df.describe().T

Input dataset descriptive statistics summary

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trtbps    303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalachh  303 non-null    int64  
 8   exng      303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slp       303 non-null    int64  
 11  caa       303 non-null    int64  
 12  thall     303 non-null    int64  
 13  output    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB

df.drop_duplicates(inplace = True)

df.shape

(302, 14)

We are ready for EDA.

Exploratory Data Analysis (EDA)

Let’s rename the columns

df.rename(columns = {‘cp’:’Chest_Pain_type’},inplace = True )
df.rename(columns ={‘exng’: ‘Exercise_induced_Angina’},inplace = True)
df.rename(columns = {‘trtbps’:’Resting_Blood_Pressure’},inplace = True)
df.rename(columns = {‘fbs’:’Fasting_blood_sugar’},inplace = True )
df.rename(columns = {‘chol’:’Serum_Cholesterol’},inplace = True )
df.rename(columns = {‘restecg’:’Resting_ECG’},inplace = True )
df.rename(columns = {‘thalachh’:’Maximum_Heart_Rate’},inplace = True )
df.rename(columns = {‘slp’:’ECG_Slope’},inplace = True )
df.rename(columns = {‘caa’:’Coronary_artery_abnormality’},inplace = True )
df.rename(columns = {‘thall’:’Thalladium_test’},inplace = True )

and set the background color/palette

background_color = “#ffe6e6”
color_palette = [“#800000″,”#8000ff”,”#6aac90″,”#5833ff”,”#da8829″]

Let’s plot the target count

fig = plt.figure(figsize=(18,7))
gs = fig.add_gridspec(1,3)
gs.update(wspace=0.03, hspace=0.05)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])

fig.patch.set_facecolor(background_color)
ax0.set_facecolor(background_color)
ax1.set_facecolor(background_color)

ax0.text(0.5,0.5,”Count of the target\n___________”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 18,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)

ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)

ax1.text(0.9,177,”Output”,fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
ax1.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.countplot(ax=ax1, data=df, x = ‘output’,palette = color_palette)
ax1.set_xlabel(“”)
ax1.set_ylabel(“”)
ax1.set_xticklabels([“Low chances of attack”,”High chances of attack”],fontsize = 14)

ax0.spines[“top”].set_visible(False)
ax0.spines[“left”].set_visible(False)
ax0.spines[“bottom”].set_visible(False)
ax0.spines[“right”].set_visible(False)
ax1.spines[“top”].set_visible(False)
ax1.spines[“left”].set_visible(False)
ax1.spines[“right”].set_visible(False)

Count of the target

Let’s begin with the univariate analysis of continuous features

fig2 = plt.figure(figsize=(18,10))
gs = fig2.add_gridspec(2,3)
gs.update(wspace=0.3, hspace=0.15)
ax0 = fig2.add_subplot(gs[0,0])
ax1 = fig2.add_subplot(gs[0,1])
ax2 = fig2.add_subplot(gs[0,2])
ax3 = fig2.add_subplot(gs[1,0])
ax4 = fig2.add_subplot(gs[1,1])
ax5 = fig2.add_subplot(gs[1,2])

axes = [ax0,ax1,ax2,ax3,ax4,ax5]

fig2.patch.set_facecolor(background_color)
for a in axes:
a.set_facecolor(background_color)
a.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))

for s in [“top”,”right”,’bottom’,”left”]:
ax0.spines[s].set_visible(False)
ax0.tick_params(left=False, bottom=False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.text(0.5,0.5,
‘Box plot for various\n continuous features\n_________________’,
horizontalalignment=’center’,
verticalalignment=’center’,
fontsize=18, fontweight=’bold’,
fontfamily=’serif’,
color=”#000000″)

ax1.text(-0.05, 81, ‘Age’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.boxenplot(ax=ax1,y=df[‘age’],palette=[“#5833ff”],width=0.6)

ax2.text(-0.05, 208, ‘Resting Blood Pressure’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.boxenplot(ax=ax2,y=df[‘Resting_Blood_Pressure’],palette=[“#8000ff”],width=0.6)

ax3.text(-0.05, 600, ‘Serum Cholesterol’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.boxenplot(ax=ax3,y=df[‘Serum_Cholesterol’],palette=[“#da8829”],width=0.6)

ax4.text(-0.09, 210, ‘Maximum Heart Rate achieved’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.boxenplot(ax=ax4,y=df[‘Maximum_Heart_Rate’],palette=[“#800000”],width=0.6)

ax5.text(-0.1, 6.6, ‘Oldpeak’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.boxenplot(ax=ax5,y=df[‘oldpeak’],palette=[“#6aac90”],width=0.6)

for s in [“top”,”right”,”left”]:
ax1.spines[s].set_visible(False)
ax2.spines[s].set_visible(False)
ax3.spines[s].set_visible(False)
ax4.spines[s].set_visible(False)
ax5.spines[s].set_visible(False)

Box plot for various continuous features

Let’s look at the univariate analysis of categorical variables

fig = plt.figure(figsize=(18,15))
gs = fig.add_gridspec(3,3)
gs.update(wspace=0.5, hspace=0.25)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[0,2])
ax3 = fig.add_subplot(gs[1,0])
ax4 = fig.add_subplot(gs[1,1])
ax5 = fig.add_subplot(gs[1,2])
ax6 = fig.add_subplot(gs[2,0])
ax7 = fig.add_subplot(gs[2,1])
ax8 = fig.add_subplot(gs[2,2])

axes = [ax0,ax1,ax2,ax3,ax4,ax5,ax6,ax7,ax8]
fig.patch.set_facecolor(background_color)

for a in axes:
a.set_facecolor(background_color)
a.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
a.set_xlabel(” “)
a.set_ylabel(” “)

for s in [“top”,”right”,’bottom’,”left”]:
ax0.spines[s].set_visible(False)
ax0.tick_params(left=False, bottom=False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.text(0.5,0.5,
‘Count plot for various\n categorical features\n_________________’,
horizontalalignment=’center’,
verticalalignment=’center’,
fontsize=18, fontweight=’bold’,
fontfamily=’serif’,
color=”#000000″)

ax1.text(0.3, 220, ‘Sex’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.countplot(ax=ax1,data=df,x=’sex’,palette=color_palette)
ax1.set_xlabel(“”)
ax1.set_ylabel(“”)

ax2.text(0.3, 220, ‘Exercise induced Angina’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.countplot(ax=ax2,data=df,x=’Exercise_induced_Angina’,palette=color_palette)
ax2.set_xlabel(“”)
ax2.set_ylabel(“”)

ax3.text(1.5, 200, ‘Number of normal coronary vessels’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.countplot(ax=ax3,data=df,x=’Coronary_artery_abnormality’,palette=color_palette)
ax3.set_xlabel(“”)
ax3.set_ylabel(“”)

ax4.text(1.5, 162, ‘Chest Pain type’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.countplot(ax=ax4,data=df,x=’Chest_Pain_type’,palette=color_palette)
ax4.set_xlabel(“”)
ax4.set_ylabel(“”)

ax5.text(0.5, 290, ‘Fasting blood sugar’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.countplot(ax=ax5,data=df,x=’Fasting_blood_sugar’,palette=color_palette)
ax5.set_xlabel(“”)
ax5.set_ylabel(“”)

ax6.text(0.75, 165, ‘Resting Ecg’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.countplot(ax=ax6,data=df,x=’Resting_ECG’,palette=color_palette)
ax6.set_xlabel(“”)
ax6.set_ylabel(“”)

ax7.text(0.85, 155, ‘ECG Slope’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.countplot(ax=ax7,data=df,x=’ECG_Slope’,palette=color_palette)
ax7.set_xlabel(“”)
ax7.set_ylabel(“”)

ax8.text(1.2, 180, ‘Thalladium Stress test’, fontsize=14, fontweight=’bold’, fontfamily=’serif’, color=”#000000″)
sns.countplot(ax=ax8,data=df,x=’Thalladium_test’,palette=color_palette)
ax8.set_xlabel(“”)
ax8.set_ylabel(“”)
for s in [“top”,”right”,”left”]:
ax1.spines[s].set_visible(False)
ax2.spines[s].set_visible(False)
ax3.spines[s].set_visible(False)
ax4.spines[s].set_visible(False)
ax5.spines[s].set_visible(False)
ax6.spines[s].set_visible(False)
ax7.spines[s].set_visible(False)
ax8.spines[s].set_visible(False)

The univariate analysis of categorical variables

Let’s perform the multivariate analysis of continuous variables
fig = plt.figure(figsize=(18,18))
gs = fig.add_gridspec(5,3)
gs.update(wspace=0.5, hspace=0.5)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])

axes = [ax0,ax1,ax2,ax3,ax4,ax5,ax6,ax7,ax8,ax9]

fig.patch.set_facecolor(background_color)
for a in axes:
a.set_facecolor(background_color)

var = [‘Age’,’Resting Blood pressure’,’Serum Cholesterol’,’Maximum_Heart_Rate’,’oldpeak’]
axe = [ax0,ax2,ax4,ax6,ax8]
i = 0
for a in axe:

    a.text(0.5,0.5,"Distribution of " + var[i]+ " \naccording to\n target variable\n___________",
    horizontalalignment = 'center',
    verticalalignment = 'center',
    fontsize = 18,
    fontweight='bold',
    fontfamily='serif',
    color='#000000')
    i = i + 1
    a.spines["bottom"].set_visible(False)
    a.set_xticklabels([])
    a.set_yticklabels([])
    a.tick_params(left=False, bottom=False)

axs = [ax1,ax3,ax5,ax7,ax9]
i = 0
for a in axs:
a.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.kdeplot(ax=a, data=df, x =df.columns[i],hue=”output”, fill=True,palette=[“#8000ff”,”#da8829″], alpha=.5, linewidth=0)
i = i + 1

for k in [“top”,”left”,”right”]:
ax0.spines[k].set_visible(False)
ax1.spines[k].set_visible(False)
ax2.spines[k].set_visible(False)
ax3.spines[k].set_visible(False)
ax4.spines[k].set_visible(False)
ax5.spines[k].set_visible(False)
ax6.spines[k].set_visible(False)
ax7.spines[k].set_visible(False)
ax8.spines[k].set_visible(False)
ax9.spines[k].set_visible(False)

The multivariate analysis of continuous variables
The multivariate analysis of continuous variables

Let’s look at the multivariable analysis of categorical features

fig = plt.figure(figsize=(18,20))
gs = fig.add_gridspec(6,3)
gs.update(wspace=0.5, hspace=0.5)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])

fig.patch.set_facecolor(background_color)
axes = [ax0,ax1,ax2,ax3,ax4,ax5,ax6,ax7,ax8,ax9]
for a in axes:
a.set_facecolor(background_color)

ax0.text(0.5,0.5,”Chest pain\ndistribution\n__________”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
)
ax0.spines[“bottom”].set_visible(False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)
ax0.text(1,.5,”0 – Typical Angina\n1 – Atypical Angina\n2 – Non-anginal Pain\n3 – Asymptomatic”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)

ax1.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.kdeplot(ax=ax1, data=df, x=’Chest_Pain_type’,hue=”output”, fill=True,palette=[“#8000ff”,”#da8829″], alpha=.5, linewidth=0)
ax1.set_xlabel(“”)
ax1.set_ylabel(“”)

ax2.text(0.5,0.5,”Number of\nnormal major vessels \n___________”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)
ax2.text(1,.5,”0 vessels\n1 vessel\n2 vessels\n3 vessels\n4 vessels”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)

ax2.spines[“bottom”].set_visible(False)
ax2.set_xticklabels([])
ax2.set_yticklabels([])
ax2.tick_params(left=False, bottom=False)

ax3.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.kdeplot(ax=ax3, data=df, x=’Coronary_artery_abnormality’,hue=”output”, fill=True,palette=[“#8000ff”,”#da8829″], alpha=.5, linewidth=0)
ax3.set_xlabel(“”)
ax3.set_ylabel(“”)

ax4.text(0.5,0.5,”Coronary artery disease\naccording to\nsex\n______”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)
ax4.text(1,.5,”0 – Female\n1 – Male”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)
ax4.spines[“bottom”].set_visible(False)
ax4.set_xticklabels([])
ax4.set_yticklabels([])
ax4.tick_params(left=False, bottom=False)

ax5.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.countplot(ax=ax5,data=df,x=’sex’,palette=[“#8000ff”,”#da8829″], hue=’output’)
ax5.set_xlabel(“”)
ax5.set_ylabel(“”)

ax6.text(0.5,0.5,”Distribution of results\naccording to\n target variable\n___________”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)
ax6.text(1,.5,”Thalium Stress\nTest Result\n0, 1, 2, 3″,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)
ax6.spines[“bottom”].set_visible(False)
ax6.set_xticklabels([])
ax6.set_yticklabels([])
ax6.tick_params(left=False, bottom=False)

ax7.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.kdeplot(ax=ax7, data=df, x=’Thalladium_test’,hue=”output”, fill=True,palette=[“#8000ff”,”#da8829″], alpha=.5, linewidth=0)
ax7.set_xlabel(“”)
ax7.set_ylabel(“”)

ax8.text(0.5,0.5,”Box plot of\n Max Heart rate \nwith outcome\n_______”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)
ax8.text(1,.5,”Maximum heart\nrate achieved”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)

ax8.spines[“bottom”].set_visible(False)
ax8.set_xticklabels([])
ax8.set_yticklabels([])
ax8.tick_params(left=False, bottom=False)

ax9.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.boxenplot(ax=ax9, data=df,x=’output’,y=’Maximum_Heart_Rate’,palette=[“#8000ff”,”#da8829″])
ax9.set_xlabel(“”)
ax9.set_ylabel(“”)

for i in [“top”,”left”,”right”]:
ax0.spines[i].set_visible(False)
ax1.spines[i].set_visible(False)
ax2.spines[i].set_visible(False)
ax3.spines[i].set_visible(False)
ax4.spines[i].set_visible(False)
ax5.spines[i].set_visible(False)
ax6.spines[i].set_visible(False)
ax7.spines[i].set_visible(False)
ax8.spines[i].set_visible(False)
ax9.spines[i].set_visible(False)

The multivariable analysis of categorical features
The multivariable analysis of categorical features

Let’s look at the multivariable analysis of categorical features

fig = plt.figure(figsize=(18,20))
gs = fig.add_gridspec(6,3)
gs.update(wspace=0.5, hspace=0.5)
ax0 = fig.add_subplot(gs[0,0])
ax1 = fig.add_subplot(gs[0,1])
ax2 = fig.add_subplot(gs[1,0])
ax3 = fig.add_subplot(gs[1,1])
ax4 = fig.add_subplot(gs[2,0])
ax5 = fig.add_subplot(gs[2,1])
ax6 = fig.add_subplot(gs[3,0])
ax7 = fig.add_subplot(gs[3,1])
ax8 = fig.add_subplot(gs[4,0])
ax9 = fig.add_subplot(gs[4,1])

fig.patch.set_facecolor(background_color)
axes = [ax0,ax1,ax2,ax3,ax4,ax5,ax6,ax7,ax8,ax9]
for a in axes:
a.set_facecolor(background_color)

ax0.text(0.5,0.5,”Chest pain\ndistribution\n__________”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
)
ax0.spines[“bottom”].set_visible(False)
ax0.set_xticklabels([])
ax0.set_yticklabels([])
ax0.tick_params(left=False, bottom=False)
ax0.text(1,.5,”0 – Typical Angina\n1 – Atypical Angina\n2 – Non-anginal Pain\n3 – Asymptomatic”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)

ax1.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.kdeplot(ax=ax1, data=df, x=’Chest_Pain_type’,hue=”output”, fill=True,palette=[“#8000ff”,”#da8829″], alpha=.5, linewidth=0)
ax1.set_xlabel(“”)
ax1.set_ylabel(“”)

ax2.text(0.5,0.5,”Number of\nnormal major vessels \n___________”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)
ax2.text(1,.5,”0 vessels\n1 vessel\n2 vessels\n3 vessels\n4 vessels”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)

ax2.spines[“bottom”].set_visible(False)
ax2.set_xticklabels([])
ax2.set_yticklabels([])
ax2.tick_params(left=False, bottom=False)

ax3.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.kdeplot(ax=ax3, data=df, x=’Coronary_artery_abnormality’,hue=”output”, fill=True,palette=[“#8000ff”,”#da8829″], alpha=.5, linewidth=0)
ax3.set_xlabel(“”)
ax3.set_ylabel(“”)

ax4.text(0.5,0.5,”Coronary artery disease\naccording to\nsex\n______”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)
ax4.text(1,.5,”0 – Female\n1 – Male”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)
ax4.spines[“bottom”].set_visible(False)
ax4.set_xticklabels([])
ax4.set_yticklabels([])
ax4.tick_params(left=False, bottom=False)

ax5.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.countplot(ax=ax5,data=df,x=’sex’,palette=[“#8000ff”,”#da8829″], hue=’output’)
ax5.set_xlabel(“”)
ax5.set_ylabel(“”)

ax6.text(0.5,0.5,”Distribution of results\naccording to\n target variable\n___________”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)
ax6.text(1,.5,”Thalium Stress\nTest Result\n0, 1, 2, 3″,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)
ax6.spines[“bottom”].set_visible(False)
ax6.set_xticklabels([])
ax6.set_yticklabels([])
ax6.tick_params(left=False, bottom=False)

ax7.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.kdeplot(ax=ax7, data=df, x=’Thalladium_test’,hue=”output”, fill=True,palette=[“#8000ff”,”#da8829″], alpha=.5, linewidth=0)
ax7.set_xlabel(“”)
ax7.set_ylabel(“”)

ax8.text(0.5,0.5,”Box plot of\n Max Heart rate \nwith outcome\n_______”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 16,
fontweight=’bold’,
fontfamily=’serif’,
color=’#000000′)
ax8.text(1,.5,”Maximum heart\nrate achieved”,
horizontalalignment = ‘center’,
verticalalignment = ‘center’,
fontsize = 14
)

ax8.spines[“bottom”].set_visible(False)
ax8.set_xticklabels([])
ax8.set_yticklabels([])
ax8.tick_params(left=False, bottom=False)

ax9.grid(color=’#000000′, linestyle=’:’, axis=’y’, zorder=0, dashes=(1,5))
sns.boxenplot(ax=ax9, data=df,x=’output’,y=’Maximum_Heart_Rate’,palette=[“#8000ff”,”#da8829″])
ax9.set_xlabel(“”)
ax9.set_ylabel(“”)

for i in [“top”,”left”,”right”]:
ax0.spines[i].set_visible(False)
ax1.spines[i].set_visible(False)
ax2.spines[i].set_visible(False)
ax3.spines[i].set_visible(False)
ax4.spines[i].set_visible(False)
ax5.spines[i].set_visible(False)
ax6.spines[i].set_visible(False)
ax7.spines[i].set_visible(False)
ax8.spines[i].set_visible(False)
ax9.spines[i].set_visible(False)

Multivariable analysis of categorical features
Multivariable analysis of categorical features

Let’s look at boxplots of individual features

dataset = df

dataset[‘age’].plot(kind=’box’, subplots=True, sharex=False, sharey=False)
plt.figure()

Boxplot age

dataset[‘Resting_Blood_Pressure’].plot(kind=’box’, subplots=True, sharex=False, sharey=False)
plt.figure()

Boxplot Resting_Blood_Pressure

dataset[‘Serum_Cholesterol’].plot(kind=’box’, subplots=True, sharex=False, sharey=False)
plt.figure()

Boxplot Serum_Cholesterol

dataset[‘Maximum_Heart_Rate’].plot(kind=’box’, subplots=True, sharex=False, sharey=False)
plt.figure()

Boxplot Maximum_Heart_Rate

Feature Correlations

Let’s plot the feature correlation matrix heatmap

plt.figure(figsize=(14,10))
sns.heatmap(dataset.corr(),annot=True,cmap=’cubehelix_r’)
plt.show()

The feature correlation matrix heatmap

Let’s compare the seaborn hexagonal bin plots

sns.jointplot(x=”age”, y=”Maximum_Heart_Rate”, data=dataset, ratio=10, kind=’hex’,color=’green’)
plt.show()

Hexagonal bin plot age vs Maximum_Heart_Rate

sns.jointplot(x=”Serum_Cholesterol”, y=”Maximum_Heart_Rate”, data=dataset, ratio=10, kind=’hex’,color=’green’)
plt.show()

Hexagonal bin plot Serum_Cholesterol vs Maximum_Heart_Rate

sns.jointplot(x=”Resting_Blood_Pressure”, y=”Maximum_Heart_Rate”, data=dataset, ratio=10, kind=’hex’,color=’green’)
plt.show()

Hexagonal bin plot  Resting_Blood_Pressure vs Maximum_Heart_Rate

sns.jointplot(x=”Resting_Blood_Pressure”, y=”Serum_Cholesterol”, data=dataset, ratio=10, kind=’hex’,color=’green’)
plt.show()

Hexagonal bin plot  Resting_Blood_Pressure vs Serum_Cholesterol

sns.jointplot(x=”Resting_Blood_Pressure”, y=”age”, data=dataset, ratio=10, kind=’hex’,color=’green’)
plt.show()

Hexagonal bin plot  Resting_Blood_Pressure vs age

Multiple ML/AI Models

Let’s import the libraries

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score,recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix,accuracy_score, classification_report
from sklearn.metrics import ConfusionMatrixDisplay

and prepare the input data

df1 = df

let’s define the columns to be encoded and scaled
cat_cols = [‘Coronary_artery_abnormality’,’Chest_Pain_type’,’Resting_ECG’,’ECG_Slope’,’Thalladium_test’]
con_cols = [“age”,”Resting_Blood_Pressure”,”Serum_Cholesterol”,”Maximum_Heart_Rate”,”oldpeak”]

encoding the categorical columns
df1 = pd.get_dummies(df1, columns = cat_cols, drop_first = True)

and defining the features and target
X = df1.drop([‘output’],axis=1)
y = df1[[‘output’]]

Let’s perform train_test_split with test_size = 0.2

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2, random_state = 42)

instantiating the scaler
scaler =StandardScaler()

and scaling the continuous features
d = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
names = X.columns
scaled = pd.DataFrame(d, columns=names)
#scaled.head()

Let’s begin training the most popular models

models = [RandomForestClassifier(), KNeighborsClassifier(), SVC(), LogisticRegression()]

for a in models:
a.fit(X_train, y_train)
y_pred = a.predict(X_test)

a = print(f'model: {str(a)}')
b= print(classification_report(y_test,y_pred, zero_division=1))
c = print('-'*30, '\n')
model: RandomForestClassifier()
              precision    recall  f1-score   support

           0       0.78      0.86      0.82        29
           1       0.86      0.78      0.82        32

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61

------------------------------ 

model: KNeighborsClassifier()
              precision    recall  f1-score   support

           0       0.48      1.00      0.64        29
           1       1.00      0.00      0.00        32

    accuracy                           0.48        61
   macro avg       0.74      0.50      0.32        61
weighted avg       0.75      0.48      0.31        61

------------------------------ 

model: SVC()
              precision    recall  f1-score   support

           0       0.48      1.00      0.64        29
           1       1.00      0.00      0.00        32

    accuracy                           0.48        61
   macro avg       0.74      0.50      0.32        61
weighted avg       0.75      0.48      0.31        61

------------------------------ 

model: LogisticRegression()
              precision    recall  f1-score   support

           0       0.79      0.90      0.84        29
           1       0.89      0.78      0.83        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61

------------------------------ 

Feature Importance

Let’s select RandomForestClassifier

FinalM= RandomForestClassifier()
FinalM.fit(X_train, y_train)
y_pred = FinalM.predict(X_test)
print(classification_report(y_test,y_pred, zero_division=1))
feature_importances=FinalM.feature_importances_
feature_importances_df=pd.DataFrame({‘features’:list(X_train), ‘feature_importances’:feature_importances})
feature_importances_df=pd.DataFrame({‘features’:list(X_train), ‘feature_importances’:feature_importances})
feature_importances_df.sort_values(‘feature_importances’,ascending=False)

   precision    recall  f1-score   support

           0       0.86      0.83      0.84        29
           1       0.85      0.88      0.86        32

    accuracy                           0.85        61
   macro avg       0.85      0.85      0.85        61
weighted avg       0.85      0.85      0.85        61
RandomForestClassifier feature importance table

Let’s visualize this table as a vertical bar plot

plt.figure(figsize=(20,15))
importance = FinalM.feature_importances_
idxs = np.argsort(importance, axis=0, order=None)
sort = np.sort(idxs)
new = sort[:15]
ax = plt.axes()
ax.set_facecolor(background_color)
plt.title(“Feature Importance”,fontsize=30)
plt.barh(range(len(new)),importance[new],align=”center”,color = color_palette)
plt.yticks(range(len(new)),(feature_importances_df[‘features’][i] for i in new),fontsize=20)
plt.xlabel(“Random Forest Feature Importance”,fontsize=18)
plt.tight_layout()
plt.show()

plt.figure(figsize=(20,15))
importance = FinalM.feature_importances_
idxs = np.argsort(importance, axis=0, order=None)
sort = np.sort(idxs)
new = sort[:15]
ax = plt.axes()
ax.set_facecolor(background_color)
plt.title(“Feature Importance”,fontsize=30)
plt.barh(range(len(new)),importance[new],align=”center”,color = color_palette)
plt.yticks(range(len(new)),(feature_importances_df[‘features’][i] for i in new),fontsize=20)
plt.xlabel(“Random Forest Feature Importance”,fontsize=18)
plt.tight_layout()
plt.show()

Random Forest Feature Importance vertical bar plot

Model Performance

Let’s select LogisticRegression and plot both classification report and the normalized confusion matrix

FinalM= LogisticRegression()
FinalM.fit(X_train, y_train)
y_pred = FinalM.predict(X_test)
print(classification_report(y_test,y_pred, zero_division=1))

   precision    recall  f1-score   support

           0       0.79      0.90      0.84        29
           1       0.89      0.78      0.83        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61

cm = confusion_matrix(FinalM.predict(X_test),y_test)
target_names=[‘Low’,’High’]

cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.show(block=False)

RandomForestClassifier normalized confusion matrix

Let’s select RandomForestClassifier

FinalM= RandomForestClassifier()
FinalM.fit(X_train, y_train)
y_pred = FinalM.predict(X_test)
print(classification_report(y_test,y_pred, zero_division=1))

precision    recall  f1-score   support

           0       0.86      0.83      0.84        29
           1       0.85      0.88      0.86        32

    accuracy                           0.85        61
   macro avg       0.85      0.85      0.85        61
weighted avg       0.85      0.85      0.85        61

and plot the normalized confusion matrix

cm = confusion_matrix(FinalM.predict(X_test),y_test)
target_names=[‘Low’,’High’]

cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.show(block=False)

LogisticRegression normalized confusion matrix

Let’s look at all estimators

from sklearn.utils import all_estimators

estimators = all_estimators(type_filter=’classifier’)

all_clfs = []
for name, ClassifierClass in estimators:
print(‘Appending’, name)
try:
clf = ClassifierClass()
all_clfs.append(clf)
except Exception as e:
print(‘Unable to import’, name)
print(e)

Appending AdaBoostClassifier
Appending BaggingClassifier
Appending BernoulliNB
Appending CalibratedClassifierCV
Appending CategoricalNB
Appending ClassifierChain
Unable to import ClassifierChain
__init__() missing 1 required positional argument: 'base_estimator'
Appending ComplementNB
Appending DecisionTreeClassifier
Appending DummyClassifier
Appending ExtraTreeClassifier
Appending ExtraTreesClassifier
Appending GaussianNB
Appending GaussianProcessClassifier
Appending GradientBoostingClassifier
Appending HistGradientBoostingClassifier
Appending KNeighborsClassifier
Appending LabelPropagation
Appending LabelSpreading
Appending LinearDiscriminantAnalysis
Appending LinearSVC
Appending LogisticRegression
Appending LogisticRegressionCV
Appending MLPClassifier
Appending MultiOutputClassifier
Unable to import MultiOutputClassifier
__init__() missing 1 required positional argument: 'estimator'
Appending MultinomialNB
Appending NearestCentroid
Appending NuSVC
Appending OneVsOneClassifier
Unable to import OneVsOneClassifier
__init__() missing 1 required positional argument: 'estimator'
Appending OneVsRestClassifier
Unable to import OneVsRestClassifier
__init__() missing 1 required positional argument: 'estimator'
Appending OutputCodeClassifier
Unable to import OutputCodeClassifier
__init__() missing 1 required positional argument: 'estimator'
Appending PassiveAggressiveClassifier
Appending Perceptron
Appending QuadraticDiscriminantAnalysis
Appending RadiusNeighborsClassifier
Appending RandomForestClassifier
Appending RidgeClassifier
Appending RidgeClassifierCV
Appending SGDClassifier
Appending SVC
Appending StackingClassifier
Unable to import StackingClassifier
__init__() missing 1 required positional argument: 'estimators'
Appending VotingClassifier
Unable to import VotingClassifier
__init__() missing 1 required positional argument: 'estimators'

print (all_clfs)

[AdaBoostClassifier(), BaggingClassifier(), BernoulliNB(), CalibratedClassifierCV(), CategoricalNB(), ComplementNB(), DecisionTreeClassifier(), DummyClassifier(), ExtraTreeClassifier(), ExtraTreesClassifier(), GaussianNB(), GaussianProcessClassifier(), GradientBoostingClassifier(), HistGradientBoostingClassifier(), KNeighborsClassifier(), LabelPropagation(), LabelSpreading(), LinearDiscriminantAnalysis(), LinearSVC(), LogisticRegression(), LogisticRegressionCV(), MLPClassifier(), MultinomialNB(), NearestCentroid(), NuSVC(), PassiveAggressiveClassifier(), Perceptron(), QuadraticDiscriminantAnalysis(), RadiusNeighborsClassifier(), RandomForestClassifier(), RidgeClassifier(), RidgeClassifierCV(), SGDClassifier(), SVC()]

Let’s import libraries and create multiple model pipelines

import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import AdaBoostClassifier,BaggingClassifier,ExtraTreesClassifier,GradientBoostingClassifier,HistGradientBoostingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import BernoulliNB,ComplementNB,GaussianNB,MultinomialNB
from sklearn.calibration import CalibratedClassifierCV
from sklearn.tree import DecisionTreeClassifier,ExtraTreeClassifier
from sklearn.dummy import DummyClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.semi_supervised import LabelPropagation,LabelSpreading
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.svm import LinearSVC,NuSVC
from sklearn.linear_model import LogisticRegressionCV,PassiveAggressiveClassifier,Perceptron,RidgeClassifier,RidgeClassifierCV
from sklearn.linear_model import SGDClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.multiclass import OutputCodeClassifier,OneVsRestClassifier
from sklearn.multioutput import ClassifierChain
from sklearn.kernel_approximation import RBFSampler
allclfs = []
clf = AdaBoostClassifier()
allclfs.append(clf)
clf = BaggingClassifier()
allclfs.append(clf)
clf = CalibratedClassifierCV()
allclfs.append(clf)

import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

classifier = Pipeline([
(‘tfidf’, TfidfTransformer()),
(‘clf’, OneVsRestClassifier(LinearSVC()))])

base_lr = LogisticRegression()
chains = [ClassifierChain(base_lr, order=”random”, random_state=i) for i in range(10)]

forest = RandomForestClassifier(n_estimators=100,
random_state=123)

boost = xgb.XGBClassifier(random_state=123, verbosity=0, use_label_encoder=False)

lgclassifier = LogisticRegression(random_state=123)
estimators = [
(‘rf’, forest),
(‘xgb’, boost)
]
stackclf = StackingClassifier(estimators=estimators,
final_estimator=lgclassifier,
cv=10)

svm = LinearSVC(random_state=42)

ecoc_classifier = OutputCodeClassifier(svm, code_size=6)

from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

estimator = []
estimator.append((‘LR’,
LogisticRegression(solver =’lbfgs’,
multi_class =’multinomial’,
max_iter = 200)))
estimator.append((‘SVC’, SVC(gamma =’auto’, probability = True)))
estimator.append((‘DTC’, DecisionTreeClassifier()))

vot_hard = VotingClassifier(estimators = estimator, voting =’hard’)

vot_soft = VotingClassifier(estimators = estimator, voting =’soft’)

Expanded list of models:

models = [AdaBoostClassifier(), BaggingClassifier(), CalibratedClassifierCV(),ComplementNB(),DecisionTreeClassifier(),DummyClassifier(),
ExtraTreeClassifier(),ExtraTreesClassifier(),GaussianNB(),GaussianProcessClassifier(),GradientBoostingClassifier(),
HistGradientBoostingClassifier(),LabelPropagation(),LabelSpreading(),LinearDiscriminantAnalysis(),LinearSVC(),
LogisticRegressionCV(),MLPClassifier(),MultinomialNB(),NearestCentroid(),NuSVC(),PassiveAggressiveClassifier(),
Perceptron(),QuadraticDiscriminantAnalysis(),xgb.XGBClassifier(),lgb.LGBMClassifier(),RidgeClassifier(),
RidgeClassifierCV(),SGDClassifier(),stackclf,vot_hard,vot_soft,ecoc_classifier,classifier]

for a in models:
a.fit(X_train, y_train)
y_pred = a.predict(X_test)

a = print(f'model: {str(a)}')
b= print(classification_report(y_test,y_pred, zero_division=1))
c = print('-'*30, '\n')
model: AdaBoostClassifier()
              precision    recall  f1-score   support

           0       0.62      0.97      0.76        29
           1       0.94      0.47      0.62        32

    accuracy                           0.70        61
   macro avg       0.78      0.72      0.69        61
weighted avg       0.79      0.70      0.69        61

------------------------------ 
model: BaggingClassifier()
              precision    recall  f1-score   support

           0       0.60      0.62      0.61        29
           1       0.65      0.62      0.63        32

    accuracy                           0.62        61
   macro avg       0.62      0.62      0.62        61
weighted avg       0.62      0.62      0.62        61

------------------------------ 

model: CalibratedClassifierCV()
              precision    recall  f1-score   support

           0       0.91      0.69      0.78        29
           1       0.77      0.94      0.85        32

    accuracy                           0.82        61
   macro avg       0.84      0.81      0.81        61
weighted avg       0.84      0.82      0.82        61

------------------------------ 

model: ComplementNB()
              precision    recall  f1-score   support

           0       0.78      0.86      0.82        29
           1       0.86      0.78      0.82        32

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61

------------------------------ 

model: DecisionTreeClassifier()
              precision    recall  f1-score   support

           0       0.56      0.76      0.65        29
           1       0.68      0.47      0.56        32

    accuracy                           0.61        61
   macro avg       0.62      0.61      0.60        61
weighted avg       0.63      0.61      0.60        61

------------------------------ 

model: DummyClassifier()
              precision    recall  f1-score   support

           0       1.00      0.00      0.00        29
           1       0.52      1.00      0.69        32

    accuracy                           0.52        61
   macro avg       0.76      0.50      0.34        61
weighted avg       0.75      0.52      0.36        61

------------------------------ 

model: ExtraTreeClassifier()
              precision    recall  f1-score   support

           0       0.64      0.24      0.35        29
           1       0.56      0.88      0.68        32

    accuracy                           0.57        61
   macro avg       0.60      0.56      0.52        61
weighted avg       0.60      0.57      0.52        61

------------------------------ 

model: ExtraTreesClassifier()
              precision    recall  f1-score   support

           0       0.81      0.86      0.83        29
           1       0.87      0.81      0.84        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61

------------------------------ 

model: GaussianNB()
              precision    recall  f1-score   support

           0       0.58      1.00      0.73        29
           1       1.00      0.34      0.51        32

    accuracy                           0.66        61
   macro avg       0.79      0.67      0.62        61
weighted avg       0.80      0.66      0.62        61

------------------------------ 

model: GaussianProcessClassifier()
              precision    recall  f1-score   support

           0       0.48      1.00      0.64        29
           1       1.00      0.00      0.00        32

    accuracy                           0.48        61
   macro avg       0.74      0.50      0.32        61
weighted avg       0.75      0.48      0.31        61

------------------------------ 

model: GradientBoostingClassifier()
              precision    recall  f1-score   support

           0       0.63      0.83      0.72        29
           1       0.78      0.56      0.65        32

    accuracy                           0.69        61
   macro avg       0.71      0.70      0.69        61
weighted avg       0.71      0.69      0.68        61

------------------------------ 

model: HistGradientBoostingClassifier()
              precision    recall  f1-score   support

           0       0.68      0.86      0.76        29
           1       0.83      0.62      0.71        32

    accuracy                           0.74        61
   macro avg       0.75      0.74      0.74        61
weighted avg       0.76      0.74      0.73        61

------------------------------ 

model: LabelPropagation()
              precision    recall  f1-score   support

           0       0.48      1.00      0.64        29
           1       1.00      0.00      0.00        32

    accuracy                           0.48        61
   macro avg       0.74      0.50      0.32        61
weighted avg       0.75      0.48      0.31        61

------------------------------ 

model: LabelSpreading()
              precision    recall  f1-score   support

           0       0.48      1.00      0.64        29
           1       1.00      0.00      0.00        32

    accuracy                           0.48        61
   macro avg       0.74      0.50      0.32        61
weighted avg       0.75      0.48      0.31        61

------------------------------ 

model: LinearDiscriminantAnalysis()
              precision    recall  f1-score   support

           0       0.92      0.83      0.87        29
           1       0.86      0.94      0.90        32

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.88        61
weighted avg       0.89      0.89      0.88        61

------------------------------ 

model: LinearSVC()
              precision    recall  f1-score   support

           0       0.78      0.86      0.82        29
           1       0.86      0.78      0.82        32

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61

------------------------------ 

model: LogisticRegressionCV()
              precision    recall  f1-score   support

           0       0.82      0.97      0.89        29
           1       0.96      0.81      0.88        32

    accuracy                           0.89        61
   macro avg       0.89      0.89      0.89        61
weighted avg       0.90      0.89      0.88        61

------------------------------ 

model: MLPClassifier()
              precision    recall  f1-score   support

           0       0.81      0.86      0.83        29
           1       0.87      0.81      0.84        32

    accuracy                           0.84        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.84      0.84      0.84        61

------------------------------ 

model: MultinomialNB()
              precision    recall  f1-score   support

           0       0.78      0.86      0.82        29
           1       0.86      0.78      0.82        32

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61

------------------------------ 

model: NearestCentroid()
              precision    recall  f1-score   support

           0       0.48      1.00      0.64        29
           1       1.00      0.00      0.00        32

    accuracy                           0.48        61
   macro avg       0.74      0.50      0.32        61
weighted avg       0.75      0.48      0.31        61

------------------------------ 

model: NuSVC()
              precision    recall  f1-score   support

           0       0.48      1.00      0.64        29
           1       1.00      0.00      0.00        32

    accuracy                           0.48        61
   macro avg       0.74      0.50      0.32        61
weighted avg       0.75      0.48      0.31        61

------------------------------ 

model: PassiveAggressiveClassifier()
              precision    recall  f1-score   support

           0       0.78      0.86      0.82        29
           1       0.86      0.78      0.82        32

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61

------------------------------ 

model: Perceptron()
              precision    recall  f1-score   support

           0       0.74      0.79      0.77        29
           1       0.80      0.75      0.77        32

    accuracy                           0.77        61
   macro avg       0.77      0.77      0.77        61
weighted avg       0.77      0.77      0.77        61

------------------------------ 

model: QuadraticDiscriminantAnalysis()
              precision    recall  f1-score   support

           0       1.00      0.00      0.00        29
           1       0.52      1.00      0.69        32

    accuracy                           0.52        61
   macro avg       0.76      0.50      0.34        61
weighted avg       0.75      0.52      0.36        61

------------------------------ 
model: XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_bin=256, max_cat_threshold=64, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, ...)
              precision    recall  f1-score   support

           0       0.68      0.79      0.73        29
           1       0.78      0.66      0.71        32

    accuracy                           0.72        61
   macro avg       0.73      0.72      0.72        61
weighted avg       0.73      0.72      0.72        61

------------------------------ 

model: LGBMClassifier()
              precision    recall  f1-score   support

           0       0.67      0.90      0.76        29
           1       0.86      0.59      0.70        32

    accuracy                           0.74        61
   macro avg       0.77      0.75      0.73        61
weighted avg       0.77      0.74      0.73        61

------------------------------ 

model: RidgeClassifier()
              precision    recall  f1-score   support

           0       0.93      0.86      0.89        29
           1       0.88      0.94      0.91        32

    accuracy                           0.90        61
   macro avg       0.90      0.90      0.90        61
weighted avg       0.90      0.90      0.90        61

------------------------------ 

model: RidgeClassifierCV()
              precision    recall  f1-score   support

           0       0.93      0.86      0.89        29
           1       0.88      0.94      0.91        32

    accuracy                           0.90        61
   macro avg       0.90      0.90      0.90        61
weighted avg       0.90      0.90      0.90        61

------------------------------ 

model: SGDClassifier()
              precision    recall  f1-score   support

           0       0.74      0.90      0.81        29
           1       0.88      0.72      0.79        32

    accuracy                           0.80        61
   macro avg       0.81      0.81      0.80        61
weighted avg       0.82      0.80      0.80        61

------------------------------ 

model: StackingClassifier(cv=10,
                   estimators=[('rf', RandomForestClassifier(random_state=123)),
                               ('xgb',
                                XGBClassifier(base_score=None, booster=None,
                                              callbacks=None,
                                              colsample_bylevel=None,
                                              colsample_bynode=None,
                                              colsample_bytree=None,
                                              early_stopping_rounds=None,
                                              enable_categorical=False,
                                              eval_metric=None,
                                              feature_types=None, gamma=None,
                                              gpu_id=None, grow_policy=None,
                                              impor...
                                              interaction_constraints=None,
                                              learning_rate=None, max_bin=None,
                                              max_cat_threshold=None,
                                              max_cat_to_onehot=None,
                                              max_delta_step=None,
                                              max_depth=None, max_leaves=None,
                                              min_child_weight=None,
                                              missing=nan,
                                              monotone_constraints=None,
                                              n_estimators=100, n_jobs=None,
                                              num_parallel_tree=None,
                                              predictor=None, random_state=123, ...))],
                   final_estimator=LogisticRegression(random_state=123))
              precision    recall  f1-score   support

           0       0.76      0.86      0.81        29
           1       0.86      0.75      0.80        32

    accuracy                           0.80        61
   macro avg       0.81      0.81      0.80        61
weighted avg       0.81      0.80      0.80        61

------------------------------ 

model: VotingClassifier(estimators=[('LR',
                              LogisticRegression(max_iter=200,
                                                 multi_class='multinomial')),
                             ('SVC', SVC(gamma='auto', probability=True)),
                             ('DTC', DecisionTreeClassifier())])
              precision    recall  f1-score   support

           0       0.96      0.76      0.85        29
           1       0.82      0.97      0.89        32

    accuracy                           0.87        61
   macro avg       0.89      0.86      0.87        61
weighted avg       0.88      0.87      0.87        61

------------------------------ 

model: VotingClassifier(estimators=[('LR',
                              LogisticRegression(max_iter=200,
                                                 multi_class='multinomial')),
                             ('SVC', SVC(gamma='auto', probability=True)),
                             ('DTC', DecisionTreeClassifier())],
                 voting='soft')
              precision    recall  f1-score   support

           0       0.87      0.45      0.59        29
           1       0.65      0.94      0.77        32

    accuracy                           0.70        61
   macro avg       0.76      0.69      0.68        61
weighted avg       0.75      0.70      0.68        61

------------------------------ 

model: OutputCodeClassifier(code_size=6, estimator=LinearSVC(random_state=42))
              precision    recall  f1-score   support

           0       0.78      0.86      0.82        29
           1       0.86      0.78      0.82        32

    accuracy                           0.82        61
   macro avg       0.82      0.82      0.82        61
weighted avg       0.82      0.82      0.82        61

------------------------------ 

model: Pipeline(steps=[('tfidf', TfidfTransformer()),
                ('clf', OneVsRestClassifier(estimator=LinearSVC()))])
              precision    recall  f1-score   support

           0       0.81      0.90      0.85        29
           1       0.90      0.81      0.85        32

    accuracy                           0.85        61
   macro avg       0.85      0.85      0.85        61
weighted avg       0.86      0.85      0.85        61

------------------------------ 

Application of all Machine Learning methods:

import sklearn

MLA = [
#GLM
sklearn.linear_model.LogisticRegressionCV(),
sklearn.linear_model.PassiveAggressiveClassifier(),
sklearn.linear_model. RidgeClassifierCV(),
sklearn.linear_model.SGDClassifier(),
sklearn.linear_model.Perceptron(),

#Ensemble Methods
sklearn.ensemble.AdaBoostClassifier(),
sklearn.ensemble.BaggingClassifier(),
sklearn.ensemble.ExtraTreesClassifier(),
sklearn.ensemble.GradientBoostingClassifier(),
sklearn.ensemble.RandomForestClassifier(),

#Gaussian Processes
sklearn.gaussian_process.GaussianProcessClassifier(),

#SVM
sklearn.svm.SVC(probability=True),
sklearn.svm.NuSVC(probability=True),
sklearn.svm.LinearSVC(),

#Trees    
sklearn.tree.DecisionTreeClassifier(),

#Navies Bayes
sklearn.naive_bayes.BernoulliNB(),
sklearn.naive_bayes.GaussianNB(),

#Nearest Neighbor
sklearn.neighbors.KNeighborsClassifier(),
]

MLA_columns = []
MLA_compare = pd.DataFrame(columns = MLA_columns)

row_index = 0
for alg in MLA:

predicted = alg.fit(X_train, y_train).predict(X_test)
fp, tp, th = sklearn.metrics.roc_curve(y_test, predicted)
MLA_name = alg.__class__.__name__
MLA_compare.loc[row_index,'MLA used'] = MLA_name
MLA_compare.loc[row_index, 'Train Accuracy'] = round(alg.score(X_train, y_train), 4)
MLA_compare.loc[row_index, 'Test Accuracy'] = round(alg.score(X_test, y_test), 4)
MLA_compare.loc[row_index, 'Precission'] = sklearn.metrics.precision_score(y_test, predicted)
MLA_compare.loc[row_index, 'Recall'] = sklearn.metrics.recall_score(y_test, predicted)
MLA_compare.loc[row_index, 'AUC'] = sklearn.metrics.auc(fp, tp)

row_index+=1

MLA_compare.sort_values(by = [‘Test Accuracy’], ascending = False, inplace = True)
MLA_compare

Multiple ML models performance table

Let’s perform MLA Train Accuracy Comparison

plt.subplots(figsize=(13,5))
sns.barplot(x=”MLA used”, y=”Train Accuracy”,data=MLA_compare,palette=’hot’,edgecolor=sns.color_palette(‘dark’,7))
plt.xticks(rotation=90)
plt.title(‘MLA Train Accuracy Comparison’)
plt.show()

MLA Train Accuracy Comparison bar plot

Let’s check the test data accuracy
plt.subplots(figsize=(13,5))
sns.barplot(x=”MLA used”, y=”Test Accuracy”,data=MLA_compare,palette=’hot’,edgecolor=sns.color_palette(‘dark’,7))
plt.xticks(rotation=90)
plt.title(‘Accuracy of different machine learning models’)
plt.show()

Test accuracy of different machine learning models

Creating plots to compare precision of the MLAs
plt.subplots(figsize=(13,5))
sns.barplot(x=”MLA used”, y=”Precision”,data=MLA_compare,palette=’hot’,edgecolor=sns.color_palette(‘dark’,7))
plt.xticks(rotation=90)
plt.title(‘Comparing different Machine Learning Models’)
plt.show()

Comparing precision of different Machine Learning Models

Creating plots for MLA recall comparison
plt.subplots(figsize=(13,5))
sns.barplot(x=”MLA used”, y=”Recall”,data=MLA_compare,palette=’hot’,edgecolor=sns.color_palette(‘dark’,7))
plt.xticks(rotation=90)
plt.title(‘MLA Recall Comparison’)
plt.show()

MLA Recall Comparison

Creating plot for MLA AUC comparison
plt.subplots(figsize=(13,5))
sns.barplot(x=”MLA used”, y=”AUC”,data=MLA_compare,palette=’hot’,edgecolor=sns.color_palette(‘dark’,7))
plt.xticks(rotation=90)
plt.title(‘MLA AUC Comparison’)
plt.show()

MLA AUC comparison

Creating a plot to show the ROC curves for all MLA
index = 1
for alg in MLA:

predicted = alg.fit(X_train, y_train).predict(X_test)
fp, tp, th = sklearn.metrics.roc_curve(y_test, predicted)
roc_auc_mla = sklearn.metrics.auc(fp, tp)
MLA_name = alg.__class__.__name__
plt.plot(fp, tp, lw=2, alpha=0.3, label='ROC %s (AUC = %0.2f)'  % (MLA_name, roc_auc_mla))

index+=1

plt.title(‘ROC Curve comparison’)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.plot([0,1],[0,1],’r–‘)
plt.xlim([0,1])
plt.ylim([0,1])
plt.ylabel(‘True Positive Rate’)
plt.xlabel(‘False Positive Rate’)
plt.show()

ROC Curve comparison for all MLA

Let’s invoke scikitplot to validate selected ML models of interest

import scikitplot as skplt

import sklearn

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys
import warnings
warnings.filterwarnings(“ignore”)

print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)

Scikit Plot Version :  0.3.7
Scikit Learn Version :  1.2.2
Python Version :  3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]

Let’s plot the LR Classification Learning Curve

skplt.estimators.plot_learning_curve(LogisticRegression(), X_train, y_train,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”LR Classification Learning Curve”);

LR Classification Learning Curve

Let’s compare the classification scores

gb_classif = GradientBoostingClassifier()
gb_classif.fit(X_train, y_train)
gb_classif.score(X_test, y_test)

0.6557377049180327

rf_reg = RandomForestRegressor()
rf_reg.fit(X_train, y_train)
rf_reg.score(X_test, y_test)

0.2841965517241377

Let’s plot the final LR ROC Curve

log_reg=LogisticRegression()
log_reg.fit(X_train, y_train)

Y_test_probs = log_reg.predict_proba(X_test)

skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
title=”LR ROC Curve”, figsize=(12,6));

LR ROC Curve

Let’s plot again the final LR confusion matrix

cm = confusion_matrix(log_reg.predict(X_test),y_test)
target_names=[‘Low’,’High’]

cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.show(block=False)

LR normalized confusion matrix (final model).

Let’s plot the LR Precision-Recall Curve

skplt.metrics.plot_precision_recall_curve(y_test, Y_test_probs,
title=”LR Precision-Recall Curve”, figsize=(12,6));

LR Precision-Recall Curve

Let’s look at the KS Statistic plot

Y_probas = log_reg.predict_proba(X_test)

skplt.metrics.plot_ks_statistic(y_test, Y_probas, figsize=(10,6));

LR KS Statistic plot

Let’s examine the Cumulative Gains Curve

LR Cumulative Gains Curve

Let’s check the LR Lift Curve

skplt.metrics.plot_lift_curve(y_test, Y_probas, figsize=(10,6));

LR Lift Curve

Let’s examine the Elbow Plot

skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
X_train,
cluster_ranges=range(2, 20),
figsize=(8,6));

Kmeans Elbow Plot

Let’s check PCA component explained variances

kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_train, y_train)
cluster_labels = kmeans.predict(X_test)

pca = PCA(random_state=1)
pca.fit(X_train)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

PCA component explained variances

Let’s look at the yellowbricks threshold plot for LR

from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import DiscriminationThreshold

viz = DiscriminationThreshold(LogisticRegression(random_state=123),
classes=target_names,
cv=0.2,
fig=plt.figure(figsize=(9,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

The yellowbricks threshold plot for LR

As an example of yellowbricks data visualization, let’s look at DecisionTreeClassifier

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import sklearn
import yellowbrick

pd.set_option(“display.max_columns”, 35)

import warnings
warnings.filterwarnings(“ignore”)

from yellowbrick.classifier import ClassificationReport
from sklearn.tree import DecisionTreeClassifier

viz = ClassificationReport(DecisionTreeClassifier(random_state=123),
classes=target_names,
support=True,
fig=plt.figure(figsize=(8,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

ClassificationReport DecisionTreeClassifier

Let’ check the ROC curves for DecisionTreeClassifier

from yellowbrick.classifier import ROCAUC

viz = ROCAUC(DecisionTreeClassifier(random_state=123),
classes=target_names,
fig=plt.figure(figsize=(7,5)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

ROC curves for DecisionTreeClassifier

Let’s plot the Precision-Recall Curve for DecisionTreeClassifier

from yellowbrick.classifier import PrecisionRecallCurve

viz = PrecisionRecallCurve(DecisionTreeClassifier(random_state=123),
classes=target_names,
ap_score=True,
iso_f1_curves=True,
fig=plt.figure(figsize=(7,5)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

The Precision-Recall Curve for DecisionTreeClassifier

The same considerations apply to other ML algorithms mentioned above.

Summary

Part 1

  • The ECG is a proven noninvasive test for predicting CVD. However, the ECG signal is prone to contamination by different kinds of noise. Such noise may cause deformation on the ECG heartbeat waveform, leading to cardiologists’ mislabeling or misinterpreting heartbeats. 
  • To address this problem, this paper proposes a robust and automated approach to analyse ECG signals using the Keras Sequential Autoencoder ANN. The autoencoder was trained to learn a representation of the input data that has a lower dimensionality than the original input. This is done by learning to reconstruct the input data from a compressed representation.
  • Results show that losses.mae = 0.034331195 and the val_loss curve is
ECG val_loss vs EPoch
  • It appears that the ANN training model can also handle noise associated with ECG-based time series signal, its achieved accuracy is extremely high and over fitting problems is solved in a robust and efficient manner.

Part 2

  • In the second part of this paper, we present a ML-based heart attack prediction method in which the analysis of different risk factors and prediction for CVD is done using multiple supervised ML approaches.
  • The data of heart disease  symptoms has been collected from the UCI ML Repository and analysis has been performed on the data using ML methods. The focus has been on optimizing the prediction on the basis of different parameters.
  • Logistic Regression and Ridge Classifier provided the best predictions among the 19 models. The AUC achieved with Ridge Classifier and Logistic Regression is .90. The prediction with ML models in identifying heart attack symptoms is highly efficient, especially with boosting algorithms. The prediction was done to evaluate accuracy, precision, recall, and area under the curve. ML models are being trained to perform optimized predictions.
  • Results of in-depth performance QC for Logistic Regression (LR): the ROC curve of class 1 are = 0.93, LR classification learning curves – training score is 0.87 and cross-validation score is 0.83, precision-recall of class 1 area = 0.94, discrimination threshold value is 0.45, KS statistic is 0.768 at 0.159,
  • According to the Random Forest Classifier, the top 3 most dominant features are oldpeak, Max_Heart_Rate, and age.
  • PCA: the elbow plot yields 10 clusters, 0.758 explained variance ratio for the first 1 components.
  • This study opens the doors to a new dimension for application of both autoencoders and supervised ML algorithms in predicting CVD by implementing hybrid models combined with various performance measures.
  • This prediction can help clinically in analyzing the risk factors of the heart disease and ECG interpretation of the patient scenario. Optimizing the ML models provided promising results to predict early symptoms of CVD. It can further be optimized by working further on risk factors associated with this condition.

Explore More


Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: