Titanic Benchmark Hypothesis Testing in Disaster Risk Management: (Auto)EDA, ML, HPO & SHAP

Featured Image by Fazzi Studio via Canva.

The objective of this project is to apply the Titanic benchmark to hypothesis testing in disaster risk management.
In principle, it is possible to evaluate several hypotheses (arranged according to whether they belong to what can be called “economic,” “natural,” or “social” factors) that can be tested using the data on who survived and who perished in the Titanic disaster. This is an interesting issue in itself as the probability of survival differs greatly between passengers.
In this post, we will perform a Machine Learning (ML) analysis of the fatalities on the ship using the Titanic dataset on Kaggle. The main question that we are addressing here is whether there is a statistically significance relation between the death of the person and their passenger class, age, sex and/or port where they embarked their journey.
We have implemented and tested the comprehensive ML pipeline that consists of the following steps:
- Importing the Input Data
- Input Data QC & Pre-Processing
- Descriptive Statistical Analysis
- Exploratory Data Analysis (EDA)
- DataPrep/SweetViz AutoEDA
- Feature Engineering & Data Preparation
- ML Training, Testing, Tuning, and Prediction
- Model Tuning via Hyperparameter Optimization (HPO)
- Complete ML Performance Validation & Classification Report
- Final SHAP Interpretation of Model Features vs Predictions

About the Kaggle dataset : 890 passenger samples were given under the training set, and there were associated labels as to whether the passengers survived or not. Each passenger’s name, sex, age, passenger class, and point of embarkation was provided. In the test data, 418 samples that used the same format were given. The dataset is not complete as, for several samples, one or many fields were empty. All sample points for sex and passenger class were complete.

Read more details about the data dictionary in the dataset description section
- survival: Survival
- pclass: Ticket class
- sex: Sex
- Age: Age in years
- sibsp: # of siblings / spouses aboard the Titanic
- parch: # of parents / children aboard the Titanic
- ticket: Ticket number
- fare: Passenger fare
- cabin: Cabin number
- embarked: Port of embarkation

Let’s get started!

Table of Contents

An Environment Setup
Input Data Reading & Preparation
Descriptive Statistics of Input Data
Exploratory Data Analysis (EDA)
DataPrep AutoEDA
SweetViz AutoEDA
ML Training, Testing & Validation
Full ML Classification Report
SHAP Interpretation
Conclusions
Explore More
References

An Environment Setup

Setting up the working directory YOURPATH

import os
os.chdir('YOURPATH')    
os. getcwd()

Installing packages in Jupyter

!pip install --user plotly , seaborn ,missingo ,dataprep, sweetviz, sklearn, scikitplot, scipy, shap, yellowbrick

Conversely, installing packages with requirements.txt

!pip install -r requirements.txt

and in the requirements.txt file you put your modules in a list, with one item per line.

Setting up a Conda Environment (optional)

Creating an environment <env_name> from cmd

conda create --name <env_name>

Activating the environment with

conda activate <env_name>

To deactivate an environment, type: conda deactivate.

Read more about Conda Environment Best Practices

Input Data Reading & Preparation

Let’s focus on the Titanic Case – Machine Learning from Disaster using the Kaggle dataset.
Importing basic libraries

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates

Templates configuration
-----------------------
    Default template: 'plotly'
    Available templates:
        ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']

Reading the input data

train_data = pd.read_csv("traintitanic.csv")
test_data = pd.read_csv("testtitanic.csv")
gender_data = pd.read_csv("gender_submission.csv")

Checking the percentage of missing values per column

ddf=train_data.copy()
missing_percentage = ddf.isnull().mean() * 100
print("Percentage of missing values for each column:")
print(missing_percentage)

Percentage of missing values for each column:
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64

Plotting the percentage of missing values per column using missingo

import pandas as pd
import missingno as msno
msno.bar(ddf)

Missingo bar plot of missing values per column.

Editing, transforming and cleaning the input data

train_data['Embarked'] = train_data.Embarked.fillna(train_data.Embarked.dropna().max())
test_data['Fare'] = test_data.Fare.fillna(test_data.Fare.dropna().mean())
guess_ages = np.zeros((2,3))
guess_ages

combine = [train_data , test_data]

# Converting Sex categories (male and female) to 0 and 1:
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

# Filling missed age feature:

for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()
            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)
# Combine SibSp and Parch became the Relatives Column
data = [train_data, test_data]
for dataset in data:
    dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
    dataset.loc[dataset['relatives'] > 0, 'travelled_alone'] = 'No'
    dataset.loc[dataset['relatives'] == 0, 'travelled_alone'] = 'Yes'

#Changing cateogorcial data to 'string' type categorical data
train_data['Pclass'][train_data['Pclass'] == 1] = 'High Class'
train_data['Pclass'][train_data['Pclass'] == 2] = 'Middle Class'
train_data['Pclass'][train_data['Pclass'] == 3] = 'Low Class'

train_data['relatives'][train_data['relatives'] == 0] = 'Travel Alone'
train_data['relatives'][train_data['relatives'] == 1] = 'brothers'
train_data['relatives'][train_data['relatives'] == 2] = 'sisters'
train_data['relatives'][train_data['relatives'] == 3] = 'wife'
train_data['relatives'][train_data['relatives'] == 4] = 'husband'
train_data['relatives'][train_data['relatives'] == 5] = 'father'
train_data['relatives'][train_data['relatives'] == 6] = 'mother'
train_data['relatives'][train_data['relatives'] == 7] = 'descendants'
train_data['relatives'][train_data['relatives'] == 10] = 'others'


#Changing cateogorcial data to 'string' type categorical data
test_data['Pclass'][test_data['Pclass'] == 1] = 'High Class'
test_data['Pclass'][test_data['Pclass'] == 2] = 'Middle Class'
test_data['Pclass'][test_data['Pclass'] == 3] = 'Low Class'

test_data['relatives'][test_data['relatives'] == 0] = 'Travel Alone'
test_data['relatives'][test_data['relatives'] == 1] = 'brothers'
test_data['relatives'][test_data['relatives'] == 2] = 'sisters'
test_data['relatives'][test_data['relatives'] == 3] = 'wife'
test_data['relatives'][test_data['relatives'] == 4] = 'husband'
test_data['relatives'][test_data['relatives'] == 5] = 'father'
test_data['relatives'][test_data['relatives'] == 6] = 'mother'
test_data['relatives'][test_data['relatives'] == 7] = 'descendants'
test_data['relatives'][test_data['relatives'] == 10] = 'others'

# Dropping Unuseful feature because it has too many missed values:

train_data.drop(columns = ["Cabin"] , inplace = True)
train_data.drop(columns = ["Ticket"] , inplace = True)
train_data.drop(columns = ["SibSp"] , inplace = True)
train_data.drop(columns = ["Parch"] , inplace = True)

test_data.drop(columns = ["Cabin"] , inplace = True)
test_data.drop(columns = ["Ticket"] , inplace = True)
test_data.drop(columns = ["SibSp"] , inplace = True)
test_data.drop(columns = ["Parch"] , inplace = True)

#Select numerical and categorical variables
data_num = train_data.select_dtypes(include=['int64','int32' ,'float64'])
data_cat = train_data.select_dtypes(include=['object'])

#Resetting the index for both types of variables
data_num.mode().iloc[0].reset_index(name= 'Mode')
index	Mode
0	PassengerId	1.00
1	Survived	0.00
2	Sex	0.00
3	Age	25.00
4	Fare	8.05
data_cat.mode().iloc[0].reset_index(name= 'Mode')
index	Mode
0	Pclass	Low Class
1	Name	Abbing, Mr. Anthony
2	Embarked	S
3	relatives	Travel Alone
4	travelled_alone	Yes

Descriptive Statistics of Input Data

Examining the basic data info with descriptive statistics

train_data.shape
(891, 10)

train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   PassengerId      891 non-null    int64  
 1   Survived         891 non-null    int64  
 2   Pclass           891 non-null    object 
 3   Name             891 non-null    object 
 4   Sex              891 non-null    int32  
 5   Age              891 non-null    int32  
 6   Fare             891 non-null    float64
 7   Embarked         891 non-null    object 
 8   relatives        891 non-null    object 
 9   travelled_alone  891 non-null    object 
dtypes: float64(1), int32(2), int64(2), object(5)
memory usage: 62.8+ KB

train_data.head()

Checking the number of unique values in input columns

for column in train_data.columns:
    num_unique_values = train_data[column].nunique()
    print(f'Number of unique values in {column}: {num_unique_values}')

Number of unique values in PassengerId: 891
Number of unique values in Survived: 2
Number of unique values in Pclass: 3
Number of unique values in Name: 891
Number of unique values in Sex: 2
Number of unique values in Age: 71
Number of unique values in Fare: 248
Number of unique values in Embarked: 3
Number of unique values in relatives: 9
Number of unique values in travelled_alone: 2

Checking the descriptive statistics of numerical values

train_data.describe().T

Descriptive statistics of numerical values.

Counting the total number of null values per column

train_data.isnull().sum()
PassengerId        0
Survived           0
Pclass             0
Name               0
Sex                0
Age                0
Fare               0
Embarked           0
relatives          0
travelled_alone    0
dtype: int64

Calculating the numerical data variance

data_num.var()
PassengerId    66231.000000
Survived           0.236772
Sex                0.228475
Age              177.591301
Fare            2469.436846
dtype: float64

Variance-to-STD conversion

np.sqrt(data_num.var())
PassengerId    257.353842
Survived         0.486592
Sex              0.477990
Age             13.326339
Fare            49.693429
dtype: float64

Computing the IQR for detecting outliers

print('Inter Quartile Range (IQR):')
data_num.quantile(0.75)-data_num.quantile(0.25)
Inter Quartile Range (IQR):
PassengerId    445.0000
Survived         1.0000
Sex              1.0000
Age             15.0000
Fare            23.0896
dtype: float64

Exploratory Data Analysis (EDA)

Examining outliers of numerical values via boxplots

import seaborn as sns 
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6)) 
sns.boxplot(x=train_data['Age'], color='lightgreen')

import seaborn as sns 
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6)) 
sns.boxplot(x=train_data['Fare'], color='lightgreen')

Computing and plotting the correlation matrix of numerical values

traincat=train_data[['Survived','Sex','Age','Fare']]
# Compute correlation matrix
corr_matrix = traincat.corr()
# Generate heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', square=True)

Portraying the violin plots of numerical values

# Features of interest
selected_features = ['Sex','Age','Fare']
# Plotting violin plots for selected features
for feature in selected_features:
    plt.figure(figsize=(8, 6))
    sns.violinplot(x='Survived', y=feature, data=traincat, hue='Survived', palette='Blues', inner='quartile', legend=False)
    plt.title(f'Violin Plot for {feature} by Survival Status')
    plt.xlabel('Survival Status')
    plt.ylabel(feature)
    plt.show()

Looking at the pairplot of numerical features

sns.pairplot(traincat, hue='Survived')

Looking at the Fare vs Age scatterplot

plt.figure(figsize=(7, 6))
sns.scatterplot(x='Age', y='Fare', data=traincat, hue='Survived', palette='pastel')

Comparing it to the Fare vs Age jointplots

sns.jointplot(x='Age', y='Fare', data=traincat,color='darkblue')
plt.show()

sns.jointplot(x = 'Age',  y = 'Fare' , data =traincat,  kind = 'kde', fill = True)
plt.show()

Fare vs Age jointplots - smooth contours

Let’s get extra insights into the categorical variables using pie plots and the train subset

# Group the data frame by classes in the pclass column, and count the number of occurrences of each group.
df= pd.read_csv('train.csv')
pclass_count = df.groupby('Pclass')['Pclass'].count()
pclass_count
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

plt.figure(figsize=(7,7))
plt.title('Grouped by pclass')
plt.pie(pclass_count.values, labels=['Class 1', 'Class 2', 'Class 3'], 
 autopct='%1.1f%%', textprops={'fontsize':13})
plt.show()

# Group the data frame by classes in the pclass column, and count the number of occurrences of each group.
sex_count = df.groupby('Sex')['Sex'].count()
sex_count
Sex
female    314
male      577
Name: Sex, dtype: int64

plt.figure(figsize=(7,7))
plt.title('Grouped by gender')
plt.pie(sex_count.values, labels=['female', 'male'], 
 autopct='%1.1f%%', textprops={'fontsize':13})
plt.show()

# Group the data frame by classes in the Embarked column, and count the number of occurrences of each group.
embark_count = df.groupby('Embarked')['Embarked'].count()
embark_count
Embarked
C    168
Q     77
S    644
Name: Embarked, dtype: int64
plt.figure(figsize=(7,7))
plt.title('Grouped by embarkation')
plt.pie(embark_count.values, labels=['Cherbourg', 'Queenstown', 'Southampton'], 
        autopct='%1.1f%%', textprops={'fontsize':13})
plt.show()

DataPrep AutoEDA

Let’s discover DataPrep to make EDA easier in Python:

import dataprep
from dataprep.eda import *
from dataprep.datasets import load_dataset
from dataprep.eda import plot, plot_correlation, plot_missing, plot_diff, create_report

df = load_dataset("titanic")
plot(df)

DataPrep.EDA Report
Stats and Insights
Dataset Statistics
Number of Variables	12
Number of Rows	891
Missing Cells	866
Missing Cells (%)	8.1%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	315.7 KB
Average Row Size in Memory	362.8 B
Variable Types	
Numerical: 3
Categorical: 9
Dataset Insights
PassengerId is uniformly distributed	Uniform
Age has 177 (19.87%) missing values	Missing
Cabin has 687 (77.1%) missing values	Missing
Fare is skewed	Skewed
Name has a high cardinality: 891 distinct values	High Cardinality
Ticket has a high cardinality: 681 distinct values	High Cardinality
Cabin has a high cardinality: 147 distinct values	High Cardinality
Survived has constant length 1	Constant Length
Pclass has constant length 1	Constant Length
SibSp has constant length 1	Constant Length
Dataset Insights
Parch has constant length 1	Constant Length
Embarked has constant length 1	Constant Length
Name has all distinct values	Unique
1
2
Number of plots per page:

PassengerId
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'

etc.

The content of the titanic-analysis-report.html file is as follows:

Some observations

The maximum and minimum age for Pclass 1 > Pclass 2 > Pclass 3. This is understandable because it takes years to accumulate wealth.
The average age of those that survived in all the classes is lower than those that did not survive.
Passengers that survived in the Pclass 1 paid the highest fare. Which concluded the higher your fare in the Pclass 1 the higher your rate of survival.

SweetViz AutoEDA

Conducting the second round of AutoEDA by importing the sweetviz library

# importing sweetviz
import sweetviz as sv
#analyzing the dataset
advert_report = sv.analyze(train_data)
#display the report
advert_report.show_html('sweetviz_titanic.html')

Checking the contents of the above html report

Associations:

[Only including dataset “DataFrame”]
■ Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is asymmetrical, (i.e. ROW LABEL values indicate how much they PROVIDE INFORMATION to each LABEL at the TOP).

• Circles are the symmetrical numerical correlations (Pearson’s) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.

ML Training, Testing & Validation

Applying the algorithm to prepare the input dataset for ML training & testing

#Feature Engineering
train_data['Pclass'] = train_data['Pclass'].fillna(train_data['Pclass'].mode()[0])
dummies_Pclass = pd.get_dummies(train_data['Pclass'],prefix='Pclass')
train_data = pd.concat([train_data, dummies_Pclass], axis=1)

train_data['Embarked'] = train_data['Embarked'].fillna(train_data['Embarked'].mode()[0])
dummies_Embarked = pd.get_dummies(train_data['Embarked'],prefix='Embarked')
train_data = pd.concat([train_data, dummies_Embarked], axis=1)

train_data['travelled_alone'] = train_data['travelled_alone'].fillna(train_data['travelled_alone'].mode()[0])
dummies_travel = pd.get_dummies(train_data['travelled_alone'],prefix='travelled_alone')
train_data = pd.concat([train_data, dummies_travel], axis=1)
train_data.drop(['Pclass','Name','Fare', 'Embarked', 'relatives', 'travelled_alone'], axis=1,inplace=True)
Modelling_Train = train_data.iloc[:891,]
Modelling_Test = train_data.iloc[891:,]
X = Modelling_Train.drop(['Survived'],axis=1)
y = Modelling_Test['Survived']
              
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_data.drop('Survived', 1), train_data['Survived'], test_size = .3)

Importing the relevant libraries and introducing the ML modeling functions

import scipy.stats as stats
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,roc_auc_score
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV,train_test_split,cross_val_score
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from scipy.stats import randint
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.metrics import roc_curve, auc, roc_auc_score

def models(models):
  results = pd.DataFrame({'accuracy_train':[],'accuracy_test':[],
                          'recall_train':[], 'recall_test':[],
                          'precision_train':[],'precision_test':[],
                          'AUC_train': [], 'AUC_test': [], 
                          'f1_score_train':[], 'f1_score_test':[]})
  

  for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)


    results = results.append({'accuracy_train': accuracy_score(y_train, model.predict(X_train)),
                              'accuracy_test': accuracy_score(y_test, model.predict(X_test)),
                              'recall_train': recall_score(y_train, model.predict(X_train), average='macro'),
                              'recall_test': recall_score(y_test, model.predict(X_test), average='macro'),
                              'precision_train': precision_score(y_train, model.predict(X_train), average='macro'),
                              'precision_test': precision_score(y_test, model.predict(X_test), average='macro'),
                              'AUC_train': roc_auc_score(y_train, model.predict(X_train)),
                              'AUC_test': roc_auc_score(y_test, model.predict(X_test)),
                              'f1_score_train': metrics.f1_score(y_train, model.predict(X_train)),
                              'f1_score_test':  metrics.f1_score(y_test, model.predict(X_test))
                              }, ignore_index=True)
    

   
  
  results['model'] = ['Logistic Regression', 'KNN','SVM','Naive Bayes', 'Decision Tree', 'Random Forest']
  return results

Running the ML modelling experiment

resultmodels=models([LogisticRegression(), KNeighborsClassifier(), SVC(),GaussianNB(), DecisionTreeClassifier(), RandomForestClassifier()])

Examining the output table resultmodels

accuracy_train	accuracy_test	recall_train	recall_test	precision_train	precision_test	AUC_train	AUC_test	f1_score_train	f1_score_test	model
0	0.792937	0.820896	0.774484	0.799549	0.783857	0.813683	0.774484	0.799549	0.721382	0.750000	Logistic Regression
1	0.707865	0.522388	0.666822	0.454378	0.695893	0.440418	0.666822	0.454378	0.562500	0.219512	KNN
2	0.613162	0.623134	0.500000	0.500000	0.306581	0.311567	0.500000	0.500000	0.000000	0.000000	SVM
3	0.768860	0.742537	0.761742	0.726893	0.756783	0.726022	0.761742	0.726893	0.709677	0.660099	Naive Bayes
4	1.000000	0.731343	1.000000	0.706172	1.000000	0.713671	1.000000	0.706172	1.000000	0.628866	Decision Tree
5	1.000000	0.776119	1.000000	0.740143	1.000000	0.771281	1.000000	0.740143	1.000000	0.666667	Random Forest

Plotting the output table resultmodels

#Accuracy
import numpy as np 
import matplotlib.pyplot as plt 
 
# set width of bar 
barWidth = 0.25
fig = plt.subplots(figsize =(10, 7)) 
 
# set height of bar 
IT = resultmodels['accuracy_train']
ECE = resultmodels['accuracy_test']

# Set position of bar on X axis 
br1 = np.arange(6) 
br2 = [x + barWidth for x in br1] 
 
# Make the plot
plt.bar(br1, IT, color ='r', width = barWidth, 
        edgecolor ='grey', label ='accuracy_train') 
plt.bar(br2, ECE, color ='g', width = barWidth, 
        edgecolor ='grey', label ='accuracy_test') 

# Adding Xticks 
plt.xlabel('Models', fontweight ='bold', fontsize = 18) 
plt.ylabel('Accuracy', fontweight ='bold', fontsize = 18) 
plt.xticks([r + barWidth for r in range(len(IT))], 
        ['LR', 'KNN', 'SVC', 'GNB', 'DT','RF'])
 
plt.legend()
plt.show()

#Recall
import numpy as np 
import matplotlib.pyplot as plt 
 
# set width of bar 
barWidth = 0.25
fig = plt.subplots(figsize =(10, 7)) 
 
# set height of bar 
IT = resultmodels['recall_train']
ECE = resultmodels['recall_test']

# Set position of bar on X axis 
br1 = np.arange(6) 
br2 = [x + barWidth for x in br1] 
 
# Make the plot
plt.bar(br1, IT, color ='r', width = barWidth, 
        edgecolor ='grey', label ='recall_train') 
plt.bar(br2, ECE, color ='g', width = barWidth, 
        edgecolor ='grey', label ='recall_test') 

# Adding Xticks 
plt.xlabel('Models', fontweight ='bold', fontsize = 18) 
plt.ylabel('Recall', fontweight ='bold', fontsize = 18) 
plt.xticks([r + barWidth for r in range(len(IT))], 
        ['LR', 'KNN', 'SVC', 'GNB', 'DT','RF'])
 
plt.legend()
plt.show()

#F1-Score
import numpy as np 
import matplotlib.pyplot as plt 
 
# set width of bar 
barWidth = 0.25
fig = plt.subplots(figsize =(10, 7)) 
 
# set height of bar 
IT = resultmodels['f1_score_train']
ECE = resultmodels['f1_score_test']

# Set position of bar on X axis 
br1 = np.arange(6) 
br2 = [x + barWidth for x in br1] 
 
# Make the plot
plt.bar(br1, IT, color ='r', width = barWidth, 
        edgecolor ='grey', label ='f1_score_train') 
plt.bar(br2, ECE, color ='g', width = barWidth, 
        edgecolor ='grey', label ='f1_score_test') 

# Adding Xticks 
plt.xlabel('Models', fontweight ='bold', fontsize = 18) 
plt.ylabel('F1-Score', fontweight ='bold', fontsize = 18) 
plt.xticks([r + barWidth for r in range(len(IT))], 
        ['LR', 'KNN', 'SVC', 'GNB', 'DT','RF'])
 
plt.legend()
plt.show()

#AUC score
import numpy as np 
import matplotlib.pyplot as plt 
 
# set width of bar 
barWidth = 0.25
fig = plt.subplots(figsize =(10, 7)) 
 
# set height of bar 
IT = resultmodels['AUC_train']
ECE = resultmodels['AUC_test']

# Set position of bar on X axis 
br1 = np.arange(6) 
br2 = [x + barWidth for x in br1] 
 
# Make the plot
plt.bar(br1, IT, color ='r', width = barWidth, 
        edgecolor ='grey', label ='AUC_train') 
plt.bar(br2, ECE, color ='g', width = barWidth, 
        edgecolor ='grey', label ='AUC_test') 

# Adding Xticks 
plt.xlabel('Models', fontweight ='bold', fontsize = 18) 
plt.ylabel('AUC Score', fontweight ='bold', fontsize = 18) 
plt.xticks([r + barWidth for r in range(len(IT))], 
        ['LR', 'KNN', 'SVC', 'GNB', 'DT','RF'])
 
plt.legend()
plt.show()

Full ML Classification Report

Selecting the best performing ML algorithm to run hyperparameter optimization (HPO)

rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

scores = cross_val_score(rf, X_train, y_train, cv=5)
print('Cross-Validation Accuracy Scores', scores)
print('Cross-Validation Accuracy Scores', scores.mean())
Cross-Validation Accuracy Scores [0.712      0.8        0.776      0.77419355 0.7983871 ]
Cross-Validation Accuracy Scores 0.7721161290322581

Running Random Forest HPO

from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform
import numpy as np
%timeit

param_grid = [
{'criterion':['gini', 'entropy'],'n_estimators': [25, 50, 100], 'max_features': ['auto', 'log2'], 
'max_depth': [25, 50, 100],
'min_samples_split':[2, 5, 10], 'min_samples_leaf':[4, 8, 16], 'bootstrap': [True, False]}
 ]

rf = RandomForestClassifier(random_state=42)

grid_search_forest = GridSearchCV(rf, param_grid, cv=5)
grid_search_forest.fit(X_train, y_train)

grid_search_forest.best_estimator_
RandomForestClassifier(max_depth=25, max_features='log2', min_samples_leaf=8,
                       n_estimators=25, random_state=42)

Running Random Forest (RF) with optimized parameters

rf = RandomForestClassifier(bootstrap=False, criterion='entropy', max_depth=25,
                       min_samples_leaf=16, n_estimators=50, random_state=42)
rf.fit(X_train,y_train)
rf_predicted = rf.predict(X_test)
rf_conf_matrix = confusion_matrix(y_test, rf_predicted)
rf_acc_score = accuracy_score(y_test, rf_predicted)

print("confussion matrix")
print(rf_conf_matrix)
print("\n")
print("Accuracy of Random Forest:",rf_acc_score*100,'\n')
print(classification_report(y_test,rf_predicted))

confussion matrix
[[159  12]
 [ 30  67]]


Accuracy of Random Forest: 84.32835820895522 

              precision    recall  f1-score   support

           0       0.84      0.93      0.88       171
           1       0.85      0.69      0.76        97

    accuracy                           0.84       268
   macro avg       0.84      0.81      0.82       268
weighted avg       0.84      0.84      0.84       268

Plotting the RF feature importance score

feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
ax = feat_importances.nlargest(10).plot(kind='barh')
ax.invert_yaxis()
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')

SciKit-Plot ML diagnostics using binary classification examples

import scikitplot as skplt

import sklearn

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys
import warnings
warnings.filterwarnings("ignore")

print("Scikit Plot Version : ", skplt.__version__)
print("Scikit Learn Version : ", sklearn.__version__)
print("Python Version : ", sys.version)

%matplotlib inline
Scikit Plot Version :  0.3.7
Scikit Learn Version :  1.4.1.post1
Python Version :  3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]

RF Classification Learning Curve

skplt.estimators.plot_learning_curve(rf, X_train, y_train,
                                     cv=7, shuffle=True, scoring="accuracy",
                                     n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title="RF Classification Learning Curve");

RF Normalized Confusion Matrix

skplt.metrics.plot_confusion_matrix(y_test, rf_predicted,
                                    normalize=True,
                                    title="RF Confusion Matrix",
                                    cmap="Purples"
                                    );

RF ROC Curve

y_test_probs = rf.predict_proba(X_test)

skplt.metrics.plot_roc_curve(y_test, y_test_probs,
                       title="RF ROC Curve", figsize=(12,6));

RF Precision-Recall Curve

skplt.metrics.plot_precision_recall_curve(y_test, y_test_probs,
                       title="RF Precision-Recall Curve", figsize=(12,6));

KS Statistic Plot

y_probas = rf.predict_proba(X_test)

skplt.metrics.plot_ks_statistic(y_test, y_probas, figsize=(10,6));

RF Cumulative Gains Curve

skplt.metrics.plot_cumulative_gain(y_test, y_probas, figsize=(10,6));

RF Lift Curve

skplt.metrics.plot_lift_curve(y_test, y_probas, figsize=(10,6));

Trained ML calibration plots

lr_probas = LogisticRegression().fit(X_train, y_train).predict_proba(X_test)
rf_probas = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
gb_probas = GradientBoostingClassifier().fit(X_train, y_train).predict_proba(X_test)
et_scores = ExtraTreesClassifier().fit(X_train, y_train).predict_proba(X_test)

probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = ['Logistic Regression', 'Random Forest', 'Gradient Boosting', 'Extra Trees Classifier']
skplt.metrics.plot_calibration_curve(y_test,
                                     probas_list,
                                     clf_names, n_bins=15,
                                     figsize=(12,6)
                                     );

RF Kmeans Elbow Plot

skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
                               X_train,
                               cluster_ranges=range(2, 20),
                               figsize=(8,6));

RF Silhouette Analysis

kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_train, y_train)
cluster_labels = kmeans.predict(X_test)
skplt.metrics.plot_silhouette(X_test, cluster_labels,
                              figsize=(8,6));

RF PCA Component Explained Variances

pca = PCA(random_state=1)
pca.fit(X_train)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

RF PCA 2-D Projection

skplt.decomposition.plot_pca_2d_projection(pca, X_train, y_train,
                                           figsize=(10,10),
                                           cmap="tab10");

yellowbrick RF Classification Report

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import sklearn
import yellowbrick

pd.set_option("display.max_columns", 35)

import warnings
warnings.filterwarnings("ignore")

from yellowbrick.classifier import ClassificationReport

target_names=['0','1']

viz = ClassificationReport(rf,
                           classes=target_names,
                           support=True,
                           fig=plt.figure(figsize=(8,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

RF Class Prediction Error

from yellowbrick.classifier import ClassPredictionError

viz = ClassPredictionError(rf,
                           classes=target_names,
                           fig=plt.figure(figsize=(9,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

yellowbrick RF confusion matrix

from yellowbrick.classifier import ConfusionMatrix
from sklearn.ensemble import RandomForestClassifier

rfmodel=RandomForestClassifier(criterion='entropy', max_depth=25, max_features='log2',
                       min_samples_leaf=16, n_estimators=25, random_state=42)

visualizer = ConfusionMatrix(rfmodel, classes=target_names)

visualizer.fit(X_train, y_train)

visualizer.score(X_test, y_test)

visualizer.show();

RF Discrimination Threshold

from yellowbrick.classifier import DiscriminationThreshold


viz = DiscriminationThreshold(RandomForestClassifier(random_state=123),
             classes=target_names,
             cv=0.2,
             fig=plt.figure(figsize=(9,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

For those classification problems that have a severe class imbalance, the default threshold of 0.5 can result in poor performance. We can see that the RF optimal threshold as 0.45 (compared to the default of 0.5) that achieves an F-Measure of about 0.7 for class 1.

SHAP Interpretation

Let’s analyze the SHAP (Shapley Additive explanations) values to compare feature contributions to the RF model.

#SHAP (Shapley Additive explanation)
import shap 
feature_names = X.columns.values.tolist()
X_test_df = pd.DataFrame(data=X_test, columns=feature_names)

explainer = shap.TreeExplainer(grid_search_forest.best_estimator_)
shap_values = explainer.shap_values(X_test_df)

shap.summary_plot(shap_values[1], X_test_df, plot_type="bar")

SHAP average impact on model output magnitude.

shap.summary_plot(shap_values[1], X_test_df)

This plot shows SHAP values for top features for every passenger in the training data. For each feature, the associated SHAP value is plotted on the x-axis. Each point represents a single model decision. The more influential a feature is, the more negative or positive it’s associated SHAP value. A SHAP value of 0 means that feature did nothing to move the decision away from the naive/reference value. In the plot, we see that the Sex feature is highly influential with strong negative and positive SHAP values for many predicted outcomes.

def plot_shap(model, passenger):

    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(passenger)
    shap.initjs()
    return shap.force_plot(explainer.expected_value[1], shap_values[1], passenger)

passenger = X_test_df.iloc[1,:].astype(float)
plot_shap(grid_search_forest.best_estimator_, passenger)
passenger = X_test_df.iloc[2,:].astype(float)
plot_shap(grid_search_forest.best_estimator_, passenger)
passenger = X_test_df.iloc[5,:].astype(float)
plot_shap(grid_search_forest.best_estimator_, passenger)
shap_val = explainer.shap_values(X_test)
shap.summary_plot(shap_val, X_test)

This plot demonstrates the model feature importance ranked by mean absolute SHAP. It shows how important each feature was on average over the passenger population.

shap_values = explainer.shap_values(X_train.iloc[:50])
shap.force_plot(explainer.expected_value[1], shap_values[1], X_test_df.iloc[:50])

Below is a gallery of the more detailed SHAP plots comparing the most dominant features, viz. Sex, Age, and Pclass

Explanations for the predicted outcomes of 3 passengers. The features x for each passenger are displayed followed by an additive force layout, which graphically conveys prominent SHAP values that influenced each decision. The term f(x) is the evaluation of the trained model f on the features x, which is the predicted outcome for the corresponding passenger. The base value serves as a reference for the force bars, and is calculated as the average predicted outcome over the training data. Red (blue) force bars represent features that have increased (decreased) the passengers chance for survival, relative to the base value. The wider the force bars are, the more influential the corresponding feature.

SHAP age vs Sex high and Pclass low with 2 averaged samples.

In the above plots, positive SHAP value means positive impact on prediction, leading the model to predict 1 (e.g. Passenger survived the Titanic). Negative SHAP value means negative impact, leading the model to predict 0 (e.g. passenger didn’t survive the Titanic).

Conclusions

The ultimate aim of this post was to use ML to create a model that predicts which passengers survived the Titanic shipwreck. Apart from elements of luck involved in surviving, it seems some groups of people were more likely to survive than others.
Specifically, we have addressed the challenge of building a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data such as age, gender, socio-economic class, etc.
The Goal: Predict whether a passenger survived or not. 0 for not surviving, 1 for surviving.
First and foremost, we have tried to get preliminary insights into the underlying dataset structure, and get familiar with it using descriptive statistics, so we can create more efficient models. This step is important because a lot of the beginners are often confused as to how to get started.
Next, we have carried out the comprehensive Exploratory Data Analysis (EDA) to uncover underlying patterns, grasp the dataset’s structure, and identify any potential anomalies or relationships between variables.
EDA is basically used to identify a human error, missing values, or outliers. It extracts useful variables and removes useless variables.
We have applied two additional rounds of AutoEDA to the input data using DataPrep and sweetviz libraries. DataPrep is a fully automated data analysis for visually exploring, cleaning, and preparing our data for ML. SweetViz is an open-source Python library that generates beautiful, high-density visualizations to perform EDA with just two lines of code.
By performing the aforementioned 3-step EDA, we have checked the quality of the data. This QC visualization and statistical analysis process ensures that further analysis is based on accurate and insightful information, thereby reducing the likelihood of errors in subsequent stages.
We have trained and compared the following 5 ML classifiers: LogisticRegression(), KNeighborsClassifier(), SVC(),GaussianNB(), DecisionTreeClassifier(), and RandomForestClassifier().
It appears that 84.3% ACC Random Forest (RF) Classifier after HPO-type tuning is the best performing model.
We have compiled the full RF binary classification report: First, the report shows a few standard quality metrics such as Accuracy, Precision, Recall, AUC, and F-1 Score (cf. yellowbrick classification table). Secondly, the reports includes a great variety of ML performance plots such as confusion matrix, feature importance, learning curves, ROC curve, Precision-Recall Curve, KS Statistic Plot, Cumulative Gains Curve, Lift Curve, trained ML calibration plots, Kmeans Elbow Plot, Silhouette Analysis, PCA Component Explained Variances, PCA 2-D Projection, Class Prediction Error, and Discrimination Threshold.
We have explored SHAP values to discover relationships between model features and ML predictions. This helps debug potential biases, identify data issues, and justify the model’s decisions.

Titanic Benchmark Hypothesis Testing in Disaster Risk Management: (Auto)EDA, ML, HPO & SHAP

An Environment Setup

Input Data Reading & Preparation

Descriptive Statistics of Input Data

Exploratory Data Analysis (EDA)

DataPrep AutoEDA

SweetViz AutoEDA

ML Training, Testing & Validation

Full ML Classification Report

SHAP Interpretation

Conclusions

Explore More

References

Discover more from Our Blogs

Leave a comment Cancel reply

Titanic Benchmark Hypothesis Testing in Disaster Risk Management: (Auto)EDA, ML, HPO & SHAP

An Environment Setup

Input Data Reading & Preparation

Descriptive Statistics of Input Data

Exploratory Data Analysis (EDA)

DataPrep AutoEDA

SweetViz AutoEDA

ML Training, Testing & Validation

Full ML Classification Report

SHAP Interpretation

Conclusions

Explore More

References

Share this:

Discover more from Our Blogs

Leave a comment Cancel reply

Discover more from Our Blogs