Titanic Benchmark Hypothesis Testing in Disaster Risk Management: (Auto)EDA, ML, HPO & SHAP

Featured Image by Fazzi Studio via Canva.

  • The objective of this project is to apply the Titanic benchmark to hypothesis testing in disaster risk management.
  • In principle, it is possible to evaluate several hypotheses (arranged according to whether they belong to what can be called “economic,” “natural,” or “social” factors) that can be tested using the data on who survived and who perished in the Titanic disaster. This is an interesting issue in itself as the probability of survival differs greatly between passengers.
  • In this post, we will perform a Machine Learning (ML) analysis of the fatalities on the ship using the Titanic dataset on Kaggle. The main question that we are addressing here is whether there is a statistically significance relation between the death of the person and their passenger class, age, sex and/or port where they embarked their journey.
  • We have implemented and tested the comprehensive ML pipeline that consists of the following steps:
  • About the Kaggle dataset : 890 passenger samples were given under the training set, and there were associated labels as to whether the passengers survived or not. Each passenger’s name, sex, age, passenger class, and point of embarkation was provided. In the test data, 418 samples that used the same format were given. The dataset is not complete as, for several samples, one or many fields were empty. All sample points for sex and passenger class were complete.
  • Read more details about the data dictionary in the dataset description section
    • survival: Survival
    • pclass: Ticket class
    • sex: Sex
    • Age: Age in years
    • sibsp: # of siblings / spouses aboard the Titanic
    • parch: # of parents / children aboard the Titanic
    • ticket: Ticket number
    • fare: Passenger fare
    • cabin: Cabin number
    • embarked: Port of embarkation
  • Let’s get started!

Table of Contents

  1. An Environment Setup
  2. Input Data Reading & Preparation
  3. Descriptive Statistics of Input Data
  4. Exploratory Data Analysis (EDA)
  5. DataPrep AutoEDA
  6. SweetViz AutoEDA
  7. ML Training, Testing & Validation
  8. Full ML Classification Report
  9. SHAP Interpretation
  10. Conclusions
  11. Explore More
  12. References

An Environment Setup

  • Setting up the working directory YOURPATH
import os
os.chdir('YOURPATH')    
os. getcwd()
!pip install --user plotly , seaborn ,missingo ,dataprep, sweetviz, sklearn, scikitplot, scipy, shap, yellowbrick
  • Conversely, installing packages with requirements.txt
!pip install -r requirements.txt

and in the requirements.txt file you put your modules in a list, with one item per line.

Creating an environment <env_name> from cmd

conda create --name <env_name>

Activating the environment with

conda activate <env_name>

To deactivate an environment, type: conda deactivate. 

Input Data Reading & Preparation

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
pio.templates

Templates configuration
-----------------------
    Default template: 'plotly'
    Available templates:
        ['ggplot2', 'seaborn', 'simple_white', 'plotly',
         'plotly_white', 'plotly_dark', 'presentation', 'xgridoff',
         'ygridoff', 'gridon', 'none']
  • Reading the input data
train_data = pd.read_csv("traintitanic.csv")
test_data = pd.read_csv("testtitanic.csv")
gender_data = pd.read_csv("gender_submission.csv")
  • Checking the percentage of missing values per column
ddf=train_data.copy()
missing_percentage = ddf.isnull().mean() * 100
print("Percentage of missing values for each column:")
print(missing_percentage)

Percentage of missing values for each column:
PassengerId     0.000000
Survived        0.000000
Pclass          0.000000
Name            0.000000
Sex             0.000000
Age            19.865320
SibSp           0.000000
Parch           0.000000
Ticket          0.000000
Fare            0.000000
Cabin          77.104377
Embarked        0.224467
dtype: float64
  • Plotting the percentage of missing values per column using missingo
import pandas as pd
import missingno as msno
msno.bar(ddf)
Missingo bar plot of missing values per column.
  • Editing, transforming and cleaning the input data
train_data['Embarked'] = train_data.Embarked.fillna(train_data.Embarked.dropna().max())
test_data['Fare'] = test_data.Fare.fillna(test_data.Fare.dropna().mean())
guess_ages = np.zeros((2,3))
guess_ages

combine = [train_data , test_data]

# Converting Sex categories (male and female) to 0 and 1:
for dataset in combine:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)

# Filling missed age feature:

for dataset in combine:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()
            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)
# Combine SibSp and Parch became the Relatives Column
data = [train_data, test_data]
for dataset in data:
    dataset['relatives'] = dataset['SibSp'] + dataset['Parch']
    dataset.loc[dataset['relatives'] > 0, 'travelled_alone'] = 'No'
    dataset.loc[dataset['relatives'] == 0, 'travelled_alone'] = 'Yes'

#Changing cateogorcial data to 'string' type categorical data
train_data['Pclass'][train_data['Pclass'] == 1] = 'High Class'
train_data['Pclass'][train_data['Pclass'] == 2] = 'Middle Class'
train_data['Pclass'][train_data['Pclass'] == 3] = 'Low Class'

train_data['relatives'][train_data['relatives'] == 0] = 'Travel Alone'
train_data['relatives'][train_data['relatives'] == 1] = 'brothers'
train_data['relatives'][train_data['relatives'] == 2] = 'sisters'
train_data['relatives'][train_data['relatives'] == 3] = 'wife'
train_data['relatives'][train_data['relatives'] == 4] = 'husband'
train_data['relatives'][train_data['relatives'] == 5] = 'father'
train_data['relatives'][train_data['relatives'] == 6] = 'mother'
train_data['relatives'][train_data['relatives'] == 7] = 'descendants'
train_data['relatives'][train_data['relatives'] == 10] = 'others'


#Changing cateogorcial data to 'string' type categorical data
test_data['Pclass'][test_data['Pclass'] == 1] = 'High Class'
test_data['Pclass'][test_data['Pclass'] == 2] = 'Middle Class'
test_data['Pclass'][test_data['Pclass'] == 3] = 'Low Class'

test_data['relatives'][test_data['relatives'] == 0] = 'Travel Alone'
test_data['relatives'][test_data['relatives'] == 1] = 'brothers'
test_data['relatives'][test_data['relatives'] == 2] = 'sisters'
test_data['relatives'][test_data['relatives'] == 3] = 'wife'
test_data['relatives'][test_data['relatives'] == 4] = 'husband'
test_data['relatives'][test_data['relatives'] == 5] = 'father'
test_data['relatives'][test_data['relatives'] == 6] = 'mother'
test_data['relatives'][test_data['relatives'] == 7] = 'descendants'
test_data['relatives'][test_data['relatives'] == 10] = 'others'

# Dropping Unuseful feature because it has too many missed values:

train_data.drop(columns = ["Cabin"] , inplace = True)
train_data.drop(columns = ["Ticket"] , inplace = True)
train_data.drop(columns = ["SibSp"] , inplace = True)
train_data.drop(columns = ["Parch"] , inplace = True)

test_data.drop(columns = ["Cabin"] , inplace = True)
test_data.drop(columns = ["Ticket"] , inplace = True)
test_data.drop(columns = ["SibSp"] , inplace = True)
test_data.drop(columns = ["Parch"] , inplace = True)

#Select numerical and categorical variables
data_num = train_data.select_dtypes(include=['int64','int32' ,'float64'])
data_cat = train_data.select_dtypes(include=['object'])

#Resetting the index for both types of variables
data_num.mode().iloc[0].reset_index(name= 'Mode')
index	Mode
0	PassengerId	1.00
1	Survived	0.00
2	Sex	0.00
3	Age	25.00
4	Fare	8.05
data_cat.mode().iloc[0].reset_index(name= 'Mode')
index	Mode
0	Pclass	Low Class
1	Name	Abbing, Mr. Anthony
2	Embarked	S
3	relatives	Travel Alone
4	travelled_alone	Yes

Descriptive Statistics of Input Data

  • Examining the basic data info with descriptive statistics
train_data.shape
(891, 10)
train_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   PassengerId      891 non-null    int64  
 1   Survived         891 non-null    int64  
 2   Pclass           891 non-null    object 
 3   Name             891 non-null    object 
 4   Sex              891 non-null    int32  
 5   Age              891 non-null    int32  
 6   Fare             891 non-null    float64
 7   Embarked         891 non-null    object 
 8   relatives        891 non-null    object 
 9   travelled_alone  891 non-null    object 
dtypes: float64(1), int32(2), int64(2), object(5)
memory usage: 62.8+ KB
train_data.head()
Input data table
  • Checking the number of unique values in input columns
for column in train_data.columns:
    num_unique_values = train_data[column].nunique()
    print(f'Number of unique values in {column}: {num_unique_values}')

Number of unique values in PassengerId: 891
Number of unique values in Survived: 2
Number of unique values in Pclass: 3
Number of unique values in Name: 891
Number of unique values in Sex: 2
Number of unique values in Age: 71
Number of unique values in Fare: 248
Number of unique values in Embarked: 3
Number of unique values in relatives: 9
Number of unique values in travelled_alone: 2
  • Checking the descriptive statistics of numerical values
train_data.describe().T
Descriptive statistics of numerical values.
  • Counting the total number of null values per column
train_data.isnull().sum()
PassengerId        0
Survived           0
Pclass             0
Name               0
Sex                0
Age                0
Fare               0
Embarked           0
relatives          0
travelled_alone    0
dtype: int64
  • Calculating the numerical data variance
data_num.var()
PassengerId    66231.000000
Survived           0.236772
Sex                0.228475
Age              177.591301
Fare            2469.436846
dtype: float64
  • Variance-to-STD conversion
np.sqrt(data_num.var())
PassengerId    257.353842
Survived         0.486592
Sex              0.477990
Age             13.326339
Fare            49.693429
dtype: float64
  • Computing the IQR for detecting outliers
print('Inter Quartile Range (IQR):')
data_num.quantile(0.75)-data_num.quantile(0.25)
Inter Quartile Range (IQR):
PassengerId    445.0000
Survived         1.0000
Sex              1.0000
Age             15.0000
Fare            23.0896
dtype: float64

Exploratory Data Analysis (EDA)

  • Examining outliers of numerical values via boxplots
import seaborn as sns 
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6)) 
sns.boxplot(x=train_data['Age'], color='lightgreen')
Age boxplot
import seaborn as sns 
import matplotlib.pyplot as plt
plt.figure(figsize=(8, 6)) 
sns.boxplot(x=train_data['Fare'], color='lightgreen')
  • Computing and plotting the correlation matrix of numerical values
traincat=train_data[['Survived','Sex','Age','Fare']]
# Compute correlation matrix
corr_matrix = traincat.corr()
# Generate heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', square=True)
Correlation matrix of numerical values
  • Portraying the violin plots of numerical values
# Features of interest
selected_features = ['Sex','Age','Fare']
# Plotting violin plots for selected features
for feature in selected_features:
    plt.figure(figsize=(8, 6))
    sns.violinplot(x='Survived', y=feature, data=traincat, hue='Survived', palette='Blues', inner='quartile', legend=False)
    plt.title(f'Violin Plot for {feature} by Survival Status')
    plt.xlabel('Survival Status')
    plt.ylabel(feature)
    plt.show()
Violin plot for sex vs survival status
Violin plot for age vs survival status
Violin plot for fare vs survival status
  • Looking at the pairplot of numerical features
sns.pairplot(traincat, hue='Survived')
A pairplot of numerical features
  • Looking at the Fare vs Age scatterplot
plt.figure(figsize=(7, 6))
sns.scatterplot(x='Age', y='Fare', data=traincat, hue='Survived', palette='pastel')
Fare vs Age scatterplot
  • Comparing it to the Fare vs Age jointplots
sns.jointplot(x='Age', y='Fare', data=traincat,color='darkblue')
plt.show()
Fare vs Age jointplots - histograms
sns.jointplot(x = 'Age',  y = 'Fare' , data =traincat,  kind = 'kde', fill = True)
plt.show()
Fare vs Age jointplots - smooth contours
# Group the data frame by classes in the pclass column, and count the number of occurrences of each group.
df= pd.read_csv('train.csv')
pclass_count = df.groupby('Pclass')['Pclass'].count()
pclass_count
Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

plt.figure(figsize=(7,7))
plt.title('Grouped by pclass')
plt.pie(pclass_count.values, labels=['Class 1', 'Class 2', 'Class 3'], 
 autopct='%1.1f%%', textprops={'fontsize':13})
plt.show()
Pie chart of Pclass distribution
# Group the data frame by classes in the pclass column, and count the number of occurrences of each group.
sex_count = df.groupby('Sex')['Sex'].count()
sex_count
Sex
female    314
male      577
Name: Sex, dtype: int64

plt.figure(figsize=(7,7))
plt.title('Grouped by gender')
plt.pie(sex_count.values, labels=['female', 'male'], 
 autopct='%1.1f%%', textprops={'fontsize':13})
plt.show()
Pie chart of gender distribution
# Group the data frame by classes in the Embarked column, and count the number of occurrences of each group.
embark_count = df.groupby('Embarked')['Embarked'].count()
embark_count
Embarked
C    168
Q     77
S    644
Name: Embarked, dtype: int64
plt.figure(figsize=(7,7))
plt.title('Grouped by embarkation')
plt.pie(embark_count.values, labels=['Cherbourg', 'Queenstown', 'Southampton'], 
        autopct='%1.1f%%', textprops={'fontsize':13})
plt.show()
Pie chart of embarkation distribution

DataPrep AutoEDA

  • Let’s discover DataPrep to make EDA easier in Python:
import dataprep
from dataprep.eda import *
from dataprep.datasets import load_dataset
from dataprep.eda import plot, plot_correlation, plot_missing, plot_diff, create_report

df = load_dataset("titanic")
plot(df)
DataPrep.EDA Report
Stats and Insights
Dataset Statistics
Number of Variables	12
Number of Rows	891
Missing Cells	866
Missing Cells (%)	8.1%
Duplicate Rows	0
Duplicate Rows (%)	0.0%
Total Size in Memory	315.7 KB
Average Row Size in Memory	362.8 B
Variable Types	
Numerical: 3
Categorical: 9
Dataset Insights
PassengerId is uniformly distributed	Uniform
Age has 177 (19.87%) missing values	Missing
Cabin has 687 (77.1%) missing values	Missing
Fare is skewed	Skewed
Name has a high cardinality: 891 distinct values	High Cardinality
Ticket has a high cardinality: 681 distinct values	High Cardinality
Cabin has a high cardinality: 147 distinct values	High Cardinality
Survived has constant length 1	Constant Length
Pclass has constant length 1	Constant Length
SibSp has constant length 1	Constant Length
Dataset Insights
Parch has constant length 1	Constant Length
Embarked has constant length 1	Constant Length
Name has all distinct values	Unique
1
2
Number of plots per page:

PassengerId
'hist.bins': 50
Number of bins in the histogram
'hist.yscale': 'linear'
Y-axis scale ("linear" or "log")
'hist.color': '#aec7e8'

etc.
  • The content of the titanic-analysis-report.html file is as follows:
Boxplots Pclass vs Age vs Fare
KDE of Fare
Titanic dataset input table

Some observations

  • The maximum and minimum age for Pclass 1 > Pclass 2 > Pclass 3. This is understandable because it takes years to accumulate wealth.
  • The average age of those that survived in all the classes is lower than those that did not survive.
  • Passengers that survived in the Pclass 1 paid the highest fare. Which concluded the higher your fare in the Pclass 1 the higher your rate of survival.

SweetViz AutoEDA

  • Conducting the second round of AutoEDA by importing the sweetviz library
# importing sweetviz
import sweetviz as sv
#analyzing the dataset
advert_report = sv.analyze(train_data)
#display the report
advert_report.show_html('sweetviz_titanic.html')
  • Checking the contents of the above html report
SweetViz report columns 1-4
SweetViz report columns 5-9
SweetViz report columns 10-12
Correlation matrix

Associations:

[Only including dataset “DataFrame”]
■ Squares are categorical associations (uncertainty coefficient & correlation ratio) from 0 to 1. The uncertainty coefficient is asymmetrical, (i.e. ROW LABEL values indicate how much they PROVIDE INFORMATION to each LABEL at the TOP).

• Circles are the symmetrical numerical correlations (Pearson’s) from -1 to 1. The trivial diagonal is intentionally left blank for clarity.

ML Training, Testing & Validation

  • Applying the algorithm to prepare the input dataset for ML training & testing
#Feature Engineering
train_data['Pclass'] = train_data['Pclass'].fillna(train_data['Pclass'].mode()[0])
dummies_Pclass = pd.get_dummies(train_data['Pclass'],prefix='Pclass')
train_data = pd.concat([train_data, dummies_Pclass], axis=1)

train_data['Embarked'] = train_data['Embarked'].fillna(train_data['Embarked'].mode()[0])
dummies_Embarked = pd.get_dummies(train_data['Embarked'],prefix='Embarked')
train_data = pd.concat([train_data, dummies_Embarked], axis=1)

train_data['travelled_alone'] = train_data['travelled_alone'].fillna(train_data['travelled_alone'].mode()[0])
dummies_travel = pd.get_dummies(train_data['travelled_alone'],prefix='travelled_alone')
train_data = pd.concat([train_data, dummies_travel], axis=1)
train_data.drop(['Pclass','Name','Fare', 'Embarked', 'relatives', 'travelled_alone'], axis=1,inplace=True)
Modelling_Train = train_data.iloc[:891,]
Modelling_Test = train_data.iloc[891:,]
X = Modelling_Train.drop(['Survived'],axis=1)
y = Modelling_Test['Survived']
              
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(train_data.drop('Survived', 1), train_data['Survived'], test_size = .3)
  • Importing the relevant libraries and introducing the ML modeling functions
import scipy.stats as stats
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import confusion_matrix,classification_report,accuracy_score,roc_auc_score
from sklearn.model_selection import RandomizedSearchCV,GridSearchCV,train_test_split,cross_val_score
from sklearn.preprocessing import MinMaxScaler,StandardScaler
from scipy.stats import randint
from statsmodels.stats.outliers_influence import variance_inflation_factor

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import sklearn.metrics as metrics
from sklearn.metrics import precision_score, recall_score, accuracy_score
from sklearn.metrics import roc_curve, auc, roc_auc_score

def models(models):
  results = pd.DataFrame({'accuracy_train':[],'accuracy_test':[],
                          'recall_train':[], 'recall_test':[],
                          'precision_train':[],'precision_test':[],
                          'AUC_train': [], 'AUC_test': [], 
                          'f1_score_train':[], 'f1_score_test':[]})
  

  for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)


    results = results.append({'accuracy_train': accuracy_score(y_train, model.predict(X_train)),
                              'accuracy_test': accuracy_score(y_test, model.predict(X_test)),
                              'recall_train': recall_score(y_train, model.predict(X_train), average='macro'),
                              'recall_test': recall_score(y_test, model.predict(X_test), average='macro'),
                              'precision_train': precision_score(y_train, model.predict(X_train), average='macro'),
                              'precision_test': precision_score(y_test, model.predict(X_test), average='macro'),
                              'AUC_train': roc_auc_score(y_train, model.predict(X_train)),
                              'AUC_test': roc_auc_score(y_test, model.predict(X_test)),
                              'f1_score_train': metrics.f1_score(y_train, model.predict(X_train)),
                              'f1_score_test':  metrics.f1_score(y_test, model.predict(X_test))
                              }, ignore_index=True)
    

   
  
  results['model'] = ['Logistic Regression', 'KNN','SVM','Naive Bayes', 'Decision Tree', 'Random Forest']
  return results
  • Running the ML modelling experiment
resultmodels=models([LogisticRegression(), KNeighborsClassifier(), SVC(),GaussianNB(), DecisionTreeClassifier(), RandomForestClassifier()])
  • Examining the output table resultmodels
accuracy_train	accuracy_test	recall_train	recall_test	precision_train	precision_test	AUC_train	AUC_test	f1_score_train	f1_score_test	model
0	0.792937	0.820896	0.774484	0.799549	0.783857	0.813683	0.774484	0.799549	0.721382	0.750000	Logistic Regression
1	0.707865	0.522388	0.666822	0.454378	0.695893	0.440418	0.666822	0.454378	0.562500	0.219512	KNN
2	0.613162	0.623134	0.500000	0.500000	0.306581	0.311567	0.500000	0.500000	0.000000	0.000000	SVM
3	0.768860	0.742537	0.761742	0.726893	0.756783	0.726022	0.761742	0.726893	0.709677	0.660099	Naive Bayes
4	1.000000	0.731343	1.000000	0.706172	1.000000	0.713671	1.000000	0.706172	1.000000	0.628866	Decision Tree
5	1.000000	0.776119	1.000000	0.740143	1.000000	0.771281	1.000000	0.740143	1.000000	0.666667	Random Forest
  • Plotting the output table resultmodels
#Accuracy
import numpy as np 
import matplotlib.pyplot as plt 
 
# set width of bar 
barWidth = 0.25
fig = plt.subplots(figsize =(10, 7)) 
 
# set height of bar 
IT = resultmodels['accuracy_train']
ECE = resultmodels['accuracy_test']

# Set position of bar on X axis 
br1 = np.arange(6) 
br2 = [x + barWidth for x in br1] 
 
# Make the plot
plt.bar(br1, IT, color ='r', width = barWidth, 
        edgecolor ='grey', label ='accuracy_train') 
plt.bar(br2, ECE, color ='g', width = barWidth, 
        edgecolor ='grey', label ='accuracy_test') 

# Adding Xticks 
plt.xlabel('Models', fontweight ='bold', fontsize = 18) 
plt.ylabel('Accuracy', fontweight ='bold', fontsize = 18) 
plt.xticks([r + barWidth for r in range(len(IT))], 
        ['LR', 'KNN', 'SVC', 'GNB', 'DT','RF'])
 
plt.legend()
plt.show() 
Accuracy for train/test data
#Recall
import numpy as np 
import matplotlib.pyplot as plt 
 
# set width of bar 
barWidth = 0.25
fig = plt.subplots(figsize =(10, 7)) 
 
# set height of bar 
IT = resultmodels['recall_train']
ECE = resultmodels['recall_test']

# Set position of bar on X axis 
br1 = np.arange(6) 
br2 = [x + barWidth for x in br1] 
 
# Make the plot
plt.bar(br1, IT, color ='r', width = barWidth, 
        edgecolor ='grey', label ='recall_train') 
plt.bar(br2, ECE, color ='g', width = barWidth, 
        edgecolor ='grey', label ='recall_test') 

# Adding Xticks 
plt.xlabel('Models', fontweight ='bold', fontsize = 18) 
plt.ylabel('Recall', fontweight ='bold', fontsize = 18) 
plt.xticks([r + barWidth for r in range(len(IT))], 
        ['LR', 'KNN', 'SVC', 'GNB', 'DT','RF'])
 
plt.legend()
plt.show() 
Recall for train/test data
#F1-Score
import numpy as np 
import matplotlib.pyplot as plt 
 
# set width of bar 
barWidth = 0.25
fig = plt.subplots(figsize =(10, 7)) 
 
# set height of bar 
IT = resultmodels['f1_score_train']
ECE = resultmodels['f1_score_test']

# Set position of bar on X axis 
br1 = np.arange(6) 
br2 = [x + barWidth for x in br1] 
 
# Make the plot
plt.bar(br1, IT, color ='r', width = barWidth, 
        edgecolor ='grey', label ='f1_score_train') 
plt.bar(br2, ECE, color ='g', width = barWidth, 
        edgecolor ='grey', label ='f1_score_test') 

# Adding Xticks 
plt.xlabel('Models', fontweight ='bold', fontsize = 18) 
plt.ylabel('F1-Score', fontweight ='bold', fontsize = 18) 
plt.xticks([r + barWidth for r in range(len(IT))], 
        ['LR', 'KNN', 'SVC', 'GNB', 'DT','RF'])
 
plt.legend()
plt.show() 
F1-score for train/test data
#AUC score
import numpy as np 
import matplotlib.pyplot as plt 
 
# set width of bar 
barWidth = 0.25
fig = plt.subplots(figsize =(10, 7)) 
 
# set height of bar 
IT = resultmodels['AUC_train']
ECE = resultmodels['AUC_test']

# Set position of bar on X axis 
br1 = np.arange(6) 
br2 = [x + barWidth for x in br1] 
 
# Make the plot
plt.bar(br1, IT, color ='r', width = barWidth, 
        edgecolor ='grey', label ='AUC_train') 
plt.bar(br2, ECE, color ='g', width = barWidth, 
        edgecolor ='grey', label ='AUC_test') 

# Adding Xticks 
plt.xlabel('Models', fontweight ='bold', fontsize = 18) 
plt.ylabel('AUC Score', fontweight ='bold', fontsize = 18) 
plt.xticks([r + barWidth for r in range(len(IT))], 
        ['LR', 'KNN', 'SVC', 'GNB', 'DT','RF'])
 
plt.legend()
plt.show()  
AUC score for train/test data

Full ML Classification Report

  • Selecting the best performing ML algorithm to run hyperparameter optimization (HPO)
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)

scores = cross_val_score(rf, X_train, y_train, cv=5)
print('Cross-Validation Accuracy Scores', scores)
print('Cross-Validation Accuracy Scores', scores.mean())
Cross-Validation Accuracy Scores [0.712      0.8        0.776      0.77419355 0.7983871 ]
Cross-Validation Accuracy Scores 0.7721161290322581
  • Running Random Forest HPO
from sklearn.model_selection import GridSearchCV
from scipy.stats import uniform
import numpy as np
%timeit

param_grid = [
{'criterion':['gini', 'entropy'],'n_estimators': [25, 50, 100], 'max_features': ['auto', 'log2'], 
'max_depth': [25, 50, 100],
'min_samples_split':[2, 5, 10], 'min_samples_leaf':[4, 8, 16], 'bootstrap': [True, False]}
 ]

rf = RandomForestClassifier(random_state=42)

grid_search_forest = GridSearchCV(rf, param_grid, cv=5)
grid_search_forest.fit(X_train, y_train)

grid_search_forest.best_estimator_
RandomForestClassifier(max_depth=25, max_features='log2', min_samples_leaf=8,
                       n_estimators=25, random_state=42)
  • Running Random Forest (RF) with optimized parameters
rf = RandomForestClassifier(bootstrap=False, criterion='entropy', max_depth=25,
                       min_samples_leaf=16, n_estimators=50, random_state=42)
rf.fit(X_train,y_train)
rf_predicted = rf.predict(X_test)
rf_conf_matrix = confusion_matrix(y_test, rf_predicted)
rf_acc_score = accuracy_score(y_test, rf_predicted)

print("confussion matrix")
print(rf_conf_matrix)
print("\n")
print("Accuracy of Random Forest:",rf_acc_score*100,'\n')
print(classification_report(y_test,rf_predicted))

confussion matrix
[[159  12]
 [ 30  67]]


Accuracy of Random Forest: 84.32835820895522 

              precision    recall  f1-score   support

           0       0.84      0.93      0.88       171
           1       0.85      0.69      0.76        97

    accuracy                           0.84       268
   macro avg       0.84      0.81      0.82       268
weighted avg       0.84      0.84      0.84       268
  • Plotting the RF feature importance score
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
ax = feat_importances.nlargest(10).plot(kind='barh')
ax.invert_yaxis()
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')
Random Forest feature importance score
import scikitplot as skplt

import sklearn

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys
import warnings
warnings.filterwarnings("ignore")

print("Scikit Plot Version : ", skplt.__version__)
print("Scikit Learn Version : ", sklearn.__version__)
print("Python Version : ", sys.version)

%matplotlib inline
Scikit Plot Version :  0.3.7
Scikit Learn Version :  1.4.1.post1
Python Version :  3.11.5 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:26:23) [MSC v.1916 64 bit (AMD64)]

  • RF Classification Learning Curve
skplt.estimators.plot_learning_curve(rf, X_train, y_train,
                                     cv=7, shuffle=True, scoring="accuracy",
                                     n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title="RF Classification Learning Curve");
RF Classification Learning Curve
  • RF Normalized Confusion Matrix
skplt.metrics.plot_confusion_matrix(y_test, rf_predicted,
                                    normalize=True,
                                    title="RF Confusion Matrix",
                                    cmap="Purples"
                                    );
RF Normalized Confusion Matrix
  • RF ROC Curve
y_test_probs = rf.predict_proba(X_test)

skplt.metrics.plot_roc_curve(y_test, y_test_probs,
                       title="RF ROC Curve", figsize=(12,6));
RF ROC Curve
  • RF Precision-Recall Curve
skplt.metrics.plot_precision_recall_curve(y_test, y_test_probs,
                       title="RF Precision-Recall Curve", figsize=(12,6));
RF Precision-Recall Curve
  • KS Statistic Plot
y_probas = rf.predict_proba(X_test)

skplt.metrics.plot_ks_statistic(y_test, y_probas, figsize=(10,6));
KS Statistic Plot
  • RF Cumulative Gains Curve
skplt.metrics.plot_cumulative_gain(y_test, y_probas, figsize=(10,6));
RF Cumulative Gains Curve
  • RF Lift Curve
skplt.metrics.plot_lift_curve(y_test, y_probas, figsize=(10,6));
RF Lift Curve
  • Trained ML calibration plots
lr_probas = LogisticRegression().fit(X_train, y_train).predict_proba(X_test)
rf_probas = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
gb_probas = GradientBoostingClassifier().fit(X_train, y_train).predict_proba(X_test)
et_scores = ExtraTreesClassifier().fit(X_train, y_train).predict_proba(X_test)

probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = ['Logistic Regression', 'Random Forest', 'Gradient Boosting', 'Extra Trees Classifier']
skplt.metrics.plot_calibration_curve(y_test,
                                     probas_list,
                                     clf_names, n_bins=15,
                                     figsize=(12,6)
                                     );
Trained ML calibration plots
  • RF Kmeans Elbow Plot
skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
                               X_train,
                               cluster_ranges=range(2, 20),
                               figsize=(8,6));
RF Kmeans Elbow Plot
  • RF Silhouette Analysis
kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_train, y_train)
cluster_labels = kmeans.predict(X_test)
skplt.metrics.plot_silhouette(X_test, cluster_labels,
                              figsize=(8,6));
RF Silhouette Analysis
  • RF PCA Component Explained Variances
pca = PCA(random_state=1)
pca.fit(X_train)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));
RF PCA Component Explained Variances
skplt.decomposition.plot_pca_2d_projection(pca, X_train, y_train,
                                           figsize=(10,10),
                                           cmap="tab10");
RF PCA 2-D Projection
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import sklearn
import yellowbrick

pd.set_option("display.max_columns", 35)

import warnings
warnings.filterwarnings("ignore")

from yellowbrick.classifier import ClassificationReport

target_names=['0','1']

viz = ClassificationReport(rf,
                           classes=target_names,
                           support=True,
                           fig=plt.figure(figsize=(8,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

yellowbrick RF Classification Report
  • RF Class Prediction Error
from yellowbrick.classifier import ClassPredictionError

viz = ClassPredictionError(rf,
                           classes=target_names,
                           fig=plt.figure(figsize=(9,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();
RF Class Prediction Error
  • yellowbrick RF confusion matrix
from yellowbrick.classifier import ConfusionMatrix
from sklearn.ensemble import RandomForestClassifier

rfmodel=RandomForestClassifier(criterion='entropy', max_depth=25, max_features='log2',
                       min_samples_leaf=16, n_estimators=25, random_state=42)

visualizer = ConfusionMatrix(rfmodel, classes=target_names)

visualizer.fit(X_train, y_train)

visualizer.score(X_test, y_test)

visualizer.show();
yellowbrick RF confusion matrix
  • RF Discrimination Threshold
from yellowbrick.classifier import DiscriminationThreshold


viz = DiscriminationThreshold(RandomForestClassifier(random_state=123),
             classes=target_names,
             cv=0.2,
             fig=plt.figure(figsize=(9,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();
RF Discrimination Threshold
  • For those classification problems that have a severe class imbalance, the default threshold of 0.5 can result in poor performance. We can see that the RF optimal threshold as 0.45 (compared to the default of 0.5) that achieves an F-Measure of about 0.7 for class 1.

SHAP Interpretation

  • Let’s analyze the SHAP (Shapley Additive explanations) values to compare feature contributions to the RF model.
#SHAP (Shapley Additive explanation)
import shap 
feature_names = X.columns.values.tolist()
X_test_df = pd.DataFrame(data=X_test, columns=feature_names)

explainer = shap.TreeExplainer(grid_search_forest.best_estimator_)
shap_values = explainer.shap_values(X_test_df)

shap.summary_plot(shap_values[1], X_test_df, plot_type="bar")
SHAP average impact on model output magnitude.
shap.summary_plot(shap_values[1], X_test_df)
SHAP summary plot
  • This plot shows SHAP values for top features for every passenger in the training data. For each feature, the associated SHAP value is plotted on the x-axis. Each point represents a single model decision. The more influential a feature is, the more negative or positive it’s associated SHAP value. A SHAP value of 0 means that feature did nothing to move the decision away from the naive/reference value. In the plot, we see that the Sex feature is highly influential with strong negative and positive SHAP values for many predicted outcomes.
def plot_shap(model, passenger):

    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(passenger)
    shap.initjs()
    return shap.force_plot(explainer.expected_value[1], shap_values[1], passenger)

passenger = X_test_df.iloc[1,:].astype(float)
plot_shap(grid_search_forest.best_estimator_, passenger)
passenger = X_test_df.iloc[2,:].astype(float)
plot_shap(grid_search_forest.best_estimator_, passenger)
passenger = X_test_df.iloc[5,:].astype(float)
plot_shap(grid_search_forest.best_estimator_, passenger)
shap_val = explainer.shap_values(X_test)
shap.summary_plot(shap_val, X_test)

SHAP mean values for classes 0 and 1.
  • This plot demonstrates the model feature importance ranked by mean absolute SHAP. It shows how important each feature was on average over the passenger population.
shap_values = explainer.shap_values(X_train.iloc[:50])
shap.force_plot(explainer.expected_value[1], shap_values[1], X_test_df.iloc[:50])
  • Below is a gallery of the more detailed SHAP plots comparing the most dominant features, viz. Sex, Age, and Pclass
SHAP sex v Pclass lower scale
SHAP sex v Pclass higher scale
SHAP sex v Pclass lower & higher scales
  • Explanations for the predicted outcomes of 3 passengers. The features x for each passenger are displayed followed by an additive force layout, which graphically conveys prominent SHAP values that influenced each decision. The term f(x) is the evaluation of the trained model f on the features x, which is the predicted outcome for the corresponding passenger. The base value serves as a reference for the force bars, and is calculated as the average predicted outcome over the training data. Red (blue) force bars represent features that have increased (decreased) the passengers chance for survival, relative to the base value. The wider the force bars are, the more influential the corresponding feature. 
SHAP sample order by similarity: Sex.
SHAP age vs Sex high and Pclass low.
SHAP age vs Sex high and Pclass low with 2 averaged samples.
  • In the above plots, positive SHAP value means positive impact on prediction, leading the model to predict 1 (e.g. Passenger survived the Titanic). Negative SHAP value means negative impact, leading the model to predict 0 (e.g. passenger didn’t survive the Titanic).

Conclusions

  • The ultimate aim of this post was to use ML to create a model that predicts which passengers survived the Titanic shipwreck. Apart from elements of luck involved in surviving, it seems some groups of people were more likely to survive than others.
  • Specifically, we have addressed the challenge of building a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data such as age, gender, socio-economic class, etc.
  • The Goal: Predict whether a passenger survived or not. 0 for not surviving, 1 for surviving.
  • First and foremost, we have tried to get preliminary insights into the underlying dataset structure, and get familiar with it using descriptive statistics, so we can create more efficient models. This step is important because a lot of the beginners are often confused as to how to get started.
  • Next, we have carried out the comprehensive Exploratory Data Analysis (EDA) to uncover underlying patterns, grasp the dataset’s structure, and identify any potential anomalies or relationships between variables. 
  • EDA is basically used to identify a human error, missing values, or outliers. It extracts useful variables and removes useless variables.
  • We have applied two additional rounds of AutoEDA to the input data using DataPrep and sweetviz libraries. DataPrep is a fully automated data analysis for visually exploring, cleaning, and preparing our data for ML. SweetViz is an open-source Python library that generates beautiful, high-density visualizations to perform EDA with just two lines of code. 
  • By performing the aforementioned 3-step EDA, we have checked the quality of the data. This QC visualization and statistical analysis process ensures that further analysis is based on accurate and insightful information, thereby reducing the likelihood of errors in subsequent stages.
  • We have trained and compared the following 5 ML classifiers: LogisticRegression(), KNeighborsClassifier(), SVC(),GaussianNB(), DecisionTreeClassifier(), and RandomForestClassifier().
  • It appears that 84.3% ACC Random Forest (RF) Classifier after HPO-type tuning is the best performing model.
  • We have compiled the full RF binary classification report: First, the report shows a few standard quality metrics such as Accuracy, Precision, Recall, AUC, and F-1 Score (cf. yellowbrick classification table). Secondly, the reports includes a great variety of ML performance plots such as confusion matrix, feature importance, learning curves, ROC curve, Precision-Recall Curve, KS Statistic Plot, Cumulative Gains Curve, Lift Curve, trained ML calibration plots, Kmeans Elbow Plot, Silhouette Analysis, PCA Component Explained Variances, PCA 2-D Projection, Class Prediction Error, and Discrimination Threshold.
  • We have explored SHAP values to discover relationships between model features and ML predictions. This helps debug potential biases, identify data issues, and justify the model’s decisions.

Explore More

References



Discover more from Our Blogs

Subscribe to get the latest posts sent to your email.

Leave a comment

Discover more from Our Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading