Health Insurance Cross Sell Prediction with ML Model Tuning & Validation

Featured Image via Canva

  • It is no secret that AI carries enormous transformational potential for industries and society. Within the insurance sector, insurers are already using AI to improve customer service, to increase efficiency and to fight against fraud more effectively.
  • Cross-sell is the selling of additional products or services to existing customers. This type of sale involves an implementation that can increase customer longevity and reduce churn.
  • In this walkthrough, we will discuss best industry practices of insurance price predictions by combining Machine Learning (ML) with model tuning (aka hyperparameter optimization).
  • Predicting the medical insurance cost is very important task in healthtech. As insurance providers strive to offer competitive and personalized healthcare plans, it becomes essential to understand the factors influencing premium prices and create models that can accurately predict them.
  • Scope: input data preparation, Exploratory Data Analysis (EDA), Feature Engineering (FE), ML models training, testing and cross-validation, parameter optimization, and final classification report with relevant QC metrics.
  • About dataset: The client is an Insurance company that has provided Health Insurance to its customers. Now they need AI help in building a model to predict whether the policyholders (customers) from past year will also be interested in Vehicle Insurance provided by the company.
  • Read more here.

Table of Contents

  1. ML Tuning Pipelines
  2. HGBM Model Tuning, Validation & Interpretation
  3. XGB Model Validation
  4. IQR Filtering & RF Modeling
  5. Conclusions
  6. References
  7. Explore More

ML Tuning Pipelines

  • Setting the working directory YOURPATH
import os
os.chdir('YOURPATH')    
os. getcwd() 
  • Reading the input dataset and looking at the basic structure
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

df = pd.read_csv('train.csv')

#Descriptive statistics of continuous variables 
df.describe().T
Descriptive statistics of continuous variables
df.describe(include=object).T

count	         unique	    top	        freq
Gender	        381109	2	Male	    206089
Vehicle_Age	    381109	3	1-2 Year	200316
Vehicle_Damage	381109	2	Yes	        192413
  • Check for possible nulls in the dataset
df.isnull().sum()
id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64

# Categorical variables 
for i in df.select_dtypes(include=['object']).columns:
    print(df[i].value_counts())

Male      206089
Female    175020
Name: Gender, dtype: int64
1-2 Year     200316
< 1 Year     164786
> 2 Years     16007
Name: Vehicle_Age, dtype: int64
Yes    192413
No     188696
Name: Vehicle_Damage, dtype: int64
  • Data preparation
df_trimmed = df.loc[:,['Gender','Age','Driving_License','Previously_Insured','Vehicle_Age','Vehicle_Damage','Annual_Premium','Vintage','Response']]
#Drop null values and create dummy variables 
df_final = pd.get_dummies(df_trimmed).dropna()

df_final.columns
Index(['Age', 'Driving_License', 'Previously_Insured', 'Annual_Premium',
       'Vintage', 'Response', 'Gender_Female', 'Gender_Male',
       'Vehicle_Age_1-2 Year', 'Vehicle_Age_< 1 Year', 'Vehicle_Age_> 2 Years',
       'Vehicle_Damage_No', 'Vehicle_Damage_Yes'],
      dtype='object')

df_final.Response.value_counts()
0    334399
1     46710
Name: Response, dtype: int64
#Create train test split 

from sklearn.model_selection import train_test_split
X = df_final.drop('Response', axis =1)
y = df_final.loc[:,['Response']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#balance the data (SMOTE) 
from imblearn.over_sampling import SMOTE 
smote = SMOTE(sampling_strategy =1)

X_train, y_train = smote.fit_resample(X_train,y_train)
  • Create the baseline ML model using the Naïve Bayes algorithm
from sklearn.model_selection import cross_val_score

#import Naive Bayes Classifier 
from sklearn.naive_bayes import GaussianNB

#create classifier object
nb = GaussianNB()

#run cv for NB classifier
from sklearn.metrics import classification_report

nb_accuracy = cross_val_score(nb,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
nb_f1 = cross_val_score(nb,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('nb_accuracy: ' +str(nb_accuracy))
print('nb F1_Macro Score: '+str(nb_f1))
print('nb_accuracy_avg: ' + str(nb_accuracy.mean()) +'  |  lr_f1_avg: '+str(nb_f1.mean()))

nb_accuracy: [0.80302801 0.81294581 0.81078189]
nb F1_Macro Score: [0.82865318 0.83993805 0.83826875]
nb_accuracy_avg: 0.8089185690386936  |  lr_f1_avg: 0.8356199930577949
  • Comparison with other ML algorithms
#Model Comparison & Selection

## Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(random_state =32)

dt_accuracy = cross_val_score(dt,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
dt_f1 = cross_val_score(dt,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('dt_accuracy: ' +str(dt_accuracy))
print('dt F1_Macro Score: '+str(dt_f1))
print('dt_accuracy_avg: ' + str(dt_accuracy.mean()) +'  |  dt_f1_avg: '+str(dt_f1.mean())+'\n')


## Logistic Regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(random_state=32, max_iter = 2000, class_weight = 'balanced')

lr_accuracy = cross_val_score(lr,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
lr_f1 = cross_val_score(lr,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('lr_accuracy: ' +str(lr_accuracy))
print('lr F1_Macro Score: '+str(lr_f1))
print('lr_accuracy_avg: ' + str(lr_accuracy.mean()) +'  |  lr_f1_avg: '+str(lr_f1.mean())+'\n')


## KNN 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline, Pipeline

knn = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=3))
knn_accuracy = cross_val_score(knn,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
knn_f1 = cross_val_score(knn,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('knn_accuracy: ' +str(knn_accuracy))
print('knn F1_Macro Score: '+str(knn_f1))
print('knn_accuracy_avg: ' + str(knn_accuracy.mean()) +'  |  knn_f1_avg: '+str(knn_f1.mean()))

dt_accuracy: [0.82508811 0.8866345  0.8839835 ]
dt F1_Macro Score: [0.81301514 0.89198078 0.88957392]
dt_accuracy_avg: 0.8652353730820522  |  dt_f1_avg: 0.8648566148576409

lr_accuracy: [0.79171499 0.8250713  0.56066141]
lr F1_Macro Score: [0.80549422 0.84384176 0.60709078]
lr_accuracy_avg: 0.7258159037084625  |  lr_f1_avg: 0.752142250293187

knn_accuracy: [0.80213149 0.87422325 0.87273766]
knn F1_Macro Score: [0.78854111 0.88014651 0.87883444]
knn_accuracy_avg: 0.849697465521477  |  knn_f1_avg: 0.8491740207579683
  • Let’s optimize the KNN model by changing the K-parameter
#Manual Parameter Tuning
#Here we will loop through and see which value of k performs the best. 

for i in range(1,7):
    knn = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=i))
    knn_f1 = cross_val_score(knn,X_train,y_train.values.ravel(), cv=3, scoring ='f1')
    print('K ='+(str(i)) + (': ') + str(knn_f1.mean()))

K =1: 0.8419050988542729
K =2: 0.8177949676361784
K =3: 0.8491740207579683
K =4: 0.8352706032017148
K =5: 0.8504233464794012
K =6: 0.8405313324705315
  • Implementing the Decision Tree model tuning using RandomizedSearchCV
#Randomized Parameter Tuning

from sklearn.model_selection import RandomizedSearchCV

dt = DecisionTreeClassifier(random_state = 42)

features = {'criterion': ['gini','entropy'],
            'splitter': ['best','random'],
           'max_depth': [2,5,10,20,40,None],
           'min_samples_split': [2,5,10,15],
           'max_features': ['auto','sqrt','log2',None]}

rs_dt = RandomizedSearchCV(estimator = dt, param_distributions =features, n_iter =100, cv = 3, random_state = 42, scoring ='f1')

rs_dt.fit(X_train,y_train)

RandomizedSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=42),
                   n_iter=100,
                   param_distributions={'criterion': ['gini', 'entropy'],
                                        'max_depth': [2, 5, 10, 20, 40, None],
                                        'max_features': ['auto', 'sqrt', 'log2',
                                                         None],
                                        'min_samples_split': [2, 5, 10, 15],
                                        'splitter': ['best', 'random']},
                   random_state=42, scoring='f1')

print('best stcore = ' + str(rs_dt.best_score_))
print('best params = ' + str(rs_dt.best_params_))

best stcore = 0.8628995334929804
best params = {'splitter': 'best', 'min_samples_split': 2, 'max_features': None, 'max_depth': 40, 'criterion': 'gini'}
  • Implementing the Decision Tree model tuning using GridSearchCV
#GridsearchCV (Exhaustive Parameter Tuning)
from sklearn.model_selection import GridSearchCV


features_gs = {'criterion': ['entropy'],
            'splitter': ['random'],
           'max_depth': np.arange(30,50,1), #getting more precise within range
           'min_samples_split': [2,3,4,5,6,7,8,9],
           'max_features': [None]}

gs_dt = GridSearchCV(estimator = dt, param_grid =features_gs, cv = 3, scoring ='f1') #we don't need random state because there isn't randomization like before

gs_dt.fit(X_train,y_train)

GridSearchCV(cv=3, estimator=DecisionTreeClassifier(random_state=42),
             param_grid={'criterion': ['entropy'],
                         'max_depth': array([30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,
       47, 48, 49]),
                         'max_features': [None],
                         'min_samples_split': [2, 3, 4, 5, 6, 7, 8, 9],
                         'splitter': ['random']},
             scoring='f1')

print('best stcore = ' + str(gs_dt.best_score_))
print('best params = ' + str(gs_dt.best_params_))

best stcore = 0.848931249858977
best params = {'criterion': 'entropy', 'max_depth': 31, 'max_features': None, 'min_samples_split': 2, 'splitter': 'random'}
  • Implementing the Decision Tree model tuning using BayesSearchCV
#Bayesian Optimization
from skopt import BayesSearchCV
from sklearn.model_selection import StratifiedKFold
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# Choose cross validation method 
cv = StratifiedKFold(n_splits = 3)


bs_lr = BayesSearchCV(
    dt,
    {'criterion': Categorical(['gini','entropy']),
            'splitter': Categorical(['best','random']),
           'max_depth': Integer(10,50),
           'min_samples_split': Integer(2,15),
           'max_features': Categorical(['sqrt','log2',None])},
    random_state=42,
    n_iter= 100,
    cv= cv,
    scoring ='f1')
 
bs_lr.fit(X_train,y_train.values.ravel())

BayesSearchCV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
              estimator=DecisionTreeClassifier(random_state=42), n_iter=100,
              random_state=42, scoring='f1',
              search_spaces={'criterion': Categorical(categories=('gini', 'entropy'), prior=None),
                             'max_depth': Integer(low=10, high=50, prior='uniform', transform='normalize'),
                             'max_features': Categorical(categories=('sqrt', 'log2', None), prior=None),
                             'min_samples_split': Integer(low=2, high=15, prior='uniform', transform='normalize'),
                             'splitter': Categorical(categories=('best', 'random'), prior=None)})

print('best stcore = ' + str(bs_lr.best_score_))
print('best params = ' + str(bs_lr.best_params_))

best stcore = 0.8646572554143731
best params = OrderedDict([('criterion', 'entropy'), ('max_depth', 50), ('max_features', None), ('min_samples_split', 2), ('splitter', 'best')])
  • Implementing the Voting Classifier by optimizing the Logistic Regression, Decision Tree, and KNN models within the following pipeline
from sklearn.ensemble import VotingClassifier

dt_voting = DecisionTreeClassifier(**{'criterion': 'entropy', 'max_depth': 50, 'max_features': None, 'min_samples_split': 2, 'splitter': 'best'}) # ** allows you to pass in parameters as dict
knn_voting = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=5))
lr_voting = LogisticRegression(random_state=32, max_iter = 2000, class_weight = 'balanced')

ens = VotingClassifier(estimators = [('dt', dt_voting), ('knn', knn_voting), ('lr',lr_voting)], voting = 'hard')

voting_accuracy = cross_val_score(ens,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
voting_f1 = cross_val_score(ens,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('voting_accuracy: ' +str(voting_accuracy))
print('voting F1_Macro Score: '+str(voting_f1))
print('voting_accuracy_avg: ' + str(voting_accuracy.mean()) +'  |  voting_f1_avg: '+str(voting_f1.mean()))

voting_accuracy: [0.82490881 0.87750116 0.87486692]
voting F1_Macro Score: [0.82273785 0.88742814 0.88349019]
voting_accuracy_avg: 0.8590922969268296  |  voting_f1_avg: 0.8645520579723294

ens = VotingClassifier(estimators = [('dt', dt_voting), ('knn', knn_voting), ('lr',lr_voting)], voting = 'soft')
voting_accuracy = cross_val_score(ens,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
voting_f1 = cross_val_score(ens,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('voting_accuracy: ' +str(voting_accuracy))
print('voting F1_Macro Score: '+str(voting_f1))
print('voting_accuracy_avg: ' + str(voting_accuracy.mean()) +'  |  voting_f1_avg: '+str(voting_f1.mean()))

voting_accuracy: [0.82665703 0.88745258 0.88575415]
voting F1_Macro Score: [0.81886784 0.89410882 0.89215842]
voting_accuracy_avg: 0.866621253516802  |  voting_f1_avg: 0.8683783597723743
  • Implementing the Stacking Classifier by cascading the Logistic Regression, Decision Tree, and GaussianNB models within the following pipeline
#Stacking Classifier
from sklearn.ensemble import StackingClassifier

ens_stack = StackingClassifier(estimators = [('dt', dt_voting), ('lr',lr_voting), ('nb',GaussianNB())], final_estimator = GaussianNB())

stack_accuracy = cross_val_score(ens_stack,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
stack_f1 = cross_val_score(ens_stack,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('stacking_accuracy: ' +str(stack_accuracy))
print('stacking F1_Macro Score: '+str(stack_f1))
print('stacking_accuracy_avg: ' + str(stack_accuracy.mean()) +'  |  stack_f1_avg: '+str(stack_f1.mean()))

stacking_accuracy: [0.81573624 0.83978551 0.87037867]
stacking F1_Macro Score: [0.80105482 0.85657788 0.87878055]
stacking_accuracy_avg: 0.841966806895526  |  stack_f1_avg: 0.8454710837994078
  • Let’s look at the Random Forest (RF) Classifier
#Ensemble Models
from sklearn.ensemble import RandomForestClassifier

#first let's try a non-tuned implementation 
rf = RandomForestClassifier(random_state=42)

rf_accuracy = cross_val_score(rf,X_train,y_train.values.ravel(), cv=3, scoring ='accuracy')
rf_f1 = cross_val_score(rf,X_train,y_train.values.ravel(), cv=3, scoring ='f1')

print('rf_accuracy: ' +str(rf_accuracy))
print('rf F1_Macro Score: '+str(rf_f1))
print('rf_accuracy_avg: ' + str(rf_accuracy.mean()) +'  |  rf_f1_avg: '+str(rf_f1.mean()))

rf_accuracy: [0.81875641 0.89067447 0.88857822]
rf F1_Macro Score: [0.80563167 0.89648843 0.89457472]
rf_accuracy_avg: 0.866003030584957  |  rf_f1_avg: 0.8655649389116813
  • Comparing the SciKit-Plot learning curves – Random Forest (RF) vs Decision Tree (DT)
import scikitplot as skplt

import sklearn

import matplotlib.pyplot as plt

import sys
import warnings
warnings.filterwarnings("ignore")

print("Scikit Plot Version : ", skplt.__version__)
print("Scikit Learn Version : ", sklearn.__version__)
print("Python Version : ", sys.version)

Scikit Plot Version :  0.3.7
Scikit Learn Version :  1.3.2
Python Version :  3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]
skplt.estimators.plot_learning_curve(rf,X_train,y_train.values.ravel(),
                                     cv=7, shuffle=True, scoring="accuracy",
                                     n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title="RF Learning Curve");
Random Forest Learning Curve
skplt.estimators.plot_learning_curve(dt,X_train,y_train.values.ravel(),
                                     cv=7, shuffle=True, scoring="accuracy",
                                     n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title="DT Learning Curve");
Decision Tree Learning Curve

HGBM Model Tuning, Validation & Interpretation

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
import seaborn as sns
import scipy
import scipy.stats as stats
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.calibration import CalibratedClassifierCV
from sklearn.calibration import calibration_curve
import pickle
import time
import shap


import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
warnings.simplefilter(action='ignore', category=FutureWarning)

df = pd.read_csv('train.csv')

# Formatting features
df['Driving_License'] = df['Driving_License'].astype('object')
df['Region_Code'] = df['Region_Code'].astype('object')
df['Previously_Insured'] = df['Previously_Insured'].astype('object')
df['Policy_Sales_Channel'] = df['Policy_Sales_Channel'].astype('object')
df['Response'] = df['Response'].astype('object')

# Split data set between target variable and features
X_full = df.copy()
y = X_full.Response
X_full.drop(['Response'], axis=1, inplace=True)

# Summarize the class distribution 
count = pd.crosstab(index = y, columns="count")
percentage = pd.crosstab(index = y, columns="frequency")/pd.crosstab(index = y, columns="frequency").sum()
pd.concat([count, percentage], axis=1)

col_0	count	frequency
Response		
0	   334399	0.877437
1	   46710	0.122563
  • Data preparation
# Select categorical columns with relatively low cardinality 
categorical_cols = [var for var in X_full.columns if
                    X_full[var].nunique() <= 15 and 
                    X_full[var].dtype == "object"]

cat = X_full[categorical_cols]
cat.columns
ndex(['Gender', 'Driving_License', 'Previously_Insured', 'Vehicle_Age',
       'Vehicle_Damage'],
      dtype='object')

cat2 = pd.concat([y,cat], axis=1)

# Transform in integer binary variables
y = y.astype('int')
cat2['Response'] = cat2['Response'].astype('int')
cat2['Driving_License'] = cat2['Driving_License'].astype('int')
cat2['Previously_Insured'] = cat2['Previously_Insured'].astype('int')

cat2['Gender']=cat2['Gender'].map({'Female':0,'Male':1})
cat2['Vehicle_Damage']=cat2['Vehicle_Damage'].map({'No':0,'Yes':1})

# calculate the mean target value per category for each feature and capture the result in a dictionary 
Vehicle_Age_LABELS = cat2.groupby(['Vehicle_Age'])['Response'].mean().to_dict()
# replace for each feature the labels with the mean target values
cat2['Vehicle_Age'] = cat2['Vehicle_Age'].map(Vehicle_Age_LABELS)

# Look at the new subset
target_cat = cat2.drop(['Response'], axis=1)
target_cat.shape
(381109, 5)

# Find features with variance equal zero 
to_drop = [col for col in X_all.columns if np.var(X_all[col]) == 0]
to_drop
[]

# Drop features 
X_all_v = X_all.drop(X_all[to_drop], axis=1)
  • Plotting the spearman correlation heatmap
# Correlation 
corr_matrix = X_all_v.corr(method ='spearman')
sns.heatmap(corr_matrix, square = True, annot=True, fmt='.2f')
plt.title('Correlation Heatmap on data set',size=15)
plt.yticks(fontsize="15")
plt.xticks(fontsize="15")
plt.show()
Spearman Correlation Heatmap
  • Data normalization and splitting with test_size=0.3
# Find index of feature columns with correlation greater than 0.80
to_drop = [column for column in upper.columns if any(upper[column].abs() > 0.80)]
to_drop
['Vehicle_Age', 'Vehicle_Damage']

# Drop features 
X_all_c = X_all_v.drop(X_all_v[to_drop], axis=1)

# Normalization 
scaling = MinMaxScaler()
# Normalization of numerical features
num_sc = pd.DataFrame(scaling.fit_transform(X_all_c[['Age','Annual_Premium','Vintage']]), columns= ['Age','Annual_Premium','Vintage'])
# Grasp all
X_all_sc = pd.concat([num_sc, X_all_c[['Gender','Driving_License','Previously_Insured']]], axis=1)

# Split data set
# Break off train and test set from data
X_train, X_test, y_train, y_test = train_test_split(X_all_sc, y, train_size=0.7, test_size=0.3,stratify=y,random_state=0)
  • Comparing several ML models in terms of the training time and AUC score
# LR model
start = time.time()
skf = StratifiedKFold(n_splits=5,random_state=0, shuffle=True)
LR = LogisticRegression(random_state=0)

param_grid = {}
LR_model = GridSearchCV(LR,param_grid,cv=skf)
LR_classifier = LR_model.fit(X_train, y_train)
predictions_tr = LR_classifier.predict_proba(X_train)[:, 1]
predictions_t = LR_classifier.predict_proba(X_test)[:, 1]
LR_auc_train = roc_auc_score(y_train, predictions_tr)  
LR_auc_test = roc_auc_score(y_test, predictions_t) 
score= {'model':['LR'], 'auc_train':[LR_auc_train],'auc_test':[LR_auc_test]}
LR_score= pd.DataFrame(score)
stop = time.time()
print(f"Training time: {stop - start}s")

Training time: 1.8592848777770996s
LR_score
model	auc_train	auc_test
0	LR	0.778631	0.780518
# GNB model
start = time.time()
skf = StratifiedKFold(n_splits=5,random_state=0, shuffle=True)
GNB= GaussianNB()
param_grid = {}
GNB_model = GridSearchCV(GNB,param_grid,cv=skf)
GNB_classifier = GNB_model.fit(X_train, y_train)
predictions_tr = GNB_classifier.predict_proba(X_train)[:, 1]
predictions_t = GNB_classifier.predict_proba(X_test)[:, 1]
GNB_auc_train = roc_auc_score(y_train, predictions_tr)  
GNB_auc_test = roc_auc_score(y_test, predictions_t) 
score= {'model':['GNB'], 'auc_train':[GNB_auc_train],'auc_test':[GNB_auc_test]}
GNB_score= pd.DataFrame(score)
stop = time.time()
print(f"Training time: {stop - start}s")

Training time: 0.478222131729126s

GNB_score

model	auc_train	auc_test
0	GNB	0.804705	0.804968
# HGBM model
start = time.time()
skf = StratifiedKFold(n_splits=5,random_state=0, shuffle=True)
HGBM= HistGradientBoostingClassifier(random_state=0)
param_grid = {}
HGBM_model = GridSearchCV(HGBM,param_grid,cv=skf)
HGBM_classifier = HGBM_model.fit(X_train, y_train)
predictions_tr = HGBM_classifier.predict_proba(X_train)[:, 1]
predictions_t = HGBM_classifier.predict_proba(X_test)[:, 1]
HGBM_auc_train = roc_auc_score(y_train, predictions_tr)  
HGBM_auc_test = roc_auc_score(y_test, predictions_t) 
score= {'model':['HGBM'], 'auc_train':[HGBM_auc_train],'auc_test':[HGBM_auc_test]}
HGBM_score= pd.DataFrame(score)
stop = time.time()
print(f"Training time: {stop - start}s")

Training time: 5.101318597793579s

HGBM_score

model	   auc_train	auc_test
0	HGBM	0.833247	0.826952
  • Comparing AUC scores for both train and test data
score_cal = LR_score.append(GNB_score)
score_cal = score_cal.append(HGBM_score)
score_cal

model	   auc_train	auc_test
0	LR	    0.778631	0.780518
0	GNB	    0.804724	0.804944
0	HGBM	0.833247	0.826952
# Plot results for a graphical comparison
print("Spot Check Models")
plt.rcParams['figure.figsize']=(15,5)
plt.figure()
plt.subplot(1,2,1)  
sns.stripplot(x="model", y="auc_train",data=score_cal,size=15)
plt.xticks(rotation=45)
plt.title('Train results')
axes = plt.gca()
axes.set_ylim([0,1.1])
plt.subplot(1,2,2)
sns.stripplot(x="model", y="auc_test",data=score_cal,size=15)
plt.xticks(rotation=45)
plt.title('Test results')
axes = plt.gca()
axes.set_ylim([0,1.1])
plt.show()

Spot Check Models
Spot check models: AUC train/test for LR, GNB, and HGBM methods.
  • Implementing the HGBM model tuning and AUC, ROC, f1 validation
#Tuning
start = time.time()
# cross validation
skf = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
# define models and hyperparameters
HGBM = HistGradientBoostingClassifier(random_state=0)
# define grid search
hyp_space = {"max_depth": [10,21],
             "learning_rate": [0.02,0.5],
             "max_bins": [80, 195]}
# Tuning and fit the model 
HGBM_model_ = GridSearchCV(HGBM, hyp_space, n_jobs=-1, cv=skf, scoring='roc_auc', error_score=0).fit(X_train, y_train)
stop = time.time()
print(f"Training time: {stop-start}s")

Training time: 25.615475177764893s
def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score'] 
    std_score = results.cv_results_['std_test_score'] 
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')
        
display(HGBM_model_)

Best parameters are: {'learning_rate': 0.02, 'max_bins': 80, 'max_depth': 10}


0.827 + or -0.002 for the {'learning_rate': 0.02, 'max_bins': 80, 'max_depth': 10}
0.827 + or -0.002 for the {'learning_rate': 0.02, 'max_bins': 80, 'max_depth': 21}
0.827 + or -0.002 for the {'learning_rate': 0.02, 'max_bins': 195, 'max_depth': 10}
0.827 + or -0.002 for the {'learning_rate': 0.02, 'max_bins': 195, 'max_depth': 21}
0.825 + or -0.002 for the {'learning_rate': 0.5, 'max_bins': 80, 'max_depth': 10}
0.825 + or -0.002 for the {'learning_rate': 0.5, 'max_bins': 80, 'max_depth': 21}
0.824 + or -0.002 for the {'learning_rate': 0.5, 'max_bins': 195, 'max_depth': 10}
0.825 + or -0.002 for the {'learning_rate': 0.5, 'max_bins': 195, 'max_depth': 21}
# HGBM Model Training
HGBM_ = HistGradientBoostingClassifier(random_state=0,learning_rate=0.02, max_bins=80, max_depth= 10)
                               
# fit the model
HGBM_tclassifier = HGBM_.fit(X_train, y_train)

start = time.time()
# prediction
predictions_tr = HGBM_tclassifier.predict_proba(X_train)[:, 1]
predictions_tr_ = pd.DataFrame(predictions_tr, columns=['y_train_pred'])
predictions_te = HGBM_tclassifier.predict_proba(X_test)[:, 1]
predictions_te_ = pd.DataFrame(predictions_te, columns=['y_test_pred'])
stop = time.time()
print(f"Training time: {stop-start}s")

Training time: 0.581667423248291s

auc_train = roc_auc_score(y_train, HGBM_tclassifier.predict_proba(X_train)[:, 1])  
auc_test = roc_auc_score(y_test, HGBM_tclassifier.predict_proba(X_test)[:, 1]) 

# metrics table
d1 = {'evaluation': ['AUC'],
     'model': ['HGBM'],
    'train': [auc_train],
    'test': [auc_test]
        }
df1 = pd.DataFrame(data=d1, columns=['model','evaluation','train','test'])
print('HGBM evaluation on cross-sell prediction')
df1

HGBM evaluation on cross-sell prediction

model	   evaluation	train	test
0	HGBM	AUC	      0.829556	0.827226

# Use f1_score to maximize
metric = f1_score

# Generate a range of classification thresholds to evaluate
thresholds = np.arange(0.0, 1.01, 0.01)

# Compute the metric for each threshold
metric_values = [metric(y_test, np.where(predictions_te >= threshold, 1, 0)) for threshold in thresholds]

# Find the best threshold that maximizes the metric
best_threshold = thresholds[np.argmax(metric_values)]

print("Best threshold:", best_threshold)

Best threshold: 0.21
  • Plotting the HGBM ROC Curve
# compute the tpr and fpr from the prediction
fpr, tpr, thresholds = roc_curve(y_test, predictions_te)

# Plot the ROC curve
plt.rcParams['figure.figsize']=(10,5)
plt.plot(fpr, tpr, label='ROC Curve (AUC = %0.2f)' % auc(fpr, tpr))
plt.plot([0, 1], [0, 1], 'k--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")

# Adjust the threshold and compute the true positive rate (TPR) and false positive rate (FPR)
threshold = 0.21
y_pred = np.where(predictions_te >= threshold, 1, 0)
fpr_new, tpr_new, _ = roc_curve(y_test, y_pred)

# Plot the new point on the ROC curve
plt.scatter(fpr_new, tpr_new, c='r', label='New Threshold = %0.2f' % threshold)
plt.legend(loc="lower right")
print('ROC on test')
plt.show()

ROC on test
 HGBM ROC Curve for test
  • Comparing true vs predicted test values
# create a barplot for a comparison between test values and predicted values
y_test_= np.array(y_test)
y_test_ = y_test_.flatten()
y_pred = y_pred.flatten()
df_2 = pd.DataFrame({'Actual': y_test_, 'Predicted': y_pred})
sns.countplot(x='value', hue='variable', data=pd.melt(df_2))
plt.title('True vs Predicted Labels')
plt.show()
True vs Predicted test values
  • Plotting the SHAP HGBM model interpreter
HGBM_explainer = shap.TreeExplainer(HistGradientBoostingClassifier(random_state=0,learning_rate=0.02, max_bins=80, max_depth= 10).fit(X_train, y_train))
shap_values = HGBM_explainer.shap_values(X_test)
# Global SHAP on test
print("HGBM SHAP BARPLOT on test Values")
shap.summary_plot(shap_values, features=X_test, feature_names=X_test.columns,plot_type='bar')
The SHAP HGBM model interpreter

XGB Model Validation

  • Let’s train and validate the XGBoost model
# Import key libraries

import pandas as pd, numpy as np
import os
import math
from math import ceil, floor, log
import random

from sklearn.model_selection import KFold
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.metrics import f1_score, roc_auc_score, confusion_matrix, precision_recall_curve, auc, roc_curve, recall_score, classification_report 
from sklearn.model_selection import train_test_split
import sklearn
from sklearn import metrics
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

import seaborn as sns

from yellowbrick.classifier import ClassificationReport
import scikitplot as skplt

from xgboost import XGBClassifier
import xgboost as xgb
from lightgbm import LGBMClassifier
import catboost
print(catboost.__version__)
from catboost import *
from catboost import datasets
from catboost import CatBoostClassifier

import scikitplot as skplt
1.1.1
  • Data preparation
SEED = 1970
random.seed(SEED)
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

col_list = df_train.columns.to_list()[1:]
df_train_corr = df_train.copy().set_index('id')
df_train_ones = df_train_corr.loc[df_train_corr.Response == 1].copy()

categorical_features = ['Gender', 'Driving_License', 'Region_Code', 'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage','Policy_Sales_Channel']
text_features = ['Gender', 'Vehicle_Age', 'Vehicle_Damage']

# code text categorical features
le = preprocessing.LabelEncoder()
for f in text_features :
    df_train_corr[f] = le.fit_transform(df_train_corr[f])
# change digital categorical datatype so CatBoost can deal with them
df_train_corr.Region_Code = df_train_corr.Region_Code.astype('int32')
df_train_corr.Policy_Sales_Channel = df_train_corr.Policy_Sales_Channel.astype('int32')

def plot_ROC(fpr, tpr, m_name):
    roc_auc = sklearn.metrics.auc(fpr, tpr)
    plt.figure(figsize=(6, 6))
    lw = 2
    plt.plot(fpr, tpr, color='darkorange',
             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc, alpha=0.5)
    
    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--', alpha=0.5)
    
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)
    plt.grid(True)
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)
    plt.title('Receiver operating characteristic for %s'%m_name, fontsize=20)
    plt.legend(loc="lower right", fontsize=16)
    plt.show()

def upsample(df, u_feature, n_upsampling):
    ones = df.copy()
    for n in range(n_upsampling):
        if u_feature == 'Annual_Premium':
            df[u_feature] = ones[u_feature].apply(lambda x: x + random.randint(-1,1)* x *0.05) # change Annual_premiun in the range of 5%
        else:
            df[u_feature] = ones[u_feature].apply(lambda x: x + random.randint(-5,5)) # change Age in the range of 5 years
                
        if n == 0:
            df_new = df.copy()
        else:
            df_new = pd.concat([df_new, df])
    return df_new

try:
    df_train_corr.drop(columns = ['bin_age'], inplace = True)
except:
    print('already deleted')        

df_train_mod = df_train_corr.copy()
df_train_mod['old_damaged'] = df_train_mod.apply(lambda x: pow(2,x.Vehicle_Age)+pow(2,x.Vehicle_Damage), axis =1)

# we shall preserve validation set without augmentation/over-sampling
df_temp, X_valid, _, y_valid = train_test_split(df_train_mod, df_train_mod['Response'], train_size=0.8, random_state = SEED)
X_valid = X_valid.drop(columns = ['Response'])

# upsampling Positive Response class only
df_train_up_a = upsample(df_temp.loc[df_temp['Response'] == 1], 'Age', 1)
df_train_up_v = upsample(df_temp.loc[df_temp['Response'] == 1], 'Vintage', 1)

df_ext = pd.concat([df_train_mod,df_train_up_a])
df_ext = pd.concat([df_ext,df_train_up_v])
X_train = df_ext.drop(columns = ['Response'])
y_train = df_ext.Response
print('Train set target class count with over-sampling:')
print(y_train.value_counts())
print('Validation set target class count: ')
print(y_valid.value_counts())
X_train.head()

Train set target class count with over-sampling:
0    334399
1    121390
Name: Response, dtype: int64
Validation set target class count: 
0    66852
1     9370
Name: Response, dtype: int64
  • Fitting the XGBoost Classifier
XGB_model_l = XGBClassifier(random_state = SEED, max_depth = 8, 
                            n_estimators = 30000, 
                            reg_lambda = 1.2, reg_alpha = 1.2, 
                            min_child_weight = 1, 
                            objective = 'binary:logistic',
                            learning_rate = 0.15, gamma = 0.3, colsample_bytree = 0.5, eval_metric = 'auc')

XGB_model_l.fit(X_train, y_train,
                eval_set = [(X_valid, y_valid)],
                early_stopping_rounds=50,verbose = 1000)
[0]	validation_0-auc:0.75188
[1000]	validation_0-auc:0.90154
[2000]	validation_0-auc:0.92239
[3000]	validation_0-auc:0.93506
[4000]	validation_0-auc:0.94647
[5000]	validation_0-auc:0.95441
[6000]	validation_0-auc:0.96068
[7000]	validation_0-auc:0.96629
[8000]	validation_0-auc:0.97081
[9000]	validation_0-auc:0.97479
[10000]	validation_0-auc:0.97813
[11000]	validation_0-auc:0.98071
[12000]	validation_0-auc:0.98277
[13000]	validation_0-auc:0.98462
[14000]	validation_0-auc:0.98639
[15000]	validation_0-auc:0.98796
[16000]	validation_0-auc:0.98918
[17000]	validation_0-auc:0.99025
[18000]	validation_0-auc:0.99115
[19000]	validation_0-auc:0.99189
[20000]	validation_0-auc:0.99260
[21000]	validation_0-auc:0.99323
[22000]	validation_0-auc:0.99378
[23000]	validation_0-auc:0.99429
[24000]	validation_0-auc:0.99476
[25000]	validation_0-auc:0.99515
[26000]	validation_0-auc:0.99550
[27000]	validation_0-auc:0.99579
[28000]	validation_0-auc:0.99600
[29000]	validation_0-auc:0.99622
[29999]	validation_0-auc:0.99643

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=0.5,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric='auc', feature_types=None, gamma=0.3, gpu_id=-1,
              grow_policy='depthwise', importance_type=None,
              interaction_constraints='', learning_rate=0.15, max_bin=256,
              max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,
              max_depth=8, max_leaves=0, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=30000, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=1970, ...)
XGB_preds_l = XGB_model_l.predict_proba(X_valid)
XGB_score_l = roc_auc_score(y_valid, XGB_preds_l[:,1])
XGB_class_l = XGB_model_l.predict(X_valid)
  • XGBoost model validation
(fpr, tpr, thresholds) = roc_curve(y_valid, XGB_preds_l[:,1])
plt.rcParams.update({'font.size': 22})
plot_ROC(fpr, tpr,'XGBoost')

print('ROC AUC score for XGBoost model with over-sampling + 2 new features: %.4f'%XGB_score_l)
print('F1 score: %0.4f'%f1_score(y_valid, XGB_class_l))
skplt.metrics.plot_confusion_matrix(y_valid, XGB_class_l,
        figsize=(8,8))

xgb.plot_importance(XGB_model_l)
ROC for XGBoost
  • ROC AUC score for XGBoost model with over-sampling + 2 new features: 0.9964 F1 score: 0.9176
XGBoost confusion matrix
XGBoost feature importance
  • Plotting the XGBoost normalized confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

target_names=['0','1']
plt.rcParams.update({'font.size': 22})
cm = confusion_matrix(y_valid, XGB_class_l)
# Normalise
cmn = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt='.6f', xticklabels=target_names, yticklabels=target_names)
plt.ylabel('Actual',fontsize=18)
plt.xlabel('Predicted',fontsize=18)
plt.title('XGB Confusion Matrix',fontsize=18)
plt.show(block=False)
XGBoost normalized confusion matrix

IQR Filtering & RF Modeling

  • Following the recent study, we will investigate the impact of outlier removal via IQR filtering
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('train.csv')
onehots = pd.get_dummies(df['Vehicle_Age'], prefix='Vehicle_Age')
df = df.join(onehots)
onehots2 = pd.get_dummies(df['Gender'], prefix='Gender')
df = df.join(onehots2)
onehots3 = pd.get_dummies(df['Vehicle_Damage'], prefix='Vehicle_Damage')
df = df.join(onehots3)
df = df.drop(['id', 'Gender', 'Vehicle_Damage', 'Vehicle_Age'], axis=1) 
print(f'Count of rows before filtering outlier: {len(df)}')

filtered_entries = np.array([True] * len(df))
for col in ['Annual_Premium']:
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    low_limit = Q1 - (IQR * 1.5)
    high_limit = Q3 + (IQR * 1.5)

    filtered_entries = ((df[col] >= low_limit) & (df[col] <= high_limit)) & filtered_entries
    
df = df[filtered_entries]

print(f'Count of rows after filtering outlier: {len(df)}')
Count of rows before filtering outlier: 369067
Count of rows after filtering outlier: 369039

df = df.drop(['Annual_Premium'], axis=1)
print(df['Response'].value_counts())
0    324365
1     44914
Name: Response, dtype: int64

X = df[[col for col in df.columns if (str(df[col].dtype) != 'object') and col not in ['Response']]]
y = df['Response'].values
print(X.shape)
print(y.shape)
(369279, 14)
(369279,)

from imblearn import over_sampling
X_over, y_over = over_sampling.RandomOverSampler().fit_resample(X, y)
df_y_over = pd.Series(y_over).value_counts()
df_y_over
1    324365
0    324365
dtype: int64

df.to_csv('train_pre_processed.csv')
  • Applying the RF Classifier to the filtered dataset
from sklearn.ensemble import RandomForestClassifier
# Split Feature Vector and Label
X = df.drop(['Response'], axis = 1) # menggunakan semua feature kecuali target
y = df['Response'] # target / label

#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split 
X_train, X_test,y_train,y_test = train_test_split(X,
                                                y,
                                                test_size = 0.3,
                                                random_state = 789)
rf = RandomForestClassifier(n_estimators= 400, max_depth=110, random_state=0)
rf.fit(X_train, y_train)
RandomForestClassifier(max_depth=110, n_estimators=400, random_state=0)

y_predicted = rf.predict(X_test)


  • Final RF classification report
# OUTLIERS throwed away 
# Data oversampled on Response == 1

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score
print('\nconfustion matrix') # generate the confusion matrix
print(confusion_matrix(y_test, y_predicted))

print('\naccuracy')
print(accuracy_score(y_test, y_predicted))
print('\nprecision')
print(precision_score(y_test, y_predicted))


print('\nclassification report')
print(classification_report(y_test, y_predicted))

confustion matrix
[[85995 11312]
 [  300 97012]]

accuracy
0.9403347052446062

precision
0.8955725416343562

classification report
              precision    recall  f1-score   support

           0       1.00      0.88      0.94     97307
           1       0.90      1.00      0.94     97312

    accuracy                           0.94    194619
   macro avg       0.95      0.94      0.94    194619
weighted avg       0.95      0.94      0.94    194619

print("train Accuracy : ",rf.score(X_train,y_train))
print("test Accuracy : ",rf.score(X_test,y_test))
train Accuracy :  0.9999229263329891
test Accuracy :  0.9403347052446062
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_predicted, pos_label=1) # pos_label: label yang kita anggap positive
print('Area Under ROC Curve (AUC):', auc(fpr, tpr))

Area Under ROC Curve (AUC): 0.9403332515355389
plt.subplots(figsize=(10, 6))
plt.plot(fpr, tpr, 'o-', label="ROC curve")
plt.plot(np.linspace(0,1,10), np.linspace(0,1,10), label="diagonal")
for x, y, txt in zip(fpr, tpr, thresholds):
    plt.annotate(np.round(txt,2), (x, y-0.04))
plt.legend(loc="upper left")
plt.xlabel("FPR")
plt.ylabel("TPR")
RF ROC curve
# RF feature importance score

feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
ax = feat_importances.nlargest(10).plot(kind='barh')
ax.invert_yaxis()
plt.xlabel('score')
plt.ylabel('feature')
plt.title('feature importance score')
RF feature importance score
  • Final RF normalized confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns

target_names=['0','1']

cm = confusion_matrix(y_test, y_predicted)
# Normalise
cmn = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt='.6f', xticklabels=target_names, yticklabels=target_names)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show(block=False)
RF normalized confusion matrix

Conclusions

  • In this project, we have applied supervised binary classification and AI-powered predictive analytics to cross-selling insurance initiatives.
  • We have proposed an integrated ML approach for identifying cross-sell opportunities within insurance customer data using ML model tuning, cross-validation & interpretation.
  • This approach has been extensively tested and evaluated on real insurance data that had been provided by an insurance company.
  • Results show the ability of ML tuning pipelines, and ensemble models (HGBM, XGBoost, and Random Forest) to identify cross-sell customers.
  • The study will be integrated into a recommendation system that can be used to assign cross-sell probability scores to current or new insurance customers, to support advisors for improved cross product selling.

References

Explore More


Go back

Your message has been sent

Warning
Warning!

One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

€5.00
€15.00
€100.00
€5.00
€15.00
€100.00
€5.00
€15.00
€100.00

Or enter a custom amount


Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly

Discover more from Our Blogs

Subscribe to get the latest posts sent to your email.

Leave a comment

Discover more from Our Blogs

Subscribe now to keep reading and get access to the full archive.

Continue reading