Breast Cancer ML Classification – Logistic Regression vs Gradient Boosting with Hyperparameter Optimization (HPO)

According to WHO, Breast Cancer (BC) is the leading cause of death among women worldwide. The present study optimizes the use of supervised Machine Learning (ML) algorithms for detecting, analyzing, and classifying BC. 

Following previous BC ML/AI case studies (cf. Read More), our objective is to compare Logistic Regression (LR) against Gradient Boosting (GB) Classifier within the Hyperparameter Optimization (HPO) loop given by GridSearchCV.

We use the publicly available BC dataset from the University of Wisconsin Hospitals, Madison, Wisconsin, USA. 

Let’s open the Jupyter IDE Notebook

and begin our project by setting the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and importing the key libraries

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from collections import OrderedDict

from sklearn import datasets
from sklearn.preprocessing import label_binarize, LabelBinarizer
from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.metrics import precision_recall_curve
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve, auc

DISPLAY_PRECISION = 4
pd.set_option(“display.precision”, DISPLAY_PRECISION)

Let’s load the BC dataset and print its detailed description

dat = datasets.load_breast_cancer()
print(dat.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radius, field
        10 is Radius SE, field 20 is Worst Radius.

        - class:
                - WDBC-Malignant
                - WDBC-Benign

    :Summary Statistics:

    ===================================== ====== ======
                                           Min    Max
    ===================================== ====== ======
    radius (mean):                        6.981  28.11
    texture (mean):                       9.71   39.28
    perimeter (mean):                     43.79  188.5
    area (mean):                          143.5  2501.0
    smoothness (mean):                    0.053  0.163
    compactness (mean):                   0.019  0.345
    concavity (mean):                     0.0    0.427
    concave points (mean):                0.0    0.201
    symmetry (mean):                      0.106  0.304
    fractal dimension (mean):             0.05   0.097
    radius (standard error):              0.112  2.873
    texture (standard error):             0.36   4.885
    perimeter (standard error):           0.757  21.98
    area (standard error):                6.802  542.2
    smoothness (standard error):          0.002  0.031
    compactness (standard error):         0.002  0.135
    concavity (standard error):           0.0    0.396
    concave points (standard error):      0.0    0.053
    symmetry (standard error):            0.008  0.079
    fractal dimension (standard error):   0.001  0.03
    radius (worst):                       7.93   36.04
    texture (worst):                      12.02  49.54
    perimeter (worst):                    50.41  251.2
    area (worst):                         185.2  4254.0
    smoothness (worst):                   0.071  0.223
    compactness (worst):                  0.027  1.058
    concavity (worst):                    0.0    1.252
    concave points (worst):               0.0    0.291
    symmetry (worst):                     0.156  0.664
    fractal dimension (worst):            0.055  0.208
    ===================================== ====== ======

    :Missing Attribute Values: None

    :Class Distribution: 212 - Malignant, 357 - Benign

    :Creator:  Dr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian

    :Donor: Nick Street

    :Date: November, 1995

This is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.
https://goo.gl/U2Uwz2

Features are computed from a digitized image of a fine needle
aspirate (FNA) of a breast mass.  They describe
characteristics of the cell nuclei present in the image.

Separating plane described above was obtained using
Multisurface Method-Tree (MSM-T) [K. P. Bennett, "Decision Tree
Construction Via Linear Programming." Proceedings of the 4th
Midwest Artificial Intelligence and Cognitive Science Society,
pp. 97-101, 1992], a classification method which uses linear
programming to construct a decision tree.  Relevant features
were selected using an exhaustive search in the space of 1-4
features and 1-3 separating planes.

The actual linear program used to obtain the separating plane
in the 3-dimensional space is that described in:
[K. P. Bennett and O. L. Mangasarian: "Robust Linear
Programming Discrimination of Two Linearly Inseparable Sets",
Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:

ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

.. topic:: References

   - W.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction 
     for breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on 
     Electronic Imaging: Science and Technology, volume 1905, pages 861-870,
     San Jose, CA, 1993.
   - O.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and 
     prognosis via linear programming. Operations Research, 43(4), pages 570-577, 
     July-August 1995.
   - W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques
     to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 
     163-171.




Let’s look at the basic structure of our input data

print(“The sklearn breast cancer dataset keys:”)
print(dat.keys()) # dict_keys([‘target_names’, ‘target’, ‘feature_names’, ‘data’, ‘DESCR’])
print(“—“)

li_classes = [dat.target_names[1], dat.target_names[0]]
li_target = [1 if x==0 else 0 for x in list(dat.target)]
li_ftrs = list(dat.feature_names)

print(“There are 2 target classes:”)
print(“li_classes”, li_classes)
print(“—“)
print(“Target class distribution from a total of %d target values:” % len(li_target))
print(pd.Series(li_target).value_counts())
print(“—“)

df_all = pd.DataFrame(dat.data[:,:], columns=li_ftrs)
print(“Describe dataframe, first 6 columns:”)
print(df_all.iloc[:,:6].describe().to_string())

The sklearn breast cancer dataset keys:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
---
There are 2 target classes:
li_classes ['benign', 'malignant']
---
Target class distribution from a total of 569 target values:
0    357
1    212
dtype: int64
---
Describe dataframe, first 6 columns:
       mean radius  mean texture  mean perimeter  mean area  mean smoothness  mean compactness
count     569.0000      569.0000         569.000   569.0000         569.0000          569.0000
mean       14.1273       19.2896          91.969   654.8891           0.0964            0.1043
std         3.5240        4.3010          24.299   351.9141           0.0141            0.0528
min         6.9810        9.7100          43.790   143.5000           0.0526            0.0194
25%        11.7000       16.1700          75.170   420.3000           0.0864            0.0649
50%        13.3700       18.8400          86.240   551.1000           0.0959            0.0926
75%        15.7800       21.8000         104.100   782.7000           0.1053            0.1304
max        28.1100       39.2800         188.500  2501.0000           0.1634            0.3454

Let’s perform test/train data splitting with test_size=0.3

TEST_SIZE_RATIO = 0.3

Setup X and y:

X = df_all
y = pd.Series(li_target)

X_train_0, X_test_0, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE_RATIO, random_state=0)
print(“X_train_0.shape, y_train.shape”, X_train_0.shape, y_train.shape)
print(“X_test_0.shape, y_test.shape”, X_test_0.shape, y_test.shape)

X_train_0.shape, y_train.shape (398, 30) (398,)
X_test_0.shape, y_test.shape (171, 30) (171,)

Let’s build the correlation matrix for the train data using the function

def correlation_matrix(y, X, is_plot=False):
# Calculate and plot the correlation symmetrical matrix
# Return:
# yX – concatenated data
# yX_corr – correlation matrix, pearson correlation of values from -1 to +1
# yX_abs_corr – correlation matrix, absolute values

yX = pd.concat([y, X], axis=1)
yX = yX.rename(columns={0: ‘TARGET’}) # rename first column

print(“Function correlation_matrix: X.shape, y.shape, yX.shape:”, X.shape, y.shape, yX.shape)
print()

# Get feature correlations and transform to dataframe
yX_corr = yX.corr(method=’pearson’)

# Convert to abolute values
yX_abs_corr = np.abs(yX_corr)

if is_plot:
plt.figure(figsize=(10, 10))
plt.imshow(yX_abs_corr, cmap=’RdYlGn’, interpolation=’none’, aspect=’auto’)
plt.colorbar()
plt.xticks(range(len(yX_abs_corr)), yX_abs_corr.columns, rotation=’vertical’)
plt.yticks(range(len(yX_abs_corr)), yX_abs_corr.columns);
plt.suptitle(‘Pearson Correlation Heat Map (absolute values)’, fontsize=15, fontweight=’bold’)
#plt.show()
plt.savefig(‘corrmatrixafter.png’)

return yX, yX_corr, yX_abs_corr

Let’s call this function

yX, yX_corr, yX_abs_corr = correlation_matrix(y_train, X_train_0, is_plot=True)

Function correlation_matrix: X.shape, y.shape, yX.shape: (398, 30) (398,) (398, 31)
Pearson correlation heat map with all features

Let’s select the BC model features within the range (0.1, 0.8)

CORRELATION_MIN = 0.1

Sort features by their Pearson correlation with the target value

s_corr_target = yX_abs_corr[‘TARGET’]
s_corr_target_sort = s_corr_target.sort_values(ascending=False)

Only use features with a minimum Pearson correlation with the target of 0.1

s_low_correlation_ftrs = s_corr_target_sort[s_corr_target_sort <= CORRELATION_MIN]

Print

print(“Removed %d low correlation features:” % len(s_low_correlation_ftrs))
for i,v in enumerate(s_low_correlation_ftrs):
print(i,np.round(v, DISPLAY_PRECISION), s_low_correlation_ftrs.index[i])

print(“—“)

s_corr_target_sort = s_corr_target_sort[s_corr_target_sort > CORRELATION_MIN]

print(“Remaining %d feature correlations:” % (len(s_corr_target_sort)-1))
for i,v in enumerate(s_corr_target_sort):
ftr = s_corr_target_sort.index[i]
if ftr == ‘TARGET’:
continue

print(i,np.round(v, DISPLAY_PRECISION), ftr)

Removed 5 low correlation features:
0 0.0778 smoothness error
1 0.0622 fractal dimension error
2 0.0373 texture error
3 0.0187 mean fractal dimension
4 0.0154 symmetry error
---
Remaining 25 feature correlations:
1 0.8024 worst concave points
2 0.7794 mean concave points
3 0.7723 worst perimeter
4 0.7661 worst radius
5 0.7318 mean perimeter
6 0.7194 worst area
7 0.719 mean radius
8 0.7111 mean concavity
9 0.6947 worst concavity
10 0.6932 mean area
11 0.6196 mean compactness
12 0.6062 worst compactness
13 0.5574 radius error
14 0.5465 perimeter error
15 0.5264 area error
16 0.4347 worst texture
17 0.4324 worst smoothness
18 0.4255 worst symmetry
19 0.3902 concave points error
20 0.3883 mean texture
21 0.373 mean smoothness
22 0.3361 mean symmetry
23 0.3272 worst fractal dimension
24 0.2823 compactness error
25 0.2322 concavity error

CORRELATION_MAX = 0.8

Remove features that have a low correlation with the target

li_X1_cols = list(set(s_corr_target_sort.index) – set(s_low_correlation_ftrs.index))
li_X1_cols.remove(‘TARGET’)

Build the correlation matrix for the reduced X

X1 = X_train_0[li_X1_cols]
yX1, yX_corr1, yX_abs_corr1 = correlation_matrix(y_train, X1, is_plot=False)

Get all the feature pairs

Xcorr1 = yX_abs_corr1.iloc[1:,1:]
s_pairs = Xcorr1.unstack()
print(“s_pairs.shape”, s_pairs.shape)
s_pairs = np.round(s_pairs, decimals=DISPLAY_PRECISION)

Sort all the pairs by highest correlation values

s_pairs_sorted = s_pairs.sort_values(ascending=False)
s_pairs_sorted = s_pairs_sorted[(s_pairs_sorted != 1) & (s_pairs_sorted > CORRELATION_MAX)] # leave only the top matches that are not identical features

Convert to a list of name tuples e.g. (‘mean radius’, ‘mean perimeter’)

li_corr_pairs = s_pairs_sorted.index.tolist()

print(“len(li_corr_pairs):”, len(li_corr_pairs))
print(“li_corr_pairs[:10]”, li_corr_pairs[:10])

Remove features that have a low correlation with the target

li_X1_cols = list(set(s_corr_target_sort.index) – set(s_low_correlation_ftrs.index))
li_X1_cols.remove(‘TARGET’)

Build the correlation matrix for the reduced X

X1 = X_train_0[li_X1_cols]
yX1, yX_corr1, yX_abs_corr1 = correlation_matrix(y_train, X1, is_plot=False)

Get all the feature pairs

Xcorr1 = yX_abs_corr1.iloc[1:,1:]
s_pairs = Xcorr1.unstack()
print(“s_pairs.shape”, s_pairs.shape)
s_pairs = np.round(s_pairs, decimals=DISPLAY_PRECISION)

Sort all the pairs by highest correlation values

s_pairs_sorted = s_pairs.sort_values(ascending=False)
s_pairs_sorted = s_pairs_sorted[(s_pairs_sorted != 1) & (s_pairs_sorted > CORRELATION_MAX)] # leave only the top matches that are not identical features

Convert to a list of name tuples e.g. (‘mean radius’, ‘mean perimeter’)

li_corr_pairs = s_pairs_sorted.index.tolist()

print(“len(li_corr_pairs):”, len(li_corr_pairs))
print(“li_corr_pairs[:10]”, li_corr_pairs[:10])

Function correlation_matrix: X.shape, y.shape, yX.shape: (398, 25) (398,) (398, 26)

s_pairs.shape (625,)
len(li_corr_pairs): 80
li_corr_pairs[:10] [('mean radius', 'mean perimeter'), ('mean perimeter', 'mean radius'), ('worst perimeter', 'worst radius'), ('worst radius', 'worst perimeter'), ('mean area', 'mean radius'), ('mean radius', 'mean area'), ('mean area', 'mean perimeter'), ('mean perimeter', 'mean area'), ('worst area', 'worst radius'), ('worst radius', 'worst area')]

Let’s build the list of features to remove
li_remove_pair_ftrs = []
li_remove_scores = []
for tup in li_corr_pairs:
s0 = s_corr_target_sort.loc[tup[0]]
s1 = s_corr_target_sort.loc[tup[1]]
remove_ftr = tup[1] if s1 < s0 else tup[0] # get the feature that is less correlated with the target
if remove_ftr not in li_remove_pair_ftrs:
li_remove_pair_ftrs.append(remove_ftr)
di = {‘ftr_0’:tup[0], ‘ftr_1’:tup[1], ‘score_0’:s0, ‘score_1’:s1, ‘FEATURE_TO_REMOVE’:remove_ftr}
li_remove_scores.append(OrderedDict(di))

df_remove_scores = pd.DataFrame(li_remove_scores)
print(“Removing %d features (see last column):” % len(li_remove_pair_ftrs))
print(df_remove_scores.to_string())
print(“—“)

Remove the features that were found in the above procedure

li_X2_cols = list(set(li_X1_cols) – set(li_remove_pair_ftrs))
li_X2_cols.sort()

print(“Remaining %d features:” % (len(li_X2_cols)))
for i,v in enumerate(s_corr_target_sort):
ftr = s_corr_target_sort.index[i]

if ftr in li_X2_cols:
print(i,np.round(v, DISPLAY_PRECISION), ftr)

Removing 16 features (see last column):
                  ftr_0                    ftr_1  score_0  score_1        FEATURE_TO_REMOVE
0           mean radius           mean perimeter   0.7190   0.7318              mean radius
1       worst perimeter             worst radius   0.7723   0.7661             worst radius
2             mean area              mean radius   0.6932   0.7190                mean area
3            worst area             worst radius   0.7194   0.7661               worst area
4          radius error          perimeter error   0.5574   0.5465          perimeter error
5       worst perimeter           mean perimeter   0.7723   0.7318           mean perimeter
6          radius error               area error   0.5574   0.5264               area error
7        mean concavity      mean concave points   0.7111   0.7794           mean concavity
8          mean texture            worst texture   0.3883   0.4347             mean texture
9   mean concave points     worst concave points   0.7794   0.8024      mean concave points
10    worst compactness          worst concavity   0.6062   0.6947        worst compactness
11       mean concavity         mean compactness   0.7111   0.6196         mean compactness
12       mean concavity          worst concavity   0.7111   0.6947          worst concavity
13      worst perimeter      mean concave points   0.7723   0.7794          worst perimeter
14    worst compactness  worst fractal dimension   0.6062   0.3272  worst fractal dimension
15      mean smoothness         worst smoothness   0.3730   0.4324          mean smoothness
---
Remaining 9 features:
1 0.8024 worst concave points
13 0.5574 radius error
16 0.4347 worst texture
17 0.4324 worst smoothness
18 0.4255 worst symmetry
19 0.3902 concave points error
22 0.3361 mean symmetry
24 0.2823 compactness error
25 0.2322 concavity error
Let’s calculate the correlation matrix on the subset of features

X2 = X1[li_X2_cols]
print(“After the pair feature reduction, X2.shape:”, X2.shape)
yX2, yX_corr2, yX_abs_corr2 = correlation_matrix(y_train, X2)

Recalculate the correlation matrix to plot the TARGET values grouped by the order of correlation

s_X3_cols = yX_abs_corr2[‘TARGET’].sort_values(ascending=False)
li_X3_cols = s_X3_cols.index.tolist()
print(“Remaining features:”)
print(s_X3_cols)
print(“—“)
li_X3_cols.remove(‘TARGET’)

X3 = X2[li_X3_cols]
print(“After the pair feature reduction, X2.shape:”, X3.shape)
yX3, yX_corr3, yX_abs_corr3 = correlation_matrix(y_train, X3, is_plot=True)

X_train = X3
X_test = X_test_0[li_X3_cols]

After the pair feature reduction, X2.shape: (398, 9)
Function correlation_matrix: X.shape, y.shape, yX.shape: (398, 9) (398,) (398, 10)

Remaining features:
TARGET                  1.0000
worst concave points    0.8024
radius error            0.5574
worst texture           0.4347
worst smoothness        0.4324
worst symmetry          0.4255
concave points error    0.3902
mean symmetry           0.3361
compactness error       0.2823
concavity error         0.2322
Name: TARGET, dtype: float64
---
After the pair feature reduction, X2.shape: (398, 9)
Function correlation_matrix: X.shape, y.shape, yX.shape: (398, 9) (398,) (398, 10)
Pearson correlation heat map on the subset of features

Let’s create the sns joint plot for the 9 remaining dominant features as follows:

sns.jointplot(yX3[‘concavity error’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscatterconcavityerror.png’)

the M and B distributions in the target-feature domain - concavity error

sns.jointplot(yX3[‘compactness error’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscattercompactnesserror.png’)

the M and B distributions in the target-feature domain - compactness error

sns.jointplot(yX3[‘mean symmetry’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscattermeansymmetry.png’)

the M and B distributions in the target-feature domain - mean symmetry

sns.jointplot(yX3[‘concave points error’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscatterconcavepointserror.png’)

the M and B distributions in the target-feature domain - concave points error

sns.jointplot(yX3[‘worst symmetry’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscatterworstsymmetry.png’)

the M and B distributions in the target-feature domain - worst symmetry

sns.jointplot(yX3[‘worst smoothness’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscatterworstsmoothness.png’)

the M and B distributions in the target-feature domain - worst smoothness

sns.jointplot(yX3[‘worst texture’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscatterworsttexture.png’)

the M and B distributions in the target-feature domain - worst texture

sns.jointplot(yX3[‘radius error’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscatterradiuserror.png’)

the M and B distributions in the target-feature domain - radius error

sns.jointplot(yX3[‘worst concave points’], yX3[‘TARGET’], kind=’scatter’, marginal_kws=dict(bins=12, rug=True))
plt.savefig(‘snsscatterworstconcavepoints.png’)

the M and B distributions in the target-feature domain: worst concave points

We can see a significant overlap between the M and B distributions in the target-feature domain.

Let’s plot the joint scatter matrix of target and model features

color_map = {0: ‘#0392cf’, 1: ‘#7bc043’} # 0 (negative class): blue, 1 (positive class): green
colors = yX3[‘TARGET’].map(lambda x: color_map.get(x))
pd.plotting.scatter_matrix(yX3, alpha=0.5, color=colors, figsize=(16,18), diagonal=’hist’, hist_kwds={‘bins’:12})
plt.suptitle(‘Scatter matrix of target and model features’)

#plt.show()

plt.savefig(‘scattermatrixjointfeatures.png’)

Croos-plot joint correlation matrix grouped by target.

Let’s print the train data and the class distribution of TARGET for both train and test sets

print(“X_train.shape, y_train.shape”, X_train.shape, y_train.shape)

val_cnts = y_train.value_counts()
print(“Class distribution of positive and negative samples in the train set:”)
print(val_cnts)
print(“Percentage of positive class samples: %s” % “%2f%%” % (100 * val_cnts[1] / len(y_train)))

print(“—“)
print(“X_test.shape, y_test.shape”, X_test.shape, y_test.shape)

val_cnts = y_test.value_counts()
print(“Class distribution of positive and negative samples in the test set:”)
print(val_cnts)
print(“Percentage of positive class samples: %s” % “%2f%%” % (100 * val_cnts[1] / len(y_test)))

X_train.shape, y_train.shape (398, 9) (398,)
Class distribution of positive and negative samples in the train set:
0    249
1    149
dtype: int64
Percentage of positive class samples: 37.437186%
---
X_test.shape, y_test.shape (171, 9) (171,)
Class distribution of positive and negative samples in the test set:
0    108
1     63
dtype: int64
Percentage of positive class samples: 36.842105%

Let’s define the HPO plotting function:

Plot 2d grid search heatmap

Parameters:
grid_search: instance of sklearn’s GridSearchCV
grid_params: dictionary of grid search parameters
x_param: name of x-axis parameter in grid_params
y_param: name of y-axis parameter in grid_params
is_verbose (optional): print results

Return:
grid_search.best_score_: best score found
grid_search.best_estimator_: best estimator found

def plot_2d_grid_search_heatmap(grid_search, grid_params, x_param, y_param, is_verbose=True):

grid_params_x = grid_params[x_param]
grid_params_y = grid_params[y_param]

df_results = pd.DataFrame(grid_search.cv_results_)
ar_scores = np.array(df_results.mean_test_score).reshape(len(grid_params_y), len(grid_params_x))
sns.heatmap(ar_scores, annot=True, fmt=’.3f’, xticklabels=grid_params_x, yticklabels=grid_params_y)
print()
plt.suptitle(‘Grid search heatmap’)
plt.xlabel(x_param)
plt.ylabel(y_param)
plt.savefig(“gridsearchheatmapgbm.png”)

if is_verbose:
print(“grid_search.best_score_:”)
print(grid_search.best_score_)
print()
print(“grid_search.best_estimator_:”)
print(grid_search.best_estimator_)

return grid_search.best_score_, grid_search.best_estimator_

Let’s run the LR classifier with GridSearchCV

grid_lr = {‘C’: [0.001, 0.01, 0.1, 1, 10, 100, 1000], ‘penalty’: [‘l1’, ‘l2’]}
clf_lr = LogisticRegression(class_weight=’balanced’, dual=False,
fit_intercept=True, intercept_scaling=1, max_iter=200,
n_jobs=1, random_state=0, tol=0.0001, verbose=0, warm_start=False)

gs_lr = GridSearchCV(clf_lr, grid_lr, return_train_score=True)

gs_lr.fit(X_train, y_train)

best_score_lr, clf_lr = plot_2d_grid_search_heatmap(gs_lr, grid_lr, ‘C’, ‘penalty’)

print(“intercept_:”)
print(clf_lr.intercept_ )
print()
print(“coef_:”)
print(clf_lr.coef_)

grid_search.best_score_:
0.9548417721518987

grid_search.best_estimator_:
LogisticRegression(C=1000, class_weight='balanced', max_iter=200, n_jobs=1,
                   random_state=0)
intercept_:
[-20.60103]

coef_:
[[ 78.14404612  15.64423206   0.25956452 -10.02229477  14.54515794
  -13.02308961 -13.84545081 -47.97567754 -12.68950602]]
Grid search heatmap - LR

Let’s run the GB classifier with GridSearchCV

grid_gb = {‘min_samples_leaf’: [2, 4, 8, 16], ‘learning_rate’: [0.001, 0.01, 0.1]}

clf_gbm = GradientBoostingClassifier(criterion=’friedman_mse’,
init=None, loss=’deviance’, max_features=’sqrt’, max_leaf_nodes=None,
max_depth=3, min_impurity_decrease=0.0,
min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=800,
random_state=0, subsample=1.0, verbose=0, warm_start=False)

gs_gb = GridSearchCV(clf_gbm, grid_gb, verbose=0, return_train_score=True)

gs_gb.fit(X_train, y_train)

best_score_gb, clf_gb = plot_2d_grid_search_heatmap(gs_gb, grid_gb, ‘min_samples_leaf’, ‘learning_rate’)

grid_search.best_score_:
0.9447468354430381

grid_search.best_estimator_:
GradientBoostingClassifier(learning_rate=0.01, max_features='sqrt',
                           min_samples_leaf=2, n_estimators=800,
                           random_state=0) 
Grid search heatmap - GBM

Let’s create the std train data frame

dfimp = pd.DataFrame(np.std(X_train, 0), columns=[‘std’])

and calculate the importance factors to rank them from high to low values of importance

Logistic regression:
dfimp[‘coef’] = clf_lr.coef_.T
dfimp[‘lr_sign_imp’] = (np.std(X_train, 0).ravel() * clf_lr.coef_).T
dfimp[‘lr_imp’] = np.abs(dfimp[‘lr_sign_imp’])

GBM:
dfimp[‘gb_imp’] = pd.Series(clf_gb.feature_importances_, X_train.columns)

Add a column for the correlation with target
dfimp[‘target_corr’] = s_X3_cols.drop(‘TARGET’)

and perform ranking

dfimp[‘lr_rank’] = dfimp[‘lr_imp’].rank(ascending=False)
dfimp[‘gb_rank’] = dfimp[‘gb_imp’].rank(ascending=False)
dfimp[‘target_corr_rank’] = dfimp[‘target_corr’].rank(ascending=False)

dfsort = dfimp.sort_values(‘target_corr_rank’, ascending=True)
print(dfsort.iloc[:20,:].to_string())

                         std     coef  lr_sign_imp  lr_imp  gb_imp  target_corr  lr_rank  gb_rank  target_corr_rank
worst concave points  0.0656  78.1440       5.1267  5.1267  0.4893       0.8024      1.0      1.0               1.0
radius error          0.2936  15.6442       4.5931  4.5931  0.2137       0.5574      2.0      2.0               2.0
worst texture         6.1397   0.2596       1.5937  1.5937  0.0878       0.4347      3.0      3.0               3.0
worst smoothness      0.0230 -10.0223      -0.2301  0.2301  0.0400       0.4324      8.0      6.0               4.0
worst symmetry        0.0634  14.5452       0.9216  0.9216  0.0286       0.4255      4.0      7.0               5.0
concave points error  0.0059 -13.0231      -0.0762  0.0762  0.0567       0.3902      9.0      5.0               6.0
mean symmetry         0.0280 -13.8455      -0.3880  0.3880  0.0092       0.3361      6.0      9.0               7.0
compactness error     0.0168 -47.9757      -0.8066  0.8066  0.0141       0.2823      5.0      8.0               8.0
concavity error       0.0285 -12.6895      -0.3622  0.3622  0.0605       0.2322      7.0      4.0               9.0

Let’s look at the test data

cols_top = dfsort.index[:7]
print(“[y_test, X_test]”)
yX_test = pd.concat([y_test.iloc[-10:], X_test[cols_top].iloc[-10:,:]], axis=1)
yX_test = yX_test.rename(columns={0: ‘TARGET’})
print(yX_test.to_string())

[y_test, X_test]
     TARGET  worst concave points  radius error  worst texture  worst smoothness  worst symmetry  concave points error  mean symmetry
334       0                0.0398        0.1840          28.46            0.1222          0.2554                0.0065         0.1539
440       0                0.1555        0.2574          26.87            0.1391          0.2540                0.0177         0.1489
441       1                0.1739        0.5100          35.46            0.1436          0.2500                0.0174         0.1467
137       0                0.0848        0.1759          22.02            0.1190          0.2676                0.0086         0.1734
230       1                0.2543        0.2959          24.89            0.1703          0.3109                0.0138         0.2131
7         1                0.1556        0.5835          28.14            0.1654          0.3196                0.0145         0.2196
408       1                0.1974        0.4537          25.41            0.1482          0.3060                0.0148         0.1992
523       0                0.1284        0.3191          25.63            0.1425          0.2849                0.0135         0.1714
361       0                0.0561        0.2621          29.20            0.1140          0.2637                0.0064         0.1815
553       0                0.0256        0.3013          25.05            0.1103          0.2435                0.0128         0.1692

Let’s created the integrated LR-GBM HPO performance classification report

li_clfs = [‘clf_lr’, ‘clf_gb’]
dfp = pd.DataFrame(index=[‘TARGET’], data=[y_test]).T

for s_clf in li_clfs:
print(“MODEL: ” + s_clf)
print(“————–“)
clf = eval(s_clf)

y_pred = clf.predict(X_test).astype(int)  # returns a class decision based on the value of the predicted probability
y_score = clf.predict_proba(X_test)  # returns the value of the predicted probability

s_class = "%s_class" % s_clf
s_proba = "%s_proba" % s_clf
s_rank = "%s_rank" % s_clf

dfp[s_class] = y_pred
dfp[s_proba] = y_score[:,1]
dfp[s_rank] = dfp[s_proba].rank(ascending=1).astype(int)

# Print confusion matrix & classification report
# from sklearn.metrics import confusion_matrix
# cm = confusion_matrix(y_test, y_pred)

# Pandas 'crosstab' displays a better formatted confusion matrix than the one in sklearn
cm = pd.crosstab(y_test, y_pred, rownames=['Reality'], colnames=['Predicted'], margins=True)
print(cm) 

print()
print("Classification report:")
print(classification_report(y_test, y_pred))
if s_clf == 'clf_lr':
    y_score_lr = y_score.copy() 
elif s_clf == 'clf_gb':
    y_score_gb = y_score.copy()
else:
    print('Error')
    break 
MODEL: clf_lr
--------------
Predicted    0   1  All
Reality                
0          101   7  108
1            4  59   63
All        105  66  171

Classification report:
              precision    recall  f1-score   support

           0       0.96      0.94      0.95       108
           1       0.89      0.94      0.91        63

    accuracy                           0.94       171
   macro avg       0.93      0.94      0.93       171
weighted avg       0.94      0.94      0.94       171

MODEL: clf_gb
--------------
Predicted    0   1  All
Reality                
0          103   5  108
1            7  56   63
All        110  61  171
Classification report:
              precision    recall  f1-score   support

           0       0.94      0.95      0.94       108
           1       0.92      0.89      0.90        63

    accuracy                           0.93       171
   macro avg       0.93      0.92      0.92       171
weighted avg       0.93      0.93      0.93       171

print(dfp.tail(10).to_string())

     TARGET  clf_lr_class  clf_lr_proba  clf_lr_rank  clf_gb_class  clf_gb_proba  clf_gb_rank
334       0             0        0.0006           30             0        0.0079           58
440       0             0        0.3821          101             1        0.7771          113
441       1             1        0.9994          137             1        0.9928          147
137       0             0        0.0020           44             0        0.0032            6
230       1             1        0.9998          147             1        0.9350          119
7         1             1        0.9975          131             1        0.9949          157
408       1             1        0.9990          135             1        0.9924          144
523       0             0        0.4748          105             0        0.2996          103
361       0             0        0.0035           52             0        0.0045           29
553       0             0        0.0001           13             0        0.0041           21

Let’s summarize the statistics

RANK_BINS = 10
rank_cols = [‘i_bin’, ‘i_min_bin’, ‘i_max_bin’, ‘score_min’, ‘score_max’, ‘bin_cnt’, ‘pos_cnt’, ‘pos_rate’]
rank_cols2 = [‘i_bin’, ‘score_min’, ‘score_max’, ‘tnr’, ‘fpr’, ‘fnr’, ‘tpr’, ‘tn’, ‘fp’, ‘fn’, ‘tp’]

print(dfp.shape)
len_test = dfp.shape[0]
len_bin = int(len_test / RANK_BINS)

for s_clf in li_clfs:
i_min_bin = 0
i_max_bin = 0
dfr = pd.DataFrame(columns=rank_cols)
dfr2 = pd.DataFrame(columns=rank_cols2)

s_class = "%s_class" % s_clf
s_proba = "%s_proba" % s_clf
s_rank = "%s_rank" % s_clf

for i in range(RANK_BINS):
    if i == RANK_BINS - 1:
        i_max_bin = len_test
    else:
        i_max_bin += len_bin

    # Range used for *each* score bin
    df_rng = dfp[(dfp[s_rank] >= i_min_bin) & (dfp[s_rank] < i_max_bin)]

    score_min = np.min(df_rng[s_proba])
    score_max = np.max(df_rng[s_proba])
    bin_cnt = df_rng.shape[0]

    pos_cnt = len(df_rng[df_rng['TARGET'] == 1])
    pos_rate = pos_cnt / bin_cnt

    # Range used for *all* score bins up to i_max_bin
    df_rng_0 = dfp[dfp[s_rank] < i_max_bin]
    df_rng_1 = dfp[dfp[s_rank] >= i_max_bin]

    # Positive targets (positive customers)
    tp = len(df_rng_1[df_rng_1['TARGET'] == 1])
    fn = len(df_rng_0[df_rng_0['TARGET'] == 1])
    pos = tp + fn
    tpr = tp / pos
    fnr = 1 - tpr

    # Negative targets
    tn = len(df_rng_0[df_rng_0['TARGET'] == 0])
    fp = len(df_rng_1[df_rng_1['TARGET'] == 0])
    neg = tn + fp
    fpr = fp / neg      
    tnr = 1 - fpr

    # Build the dataframe for the summary stats per bin
    row = [i, i_min_bin, i_max_bin, score_min, score_max, bin_cnt, pos_cnt, pos_rate]
    dfr.loc[i] = row

    # Build the dataframe for the ROC stats per bin
    row2 = [i, score_min, score_max, tnr, fpr, fnr, tpr, tn, fp, fn, tp]
    dfr2.loc[i] = row2

    # Prep for next iteration
    i_min_bin = i_max_bin 

if s_clf == 'clf_lr':
    dfr_lr = dfr.copy() 
    dfr2_lr = dfr2.copy() 
elif s_clf == 'clf_gb':
    dfr_gb = dfr.copy()
    dfr2_gb = dfr2.copy()
else:
    print('Error')
    break

print(“——-“)
print(“Results of positive count stats per bin”)
print(“Logistic Regression stats:”)
print(dfr_lr.to_string())
print()
print(“GBM stats:”)
print(dfr_gb.to_string())

print()
print(“—“)
print(“Overall stats in the test set:”)
pos_cnt = len(dfp[dfp[‘TARGET’] == 1])
pos_rate = pos_cnt / len_test

print(“pos_cnt”, pos_cnt)
print(“total rows”, len_test)
print(“pos_rate”, pos_rate)

(171, 7)
-------
Results of positive count stats per bin
Logistic Regression stats:
   i_bin  i_min_bin  i_max_bin   score_min  score_max  bin_cnt  pos_cnt  pos_rate
0    0.0        0.0       17.0  1.4302e-06     0.0002     16.0      0.0    0.0000
1    1.0       17.0       34.0  1.9229e-04     0.0007     17.0      0.0    0.0000
2    2.0       34.0       51.0  7.1944e-04     0.0033     17.0      0.0    0.0000
3    3.0       51.0       68.0  3.5092e-03     0.0103     17.0      0.0    0.0000
4    4.0       68.0       85.0  1.2065e-02     0.0481     17.0      1.0    0.0588
5    5.0       85.0      102.0  4.8669e-02     0.3821     17.0      3.0    0.1765
6    6.0      102.0      119.0  3.9533e-01     0.9029     17.0      6.0    0.3529
7    7.0      119.0      136.0  9.4799e-01     0.9990     17.0     17.0    1.0000
8    8.0      136.0      153.0  9.9925e-01     1.0000     17.0     17.0    1.0000
9    9.0      153.0      171.0  9.9999e-01     1.0000     18.0     18.0    1.0000

GBM stats:
   i_bin  i_min_bin  i_max_bin  score_min  score_max  bin_cnt  pos_cnt  pos_rate
0    0.0        0.0       17.0     0.0023     0.0040     16.0      0.0    0.0000
1    1.0       17.0       34.0     0.0040     0.0047     17.0      0.0    0.0000
2    2.0       34.0       51.0     0.0047     0.0058     17.0      0.0    0.0000
3    3.0       51.0       68.0     0.0059     0.0101     17.0      0.0    0.0000
4    4.0       68.0       85.0     0.0109     0.0341     17.0      2.0    0.1176
5    5.0       85.0      102.0     0.0362     0.2315     17.0      4.0    0.2353
6    6.0      102.0      119.0     0.2659     0.9298     17.0      5.0    0.2941
7    7.0      119.0      136.0     0.9350     0.9880     17.0     16.0    0.9412
8    8.0      136.0      153.0     0.9895     0.9945     17.0     17.0    1.0000
9    9.0      153.0      171.0     0.9946     0.9964     18.0     18.0    1.0000

---
Overall stats in the test set:
pos_cnt 63
total rows 171
pos_rate 0.3684210526315789

Let’s compare the percentiles for LR and GBM

s_titles = [‘Logistic Regression’, ‘GBM’]

plt.figure(figsize=(12, 4))
for i, s_clf in enumerate(li_clfs):
s_title = s_titles[i]

if s_clf == 'clf_lr':
    dfr = dfr_lr 
    dfr2 = dfr2_lr 
elif s_clf == 'clf_gb':
    dfr = dfr_gb
    dfr2 = dfr2_gb
else:
    print('Error')
    break

plt.subplot(1, 2, i+1)
plt.bar(dfr['i_bin'], dfr['pos_rate'])
plt.title('Percentiles for ' + s_title)

# Format charts
plt.grid()
plt.xlabel('Score bin (from low to high scoring bins)')
plt.ylabel('Ratio of positive classes per bin')
plt.legend(loc="lower right")

plt.xlim([0, RANK_BINS])
plt.ylim([0.0, 1])
#plt.show()

plt.savefig(“percentileslrgbm.png”)

Percentiles for Logistic Regression (LR) and GBM

It is clear that LR has the highest score and POS.

Let’s check the ROC curves

print(“Results of ROC stats per bin”)
print(“Logistic Regression stats:”)
print(dfr2_lr.to_string())
print()
print(“GBM stats:”)
print(dfr2_gb.to_string())

Results of ROC stats per bin
Logistic Regression stats:
   i_bin   score_min  score_max     tnr     fpr     fnr     tpr     tn    fp    fn    tp
0    0.0  1.4302e-06     0.0002  0.1481  0.8519  0.0000  1.0000   16.0  92.0   0.0  63.0
1    1.0  1.9229e-04     0.0007  0.3056  0.6944  0.0000  1.0000   33.0  75.0   0.0  63.0
2    2.0  7.1944e-04     0.0033  0.4630  0.5370  0.0000  1.0000   50.0  58.0   0.0  63.0
3    3.0  3.5092e-03     0.0103  0.6204  0.3796  0.0000  1.0000   67.0  41.0   0.0  63.0
4    4.0  1.2065e-02     0.0481  0.7685  0.2315  0.0159  0.9841   83.0  25.0   1.0  62.0
5    5.0  4.8669e-02     0.3821  0.8981  0.1019  0.0635  0.9365   97.0  11.0   4.0  59.0
6    6.0  3.9533e-01     0.9029  1.0000  0.0000  0.1587  0.8413  108.0   0.0  10.0  53.0
7    7.0  9.4799e-01     0.9990  1.0000  0.0000  0.4286  0.5714  108.0   0.0  27.0  36.0
8    8.0  9.9925e-01     1.0000  1.0000  0.0000  0.6984  0.3016  108.0   0.0  44.0  19.0
9    9.0  9.9999e-01     1.0000  1.0000  0.0000  0.9841  0.0159  108.0   0.0  62.0   1.0
GBM stats:
   i_bin  score_min  score_max     tnr     fpr     fnr     tpr     tn    fp    fn    tp
0    0.0     0.0023     0.0040  0.1481  0.8519  0.0000  1.0000   16.0  92.0   0.0  63.0
1    1.0     0.0040     0.0047  0.3056  0.6944  0.0000  1.0000   33.0  75.0   0.0  63.0
2    2.0     0.0047     0.0058  0.4630  0.5370  0.0000  1.0000   50.0  58.0   0.0  63.0
3    3.0     0.0059     0.0101  0.6204  0.3796  0.0000  1.0000   67.0  41.0   0.0  63.0
4    4.0     0.0109     0.0341  0.7593  0.2407  0.0317  0.9683   82.0  26.0   2.0  61.0
5    5.0     0.0362     0.2315  0.8796  0.1204  0.0952  0.9048   95.0  13.0   6.0  57.0
6    6.0     0.2659     0.9298  0.9907  0.0093  0.1746  0.8254  107.0   1.0  11.0  52.0
7    7.0     0.9350     0.9880  1.0000  0.0000  0.4286  0.5714  108.0   0.0  27.0  36.0
8    8.0     0.9895     0.9945  1.0000  0.0000  0.6984  0.3016  108.0   0.0  44.0  19.0
9    9.0     0.9946     0.9964  1.0000  0.0000  0.9841  0.0159  108.0   0.0  62.0   1.0

Let’s define and call the plot function

def plot_roc_and_precision_recall(y_true, y_score):
# Get ROC curve FPR and TPR from true labels vs score values
fpr, tpr, _ = roc_curve(y_true, y_score)

# Calculate ROC Area Under the Curve (AUC) from FPR and TPR data points
roc_auc = auc(fpr, tpr)

# Calculate precision and recall from true labels vs score values
precision, recall, _ = precision_recall_curve(y_true, y_score)

plt.figure(figsize=(8, 3))

plt.subplot(1,2,1)
lw = 2
plt.plot(fpr, tpr, color=’darkorange’, lw=lw, label=’ROC curve (AUC = %0.4f)’ % roc_auc)
plt.plot([0, 1], [0, 1], color=’navy’, lw=lw, linestyle=’–‘)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel(‘False Positive Rate’)
plt.ylabel(‘True Positive Rate’)
plt.title(‘ROC Curve’)
plt.legend(loc=”lower right”)
plt.grid(True)

plt.subplot(1,2,2)
plt.step(recall, precision, color=’orange’, where=’post’)
# plt.fill_between(recall, precision, step=’post’, alpha=0.5, color=’orange’)
plt.xlabel(‘Recall’)
plt.ylabel(‘Precision’)
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title(‘Precision Recall Curve’)
plt.grid(True)

left = 0.125 # the left side of the subplots of the figure
right = 0.9 # the right side of the subplots of the figure
bottom = 0.1 # the bottom part of the subplots of the figure
top = 0.9 # the top part of the subplots of the figure
wspace = 0.5 # the amount of width reserved for blank space between subplots
hspace = 0.2 # the amount of height reserved for white space between subplots
plt.subplots_adjust(left, bottom, right, top, wspace, hspace)
plt.show()

for s_clf in li_clfs:
print(“MODEL: ” + s_clf)
print(“————–“)
clf = eval(s_clf)

y_pred = clf.predict(X_test).astype(int)
y_score = clf.predict_proba(X_test)

plot_roc_and_precision_recall(y_test, y_score[:,1])  # provide the column for the scores belonging only to the positive class
print()
MODEL: clf_lr
--------------
ROC and Precision-Recall Curves - LR
MODEL: clf_gb
--------------
ROC and Precision-Recall Curves - GBM

We can see that AUC(LR)>AUC(GBM).

Summary

  • Train/test data splitting ratio 70:30%
  • Feature engineering yields 9 dominant features

worst concave points 0.8024

radius error 0.5574

worst texture 0.4347

worst smoothness 0.4324

worst symmetry 0.4255

concave points error 0.3902

mean symmetry 0.3361

compactness error 0.2823

concavity error 0.2322

  • LR Classification report:
  • precision recall f1-score support
  • 0 0.96 0.94 0.95 108
  • 1 0.89 0.94 0.91 63
  • accuracy 0.94 171
  • GB Classification report:
  • precision recall f1-score support
  • 0 0.94 0.95 0.94 108
  • 1 0.92 0.89 0.90 63
  • accuracy 0.93 171
  • Percentiles for LR vs GBM: LR has the highest score.
  • AUC(LR)=0.987 > AUC(GBM)=0.975
  • Results are consistent with the previous ML R&D study.

Read More

Colab Research Example using scikit-learn: Breast cancer prediction

A Comparative Analysis of Breast Cancer ML/AI Binary Classifications

Supervised ML/AI Breast Cancer Diagnostics (BCD) – The Power of HealthTech

HealthTech ML/AI Use-Cases

How to find the importance of the features for a logistic regression model?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: