A Comparison of Scikit Learn Algorithms for Breast Cancer Classification – 2. Cross Validation vs Performance

Featured Photo by Tara Winstead @ Pexels.

This post is a continuation of the previous breast cancer (BC) study focused on a comparison of available Scikit-Learn binary classifiers (Logistic Regression, GaussianNB, SVC, KNN, Random Forest, Extra Trees, and Gradient Boosting) in terms of cross validation and model performance/scalability scores.

Contents:

  1. Cross-Validation Score
  2. Learning Curves
  3. Model Scalability
  4. Model Performance
  5. Summary
  6. Explore More
  7. Appendix
  8. Embed Socials

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

import the key libraries

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier

and prepare the BC data

  • get BC data
    cancer = load_breast_cancer()
    X_cancer = cancer[‘data’]
    y_cancer = cancer[‘target’]
  • normalized the data
    • scaler = MinMaxScaler()
    • X_cancer_scaled = scaler.fit_transform(X_cancer)

Cross-Validation Score

Let’s apply the Classifier and get the CV score

clf = LogisticRegression()

cv_scores = cross_val_score(clf, X_cancer_scaled, y_cancer, cv = 5)

print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.95614035 0.96491228 0.97368421 0.95614035 0.96460177]
The average cross validation score (5 folds): 0.9630957925787922

estimator = GaussianNB()

cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)

print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.90350877 0.9122807  0.95614035 0.94736842 0.92035398]
The average cross validation score (5 folds): 0.927930445582984

estimator = SVC()

cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)

Cross validation scores (5 folds): [0.96491228 0.96491228 0.99122807 0.96491228 0.98230088]
The average cross validation score (5 folds): 0.9736531594472908

from sklearn.neighbors import KNeighborsClassifier
logreg = KNeighborsClassifier(n_neighbors=6)
cv_scores = cross_val_score(logreg, X_cancer_scaled, y_cancer, cv = 5)

print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.95614035 0.94736842 0.99122807 0.96491228 0.96460177]
The average cross validation score (5 folds): 0.9648501785437043

estimator = RandomForestClassifier()

cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.92982456 0.94736842 0.99122807 0.97368421 0.97345133]
The average cross validation score (5 folds): 0.9631113181183046

estimator = GradientBoostingClassifier()

cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.92982456 0.93859649 0.97368421 0.98245614 0.98230088]
The average cross validation score (5 folds): 0.9613724576929048

estimator = ExtraTreesClassifier()

cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.94736842 0.96491228 0.98245614 0.97368421 0.96460177]
The average cross validation score (5 folds): 0.9666045645086166

Learning Curves

Learning curves show you how the performance of a classifier changes. Here we compare GaussianNB and SVC(gamma=0.001) using the function plot_learning_curve (cf. Appendix)

fig, axes = plt.subplots(3, 2, figsize=(10, 15))

title = “Learning Curves (Naive Bayes)”

We define cross validation with 50 iterations to get smoother mean test and train score curves, each time with 20% data randomly selected as a validation set

cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)

estimator = GaussianNB()
plot_learning_curve(
estimator,
title,
X,
y,
axes=axes[:, 0],
ylim=(0.7, 1.01),
cv=cv,
n_jobs=4,
scoring=”accuracy”,
)

title = r”Learning Curves (SVM, RBF kernel, $\gamma=0.001$)”

#SVC is more expensive so we define a lower number of CV iterations

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(
estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01), cv=cv, n_jobs=4
)

plt.show()

We can see that Training_score (NB) < Training_score (SVM) whereas CV_score (NB) > CV_score(NB).

Model Scalability

Note that the function in Appendix also returns fit_times. This returns the time it took to fit the model. As you can expect, the more data the longer it takes to run the model. 

Naive Bayes SVM

We can see that fit_times(NB) << fit_times(SVM) at Training_examples = 400.

Model Performance

Naive Bayes SVM

We can see that Score(NB) > Score(SVM) as fit_times<0.0005.

Summary

NoMethodCV ScoreRank
1 LogisticRegression0.9633
2GaussianNB0.9284
3SVC 0.9731
4KNeighborsClassifier0.9652
5RandomForestClassifier0.9633
6GradientBoostingClassifier0.9613
7ExtraTreesClassifier0.9662

However, results show that GaussianNB is more efficient than SVC in terms of model scalability and performance.

Explore More

A Comparison of Binary Classifiers for Enhanced ML/AI Breast Cancer Diagnostics – 1. Scikit-Plot

ML/AI Breast Cancer Diagnosis with 98% Confidence

Breast Cancer ML Classification – Logistic Regression vs Gradient Boosting with Hyperparameter Optimization (HPO)

HealthTech ML/AI Q3 ’22 Round-Up

Supervised ML/AI Breast Cancer Diagnostics (BCD) – The Power of HealthTech

Appendix

def plot_learning_curve(
estimator,
title,
X,
y,
axes=None,
ylim=None,
cv=None,
n_jobs=None,
scoring=None,
train_sizes=np.linspace(0.1, 1.0, 5),
):
if axes is None:
_, axes = plt.subplots(1, 3, figsize=(20, 5))

axes[0].set_title(title)
if ylim is not None:
    axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
    estimator,
    X,
    y,
    scoring=scoring,
    cv=cv,
    n_jobs=n_jobs,
    train_sizes=train_sizes,
    return_times=True,
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)

# Plot learning curve
axes[0].grid()
axes[0].fill_between(
    train_sizes,
    train_scores_mean - train_scores_std,
    train_scores_mean + train_scores_std,
    alpha=0.1,
    color="r",
)
axes[0].fill_between(
    train_sizes,
    test_scores_mean - test_scores_std,
    test_scores_mean + test_scores_std,
    alpha=0.1,
    color="g",
)
axes[0].plot(
    train_sizes, train_scores_mean, "o-", color="r", label="Training score"
)
axes[0].plot(
    train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
)
axes[0].legend(loc="best")

# Plot n_samples vs fit_times
axes[1].grid()
axes[1].plot(train_sizes, fit_times_mean, "o-")
axes[1].fill_between(
    train_sizes,
    fit_times_mean - fit_times_std,
    fit_times_mean + fit_times_std,
    alpha=0.1,
)
axes[1].set_xlabel("Training examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability of the model")

# Plot fit_time vs score
fit_time_argsort = fit_times_mean.argsort()
fit_time_sorted = fit_times_mean[fit_time_argsort]
test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
test_scores_std_sorted = test_scores_std[fit_time_argsort]
axes[2].grid()
axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
axes[2].fill_between(
    fit_time_sorted,
    test_scores_mean_sorted - test_scores_std_sorted,
    test_scores_mean_sorted + test_scores_std_sorted,
    alpha=0.1,
)
axes[2].set_xlabel("fit_times")
axes[2].set_ylabel("Score")
axes[2].set_title("Performance of the model")

return plt    

Embed Socials

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: