A Comparison of Scikit Learn Algorithms for Breast Cancer Classification – 2. Cross Validation vs Performance

Featured Photo by Tara Winstead @ Pexels.

This post is a continuation of the previous breast cancer (BC) study focused on a comparison of available Scikit-Learn binary classifiers (Logistic Regression, GaussianNB, SVC, KNN, Random Forest, Extra Trees, and Gradient Boosting) in terms of cross validation and model performance/scalability scores.

Contents:

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

import the key libraries

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier

and prepare the BC data

get BC data
cancer = load_breast_cancer()
X_cancer = cancer[‘data’]
y_cancer = cancer[‘target’]
normalized the data
- scaler = MinMaxScaler()
- X_cancer_scaled = scaler.fit_transform(X_cancer)

Cross-Validation Score

Let’s apply the Classifier and get the CV score

clf = LogisticRegression()

cv_scores = cross_val_score(clf, X_cancer_scaled, y_cancer, cv = 5)

print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.95614035 0.96491228 0.97368421 0.95614035 0.96460177]
The average cross validation score (5 folds): 0.9630957925787922

estimator = GaussianNB()

cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)

print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.90350877 0.9122807  0.95614035 0.94736842 0.92035398]
The average cross validation score (5 folds): 0.927930445582984

estimator = SVC()

cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)

Cross validation scores (5 folds): [0.96491228 0.96491228 0.99122807 0.96491228 0.98230088]
The average cross validation score (5 folds): 0.9736531594472908

from sklearn.neighbors import KNeighborsClassifier
logreg = KNeighborsClassifier(n_neighbors=6)
cv_scores = cross_val_score(logreg, X_cancer_scaled, y_cancer, cv = 5)

print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.95614035 0.94736842 0.99122807 0.96491228 0.96460177]
The average cross validation score (5 folds): 0.9648501785437043

estimator = RandomForestClassifier()

cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))

Cross validation scores (5 folds): [0.92982456 0.94736842 0.99122807 0.97368421 0.97345133]
The average cross validation score (5 folds): 0.9631113181183046

estimator = GradientBoostingClassifier()

Cross validation scores (5 folds): [0.92982456 0.93859649 0.97368421 0.98245614 0.98230088]
The average cross validation score (5 folds): 0.9613724576929048

estimator = ExtraTreesClassifier()

Cross validation scores (5 folds): [0.94736842 0.96491228 0.98245614 0.97368421 0.96460177]
The average cross validation score (5 folds): 0.9666045645086166

Learning Curves

Learning curves show you how the performance of a classifier changes. Here we compare GaussianNB and SVC(gamma=0.001) using the function plot_learning_curve (cf. Appendix)

fig, axes = plt.subplots(3, 2, figsize=(10, 15))

title = “Learning Curves (Naive Bayes)”

We define cross validation with 50 iterations to get smoother mean test and train score curves, each time with 20% data randomly selected as a validation set

cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)

estimator = GaussianNB()
plot_learning_curve(
estimator,
title,
X,
y,
axes=axes[:, 0],
ylim=(0.7, 1.01),
cv=cv,
n_jobs=4,
scoring=”accuracy”,
)

title = r”Learning Curves (SVM, RBF kernel, $\gamma=0.001$)”

#SVC is more expensive so we define a lower number of CV iterations

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(
estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01), cv=cv, n_jobs=4
)

plt.show()

We can see that Training_score (NB) < Training_score (SVM) whereas CV_score (NB) > CV_score(NB).

Model Scalability

Note that the function in Appendix also returns fit_times. This returns the time it took to fit the model. As you can expect, the more data the longer it takes to run the model.

Naive Bayes SVM

We can see that fit_times(NB) << fit_times(SVM) at Training_examples = 400.

Model Performance

Naive Bayes SVM

We can see that Score(NB) > Score(SVM) as fit_times<0.0005.

Summary

No	Method	CV Score	Rank
1	LogisticRegression	0.963	3
2	GaussianNB	0.928	4
3	SVC	0.973	1
4	KNeighborsClassifier	0.965	2
5	RandomForestClassifier	0.963	3
6	GradientBoostingClassifier	0.961	3
7	ExtraTreesClassifier	0.966	2

However, results show that GaussianNB is more efficient than SVC in terms of model scalability and performance.

Explore More

A Comparison of Binary Classifiers for Enhanced ML/AI Breast Cancer Diagnostics – 1. Scikit-Plot

ML/AI Breast Cancer Diagnosis with 98% Confidence

Breast Cancer ML Classification – Logistic Regression vs Gradient Boosting with Hyperparameter Optimization (HPO)

HealthTech ML/AI Q3 ’22 Round-Up

Supervised ML/AI Breast Cancer Diagnostics (BCD) – The Power of HealthTech

Appendix

def plot_learning_curve(
estimator,
title,
X,
y,
axes=None,
ylim=None,
cv=None,
n_jobs=None,
scoring=None,
train_sizes=np.linspace(0.1, 1.0, 5),
):
if axes is None:
_, axes = plt.subplots(1, 3, figsize=(20, 5))

axes[0].set_title(title)
if ylim is not None:
    axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")

train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
    estimator,
    X,
    y,
    scoring=scoring,
    cv=cv,
    n_jobs=n_jobs,
    train_sizes=train_sizes,
    return_times=True,
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)

# Plot learning curve
axes[0].grid()
axes[0].fill_between(
    train_sizes,
    train_scores_mean - train_scores_std,
    train_scores_mean + train_scores_std,
    alpha=0.1,
    color="r",
)
axes[0].fill_between(
    train_sizes,
    test_scores_mean - test_scores_std,
    test_scores_mean + test_scores_std,
    alpha=0.1,
    color="g",
)
axes[0].plot(
    train_sizes, train_scores_mean, "o-", color="r", label="Training score"
)
axes[0].plot(
    train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
)
axes[0].legend(loc="best")

# Plot n_samples vs fit_times
axes[1].grid()
axes[1].plot(train_sizes, fit_times_mean, "o-")
axes[1].fill_between(
    train_sizes,
    fit_times_mean - fit_times_std,
    fit_times_mean + fit_times_std,
    alpha=0.1,
)
axes[1].set_xlabel("Training examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability of the model")

# Plot fit_time vs score
fit_time_argsort = fit_times_mean.argsort()
fit_time_sorted = fit_times_mean[fit_time_argsort]
test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
test_scores_std_sorted = test_scores_std[fit_time_argsort]
axes[2].grid()
axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
axes[2].fill_between(
    fit_time_sorted,
    test_scores_mean_sorted - test_scores_std_sorted,
    test_scores_mean_sorted + test_scores_std_sorted,
    alpha=0.1,
)
axes[2].set_xlabel("fit_times")
axes[2].set_ylabel("Score")
axes[2].set_title("Performance of the model")

return plt

Embed Socials

#machinelearning @scikit_learn binary classification #breastcancer #Diagnose #comparison 7 #Algorithms #BreastCancerAwarenessMonth #BreastCancerAwareness #breastcancerawarenessday #CancerResearch #healthcare #HealthTech #datascience #data #python https://t.co/KjhZvxIfk4
— Alex Z. data4u #va #DataScience #investments (@AlexZaplin) November 25, 2022

#machinelearning @scikit_learn binary classification for enhanced #breastcancer #Diagnose #comparison 5 #Algorithms PCA#BreastCancerAwarenessMonth #BreastCancerAwareness #breastcancerawarenessday #CancerResearch #healthcare #HealthTech https://t.co/gvMc7hLSxp
— Alex Z. data4u #va #DataScience #investments (@AlexZaplin) November 21, 2022

← Back

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

Or enter a custom amount

€

Your contribution is appreciated.

Donate Donate monthly Donate yearly

A Comparison of Scikit Learn Algorithms for Breast Cancer Classification – 2. Cross Validation vs Performance

Cross-Validation Score

Learning Curves

Model Scalability

Model Performance

Summary

Explore More

Appendix

Embed Socials

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Share this:

Discover more from Our Blogs

Leave a comment Cancel reply

Discover more from Our Blogs