Featured Photo by Tara Winstead @ Pexels.
This post is a continuation of the previous breast cancer (BC) study focused on a comparison of available Scikit-Learn binary classifiers (Logistic Regression, GaussianNB, SVC, KNN, Random Forest, Extra Trees, and Gradient Boosting) in terms of cross validation and model performance/scalability scores.
Contents:
- Cross-Validation Score
- Learning Curves
- Model Scalability
- Model Performance
- Summary
- Explore More
- Appendix
- Embed Socials
Let’s set the working directory YOURPATH
import os
os.chdir(‘YOURPATH’)
os. getcwd()
import the key libraries
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.model_selection import learning_curve
from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
and prepare the BC data
- get BC data
cancer = load_breast_cancer()
X_cancer = cancer[‘data’]
y_cancer = cancer[‘target’] - normalized the data
- scaler = MinMaxScaler()
- X_cancer_scaled = scaler.fit_transform(X_cancer)
Cross-Validation Score
Let’s apply the Classifier and get the CV score
clf = LogisticRegression()
cv_scores = cross_val_score(clf, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))
Cross validation scores (5 folds): [0.95614035 0.96491228 0.97368421 0.95614035 0.96460177] The average cross validation score (5 folds): 0.9630957925787922
estimator = GaussianNB()
cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))
Cross validation scores (5 folds): [0.90350877 0.9122807 0.95614035 0.94736842 0.92035398] The average cross validation score (5 folds): 0.927930445582984
estimator = SVC()
cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
Cross validation scores (5 folds): [0.96491228 0.96491228 0.99122807 0.96491228 0.98230088] The average cross validation score (5 folds): 0.9736531594472908
from sklearn.neighbors import KNeighborsClassifier
logreg = KNeighborsClassifier(n_neighbors=6)
cv_scores = cross_val_score(logreg, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))
Cross validation scores (5 folds): [0.95614035 0.94736842 0.99122807 0.96491228 0.96460177] The average cross validation score (5 folds): 0.9648501785437043
estimator = RandomForestClassifier()
cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))
Cross validation scores (5 folds): [0.92982456 0.94736842 0.99122807 0.97368421 0.97345133] The average cross validation score (5 folds): 0.9631113181183046
estimator = GradientBoostingClassifier()
cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))
Cross validation scores (5 folds): [0.92982456 0.93859649 0.97368421 0.98245614 0.98230088] The average cross validation score (5 folds): 0.9613724576929048
estimator = ExtraTreesClassifier()
cv_scores = cross_val_score(estimator, X_cancer_scaled, y_cancer, cv = 5)
print(‘Cross validation scores (5 folds): {}’.format(cv_scores))
print(‘The average cross validation score (5 folds): {}’.format(np.mean(cv_scores)))
Cross validation scores (5 folds): [0.94736842 0.96491228 0.98245614 0.97368421 0.96460177] The average cross validation score (5 folds): 0.9666045645086166
Learning Curves
Learning curves show you how the performance of a classifier changes. Here we compare GaussianNB and SVC(gamma=0.001) using the function plot_learning_curve (cf. Appendix)
fig, axes = plt.subplots(3, 2, figsize=(10, 15))
title = “Learning Curves (Naive Bayes)”
We define cross validation with 50 iterations to get smoother mean test and train score curves, each time with 20% data randomly selected as a validation set
cv = ShuffleSplit(n_splits=50, test_size=0.2, random_state=0)
estimator = GaussianNB()
plot_learning_curve(
estimator,
title,
X,
y,
axes=axes[:, 0],
ylim=(0.7, 1.01),
cv=cv,
n_jobs=4,
scoring=”accuracy”,
)
title = r”Learning Curves (SVM, RBF kernel, $\gamma=0.001$)”
#SVC is more expensive so we define a lower number of CV iterations
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
estimator = SVC(gamma=0.001)
plot_learning_curve(
estimator, title, X, y, axes=axes[:, 1], ylim=(0.7, 1.01), cv=cv, n_jobs=4
)
plt.show()

We can see that Training_score (NB) < Training_score (SVM) whereas CV_score (NB) > CV_score(NB).
Model Scalability
Note that the function in Appendix also returns fit_times. This returns the time it took to fit the model. As you can expect, the more data the longer it takes to run the model.
Naive Bayes SVM

We can see that fit_times(NB) << fit_times(SVM) at Training_examples = 400.
Model Performance
Naive Bayes SVM

We can see that Score(NB) > Score(SVM) as fit_times<0.0005.
Summary
No | Method | CV Score | Rank |
1 | LogisticRegression | 0.963 | 3 |
2 | GaussianNB | 0.928 | 4 |
3 | SVC | 0.973 | 1 |
4 | KNeighborsClassifier | 0.965 | 2 |
5 | RandomForestClassifier | 0.963 | 3 |
6 | GradientBoostingClassifier | 0.961 | 3 |
7 | ExtraTreesClassifier | 0.966 | 2 |
However, results show that GaussianNB is more efficient than SVC in terms of model scalability and performance.
Explore More
A Comparison of Binary Classifiers for Enhanced ML/AI Breast Cancer Diagnostics – 1. Scikit-Plot
ML/AI Breast Cancer Diagnosis with 98% Confidence
HealthTech ML/AI Q3 ’22 Round-Up
Supervised ML/AI Breast Cancer Diagnostics (BCD) – The Power of HealthTech
Appendix
def plot_learning_curve(
estimator,
title,
X,
y,
axes=None,
ylim=None,
cv=None,
n_jobs=None,
scoring=None,
train_sizes=np.linspace(0.1, 1.0, 5),
):
if axes is None:
_, axes = plt.subplots(1, 3, figsize=(20, 5))
axes[0].set_title(title)
if ylim is not None:
axes[0].set_ylim(*ylim)
axes[0].set_xlabel("Training examples")
axes[0].set_ylabel("Score")
train_sizes, train_scores, test_scores, fit_times, _ = learning_curve(
estimator,
X,
y,
scoring=scoring,
cv=cv,
n_jobs=n_jobs,
train_sizes=train_sizes,
return_times=True,
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
fit_times_mean = np.mean(fit_times, axis=1)
fit_times_std = np.std(fit_times, axis=1)
# Plot learning curve
axes[0].grid()
axes[0].fill_between(
train_sizes,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.1,
color="r",
)
axes[0].fill_between(
train_sizes,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.1,
color="g",
)
axes[0].plot(
train_sizes, train_scores_mean, "o-", color="r", label="Training score"
)
axes[0].plot(
train_sizes, test_scores_mean, "o-", color="g", label="Cross-validation score"
)
axes[0].legend(loc="best")
# Plot n_samples vs fit_times
axes[1].grid()
axes[1].plot(train_sizes, fit_times_mean, "o-")
axes[1].fill_between(
train_sizes,
fit_times_mean - fit_times_std,
fit_times_mean + fit_times_std,
alpha=0.1,
)
axes[1].set_xlabel("Training examples")
axes[1].set_ylabel("fit_times")
axes[1].set_title("Scalability of the model")
# Plot fit_time vs score
fit_time_argsort = fit_times_mean.argsort()
fit_time_sorted = fit_times_mean[fit_time_argsort]
test_scores_mean_sorted = test_scores_mean[fit_time_argsort]
test_scores_std_sorted = test_scores_std[fit_time_argsort]
axes[2].grid()
axes[2].plot(fit_time_sorted, test_scores_mean_sorted, "o-")
axes[2].fill_between(
fit_time_sorted,
test_scores_mean_sorted - test_scores_std_sorted,
test_scores_mean_sorted + test_scores_std_sorted,
alpha=0.1,
)
axes[2].set_xlabel("fit_times")
axes[2].set_ylabel("Score")
axes[2].set_title("Performance of the model")
return plt
Embed Socials
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly