https://www.canva.com/design/DAE7oU6O6QQ/share/preview?token=xH-OB2oXeQSrennmqMC2hw&role=EDITOR&utm_content=DAE7oU6O6QQ&utm_campaign=designshare&utm_medium=link&utm_source=sharebutton

Acknowledgements

with the ML/AI contribution

hiscidatmlai.blogspot.com/2022/02/digita

and

@VismeApp

#Graphics via ref

visme.co/?ref=al24

Thanks to

Mugdha Paithankar [1]

and kaggle.com/uciml/breast-c

[2]

for the shared open-source content!

Introduction

Breast Cancer (BC) continues to be the most frequent cancer in females, affecting about one in 8 women and causing the highest number of cancer-related deaths in females worldwide despite remarkable progress in early diagnosis, screening, and patient management [3]. The use of ML/AI models in combination with statistical explarotary data analysis (EDA) has become a predominant area of cancer research as a part of HealthTech data science/analytics [1,2]. Specifically, existing Python deep learning binary classification algorithms allow the integration of different sources of data, such as those from medical images, laboratory results, clinical outcomes, biomarkers, and biological features for better BC screening and diagnostics. The objective of this case study is to explore the application of ML/AI approaches to classify BC based on multiple feature values generated from a digitized image of a breast mass.

Input Data

We used a publicly available breast cancer dataset from the University of Wisconsin Hospitals, Madison, Wisconsin, USA. This dataset was generated by Dr. William H. Wolberg (General Surgery Department., University of Wisconsin, Clinical Sciences Center, Madison, WI 53792), and consisted of 569 breast cancer patients available on UCI Machine Learning Repository [1-3].

Let us download the dataset from Kaggle. It contains 596 rows and 33 columns of tumor shape and specifications. The tumor is classified as benign or malignant based on its geometry and shape. Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, which is type of biopsy procedure. They describe characteristics of the cell nuclei present in the image.

Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.
n the 3-dimensional space is that described in: [K. P. Bennett and O. L. Mangasarian: “Robust Linear Programming Discrimination of Two Linearly Inseparable Sets”, Optimization Methods and Software 1, 1992, 23-34].

This database is also available through the UW CS ftp server:
ftp ftp.cs.wisc.edu
cd math-prog/cpo-dataset/machine-learn/WDBC/

Also can be found on UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Attribute Information:

1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area – 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension (“coastline approximation” – 1)

The mean, standard error and “worst” or largest (mean of the three
largest values) of these features were computed for each image,
resulting in 30 features. For instance, field 3 is Mean Radius, field
13 is Radius SE, field 23 is Worst Radius.

All feature values are recoded with four significant digits.

Missing attribute values: none

Class distribution: 357 benign, 212 malignant.

#Step1: Importing all the necessary libraries

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

#make a dataframe

df = pd.read_csv(‘C:/OneDrive/Documents/data.csv’)

#examine the shape of the data

df.shape

(569, 33)

i.e. 63 and 37 %, respectively.

#get the column names

df.columns

Index(['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')

#Drop the column with all missing values (na, NAN, NaN)

#NOTE: This drops the column Unnamed: 32 column

df = df.dropna(axis=1)

#Get a count of the number of ‘M’ & ‘B’ cells

df[‘diagnosis’].value_counts()

B    357
M    212
Name: diagnosis, dtype: int64

#Visualize this count 
sns.countplot(df['diagnosis'],label="Count")

Violine Plots

Let us make a violin plot of different model fetaures [4,5]. A violin plot is a visualization that shows the spread of <a aria-describedby="tt" class="glossaryLink cmtt_Machine Learning" data-cmtooltip="

Numerical Data

These types of data denotes numeric description of something. Examples of numerical data are age, weight, temperature, count of something etc.

Synonyms:

Quantitative data

” href=”https://machinelearningknowledge.ai/glossary/numerical-data/” style=”border-bottom: 1px solid rgb(30, 115, 190); box-sizing: border-box; color: black; text-decoration-line: none;” target=”_blank”>quantitative data with different categorical variables. Violin plot uses kernel density estimation for displaying underlying distribution. Violin plot is generally used in cases where multiple distributions of data are to be visualized [6,7].

Let us plot selected violins for the entire M+B dataset:

The width of each curve corresponds with the approximate frequency of data points in each region. Densities are frequently accompanied by an overlaid chart type, such as box plot, with the rectangle showing the ends of the first and third quartiles and central dot the median to provide additional information [7]. The peaks, valleys, and tails of each feature’s density curve can be compared to see where feaures are similar or different. It appears that smoothness, texture and symmetry have similar density curves centered around the boxplot median. One can see that compactness, concavity and concave points have distinct right-skewed distributions compared to fractal dimension that has a slightly right-skewed distribution. The histogram skews in such a way that its right side (or “tail”) is longer than its left side. Both radius and concave points appear to represent right-skewed bimodal distributions.

At this stage, our data analysis objective is twofold: we select probability density distributions with most of the feature values clustered around the median and short tails around the min/max; we need to transform multimodal distributions (those with multiple peaks) into a series of nearly normal probability density distributions with reasonable values of kurtosis and skewness. So, we need to create separate violine plots for the “M” and “B” parts of the dataset. Let us plot and compare selected violins for the M (green) and B (dark blue) data:

The result is summarize in the table below:

We can see that features 1, 5 and 6 are quite suitable for BC classification, whereas other features should be used only in combination with more reliable model parameters.

Box Plots

In order to a better understand a difference between data distributions for malignant and benign groups, we can compute box plots of model features. A box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.

I visualized some features via box plots and compare their IQR as follows:

The following features are well separated for the M and B groups: radius, compactness, concavity, and concave points. Because of noticeable differences between B and M tumors, these could be good features for BC classification (as with violine plots described above).

Generally, the IQR is wider for malignant tumors. Fractal dimension means are almost the same for malignant and benign tumors. Texture means for malignant and benign tumors vary by about 3 units. The distribution looks similar for both the groups. Malignant tumors tend to have a higher texture mean compared to benign. Malignant groups have a wider range of values and the IQR for radius mean compared to benign groups.

Correlation Matrix

In order to check the correlation or inter-dependency between the individual features, let us consider the Pearson’s correlation matrix. The correlation matrix is simply a table of correlations, as shown below:

The means, std errors and worst dimension lengths of compactness, concavity and concave points of tumors are highly correlated amongst each other (correlation > 0.8). The mean, std errors and worst dimensions of radius, perimeter and area of tumors have a correlation of 1! texture_mean and texture_worst have a correlation of 0.9. area_worst and area_mean have a correlation of 1.

By now we have a rough idea that many of the features are highly correlated amongst each other.

Bottom line: From the above correlation matrix, it is clear that there are quite a few features with very high correlations. So we should drop one of the features, from each of the feature pairs which have a correlation greater than 0.95: ‘perimeter_mean’, ‘area_mean’, ‘perimeter_se’, ‘area_se’, ‘radius_worst’, ‘perimeter_worst’, ‘area_worst’ are amongst the features that should be dropped.

Statistical Significance Test

The rejection of the null hypothesis is needed for the above M and B data to be deemed statistically significant. A p-value is a measure of the probability that an observed difference could have occurred just by random chance. When the p-value is sufficiently small (e.g., 5% or less), then the results are not easily explained by chance alone and the null hypothesis can be rejected. When the p-value is large, then the results in the data are explainable by chance alone, and the data is deemed consistent with (while proving) the null hypothesis [7]. In addition, hypothesis tests use the t-test statistic to compare data sample to the null hypothesis. If the test statistic is extreme enough, this indicates that your data are so incompatible with the null hypothesis that you can reject the null.

Except for fractal dimension mean, the p value and t statistic is statistically significant for all the features in the table above. For fractal dimension mean the null hypothesis stands true, meaning there is no difference in means for the fractal dimension mean of M and B tumors.

Training ML Models

Let’s transform the categorical variable column (diagnosis) to a numeric type using sklearn’s LabelEncoder. The M and B variables are changed to 1 and 0 by the label encoder.

Then we split the data into the 60% training and 40% test subsets: test_size = 0.40, stratify=y, random_state = 17.

The sklearn’s Robust Scaler was used to scale the features of the dataset. The centering and scaling statistics of this scaler are based on percentiles and are therefore not influenced by a few number of very large marginal outliers.

Next, we Define a function which trains models and get the training results for several ML algorithms:

[0]Logistic Regression Training Accuracy: 0.9794721407624634
[1]Support Vector Machine (Linear Classifier) Training Accuracy: 0.9794721407624634
[2]Support Vector Machine (RBF Classifier) Training Accuracy: 0.9824046920821115
[3]Decision Tree Classifier Training Accuracy: 1.0
[4]Random Forest Classifier Training Accuracy: 0.9912023460410557

Error Analysis

We calculate the confusion matrix for several training models:

[[142   1]
 [  2  83]]
Model[0] Testing Accuracy = "0.9868421052631579"
[[141   2]
 [  4  81]]
Model[1] Testing Accuracy = "0.9736842105263158"
[[141   2]
 [  3  82]]
Model[2] Testing Accuracy = "0.9780701754385965"
[[129  14]
 [  5  80]]
Model[3] Testing Accuracy = "0.9166666666666666"
[[139   4]
 [  6  79]]
Model[4] Testing Accuracy = "0.956140350877193"

and the classification report for these models as follows:

Model  0
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       143
           1       0.99      0.98      0.98        85

    accuracy                           0.99       228
   macro avg       0.99      0.98      0.99       228
weighted avg       0.99      0.99      0.99       228

0.9868421052631579

Model  1
              precision    recall  f1-score   support

           0       0.97      0.99      0.98       143
           1       0.98      0.95      0.96        85

    accuracy                           0.97       228
   macro avg       0.97      0.97      0.97       228
weighted avg       0.97      0.97      0.97       228

0.9736842105263158

Model  2
              precision    recall  f1-score   support

           0       0.98      0.99      0.98       143
           1       0.98      0.96      0.97        85

    accuracy                           0.98       228
   macro avg       0.98      0.98      0.98       228
weighted avg       0.98      0.98      0.98       228

0.9780701754385965

Model  3
              precision    recall  f1-score   support

           0       0.96      0.90      0.93       143
           1       0.85      0.94      0.89        85

    accuracy                           0.92       228
   macro avg       0.91      0.92      0.91       228
weighted avg       0.92      0.92      0.92       228

0.9166666666666666

Model  4
              precision    recall  f1-score   support

           0       0.96      0.97      0.97       143
           1       0.95      0.93      0.94        85

    accuracy                           0.96       228
   macro avg       0.96      0.95      0.95       228
weighted avg       0.96      0.96      0.96       228

0.956140350877193

HPO

Hyperparameters optimization (HPO) is crucial as it controls the overall behavior of a ML model. In the context of BC classification, the objective is to minimize the misclassifications for the positive class (ie when the tumor is malignant ‘M’). But misclassifications include False Positives (FP) and False Negatives (FN). We focused more on reducing the FN because tumors which are malignant should never be classified as benign even if this means the model might classify a few benign tumors as malignant! Therefore I used the sklearn’s fbeta_score as the scoring function with GridSearchCV. A beta > 1 makes fbeta_score favor recall over precision.

The steps are

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import GridSearchCV

#make the scoring function with a beta = 2

from sklearn.metrics import fbeta_score, make_scorer

ftwo_scorer = make_scorer(fbeta_score, beta=2)

# Create logistic regression

logistic = LogisticRegression()

# Create regularization penalty space

penalty = [‘l1’, ‘l2’]

# Create regularization hyperparameter space

C = np.arange(0, 1, 0.001)

# Create hyperparameter options

hyperparameters = dict(C=C, penalty=penalty)

# Create grid search using 5-fold cross validation

clf = GridSearchCV(logistic, hyperparameters, cv=5, scoring=ftwo_scorer, verbose=0)

# Fit grid search

best_model = clf.fit(X_train, y_train)

# View best hyperparameters

print(‘Best Penalty:’, best_model.best_estimator_.get_params()[‘penalty’])

print(‘Best C:’, best_model.best_estimator_.get_params()[‘C’])

The output is

Accuracy score 0.986842
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       143
           1       0.99      0.98      0.98        85

    accuracy                           0.99       228
   macro avg       0.99      0.98      0.99       228
weighted avg       0.99      0.99      0.99       228

[[142   1]
 [  2  83]]

After grid searching the accuracy improved a little but the FNs are still 2.
Grid searching was done on SVC and Random Forest models too but the recall was best for logistic regression which is why the focus on logistic regression in this study.
Let's focus on the custom threshold to increase recall. The default threshold for interpreting probabilities to class labels is 0.5, and tuning this hyperparameter is called threshold moving.
y_scores = best_model.predict_proba(X_test)[:, 1]
from sklearn.metrics import precision_recall_curve
p, r, thresholds = precision_recall_curve(y_test, y_scores)
def adjusted_classes(y_scores, t):
    return [1 if y >= t else 0 for y in y_scores]
def precision_recall_threshold(p, r, thresholds, t=0.5):
    y_pred_adj = adjusted_classes(y_scores, t)
    print(pd.DataFrame(confusion_matrix(y_test, y_pred_adj),
                       columns=['pred_neg', 'pred_pos'], 
                       index=['neg', 'pos']))
    print(classification_report(y_test, y_pred_adj))
precision_recall_threshold(p, r, thresholds, 0.42)
 pred_neg  pred_pos
neg       141         2
pos         1        84
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       143
           1       0.98      0.99      0.98        85

    accuracy                           0.99       228
   macro avg       0.98      0.99      0.99       228
weighted avg       0.99      0.99      0.99       228


Finally the FNs reduced to 1, after manually setting a decision threshold of 0.42!
Graph of recall and precision VS threshold
Recall scores as a function of the decision threshold are shown below.



The line for optimal decision threshold 0.42 indicates the point of maximum recall which could be achieved without compromising a lot on precision. After that point the precision starts to drop more.
Finally, we calculate the AUC score
from sklearn import metrics
from sklearn.metrics import roc_curve
# Compute predicted probabilities: y_pred_prob
y_pred_prob = best_model.predict_proba(X_test)[:,1]
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
print(metrics.auc(fpr, tpr))
# Plot ROC curve
#plt.plot([0, 1], [0, 1], 'k — ')
plt.plot(fpr, tpr)

AUC=0.9979432332373509
We plot the ROC curve for Logistic Regression
That is TP Rate versus FP rate.
AUC score tells us how good our model is at distinguishing between classes, in this case, predicting benign tumors as benign and malignant tumors as malignant.
The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis. ROC curve looks almost ideal.
When the TPR and FPR don’t overlap at all, it means model has an ideal measure of separability ie it is able to correctly classify positives as positives and negatives as negatives.

Conclusion

BC is the leading cause of death among women worldwide. The present Python ML use-case supports earlier studies [1-3] in that it demonstrates the potential of available ML/AI algorithms for detecting, analyzing, and classifying BC. It has been proven that ML is able to evaluate different features of a digitized image of a fine needle aspirate (FNA) of a breast mass. According to our statistical analysis, confidence intervals and correlation plots, the most important real-valued features are radius, concavity and concave points of the cell image. Mean values of these features allow us to differentiate between benign and malignant with a great deal of statistical confidence.

References

[1] https://medium.com/swlh/breast-cancer-classification-using-python-e83719e5f97d

[2] https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

[3] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8612371/

[4] Seaborn Violin Plot using sns.violinplot() Explained for Beginners – MLK – Machine Learning Knowledge

[5] https://seaborn.pydata.org/

[6] https://chartio.com/learn/charts/violin-plot-complete-guide

[7] https://www.investopedia.com/terms