Advanced Integrated Data Visualization (AIDV) in Python – 2. Dabl Auto EDA & ML

  • In this post, we will use Dabl for data pre-processing, advanced integrated visualisation, exploratory data analysis (EDA) as well as ML model development.

Table of Contents

  1. The Digits Classification Dataset
  2. HAR EDA
  3. The Mosaic Plot
  4. The Diamond Regression Dataset
  5. Age-Gender Histograms
  6. The Mfeat-Factors Dataset
  7. The Ames Housing Dataset
  8. The Adult Census Dataset
  9. The Wine Dataset
  10. The Titanic Dataset
  11. Australian Wildfires
  12. Mice Protein Expression Classification
  13. Eucalyptus Soil Conservation
  14. The ISOLET Dataset
  15. PC3 Software Defect Prediction
  16. Wall Robot Navigation
  17. Gesture Phase Segmentation Dataset
  18. DNA Sequence Dataset
  19. Bank Note Authentication – Classification
  20. Balance-Scale Analysis
  21. Churn Prediction
  22. Bank Marketing
  23. Plasma Retinol and Beta-Carotene
  24. The Lowbwt dataset
  25. The cps_85_wages Dataset
  26. Soil Compositions
  27. Relative CPU Performance Data
  28. Bank-Customers Simulations
  29. BNG(autoPrice) Data
  30. Bangladesh-Rainfall
  31. Brazilian Houses
  32. The 1000 Cameras Dataset
  33. Summary
  34. Explore More

First, let’s install dabl

!pip install dabl

and set the working directory DIR

import os
os.chdir(‘DIR’)
os. getcwd()

The Digits Classification Dataset

Let’s run dabl.SimpleClassifier() as follows

import dabl
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
sc = dabl.SimpleClassifier().fit(X_train, y_train)

Running DummyClassifier()
accuracy: 0.106 recall_macro: 0.100 precision_macro: 0.011 f1_macro: 0.019
=== new best DummyClassifier() (using recall_macro):
accuracy: 0.106 recall_macro: 0.100 precision_macro: 0.011 f1_macro: 0.019

Running GaussianNB()
accuracy: 0.835 recall_macro: 0.837 precision_macro: 0.855 f1_macro: 0.833
=== new best GaussianNB() (using recall_macro):
accuracy: 0.835 recall_macro: 0.837 precision_macro: 0.855 f1_macro: 0.833

Running MultinomialNB()
accuracy: 0.901 recall_macro: 0.902 precision_macro: 0.910 f1_macro: 0.901
=== new best MultinomialNB() (using recall_macro):
accuracy: 0.901 recall_macro: 0.902 precision_macro: 0.910 f1_macro: 0.901

Running DecisionTreeClassifier(class_weight='balanced', max_depth=1)
accuracy: 0.196 recall_macro: 0.199 precision_macro: 0.076 f1_macro: 0.099
Running DecisionTreeClassifier(class_weight='balanced', max_depth=10)
accuracy: 0.829 recall_macro: 0.830 precision_macro: 0.835 f1_macro: 0.829
Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01)
accuracy: 0.780 recall_macro: 0.781 precision_macro: 0.794 f1_macro: 0.781
Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000)
accuracy: 0.959 recall_macro: 0.959 precision_macro: 0.963 f1_macro: 0.960
=== new best LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000) (using recall_macro):
accuracy: 0.959 recall_macro: 0.959 precision_macro: 0.963 f1_macro: 0.960

Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000)
accuracy: 0.962 recall_macro: 0.962 precision_macro: 0.965 f1_macro: 0.963
=== new best LogisticRegression(C=1, class_weight='balanced', max_iter=1000) (using recall_macro):
accuracy: 0.962 recall_macro: 0.962 precision_macro: 0.965 f1_macro: 0.963


Best model:
LogisticRegression(C=1, class_weight='balanced', max_iter=1000)
Best Scores:
accuracy: 0.962 recall_macro: 0.962 precision_macro: 0.965 f1_macro: 0.963

print(“Accuracy score”, sc.score(X_test, y_test))

Accuracy score 0.98

Let’s load again and plot the input data as a sequence of 8×8 images

from sklearn.datasets import load_digits
digits = load_digits()
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation=’nearest’)
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))

Digits input images

Plot a projection on the 2 first principal axis
plt.figure()

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
proj = pca.fit_transform(digits.data)
plt.scatter(proj[:, 0], proj[:, 1], c=digits.target, cmap=”Paired”)
plt.colorbar()

A projection on the 2 first principal axis

Let’s classify with the Gaussian Naive Bayes

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)

clf = GaussianNB()
clf.fit(X_train, y_train)

predicted = clf.predict(X_test)
expected = y_test

fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)

for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,
interpolation=’nearest’)

# label the image with the target value
if predicted[i] == expected[i]:
    ax.text(0, 7, str(predicted[i]), color='green')
else:
    ax.text(0, 7, str(predicted[i]), color='red')
Digits classification results

let’s print the classification report

from sklearn import metrics
print(metrics.classification_report(expected, predicted))

precision    recall  f1-score   support

           0       1.00      0.98      0.99        59
           1       0.86      0.80      0.83        45
           2       0.94      0.65      0.77        51
           3       0.92      0.82      0.87        44
           4       1.00      0.82      0.90        39
           5       0.85      0.94      0.89        36
           6       0.88      0.98      0.93        45
           7       0.79      0.95      0.86        43
           8       0.53      0.86      0.66        37
           9       0.88      0.73      0.80        51

    accuracy                           0.85       450
   macro avg       0.87      0.85      0.85       450
weighted avg       0.88      0.85      0.85       450

Let’s print the normalized confusion matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns
import numpy as np

target_names=[‘0′,’1′,’2′,’3′,’4′,’5′,’6′,’7′,’8′,’9’]
cm = confusion_matrix(expected, predicted)

cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.show(block=False)

Digits normalized confusion matrix

Let’s look at the ML performance using scikitplot

import scikitplot as skplt

import sklearn
from sklearn.datasets import load_digits, load_boston, load_breast_cancer
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys
import warnings
warnings.filterwarnings(“ignore”)

print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)

%matplotlib inline

Scikit Plot Version :  0.3.7
Scikit Learn Version :  1.1.3
Python Version :  3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]

skplt.estimators.plot_learning_curve(clf, X_train, y_train,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=” GaussianNB() Digits Classification Learning Curve”);

GaussianNB() Digits Classification Learning Curve

Let’s compare it against Logistic Regression

lr=LogisticRegression(C=1, class_weight=’balanced’, max_iter=1000)
skplt.estimators.plot_learning_curve(lr, X_train, y_train,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=” Logistic Regression Digits Classification Learning Curve”);

Logistic Regression Digits Classification Learning Curve

Let’s plot the ROC curve

Y_test_probs = clf.predict_proba(X_test)

skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
title=”GaussianNB() Digits ROC Curve”, figsize=(12,6));

GaussianNB() Digits ROC Curve

Let’s plot the Precision-Recall Curve

skplt.metrics.plot_precision_recall_curve(y_test, Y_test_probs,
title=”GaussianNB() Digits Precision-Recall Curve”, figsize=(12,6));

GaussianNB() Digits Precision-Recall Curve

Let’s plot the elbow plot

skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
X_train,
cluster_ranges=range(2, 20),
figsize=(8,6));

Digits elbow plot

Let’s perform the KMeans silhouette analysis

kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_train, y_train)
cluster_labels = kmeans.predict(X_test)

skplt.metrics.plot_silhouette(X_test, cluster_labels,
figsize=(8,6));

The KMeans silhouette analysis

Let’s plot the pca_component_variance

pca = PCA(random_state=1)
pca.fit(X_train)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

PCA component explained variances

Let’s transform the classification report into the image

from yellowbrick.classifier import ClassificationReport

viz = ClassificationReport(clf,
classes=target_names,
support=True,
fig=plt.figure(figsize=(8,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

GaussianNB classification report

Let’s plot the class prediction error for GaussianNB

from yellowbrick.classifier import ClassPredictionError

viz = ClassPredictionError(clf,
classes=target_names,
fig=plt.figure(figsize=(9,6)))

viz.fit(X_train, y_train)

viz.score(X_test, y_test)

viz.show();

Class prediction error for GaussianNB

HAR EDA

Let’s fetch available datasets from OpenML by name. Examples of using sklearn.datasets.fetch_openml can be found here.

Human Activity Recognition (HAR) is the problem of classifying sequences of accelerometer data recorded by specialized harnesses or smartphones into known well-defined movements.

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from dabl import plot

X, y = fetch_openml(‘har’, as_frame=True, return_X_y=True)

plot(X, y)
plt.show()

Target looks like classification
Showing only top 10 of 561 continuous features
Linear Discriminant Analysis training set score: 0.984
HAR target distribution
HAR classes
HAR top feature interactions

The Mosaic Plot

This is a nice illustration of the mosaic plot:

mport matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from dabl import plot

X, y = fetch_openml(‘splice’, as_frame=True, return_X_y=True)

plot(X, y)
plt.show()

Target looks like classification
Showing only top 10 of 60 categorical features
Splice target distribution
Splice mosaic plot

Mosaic Plot (also known as Marimekko diagram)is nothing but a further version of (pd. crosstab()) function in Python. Crosstab function just gives us a table of numbers whereas Mosaic Plot gives it’d graphical diagram which we can use in the data analysis report.

The Diamond Regression Dataset

Let’s apply the regression analysis to predict the Diamonds Prices Based on Cut, Color, and Clarity.

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from dabl import plot

X, y = fetch_openml(‘diamonds’, as_frame=True, return_X_y=True)

plot(X, y)
plt.show()

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='class', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=-6.92E-01'}, xlabel='MYCT', ylabel='class'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=7.38E-01'}, xlabel='class', ylabel='MMIN'>,
         <AxesSubplot: title={'center': 'F=5.18E-01'}, xlabel='class', ylabel='MMAX'>,
         <AxesSubplot: title={'center': 'F=5.18E-01'}, xlabel='class', ylabel='CACH'>,
         <AxesSubplot: title={'center': 'F=4.04E-01'}, xlabel='class', ylabel='CHMIN'>,
         <AxesSubplot: title={'center': 'F=3.41E-01'}, xlabel='class', ylabel='CHMAX'>]],
       dtype=object)]
Diamond price target distribution
Diamond price vs cut, color, and calrity.

Age-Gender Histograms

How to Visualize Age/Sex Patterns with dabl?

Let’s compare the histograms of age per gender

import matplotlib.pyplot as plt
from dabl.datasets import load_adult
from dabl.plot import class_hists

data = load_adult()

class_hists(data, “age”, “gender”, legend=True)
plt.show()

Histograms age vs gender

A histogram is similar in appearance to a bar chart, but instead of comparing categories or looking for trends over time, each bar represents how data is distributed in a single category. Each bar represents a continuous range of data or the number of frequencies for a specific data point.

Histograms are useful for showing the distribution of a single scale variable. Data are binned and summarized using a count or percentage statistic.

The Mfeat-Factors Dataset

The mfeat-factors (Multiple Features Dataset: Factors) is one of a set of 6 datasets describing features of handwritten numerals (0 – 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total of 2,000 instances) have been digitized in binary images.

import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from dabl import plot

X, y = fetch_openml(‘mfeat-factors’, as_frame=True, return_X_y=True)

plot(X, y)
plt.show()

Target looks like classification
Showing only top 10 of 216 continuous features
Linear Discriminant Analysis training set score: 0.993
Mfit-Factors target distribution
Mfit-Factors classes 1-10
Mfit-Factors LDA PCA

The Ames Housing Dataset

Let’s look at the “Ames housing” dataset. This dataset is similar to the “California housing” dataset. However, it is more complex to handle: it contains missing data and both numerical and categorical features.

from dabl import plot
from dabl.datasets import load_ames
import matplotlib.pyplot as plt

data = load_ames()

plot(data, ‘SalePrice’)
plt.show()

Target looks like regression
Showing only top 10 of 41 categorical features
Ames target distribution
Ames continuous/categorical feature distributions

The Adult Census Dataset

The dataset is a collection of information related to a person. The prediction task is to predict whether a person is earning a salary above or below 50 k$.

from dabl import plot
from dabl.datasets import load_adult
import matplotlib.pyplot as plt

data = load_adult()

plot(data, ‘income’, scatter_alpha=.1)
plt.show()

Target looks like classification
Linear Discriminant Analysis training set score: 0.530
The Adult Census target distribution
The Adult Census continuous features pairplot
The Adult Census discriminating PCA directions
The Adult Census linear discriminant
The Adult Census mosaic plots of categorical features vs target

model = dabl.SimpleClassifier(random_state=0)
X = data_clean.drop(“income”, axis=1)
y = data_clean.income
model.fit(X, y)

Running DummyClassifier(random_state=0)
accuracy: 0.759 average_precision: 0.241 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.432
=== new best DummyClassifier(random_state=0) (using recall_macro):
accuracy: 0.759 average_precision: 0.241 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.432

Running GaussianNB()
accuracy: 0.408 average_precision: 0.296 roc_auc: 0.619 recall_macro: 0.595 f1_macro: 0.407
=== new best GaussianNB() (using recall_macro):
accuracy: 0.408 average_precision: 0.296 roc_auc: 0.619 recall_macro: 0.595 f1_macro: 0.407

Running MultinomialNB()
accuracy: 0.814 average_precision: 0.694 roc_auc: 0.881 recall_macro: 0.776 f1_macro: 0.760
=== new best MultinomialNB() (using recall_macro):
accuracy: 0.814 average_precision: 0.694 roc_auc: 0.881 recall_macro: 0.776 f1_macro: 0.760

Running DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
accuracy: 0.710 average_precision: 0.417 roc_auc: 0.759 recall_macro: 0.759 f1_macro: 0.682
Running DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0)
accuracy: 0.729 average_precision: 0.673 roc_auc: 0.870 recall_macro: 0.784 f1_macro: 0.702
=== new best DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0) (using recall_macro):
accuracy: 0.729 average_precision: 0.673 roc_auc: 0.870 recall_macro: 0.784 f1_macro: 0.702

Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01,
                       random_state=0)
accuracy: 0.718 average_precision: 0.536 roc_auc: 0.810 recall_macro: 0.779 f1_macro: 0.693
Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000,
                   random_state=0)
accuracy: 0.806 average_precision: 0.759 roc_auc: 0.904 recall_macro: 0.819 f1_macro: 0.769
=== new best LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000,
                   random_state=0) (using recall_macro):
accuracy: 0.806 average_precision: 0.759 roc_auc: 0.904 recall_macro: 0.819 f1_macro: 0.769

Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0)
accuracy: 0.808 average_precision: 0.760 roc_auc: 0.905 recall_macro: 0.820 f1_macro: 0.771
=== new best LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0) (using recall_macro):
accuracy: 0.808 average_precision: 0.760 roc_auc: 0.905 recall_macro: 0.820 f1_macro: 0.771
Best model:
LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0)
Best Scores:
accuracy: 0.808 average_precision: 0.760 roc_auc: 0.905 recall_macro: 0.820 f1_macro: 0.771

SimpleClassifier(random_state=0)

dabl.explain(model)

Feature coefficients

The Wine Dataset

Let’s load the wine dataset (classification).

import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from dabl import plot
from dabl.utils import data_df_from_bunch

wine_bunch = load_wine()
wine_df = data_df_from_bunch(wine_bunch)

plot(wine_df, ‘target’)
plt.show()

Target looks like classification
Linear Discriminant Analysis training set score: 1.000
Wine classification: target distribution
Wine classification: histograms
Wine classification: feature interactions
Wine classification: LDA directions

The Titanic Dataset

Let’s look at the classic Titanic dataset, otherwise known as the course material for Kaggle 101. 

import dabl
import pandas as pd
import matplotlib.pyplot as plt

titanic = pd.read_csv(dabl.datasets.data_path(“titanic.csv”))

titanic.shape

(1309, 14)


titanic.head

<bound method NDFrame.head of       pclass  survived                                             name  \
0          1         1                    Allen, Miss. Elisabeth Walton   
1          1         1                   Allison, Master. Hudson Trevor   
2          1         0                     Allison, Miss. Helen Loraine   
3          1         0             Allison, Mr. Hudson Joshua Creighton   
4          1         0  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)   
...      ...       ...                                              ...   
1304       3         0                             Zabour, Miss. Hileni   
1305       3         0                            Zabour, Miss. Thamine   
1306       3         0                        Zakarian, Mr. Mapriededer   
1307       3         0                              Zakarian, Mr. Ortin   
1308       3         0                               Zimmerman, Mr. Leo   

         sex     age  sibsp  parch  ticket      fare    cabin embarked boat  \
0     female      29      0      0   24160  211.3375       B5        S    2   
1       male  0.9167      1      2  113781    151.55  C22 C26        S   11   
2     female       2      1      2  113781    151.55  C22 C26        S    ?   
3       male      30      1      2  113781    151.55  C22 C26        S    ?   
4     female      25      1      2  113781    151.55  C22 C26        S    ?   
...      ...     ...    ...    ...     ...       ...      ...      ...  ...   
1304  female    14.5      1      0    2665   14.4542        ?        C    ?   
1305  female       ?      1      0    2665   14.4542        ?        C    ?   
1306    male    26.5      0      0    2656     7.225        ?        C    ?   
1307    male      27      0      0    2670     7.225        ?        C    ?   
1308    male      29      0      0  315082     7.875        ?        S    ?   

     body                        home.dest  
0       ?                     St Louis, MO  
1       ?  Montreal, PQ / Chesterville, ON  
2       ?  Montreal, PQ / Chesterville, ON  
3     135  Montreal, PQ / Chesterville, ON  
4       ?  Montreal, PQ / Chesterville, ON  
...   ...                              ...  
1304  328                                ?  
1305    ?                                ?  
1306  304                                ?  
1307    ?                                ?  
1308    ?                                ?  

[1309 rows x 14 columns]>

titanic_clean = dabl.clean(titanic, verbose=1)

types = dabl.detect_types(titanic_clean)
print (types)

dabl.plot(titanic, ‘survived’)

plt.show()

fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop(“survived”, axis=1)
y = titanic_clean.survived
fc.fit(X, y)

Detected feature types:
continuous                  0
dirty_float                 3
low_card_int_ordinal        2
low_card_int_categorical    0
categorical                 5
date                        0
free_string                 4
useless                     0
dtype: int64
                      continuous  dirty_float  low_card_int_ordinal  \
pclass                     False        False                 False   
survived                   False        False                 False   
name                       False        False                 False   
sex                        False        False                 False   
sibsp                      False        False                  True   
parch                      False        False                  True   
ticket                     False        False                 False   
cabin                      False        False                 False   
embarked                   False        False                 False   
boat                       False        False                 False   
home.dest                  False        False                 False   
age_?                      False        False                 False   
age_dabl_continuous         True        False                 False   
fare_?                     False        False                 False   
fare_dabl_continuous        True        False                 False   
body_?                     False        False                 False   
body_dabl_continuous        True        False                 False   
                      low_card_int_categorical  categorical   date  \
pclass                                   False         True  False   
survived                                 False         True  False   
name                                     False        False  False   
sex                                      False         True  False   
sibsp                                    False        False  False   
parch                                    False        False  False   
ticket                                   False        False  False   
cabin                                    False        False  False   
embarked                                 False         True  False   
boat                                     False         True  False   
home.dest                                False        False  False   
age_?                                    False         True  False   
age_dabl_continuous                      False        False  False   
fare_?                                   False        False  False   
fare_dabl_continuous                     False        False  False   
body_?                                   False         True  False   
body_dabl_continuous   
                      free_string  useless  
pclass                      False    False  
survived                    False    False  
name                         True    False  
sex                         False    False  
sibsp                       False    False  
parch                       False    False  
ticket                       True    False  
cabin                        True    False  
embarked                    False    False  
boat                        False    False  
home.dest                    True    False  
age_?                       False    False  
age_dabl_continuous         False    False  
fare_?                      False     True  
fare_dabl_continuous        False    False  
body_?                      False    False  
body_dabl_continuous        False    False  
Target looks like classification
Linear Discriminant Analysis training set score: 0.578

Target Distribution

Continuous Features Pairplot

Titanic Continuous Features Pairplot
Titanic Continuous Features Pairplot
Titanic PCA

Linear Discriminant

Titanic Linear Discriminant

Categorical features vs Target

Titanic Categorical features vs Target
Titanic Categorical features vs Target

fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop(“survived”, axis=1)
y = titanic_clean.survived
fc.fit(X, y)

Running DummyClassifier(random_state=0)
accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382
=== new best DummyClassifier(random_state=0) (using recall_macro):
accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382

Running GaussianNB()
accuracy: 0.970 average_precision: 0.979 roc_auc: 0.986 recall_macro: 0.963 f1_macro: 0.968
=== new best GaussianNB() (using recall_macro):
accuracy: 0.970 average_precision: 0.979 roc_auc: 0.986 recall_macro: 0.963 f1_macro: 0.968

Running MultinomialNB()
accuracy: 0.968 average_precision: 0.982 roc_auc: 0.984 recall_macro: 0.961 f1_macro: 0.966
Running DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
=== new best DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) (using recall_macro):
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974

Running DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0)
accuracy: 0.969 average_precision: 0.965 roc_auc: 0.983 recall_macro: 0.965 f1_macro: 0.967
Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01,
                       random_state=0)
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000,
                   random_state=0)
accuracy: 0.975 average_precision: 0.991 roc_auc: 0.993 recall_macro: 0.971 f1_macro: 0.973
Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0)
accuracy: 0.976 average_precision: 0.991 roc_auc: 0.994 recall_macro: 0.971 f1_macro: 0.974

Best model:
DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0)
Best Scores:
accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974

SimpleClassifier

SimpleClassifier(random_state=0)

Australian Wildfires

Let’s look at the Kaggle Australian wildfire dataset.

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
data = pd.read_csv(‘fire_archive_V1_96617.csv’)
data.columns

Index(['latitude', 'longitude', 'bright_ti4', 'scan', 'track', 'acq_date',
       'acq_time', 'satellite', 'instrument', 'confidence', 'version',
       'bright_ti5', 'frp', 'type'],
      dtype='object')

data.type.value_counts()

0    180150
3      2735
2      1893
Name: type, dtype: int64

import dabl
dabl.plot(data, target_col=’type’, type_hints={‘type’: ‘categorical’})

Target looks like classification
Linear Discriminant Analysis training set score: 0.355
Wildfires: target
Wildfires: feature histograms
Wildfires: top feature interactions
Wildfires: LDA directions
Wildfires: Categorical features vs target

Mice Protein Expression Classification

Let’s look at the Kaggle Mice Protein Expression Classification.

from dabl import plot
from sklearn.datasets import fetch_openml

data = fetch_openml(‘MiceProtein’, as_frame=True)

print(data.frame.shape)

(1080, 78)

target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Target looks like classification
Showing only top 10 of 77 continuous features
Linear Discriminant Analysis training set score: 0.992
Mice Protein Expression target distribution
Mice Protein Expression feature histograms
Mice Protein Expression feature interactions
Mice Protein Expression LDA

Eucalyptus Soil Conservation

Let’s examine the Eucalyptus Soil Conservation dataset.

data = fetch_openml(‘eucalyptus’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Output:

Target looks like classification
Linear Discriminant Analysis training set score: 0.517
Eucalyptus target distribution
Eucalyptus feature histograms
Eucalyptus feature interactions
Eucalyptus categorical features vs target

The ISOLET Dataset

The ISOLET (Isolated Letter Speech Recognition) dataset was generated as follows: 150 subjects spoke the name of each letter of the alphabet twice. Hence, there are 52 training examples from each speaker. The speakers are grouped into sets of 30 speakers each, 4 groups can serve as training set, the last group as the test set. A total of 3 examples are missing, the authors dropped them due to difficulties in recording.

This is a good domain for a noisy, perceptual task. It is also a very good domain for testing the scaling abilities of algorithms.

data = fetch_openml(‘isolet’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Target looks like classification
Showing only top 10 of 613 continuous features
Linear Discriminant Analysis training set score: 0.969
ISOLET target distribution
ISOLET feature histograms
ISOLET feature histograms
ISOLET feature interactions
ISOLET categorical features vs target

PC3 Software Defect Prediction

PC3 Software defect prediction
One of the NASA Metrics Data Program defect data sets. Data from flight software for earth orbiting satellite. Data comes from McCabe and Halstead features extractors of source code. These features were defined in the 70s in an attempt to objectively characterize code features that are associated with software quality.

data = fetch_openml(‘pc3’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Linear Discriminant Analysis training set score: 0.631
PC3 target distribution
PC3 feature histograms
PC3 linear discriminant
PC3 feature interactions
PC3 categorical features vs target

Wall Robot Navigation

Wall-Following Robot Navigation Data DataSet
The data were collected as the SCITOS G5 robot navigates through the room following the wall in a clockwise direction, for 4 rounds, using 24 ultrasound sensors arranged circularly around its ‘waist’.

The data consists of raw values of the measurements of all 24 ultrasound sensors and the corresponding class label. Sensor readings are sampled at a rate of 9 samples per second.

The class labels are:

  1. Move-Forward,
  2. Slight-Right-Turn,
  3. Sharp-Right-Turn,
  4. Slight-Left-Turn

It is worth mentioning that the 24 ultrasound readings and the simplified distances were collected at the same time step, so each file has the same number of rows (one for each sampling time step).

The wall-following task and data gathering were designed to test the hypothesis that this apparently simple navigation task is indeed a non-linearly separable classification task. Thus, linear classifiers, such as the Perceptron network, are not able to learn the task and command the robot around the room without collisions. Nonlinear neural classifiers, such as the MLP network, are able to learn the task and command the robot successfully without collisions.

data = fetch_openml(‘wall-robot-navigation’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Target looks like classification
Showing only top 10 of 24 continuous features
Linear Discriminant Analysis training set score: 0.603
Wall Robot Navigation target distribution
Wall Robot histograms
Wall Robot Navigation feature interactions
Wall Robot Navigation LDA

Gesture Phase Segmentation Dataset

The dataset is composed by features extracted from 7 videos with people gesticulating, aiming at studying Gesture Phase Segmentation.
Each video is represented by two files: a raw file, which contains the position of hands, wrists, head and spine of the user in each frame; and a processed file, which contains velocity and acceleration of hands and wrists. See the data set description for more information on the dataset.

data = fetch_openml(‘GesturePhaseSegmentationProcessed’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Target looks like classification
Showing only top 10 of 32 continuous features
Linear Discriminant Analysis training set score: 0.366
Gesture Phase Segmentation target distribution
Gesture Phase Segmentation histograms
Gesture Phase Segmentation feature interactions

DNA Sequence Dataset

We will understand how to interpret a DNA structure and how machine learning algorithms can be used to build a prediction model on DNA sequence data.

data = fetch_openml(‘dna’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Target looks like classification
Showing only top 10 of 180 categorical features
DNA target distribution
DNA mosaic plot

This is a good application of mosaic plots.

Bank Note Authentication – Classification

Let’s look at the Bank Note Authentication UCI data

data = fetch_openml(‘banknote-authentication’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Target looks like classification
Linear Discriminant Analysis training set score: 0.979
 Bank Note Authentication target distribution pairplot

Continuous features pairplot

 Bank Note Authentication
 Bank Note Authentication target distribution pairplot
 Bank Note Authentication PCA
Bank Note Authentication LDA

This is a perfect application of LDA.

Balance-Scale Analysis

This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance left-weight) and (right-distance right-weight). If they are equal, it is balanced.

Attribute Information:

  1. Class Name: 3 (L, B, R)
  2. Left-Weight: 5 (1, 2, 3, 4, 5)
  3. Left-Distance: 5 (1, 2, 3, 4, 5)
  4. Right-Weight: 5 (1, 2, 3, 4, 5)
  5. Right-Distance: 5 (1, 2, 3, 4, 5)

data = fetch_openml(‘balance-scale’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col,
type_hints={target_col: ‘categorical’, ‘left-distance’: ‘continuous’,
‘right-distance’: ‘continuous’, ‘left-weight’: ‘continuous’,
‘right-weight’:’continuous’})

Target looks like classification
Linear Discriminant Analysis training set score: 0.638
Balance-Scale target distribution
Balance-Scale histograms
Balance-Scale histograms
Balance-Scale PCA
Balance-Scale LDA

This plot shows the LDA structure of the data.

Churn Prediction

Telco Customer Churn – focused customer retention programs.

data = fetch_openml(‘churn’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Target looks like classification
Linear Discriminant Analysis training set score: 0.522
Churn target distribution
Churn histograms
Churn LDA
Churn PCA
Churn mosaic plot

Bank Marketing

The Bank Marketing Dataset is about Predicting Term Deposit Subscriptions. How can the financial institution have a greater effectiveness for future marketing campaigns? 

data = fetch_openml(‘bank-marketing’, as_frame=True)

print(data.frame.shape)

(45211, 17)

target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})

Bank marketing target distribution
Bank marketing histograms
Bank marketing LDA
Bank marketing feature interactions
Bank marketing categorical features vs target

Plasma Retinol and Beta-Carotene

The Plasma Retinol and Beta-Carotene Dataset – Determinants of Plasma Retinol and Beta-Carotene Levels. This datafile contains 315 observations on 14 variables. This data set can be used to demonstrate multiple regression, transformations, categorical variables, outliers, pooled tests of significance and model building strategies.

data = fetch_openml(‘plasma_retinol’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
Plasma Retinol target distribution
Plasma Retinol continuous features vs target
Plasma Retinol categorical feature vs target

The Lowbwt dataset

Let’s look at the LOW BIRTH WEIGHT DATA. The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.

Columns Variable Abbreviation:

2-4 Identification Code ID

10 Low Birth Weight (0 = Birth Weight ge 2500g, LOW l = Birth Weight < 2500g)

17-18 Age of the Mother in Years AGE

23-25 Weight in Pounds at the Last Menstrual Period LWT

32 Race (1 = White, 2 = Black, 3 = Other) RACE

40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE

48 History of Premature Labor (0 = None, 1 = One, etc.) PTL

55 History of Hypertension (1 = Yes, 0 = No) HT

61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI

67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.)

73-76 Birth Weight in Grams BWT.

data = fetch_openml(‘lowbwt’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='class', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=2.48E-01'}, xlabel='LWT', ylabel='class'>,
         <AxesSubplot: title={'center': 'F=6.11E-02'}, xlabel='AGE (jittered)'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=6.16E-01'}, xlabel='class', ylabel='LOW'>,
         <AxesSubplot: title={'center': 'F=1.31E-01'}, xlabel='class', ylabel='RACE'>,
         <AxesSubplot: title={'center': 'F=2.37E-02'}, xlabel='class', ylabel='SMOKE'>,
         <AxesSubplot: title={'center': 'F=2.10E-02'}, xlabel='class', ylabel='PTL'>],
        [<AxesSubplot: title={'center': 'F=1.31E-02'}, xlabel='class', ylabel='HT'>,
         <AxesSubplot: title={'center': 'F=1.22E-02'}, xlabel='class', ylabel='UI'>,
         <AxesSubplot: title={'center': 'F=2.92E-04'}, xlabel='class', ylabel='FTV'>,
         <AxesSubplot: >]], dtype=object)]
 Lowbwt target distribution
Lowbwt continuous feature
Lowbwt categorical feature vs target

The cps_85_wages Dataset

Determinants of Wages from the 1985 Current Population Survey.

The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership.

data = fetch_openml(‘cps_85_wages’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='WAGE', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=2.58E-01'}, xlabel='AGE', ylabel='WAGE'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=1.55E-01'}, xlabel='WAGE', ylabel='EDUCATION'>,
         <AxesSubplot: title={'center': 'F=8.58E-02'}, xlabel='WAGE', ylabel='SOUTH'>,
         <AxesSubplot: title={'center': 'F=4.08E-02'}, xlabel='WAGE', ylabel='SEX'>,
         <AxesSubplot: title={'center': 'F=2.44E-02'}, xlabel='WAGE', ylabel='UNION'>],
        [<AxesSubplot: title={'center': 'F=1.97E-02'}, xlabel='WAGE', ylabel='RACE'>,
         <AxesSubplot: title={'center': 'F=1.26E-02'}, xlabel='WAGE', ylabel='OCCUPATION'>,
         <AxesSubplot: title={'center': 'F=0.00E+00'}, xlabel='WAGE', ylabel='SECTOR'>,
         <AxesSubplot: title={'center': 'F=0.00E+00'}, xlabel='WAGE', ylabel='MARR'>]],
       dtype=object)]
cps_85_wages target distribution and continuous feature vs target
cps_85_wages categorical feature vs target

Soil Compositions

Soil Compositions of Physical and Chemical Characteristics

OpenML visualizing_soil

data = fetch_openml(‘visualizing_soil’, as_frame=True) ###
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='track', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=6.77E-01'}, xlabel='northing', ylabel='track'>,
         <AxesSubplot: title={'center': 'F=1.13E-01'}, xlabel='easting'>,
         <AxesSubplot: title={'center': 'F=-6.80E-02'}, xlabel='resistivity'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=3.20E-01'}, xlabel='track', ylabel='isns'>]],
       dtype=object)]
visualizing_soil target distribution and continuous feature vs target
visualizing_soil boxplot

Relative CPU Performance Data

OpenML: machine_cpu

The problem concerns Relative CPU Performance Data. More information can be obtained in the UCI Machine Learning repository. The used attributes are : MYCT: machine cycle time in nanoseconds (integer) MMIN: minimum main memory in kilobytes (integer) MMAX: maximum main memory in kilobytes (integer) CACH: cache memory in kilobytes (integer) CHMIN: minimum channels in units (integer) CHMAX: maximum channels in units (integer) PRP: published relative performance (integer) (target variable)

Original source: UCI machine learning repository. Source: collection of regression datasets by Luis Torgo (ltorgo@ncc.up.pt)

Characteristics: 209 cases; 6 continuous variables.

data = fetch_openml(‘machine_cpu’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='class', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=-6.92E-01'}, xlabel='MYCT', ylabel='class'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=7.38E-01'}, xlabel='class', ylabel='MMIN'>,
         <AxesSubplot: title={'center': 'F=5.18E-01'}, xlabel='class', ylabel='MMAX'>,
         <AxesSubplot: title={'center': 'F=5.18E-01'}, xlabel='class', ylabel='CACH'>,
         <AxesSubplot: title={'center': 'F=4.04E-01'}, xlabel='class', ylabel='CHMIN'>,
         <AxesSubplot: title={'center': 'F=3.41E-01'}, xlabel='class', ylabel='CHMAX'>]],
       dtype=object)]
machine_cpu target and continuous feature vs target
machine_cpu categorical feature vs target

Bank-Customers Simulations

OpenML: bank32nh

A family of datasets synthetically generated from a simulation of how bank-customers choose their banks. Tasks are based on predicting the fraction of bank customers who leave the bank because of full queues. The bank family of datasets are generated from a simplistic simulator, which simulates the queues in a series of banks. The simulator was constructed with the explicit purpose of generating a family of datasets for DELVE. Customers come from several residential areas, choose their preferred bank depending on distances and have tasks of varying complexity, and various levels of patience. Each bank has several queues, that open and close according to demand. The tellers have various effectivities, and customers may change queue, if their patience expires. In the rej prototasks, the object is to predict the rate of rejections, ie the fraction of customers that are turned away from the bank because all the open tellers have full queues. Source: collection of regression datasets by Luis Torgo (ltorgo@ncc.up.pt).

data = fetch_openml(‘bank32nh’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='rej', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=5.01E-01'}, xlabel='a2pop', ylabel='rej'>,
         <AxesSubplot: title={'center': 'F=3.21E-01'}, xlabel='a1pop'>,
         <AxesSubplot: title={'center': 'F=3.20E-01'}, xlabel='a3pop'>,
         <AxesSubplot: title={'center': 'F=-9.23E-02'}, xlabel='b1eff'>,
         <AxesSubplot: title={'center': 'F=-6.77E-02'}, xlabel='temp'>],
        [<AxesSubplot: title={'center': 'F=-3.88E-02'}, xlabel='a2sy', ylabel='rej'>,
         <AxesSubplot: title={'center': 'F=-3.74E-02'}, xlabel='b1call (jittered)'>,
         <AxesSubplot: title={'center': 'F=-3.45E-02'}, xlabel='a2sx'>,
         <AxesSubplot: title={'center': 'F=-2.49E-02'}, xlabel='b2y'>,
         <AxesSubplot: title={'center': 'F=2.23E-02'}, xlabel='a3cy'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=2.78E-02'}, xlabel='rej', ylabel='mxql'>]],
       dtype=object)]
bank32nh target distribution
bank32nh continuous features
bank32nh boxplot

BNG(autoPrice) Data

OpenML: BNG(autoPrice)

data = fetch_openml(‘autoPrice’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='class', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=9.26E-01'}, xlabel='curb-weight', ylabel='class'>,
         <AxesSubplot: title={'center': 'F=8.77E-01'}, xlabel='length'>,
         <AxesSubplot: title={'center': 'F=8.61E-01'}, xlabel='horsepower'>,
         <AxesSubplot: title={'center': 'F=-8.54E-01'}, xlabel='highway-mpg (jittered)'>],
        [<AxesSubplot: title={'center': 'F=8.39E-01'}, xlabel='width', ylabel='class'>,
         <AxesSubplot: title={'center': 'F=7.94E-01'}, xlabel='wheel-base'>,
         <AxesSubplot: title={'center': 'F=6.58E-01'}, xlabel='bore'>,
         <AxesSubplot: title={'center': 'F=3.64E-01'}, xlabel='height'>],
        [<AxesSubplot: title={'center': 'F=-2.35E-01'}, xlabel='symboling (jittered)', ylabel='class'>,
         <AxesSubplot: title={'center': 'F=-1.92E-01'}, xlabel='compression-ratio'>,
         <AxesSubplot: title={'center': 'F=1.68E-01'}, xlabel='normalized-losses'>,
         <AxesSubplot: title={'center': 'F=1.68E-01'}, xlabel='stroke'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=8.70E-01'}, xlabel='class', ylabel='engine-size'>,
         <AxesSubplot: title={'center': 'F=4.14E-01'}, xlabel='class', ylabel='peak-rpm'>]],
       dtype=object)]
autoPrice target distribution
autoPrice continuous feature
autoPrice categorical feature

Bangladesh-Rainfall

Analyzing the rainfall in Bangladesh over the years and predicting the amount of rainfall (in mm) in future. This is a time series dataset

data = fetch_openml(‘rainfall_bangladesh’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='Rainfall', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=1.67E-02'}, xlabel='Year (jittered)', ylabel='Rainfall'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=5.91E-01'}, xlabel='Rainfall', ylabel='Station'>,
         <AxesSubplot: title={'center': 'F=6.47E-02'}, xlabel='Rainfall', ylabel='Month'>]],

Brazilian Houses

OpenML: Brazilian_houses

Dataset houses to rent in different cities in Brazil.

This dataset contains 10962 houses to rent with 13 different features.

data = fetch_openml(‘Brazilian_houses’, as_frame=True) #######
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='total_(BRL)', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=9.69E-01'}, xlabel='rent_amount_(BRL)', ylabel='total_(BRL)'>,
         <AxesSubplot: title={'center': 'F=7.43E-01'}, xlabel='area'>,
         <AxesSubplot: title={'center': 'F=7.41E-01'}, xlabel='bathroom (jittered)'>,
         <AxesSubplot: title={'center': 'F=7.31E-01'}, xlabel='property_tax_(BRL)'>],
        [<AxesSubplot: title={'center': 'F=6.42E-01'}, xlabel='parking_spaces (jittered)', ylabel='total_(BRL)'>,
         <AxesSubplot: title={'center': 'F=6.23E-01'}, xlabel='rooms (jittered)'>,
         <AxesSubplot: title={'center': 'F=5.20E-01'}, xlabel='hoa_(BRL)'>,
         <AxesSubplot: >]], dtype=object),
 array([[<AxesSubplot: title={'center': 'F=1.18E-01'}, xlabel='total_(BRL)', ylabel='city'>,
         <AxesSubplot: title={'center': 'F=1.11E-01'}, xlabel='total_(BRL)', ylabel='floor'>,
         <AxesSubplot: title={'center': 'F=4.66E-02'}, xlabel='total_(BRL)', ylabel='animal'>,
         <AxesSubplot: title={'center': 'F=3.00E-02'}, xlabel='total_(BRL)', ylabel='furniture'>]],
       dtype=object)]
Brazilian_houses target distribution
Brazilian_houses  continuous and categorical features vs target

The 1000 Cameras Dataset

Data describing 1000 cameras in 13 properties.

The 13 properties of each camera:

  • Model
  • Release date
  • Max resolution
  • Low resolution
  • Effective pixels
  • Zoom wide (W)
  • Zoom tele (T)
  • Normal focus range
  • Macro focus range
  • Storage included
  • Weight (inc. batteries)
  • Dimensions
  • Price

The original source can be found here.

data = fetch_openml(‘1000-Cameras-Dataset’, as_frame=True) #####
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)

Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='Price', ylabel='frequency'>,
 array([[<AxesSubplot: title={'center': 'F=-1.69E-01'}, xlabel='Storage_included', ylabel='Price'>,
         <AxesSubplot: title={'center': 'F=-1.49E-01'}, xlabel='Zoom_wide_(W) (jittered)'>,
         <AxesSubplot: title={'center': 'F=1.46E-01'}, xlabel='Low_resolution'>],
        [<AxesSubplot: title={'center': 'F=1.45E-01'}, xlabel='Dimensions', ylabel='Price'>,
         <AxesSubplot: title={'center': 'F=-1.39E-01'}, xlabel='Zoom_tele_(T)'>,
         <AxesSubplot: title={'center': 'F=1.34E-01'}, xlabel='Weight_(inc._batteries)'>],
        [<AxesSubplot: title={'center': 'F=-1.29E-01'}, xlabel='Macro_focus_range', ylabel='Price'>,
         <AxesSubplot: title={'center': 'F=9.34E-02'}, xlabel='Max_resolution'>,
         <AxesSubplot: title={'center': 'F=-1.22E-02'}, xlabel='Release_date (jittered)'>]],
       dtype=object),
 array([[<AxesSubplot: title={'center': 'F=5.70E-01'}, xlabel='Price', ylabel='Normal_focus_range'>]],
       dtype=object)]
1000-Cameras-Dataset target distribution
1000-Cameras-Dataset continuous feature vs target
1000-Cameras-Dataset categorical feature vs target

Summary

  • This case study represents comprehensive testing of Dabl API suitable for both  for classification and regression.
  • DABL is an open-source software created by Andreas Mueller. dabl makes supervised machine learning more accessible for beginners and reduces the boilerplate when working with common tasks in machine learning. Dabl takes inspiration from scikit-learn and auto-learn. 
  • Dabl determines whether the target is categorical or continuous and plots the target distribution. Then calls the relevant plotting functions accordingly.
  • Dabl UI provides us with lots of information about what is happening in the different data columns.
  • The SimpleClassifier does all the supervised ML work for us. It implements the familiar scikit-learn API of fit and predict.
  • The real strength of Dabl is in providing simple interfaces for EDA and ML. 
  • In this article, we used Dabl for data pre-processing, visualisation and analysis as well as ML model development. 
  • Dabl offers ways of automating processes that otherwise take a lot of time and effort. Faster processing of data leads to faster model development and prototyping. Using Dabl not only makes data wrangling easier but also makes it efficient by saving a lot of memory. The documentation of dabl indicated that there are some useful features still to come, including model explainers and tools for enhanced model building. 
  • In this study, Dabl has been tested on 32 OpenML datasets. These cover widely varying business applications.
  • Results show that Dabl can be useful for various problems involving regression and classification of small and highly heterogeneous data sets.

Explore More


One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: