Dabl
, short for Data Analysis Baseline Library, is a high-level data exploration library in Python that can be used to automate many of repetitive data wrangling tasks in the early stages of supervised machine learning (ML) model development. It is an open-source library, developed and maintained by Andreas Mueller and thescikit-learn
community.
- In this post, we will use Dabl for data pre-processing, advanced integrated visualisation, exploratory data analysis (EDA) as well as ML model development.
Table of Contents
- The Digits Classification Dataset
- HAR EDA
- The Mosaic Plot
- The Diamond Regression Dataset
- Age-Gender Histograms
- The Mfeat-Factors Dataset
- The Ames Housing Dataset
- The Adult Census Dataset
- The Wine Dataset
- The Titanic Dataset
- Australian Wildfires
- Mice Protein Expression Classification
- Eucalyptus Soil Conservation
- The ISOLET Dataset
- PC3 Software Defect Prediction
- Wall Robot Navigation
- Gesture Phase Segmentation Dataset
- DNA Sequence Dataset
- Bank Note Authentication – Classification
- Balance-Scale Analysis
- Churn Prediction
- Bank Marketing
- Plasma Retinol and Beta-Carotene
- The Lowbwt dataset
- The cps_85_wages Dataset
- Soil Compositions
- Relative CPU Performance Data
- Bank-Customers Simulations
- BNG(autoPrice) Data
- Bangladesh-Rainfall
- Brazilian Houses
- The 1000 Cameras Dataset
- Summary
- Explore More
First, let’s install dabl
!pip install dabl
and set the working directory DIR
import os
os.chdir(‘DIR’)
os. getcwd()
The Digits Classification Dataset
Let’s run dabl.SimpleClassifier() as follows
import dabl
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
sc = dabl.SimpleClassifier().fit(X_train, y_train)
Running DummyClassifier() accuracy: 0.106 recall_macro: 0.100 precision_macro: 0.011 f1_macro: 0.019 === new best DummyClassifier() (using recall_macro): accuracy: 0.106 recall_macro: 0.100 precision_macro: 0.011 f1_macro: 0.019 Running GaussianNB() accuracy: 0.835 recall_macro: 0.837 precision_macro: 0.855 f1_macro: 0.833 === new best GaussianNB() (using recall_macro): accuracy: 0.835 recall_macro: 0.837 precision_macro: 0.855 f1_macro: 0.833 Running MultinomialNB() accuracy: 0.901 recall_macro: 0.902 precision_macro: 0.910 f1_macro: 0.901 === new best MultinomialNB() (using recall_macro): accuracy: 0.901 recall_macro: 0.902 precision_macro: 0.910 f1_macro: 0.901 Running DecisionTreeClassifier(class_weight='balanced', max_depth=1) accuracy: 0.196 recall_macro: 0.199 precision_macro: 0.076 f1_macro: 0.099 Running DecisionTreeClassifier(class_weight='balanced', max_depth=10) accuracy: 0.829 recall_macro: 0.830 precision_macro: 0.835 f1_macro: 0.829 Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01) accuracy: 0.780 recall_macro: 0.781 precision_macro: 0.794 f1_macro: 0.781 Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000) accuracy: 0.959 recall_macro: 0.959 precision_macro: 0.963 f1_macro: 0.960 === new best LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000) (using recall_macro): accuracy: 0.959 recall_macro: 0.959 precision_macro: 0.963 f1_macro: 0.960 Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000) accuracy: 0.962 recall_macro: 0.962 precision_macro: 0.965 f1_macro: 0.963 === new best LogisticRegression(C=1, class_weight='balanced', max_iter=1000) (using recall_macro): accuracy: 0.962 recall_macro: 0.962 precision_macro: 0.965 f1_macro: 0.963 Best model: LogisticRegression(C=1, class_weight='balanced', max_iter=1000) Best Scores: accuracy: 0.962 recall_macro: 0.962 precision_macro: 0.965 f1_macro: 0.963
print(“Accuracy score”, sc.score(X_test, y_test))
Accuracy score 0.98
Let’s load again and plot the input data as a sequence of 8×8 images
from sklearn.datasets import load_digits
digits = load_digits()
from matplotlib import pyplot as plt
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(digits.images[i], cmap=plt.cm.binary, interpolation=’nearest’)
# label the image with the target value
ax.text(0, 7, str(digits.target[i]))

Plot a projection on the 2 first principal axis
plt.figure()
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
proj = pca.fit_transform(digits.data)
plt.scatter(proj[:, 0], proj[:, 1], c=digits.target, cmap=”Paired”)
plt.colorbar()

Let’s classify with the Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)
clf = GaussianNB()
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)
expected = y_test
fig = plt.figure(figsize=(6, 6)) # figure size in inches
fig.subplots_adjust(left=0, right=1, bottom=0, top=1, hspace=0.05, wspace=0.05)
for i in range(64):
ax = fig.add_subplot(8, 8, i + 1, xticks=[], yticks=[])
ax.imshow(X_test.reshape(-1, 8, 8)[i], cmap=plt.cm.binary,
interpolation=’nearest’)
# label the image with the target value
if predicted[i] == expected[i]:
ax.text(0, 7, str(predicted[i]), color='green')
else:
ax.text(0, 7, str(predicted[i]), color='red')

let’s print the classification report
from sklearn import metrics
print(metrics.classification_report(expected, predicted))
precision recall f1-score support 0 1.00 0.98 0.99 59 1 0.86 0.80 0.83 45 2 0.94 0.65 0.77 51 3 0.92 0.82 0.87 44 4 1.00 0.82 0.90 39 5 0.85 0.94 0.89 36 6 0.88 0.98 0.93 45 7 0.79 0.95 0.86 43 8 0.53 0.86 0.66 37 9 0.88 0.73 0.80 51 accuracy 0.85 450 macro avg 0.87 0.85 0.85 450 weighted avg 0.88 0.85 0.85 450
Let’s print the normalized confusion matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
import numpy as np
target_names=[‘0′,’1′,’2′,’3′,’4′,’5′,’6′,’7′,’8′,’9’]
cm = confusion_matrix(expected, predicted)
cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.show(block=False)

Let’s look at the ML performance using scikitplot
import scikitplot as skplt
import sklearn
from sklearn.datasets import load_digits, load_boston, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import sys
import warnings
warnings.filterwarnings(“ignore”)
print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)
%matplotlib inline
Scikit Plot Version : 0.3.7 Scikit Learn Version : 1.1.3 Python Version : 3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]
skplt.estimators.plot_learning_curve(clf, X_train, y_train,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=” GaussianNB() Digits Classification Learning Curve”);

Let’s compare it against Logistic Regression
lr=LogisticRegression(C=1, class_weight=’balanced’, max_iter=1000)
skplt.estimators.plot_learning_curve(lr, X_train, y_train,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=” Logistic Regression Digits Classification Learning Curve”);

Let’s plot the ROC curve
Y_test_probs = clf.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
title=”GaussianNB() Digits ROC Curve”, figsize=(12,6));

Let’s plot the Precision-Recall Curve
skplt.metrics.plot_precision_recall_curve(y_test, Y_test_probs,
title=”GaussianNB() Digits Precision-Recall Curve”, figsize=(12,6));

Let’s plot the elbow plot
skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
X_train,
cluster_ranges=range(2, 20),
figsize=(8,6));

Let’s perform the KMeans silhouette analysis
kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_train, y_train)
cluster_labels = kmeans.predict(X_test)
skplt.metrics.plot_silhouette(X_test, cluster_labels,
figsize=(8,6));

Let’s plot the pca_component_variance
pca = PCA(random_state=1)
pca.fit(X_train)
skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

Let’s transform the classification report into the image
from yellowbrick.classifier import ClassificationReport
viz = ClassificationReport(clf,
classes=target_names,
support=True,
fig=plt.figure(figsize=(8,6)))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show();

Let’s plot the class prediction error for GaussianNB
from yellowbrick.classifier import ClassPredictionError
viz = ClassPredictionError(clf,
classes=target_names,
fig=plt.figure(figsize=(9,6)))
viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show();

HAR EDA
Let’s fetch available datasets from OpenML by name. Examples of using sklearn.datasets.fetch_openml
can be found here.
Human Activity Recognition (HAR) is the problem of classifying sequences of accelerometer data recorded by specialized harnesses or smartphones into known well-defined movements.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from dabl import plot
X, y = fetch_openml(‘har’, as_frame=True, return_X_y=True)
plot(X, y)
plt.show()
Target looks like classification Showing only top 10 of 561 continuous features Linear Discriminant Analysis training set score: 0.984



The Mosaic Plot
This is a nice illustration of the mosaic plot:
mport matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from dabl import plot
X, y = fetch_openml(‘splice’, as_frame=True, return_X_y=True)
plot(X, y)
plt.show()
Target looks like classification Showing only top 10 of 60 categorical features


Mosaic Plot (also known as Marimekko diagram)is nothing but a further version of (pd. crosstab()) function in Python. Crosstab function just gives us a table of numbers whereas Mosaic Plot gives it’d graphical diagram which we can use in the data analysis report.
The Diamond Regression Dataset
Let’s apply the regression analysis to predict the Diamonds Prices Based on Cut, Color, and Clarity.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from dabl import plot
X, y = fetch_openml(‘diamonds’, as_frame=True, return_X_y=True)
plot(X, y)
plt.show()
Target looks like regression [<AxesSubplot: title={'center': 'Target distribution'}, xlabel='class', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=-6.92E-01'}, xlabel='MYCT', ylabel='class'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=7.38E-01'}, xlabel='class', ylabel='MMIN'>, <AxesSubplot: title={'center': 'F=5.18E-01'}, xlabel='class', ylabel='MMAX'>, <AxesSubplot: title={'center': 'F=5.18E-01'}, xlabel='class', ylabel='CACH'>, <AxesSubplot: title={'center': 'F=4.04E-01'}, xlabel='class', ylabel='CHMIN'>, <AxesSubplot: title={'center': 'F=3.41E-01'}, xlabel='class', ylabel='CHMAX'>]], dtype=object)]


Age-Gender Histograms
How to Visualize Age/Sex Patterns with dabl?
Let’s compare the histograms of age per gender
import matplotlib.pyplot as plt
from dabl.datasets import load_adult
from dabl.plot import class_hists
data = load_adult()
class_hists(data, “age”, “gender”, legend=True)
plt.show()

A histogram is similar in appearance to a bar chart, but instead of comparing categories or looking for trends over time, each bar represents how data is distributed in a single category. Each bar represents a continuous range of data or the number of frequencies for a specific data point.
Histograms are useful for showing the distribution of a single scale variable. Data are binned and summarized using a count or percentage statistic.
The Mfeat-Factors Dataset
The mfeat-factors (Multiple Features Dataset: Factors) is one of a set of 6 datasets describing features of handwritten numerals (0 – 9) extracted from a collection of Dutch utility maps. Corresponding patterns in different datasets correspond to the same original character. 200 instances per class (for a total of 2,000 instances) have been digitized in binary images.
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from dabl import plot
X, y = fetch_openml(‘mfeat-factors’, as_frame=True, return_X_y=True)
plot(X, y)
plt.show()
Target looks like classification Showing only top 10 of 216 continuous features Linear Discriminant Analysis training set score: 0.993



The Ames Housing Dataset
Let’s look at the “Ames housing” dataset. This dataset is similar to the “California housing” dataset. However, it is more complex to handle: it contains missing data and both numerical and categorical features.
from dabl import plot
from dabl.datasets import load_ames
import matplotlib.pyplot as plt
data = load_ames()
plot(data, ‘SalePrice’)
plt.show()
Target looks like regression Showing only top 10 of 41 categorical features


The Adult Census Dataset
The dataset is a collection of information related to a person. The prediction task is to predict whether a person is earning a salary above or below 50 k$.
from dabl import plot
from dabl.datasets import load_adult
import matplotlib.pyplot as plt
data = load_adult()
plot(data, ‘income’, scatter_alpha=.1)
plt.show()
Target looks like classification Linear Discriminant Analysis training set score: 0.530





model = dabl.SimpleClassifier(random_state=0)
X = data_clean.drop(“income”, axis=1)
y = data_clean.income
model.fit(X, y)
Running DummyClassifier(random_state=0) accuracy: 0.759 average_precision: 0.241 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.432 === new best DummyClassifier(random_state=0) (using recall_macro): accuracy: 0.759 average_precision: 0.241 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.432 Running GaussianNB() accuracy: 0.408 average_precision: 0.296 roc_auc: 0.619 recall_macro: 0.595 f1_macro: 0.407 === new best GaussianNB() (using recall_macro): accuracy: 0.408 average_precision: 0.296 roc_auc: 0.619 recall_macro: 0.595 f1_macro: 0.407 Running MultinomialNB() accuracy: 0.814 average_precision: 0.694 roc_auc: 0.881 recall_macro: 0.776 f1_macro: 0.760 === new best MultinomialNB() (using recall_macro): accuracy: 0.814 average_precision: 0.694 roc_auc: 0.881 recall_macro: 0.776 f1_macro: 0.760 Running DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) accuracy: 0.710 average_precision: 0.417 roc_auc: 0.759 recall_macro: 0.759 f1_macro: 0.682 Running DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0) accuracy: 0.729 average_precision: 0.673 roc_auc: 0.870 recall_macro: 0.784 f1_macro: 0.702 === new best DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0) (using recall_macro): accuracy: 0.729 average_precision: 0.673 roc_auc: 0.870 recall_macro: 0.784 f1_macro: 0.702 Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01, random_state=0) accuracy: 0.718 average_precision: 0.536 roc_auc: 0.810 recall_macro: 0.779 f1_macro: 0.693 Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000, random_state=0) accuracy: 0.806 average_precision: 0.759 roc_auc: 0.904 recall_macro: 0.819 f1_macro: 0.769 === new best LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000, random_state=0) (using recall_macro): accuracy: 0.806 average_precision: 0.759 roc_auc: 0.904 recall_macro: 0.819 f1_macro: 0.769 Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0) accuracy: 0.808 average_precision: 0.760 roc_auc: 0.905 recall_macro: 0.820 f1_macro: 0.771 === new best LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0) (using recall_macro): accuracy: 0.808 average_precision: 0.760 roc_auc: 0.905 recall_macro: 0.820 f1_macro: 0.771
Best model: LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0) Best Scores: accuracy: 0.808 average_precision: 0.760 roc_auc: 0.905 recall_macro: 0.820 f1_macro: 0.771
SimpleClassifier(random_state=0)
dabl.explain(model)

The Wine Dataset
Let’s load the wine dataset (classification).
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from dabl import plot
from dabl.utils import data_df_from_bunch
wine_bunch = load_wine()
wine_df = data_df_from_bunch(wine_bunch)
plot(wine_df, ‘target’)
plt.show()
Target looks like classification Linear Discriminant Analysis training set score: 1.000




The Titanic Dataset
Let’s look at the classic Titanic dataset, otherwise known as the course material for Kaggle 101.
import dabl
import pandas as pd
import matplotlib.pyplot as plt
titanic = pd.read_csv(dabl.datasets.data_path(“titanic.csv”))
titanic.shape
(1309, 14)
titanic.head
<bound method NDFrame.head of pclass survived name \ 0 1 1 Allen, Miss. Elisabeth Walton 1 1 1 Allison, Master. Hudson Trevor 2 1 0 Allison, Miss. Helen Loraine 3 1 0 Allison, Mr. Hudson Joshua Creighton 4 1 0 Allison, Mrs. Hudson J C (Bessie Waldo Daniels) ... ... ... ... 1304 3 0 Zabour, Miss. Hileni 1305 3 0 Zabour, Miss. Thamine 1306 3 0 Zakarian, Mr. Mapriededer 1307 3 0 Zakarian, Mr. Ortin 1308 3 0 Zimmerman, Mr. Leo sex age sibsp parch ticket fare cabin embarked boat \ 0 female 29 0 0 24160 211.3375 B5 S 2 1 male 0.9167 1 2 113781 151.55 C22 C26 S 11 2 female 2 1 2 113781 151.55 C22 C26 S ? 3 male 30 1 2 113781 151.55 C22 C26 S ? 4 female 25 1 2 113781 151.55 C22 C26 S ? ... ... ... ... ... ... ... ... ... ... 1304 female 14.5 1 0 2665 14.4542 ? C ? 1305 female ? 1 0 2665 14.4542 ? C ? 1306 male 26.5 0 0 2656 7.225 ? C ? 1307 male 27 0 0 2670 7.225 ? C ? 1308 male 29 0 0 315082 7.875 ? S ? body home.dest 0 ? St Louis, MO 1 ? Montreal, PQ / Chesterville, ON 2 ? Montreal, PQ / Chesterville, ON 3 135 Montreal, PQ / Chesterville, ON 4 ? Montreal, PQ / Chesterville, ON ... ... ... 1304 328 ? 1305 ? ? 1306 304 ? 1307 ? ? 1308 ? ? [1309 rows x 14 columns]>
titanic_clean = dabl.clean(titanic, verbose=1)
types = dabl.detect_types(titanic_clean)
print (types)
dabl.plot(titanic, ‘survived’)
plt.show()
fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop(“survived”, axis=1)
y = titanic_clean.survived
fc.fit(X, y)
Detected feature types: continuous 0 dirty_float 3 low_card_int_ordinal 2 low_card_int_categorical 0 categorical 5 date 0 free_string 4 useless 0 dtype: int64 continuous dirty_float low_card_int_ordinal \ pclass False False False survived False False False name False False False sex False False False sibsp False False True parch False False True ticket False False False cabin False False False embarked False False False boat False False False home.dest False False False age_? False False False age_dabl_continuous True False False fare_? False False False fare_dabl_continuous True False False body_? False False False body_dabl_continuous True False False
low_card_int_categorical categorical date \ pclass False True False survived False True False name False False False sex False True False sibsp False False False parch False False False ticket False False False cabin False False False embarked False True False boat False True False home.dest False False False age_? False True False age_dabl_continuous False False False fare_? False False False fare_dabl_continuous False False False body_? False True False body_dabl_continuous
free_string useless pclass False False survived False False name True False sex False False sibsp False False parch False False ticket True False cabin True False embarked False False boat False False home.dest True False age_? False False age_dabl_continuous False False fare_? False True fare_dabl_continuous False False body_? False False body_dabl_continuous False False Target looks like classification
Linear Discriminant Analysis training set score: 0.578
Target Distribution

Continuous Features Pairplot



Linear Discriminant

Categorical features vs Target


fc = dabl.SimpleClassifier(random_state=0)
X = titanic_clean.drop(“survived”, axis=1)
y = titanic_clean.survived
fc.fit(X, y)
Running DummyClassifier(random_state=0) accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382 === new best DummyClassifier(random_state=0) (using recall_macro): accuracy: 0.618 average_precision: 0.382 roc_auc: 0.500 recall_macro: 0.500 f1_macro: 0.382 Running GaussianNB() accuracy: 0.970 average_precision: 0.979 roc_auc: 0.986 recall_macro: 0.963 f1_macro: 0.968 === new best GaussianNB() (using recall_macro): accuracy: 0.970 average_precision: 0.979 roc_auc: 0.986 recall_macro: 0.963 f1_macro: 0.968 Running MultinomialNB() accuracy: 0.968 average_precision: 0.982 roc_auc: 0.984 recall_macro: 0.961 f1_macro: 0.966 Running DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974 === new best DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) (using recall_macro): accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974 Running DecisionTreeClassifier(class_weight='balanced', max_depth=5, random_state=0) accuracy: 0.969 average_precision: 0.965 roc_auc: 0.983 recall_macro: 0.965 f1_macro: 0.967 Running DecisionTreeClassifier(class_weight='balanced', min_impurity_decrease=0.01, random_state=0) accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974 Running LogisticRegression(C=0.1, class_weight='balanced', max_iter=1000, random_state=0) accuracy: 0.975 average_precision: 0.991 roc_auc: 0.993 recall_macro: 0.971 f1_macro: 0.973 Running LogisticRegression(C=1, class_weight='balanced', max_iter=1000, random_state=0) accuracy: 0.976 average_precision: 0.991 roc_auc: 0.994 recall_macro: 0.971 f1_macro: 0.974 Best model: DecisionTreeClassifier(class_weight='balanced', max_depth=1, random_state=0) Best Scores: accuracy: 0.976 average_precision: 0.954 roc_auc: 0.971 recall_macro: 0.971 f1_macro: 0.974
SimpleClassifier
SimpleClassifier(random_state=0)
Australian Wildfires
Let’s look at the Kaggle Australian wildfire dataset.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
data = pd.read_csv(‘fire_archive_V1_96617.csv’)
data.columns
Index(['latitude', 'longitude', 'bright_ti4', 'scan', 'track', 'acq_date', 'acq_time', 'satellite', 'instrument', 'confidence', 'version', 'bright_ti5', 'frp', 'type'], dtype='object')
data.type.value_counts()
0 180150 3 2735 2 1893 Name: type, dtype: int64
import dabl
dabl.plot(data, target_col=’type’, type_hints={‘type’: ‘categorical’})
Target looks like classification
Linear Discriminant Analysis training set score: 0.355





Mice Protein Expression Classification
Let’s look at the Kaggle Mice Protein Expression Classification.
from dabl import plot
from sklearn.datasets import fetch_openml
data = fetch_openml(‘MiceProtein’, as_frame=True)
print(data.frame.shape)
(1080, 78)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Target looks like classification Showing only top 10 of 77 continuous features
Linear Discriminant Analysis training set score: 0.992




Eucalyptus Soil Conservation
Let’s examine the Eucalyptus Soil Conservation dataset.
data = fetch_openml(‘eucalyptus’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Output:
Target looks like classification Linear Discriminant Analysis training set score: 0.517





The ISOLET Dataset
The ISOLET (Isolated Letter Speech Recognition) dataset was generated as follows: 150 subjects spoke the name of each letter of the alphabet twice. Hence, there are 52 training examples from each speaker. The speakers are grouped into sets of 30 speakers each, 4 groups can serve as training set, the last group as the test set. A total of 3 examples are missing, the authors dropped them due to difficulties in recording.
This is a good domain for a noisy, perceptual task. It is also a very good domain for testing the scaling abilities of algorithms.
data = fetch_openml(‘isolet’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Target looks like classification Showing only top 10 of 613 continuous features Linear Discriminant Analysis training set score: 0.969





PC3 Software Defect Prediction
PC3 Software defect prediction
One of the NASA Metrics Data Program defect data sets. Data from flight software for earth orbiting satellite. Data comes from McCabe and Halstead features extractors of source code. These features were defined in the 70s in an attempt to objectively characterize code features that are associated with software quality.
data = fetch_openml(‘pc3’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Linear Discriminant Analysis training set score: 0.631





Wall Robot Navigation
Wall-Following Robot Navigation Data DataSet
The data were collected as the SCITOS G5 robot navigates through the room following the wall in a clockwise direction, for 4 rounds, using 24 ultrasound sensors arranged circularly around its ‘waist’.
The data consists of raw values of the measurements of all 24 ultrasound sensors and the corresponding class label. Sensor readings are sampled at a rate of 9 samples per second.
The class labels are:
- Move-Forward,
- Slight-Right-Turn,
- Sharp-Right-Turn,
- Slight-Left-Turn
It is worth mentioning that the 24 ultrasound readings and the simplified distances were collected at the same time step, so each file has the same number of rows (one for each sampling time step).
The wall-following task and data gathering were designed to test the hypothesis that this apparently simple navigation task is indeed a non-linearly separable classification task. Thus, linear classifiers, such as the Perceptron network, are not able to learn the task and command the robot around the room without collisions. Nonlinear neural classifiers, such as the MLP network, are able to learn the task and command the robot successfully without collisions.
data = fetch_openml(‘wall-robot-navigation’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Target looks like classification Showing only top 10 of 24 continuous features Linear Discriminant Analysis training set score: 0.603




Gesture Phase Segmentation Dataset
The dataset is composed by features extracted from 7 videos with people gesticulating, aiming at studying Gesture Phase Segmentation.
Each video is represented by two files: a raw file, which contains the position of hands, wrists, head and spine of the user in each frame; and a processed file, which contains velocity and acceleration of hands and wrists. See the data set description for more information on the dataset.
data = fetch_openml(‘GesturePhaseSegmentationProcessed’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Target looks like classification Showing only top 10 of 32 continuous features
Linear Discriminant Analysis training set score: 0.366



DNA Sequence Dataset
We will understand how to interpret a DNA structure and how machine learning algorithms can be used to build a prediction model on DNA sequence data.
data = fetch_openml(‘dna’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Target looks like classification Showing only top 10 of 180 categorical features


This is a good application of mosaic plots.
Bank Note Authentication – Classification
Let’s look at the Bank Note Authentication UCI data
data = fetch_openml(‘banknote-authentication’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Target looks like classification Linear Discriminant Analysis training set score: 0.979

Continuous features pairplot




This is a perfect application of LDA.
Balance-Scale Analysis
This data set was generated to model psychological experimental results. Each example is classified as having the balance scale tip to the right, tip to the left, or be balanced. The attributes are the left weight, the left distance, the right weight, and the right distance. The correct way to find the class is the greater of (left-distance left-weight) and (right-distance right-weight). If they are equal, it is balanced.
Attribute Information:
- Class Name: 3 (L, B, R)
- Left-Weight: 5 (1, 2, 3, 4, 5)
- Left-Distance: 5 (1, 2, 3, 4, 5)
- Right-Weight: 5 (1, 2, 3, 4, 5)
- Right-Distance: 5 (1, 2, 3, 4, 5)
data = fetch_openml(‘balance-scale’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col,
type_hints={target_col: ‘categorical’, ‘left-distance’: ‘continuous’,
‘right-distance’: ‘continuous’, ‘left-weight’: ‘continuous’,
‘right-weight’:’continuous’})
Target looks like classification Linear Discriminant Analysis training set score: 0.638





This plot shows the LDA structure of the data.
Churn Prediction
Telco Customer Churn – focused customer retention programs.
data = fetch_openml(‘churn’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})
Target looks like classification Linear Discriminant Analysis training set score: 0.522





Bank Marketing
The Bank Marketing Dataset is about Predicting Term Deposit Subscriptions. How can the financial institution have a greater effectiveness for future marketing campaigns?
data = fetch_openml(‘bank-marketing’, as_frame=True)
print(data.frame.shape)
(45211, 17)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘categorical’})





Plasma Retinol and Beta-Carotene
The Plasma Retinol and Beta-Carotene Dataset – Determinants of Plasma Retinol and Beta-Carotene Levels. This datafile contains 315 observations on 14 variables. This data set can be used to demonstrate multiple regression, transformations, categorical variables, outliers, pooled tests of significance and model building strategies.
data = fetch_openml(‘plasma_retinol’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression



The Lowbwt dataset
Let’s look at the LOW BIRTH WEIGHT DATA. The goal of this study was to identify risk factors associated with giving birth to a low birth weight baby (weighing less than 2500 grams). Data were collected on 189 women, 59 of which had low birth weight babies and 130 of which had normal birth weight babies. Four variables which were thought to be of importance were age, weight of the subject at her last menstrual period, race, and the number of physician visits during the first trimester of pregnancy.
Columns Variable Abbreviation:
2-4 Identification Code ID
10 Low Birth Weight (0 = Birth Weight ge 2500g, LOW l = Birth Weight < 2500g)
17-18 Age of the Mother in Years AGE
23-25 Weight in Pounds at the Last Menstrual Period LWT
32 Race (1 = White, 2 = Black, 3 = Other) RACE
40 Smoking Status During Pregnancy (1 = Yes, 0 = No) SMOKE
48 History of Premature Labor (0 = None, 1 = One, etc.) PTL
55 History of Hypertension (1 = Yes, 0 = No) HT
61 Presence of Uterine Irritability (1 = Yes, 0 = No) UI
67 Number of Physician Visits During the First Trimester FTV (0 = None, 1 = One, 2 = Two, etc.)
73-76 Birth Weight in Grams BWT.
data = fetch_openml(‘lowbwt’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='class', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=2.48E-01'}, xlabel='LWT', ylabel='class'>, <AxesSubplot: title={'center': 'F=6.11E-02'}, xlabel='AGE (jittered)'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=6.16E-01'}, xlabel='class', ylabel='LOW'>, <AxesSubplot: title={'center': 'F=1.31E-01'}, xlabel='class', ylabel='RACE'>, <AxesSubplot: title={'center': 'F=2.37E-02'}, xlabel='class', ylabel='SMOKE'>, <AxesSubplot: title={'center': 'F=2.10E-02'}, xlabel='class', ylabel='PTL'>], [<AxesSubplot: title={'center': 'F=1.31E-02'}, xlabel='class', ylabel='HT'>, <AxesSubplot: title={'center': 'F=1.22E-02'}, xlabel='class', ylabel='UI'>, <AxesSubplot: title={'center': 'F=2.92E-04'}, xlabel='class', ylabel='FTV'>, <AxesSubplot: >]], dtype=object)]



The cps_85_wages Dataset
Determinants of Wages from the 1985 Current Population Survey.
The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership.
data = fetch_openml(‘cps_85_wages’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='WAGE', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=2.58E-01'}, xlabel='AGE', ylabel='WAGE'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=1.55E-01'}, xlabel='WAGE', ylabel='EDUCATION'>, <AxesSubplot: title={'center': 'F=8.58E-02'}, xlabel='WAGE', ylabel='SOUTH'>, <AxesSubplot: title={'center': 'F=4.08E-02'}, xlabel='WAGE', ylabel='SEX'>, <AxesSubplot: title={'center': 'F=2.44E-02'}, xlabel='WAGE', ylabel='UNION'>], [<AxesSubplot: title={'center': 'F=1.97E-02'}, xlabel='WAGE', ylabel='RACE'>, <AxesSubplot: title={'center': 'F=1.26E-02'}, xlabel='WAGE', ylabel='OCCUPATION'>, <AxesSubplot: title={'center': 'F=0.00E+00'}, xlabel='WAGE', ylabel='SECTOR'>, <AxesSubplot: title={'center': 'F=0.00E+00'}, xlabel='WAGE', ylabel='MARR'>]], dtype=object)]


Soil Compositions
Soil Compositions of Physical and Chemical Characteristics
data = fetch_openml(‘visualizing_soil’, as_frame=True) ###
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='track', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=6.77E-01'}, xlabel='northing', ylabel='track'>, <AxesSubplot: title={'center': 'F=1.13E-01'}, xlabel='easting'>, <AxesSubplot: title={'center': 'F=-6.80E-02'}, xlabel='resistivity'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=3.20E-01'}, xlabel='track', ylabel='isns'>]], dtype=object)]


Relative CPU Performance Data
The problem concerns Relative CPU Performance Data. More information can be obtained in the UCI Machine Learning repository. The used attributes are : MYCT: machine cycle time in nanoseconds (integer) MMIN: minimum main memory in kilobytes (integer) MMAX: maximum main memory in kilobytes (integer) CACH: cache memory in kilobytes (integer) CHMIN: minimum channels in units (integer) CHMAX: maximum channels in units (integer) PRP: published relative performance (integer) (target variable)
Original source: UCI machine learning repository. Source: collection of regression datasets by Luis Torgo (ltorgo@ncc.up.pt)
Characteristics: 209 cases; 6 continuous variables.
data = fetch_openml(‘machine_cpu’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='class', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=-6.92E-01'}, xlabel='MYCT', ylabel='class'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=7.38E-01'}, xlabel='class', ylabel='MMIN'>, <AxesSubplot: title={'center': 'F=5.18E-01'}, xlabel='class', ylabel='MMAX'>, <AxesSubplot: title={'center': 'F=5.18E-01'}, xlabel='class', ylabel='CACH'>, <AxesSubplot: title={'center': 'F=4.04E-01'}, xlabel='class', ylabel='CHMIN'>, <AxesSubplot: title={'center': 'F=3.41E-01'}, xlabel='class', ylabel='CHMAX'>]], dtype=object)]


Bank-Customers Simulations
A family of datasets synthetically generated from a simulation of how bank-customers choose their banks. Tasks are based on predicting the fraction of bank customers who leave the bank because of full queues. The bank family of datasets are generated from a simplistic simulator, which simulates the queues in a series of banks. The simulator was constructed with the explicit purpose of generating a family of datasets for DELVE. Customers come from several residential areas, choose their preferred bank depending on distances and have tasks of varying complexity, and various levels of patience. Each bank has several queues, that open and close according to demand. The tellers have various effectivities, and customers may change queue, if their patience expires. In the rej prototasks, the object is to predict the rate of rejections, ie the fraction of customers that are turned away from the bank because all the open tellers have full queues. Source: collection of regression datasets by Luis Torgo (ltorgo@ncc.up.pt).
data = fetch_openml(‘bank32nh’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='rej', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=5.01E-01'}, xlabel='a2pop', ylabel='rej'>, <AxesSubplot: title={'center': 'F=3.21E-01'}, xlabel='a1pop'>, <AxesSubplot: title={'center': 'F=3.20E-01'}, xlabel='a3pop'>, <AxesSubplot: title={'center': 'F=-9.23E-02'}, xlabel='b1eff'>, <AxesSubplot: title={'center': 'F=-6.77E-02'}, xlabel='temp'>], [<AxesSubplot: title={'center': 'F=-3.88E-02'}, xlabel='a2sy', ylabel='rej'>, <AxesSubplot: title={'center': 'F=-3.74E-02'}, xlabel='b1call (jittered)'>, <AxesSubplot: title={'center': 'F=-3.45E-02'}, xlabel='a2sx'>, <AxesSubplot: title={'center': 'F=-2.49E-02'}, xlabel='b2y'>, <AxesSubplot: title={'center': 'F=2.23E-02'}, xlabel='a3cy'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=2.78E-02'}, xlabel='rej', ylabel='mxql'>]], dtype=object)]



BNG(autoPrice) Data
data = fetch_openml(‘autoPrice’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='class', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=9.26E-01'}, xlabel='curb-weight', ylabel='class'>, <AxesSubplot: title={'center': 'F=8.77E-01'}, xlabel='length'>, <AxesSubplot: title={'center': 'F=8.61E-01'}, xlabel='horsepower'>, <AxesSubplot: title={'center': 'F=-8.54E-01'}, xlabel='highway-mpg (jittered)'>], [<AxesSubplot: title={'center': 'F=8.39E-01'}, xlabel='width', ylabel='class'>, <AxesSubplot: title={'center': 'F=7.94E-01'}, xlabel='wheel-base'>, <AxesSubplot: title={'center': 'F=6.58E-01'}, xlabel='bore'>, <AxesSubplot: title={'center': 'F=3.64E-01'}, xlabel='height'>], [<AxesSubplot: title={'center': 'F=-2.35E-01'}, xlabel='symboling (jittered)', ylabel='class'>, <AxesSubplot: title={'center': 'F=-1.92E-01'}, xlabel='compression-ratio'>, <AxesSubplot: title={'center': 'F=1.68E-01'}, xlabel='normalized-losses'>, <AxesSubplot: title={'center': 'F=1.68E-01'}, xlabel='stroke'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=8.70E-01'}, xlabel='class', ylabel='engine-size'>, <AxesSubplot: title={'center': 'F=4.14E-01'}, xlabel='class', ylabel='peak-rpm'>]], dtype=object)]



Bangladesh-Rainfall
Analyzing the rainfall in Bangladesh over the years and predicting the amount of rainfall (in mm) in future. This is a time series dataset.
data = fetch_openml(‘rainfall_bangladesh’, as_frame=True)
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='Rainfall', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=1.67E-02'}, xlabel='Year (jittered)', ylabel='Rainfall'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=5.91E-01'}, xlabel='Rainfall', ylabel='Station'>, <AxesSubplot: title={'center': 'F=6.47E-02'}, xlabel='Rainfall', ylabel='Month'>]],


Brazilian Houses
Dataset houses to rent in different cities in Brazil.
This dataset contains 10962 houses to rent with 13 different features.
data = fetch_openml(‘Brazilian_houses’, as_frame=True) #######
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='total_(BRL)', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=9.69E-01'}, xlabel='rent_amount_(BRL)', ylabel='total_(BRL)'>, <AxesSubplot: title={'center': 'F=7.43E-01'}, xlabel='area'>, <AxesSubplot: title={'center': 'F=7.41E-01'}, xlabel='bathroom (jittered)'>, <AxesSubplot: title={'center': 'F=7.31E-01'}, xlabel='property_tax_(BRL)'>], [<AxesSubplot: title={'center': 'F=6.42E-01'}, xlabel='parking_spaces (jittered)', ylabel='total_(BRL)'>, <AxesSubplot: title={'center': 'F=6.23E-01'}, xlabel='rooms (jittered)'>, <AxesSubplot: title={'center': 'F=5.20E-01'}, xlabel='hoa_(BRL)'>, <AxesSubplot: >]], dtype=object), array([[<AxesSubplot: title={'center': 'F=1.18E-01'}, xlabel='total_(BRL)', ylabel='city'>, <AxesSubplot: title={'center': 'F=1.11E-01'}, xlabel='total_(BRL)', ylabel='floor'>, <AxesSubplot: title={'center': 'F=4.66E-02'}, xlabel='total_(BRL)', ylabel='animal'>, <AxesSubplot: title={'center': 'F=3.00E-02'}, xlabel='total_(BRL)', ylabel='furniture'>]], dtype=object)]


The 1000 Cameras Dataset
Data describing 1000 cameras in 13 properties.
The 13 properties of each camera:
- Model
- Release date
- Max resolution
- Low resolution
- Effective pixels
- Zoom wide
- Zoom tele (T)
- Normal focus range
- Macro focus range
- Storage included
- Weight (inc. batteries)
- Dimensions
- Price
The original source can be found here.
data = fetch_openml(‘1000-Cameras-Dataset’, as_frame=True) #####
target_col = data.target.name
plot(data.frame, target_col=target_col, type_hints={target_col: ‘continuous’}, find_scatter_categoricals=True)
Target looks like regression
[<AxesSubplot: title={'center': 'Target distribution'}, xlabel='Price', ylabel='frequency'>, array([[<AxesSubplot: title={'center': 'F=-1.69E-01'}, xlabel='Storage_included', ylabel='Price'>, <AxesSubplot: title={'center': 'F=-1.49E-01'}, xlabel='Zoom_wide_(W) (jittered)'>, <AxesSubplot: title={'center': 'F=1.46E-01'}, xlabel='Low_resolution'>], [<AxesSubplot: title={'center': 'F=1.45E-01'}, xlabel='Dimensions', ylabel='Price'>, <AxesSubplot: title={'center': 'F=-1.39E-01'}, xlabel='Zoom_tele_(T)'>, <AxesSubplot: title={'center': 'F=1.34E-01'}, xlabel='Weight_(inc._batteries)'>], [<AxesSubplot: title={'center': 'F=-1.29E-01'}, xlabel='Macro_focus_range', ylabel='Price'>, <AxesSubplot: title={'center': 'F=9.34E-02'}, xlabel='Max_resolution'>, <AxesSubplot: title={'center': 'F=-1.22E-02'}, xlabel='Release_date (jittered)'>]], dtype=object), array([[<AxesSubplot: title={'center': 'F=5.70E-01'}, xlabel='Price', ylabel='Normal_focus_range'>]], dtype=object)]



Summary
- This case study represents comprehensive testing of Dabl API suitable for both for classification and regression.
- DABL is an open-source software created by Andreas Mueller. dabl makes supervised machine learning more accessible for beginners and reduces the boilerplate when working with common tasks in machine learning. Dabl takes inspiration from scikit-learn and auto-learn.
- Dabl determines whether the target is categorical or continuous and plots the target distribution. Then calls the relevant plotting functions accordingly.
- Dabl UI provides us with lots of information about what is happening in the different data columns.
- The SimpleClassifier does all the supervised ML work for us. It implements the familiar scikit-learn API of fit and predict.
- The real strength of Dabl is in providing simple interfaces for EDA and ML.
- In this article, we used Dabl for data pre-processing, visualisation and analysis as well as ML model development.
- Dabl offers ways of automating processes that otherwise take a lot of time and effort. Faster processing of data leads to faster model development and prototyping. Using Dabl not only makes data wrangling easier but also makes it efficient by saving a lot of memory. The documentation of dabl indicated that there are some useful features still to come, including model explainers and tools for enhanced model building.
- In this study, Dabl has been tested on 32 OpenML datasets. These cover widely varying business applications.
- Results show that Dabl can be useful for various problems involving regression and classification of small and highly heterogeneous data sets.
Explore More
- ML/AI Prediction of Wine Quality
- Semantic Analysis and NLP Visualizations of Wine Reviews
- ML/AI Diamond Price Prediction with R
- Multi-Label Keras CNN Image Classification of MNIST Fashion Clothing
- ML/AI Wildfire Prediction using Remote Sensing Data
- Towards Optimized ML Wildfire Prediction
- Telco Customer Churn/Retention Rate ML/AI Strategies that Work!
- Real Estate Supervised ML/AI Linear Regression Revisited – USA House Price Prediction
- US Real Estate – Harnessing the Power of AI
Make a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
Or enter a custom amount
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly