Unsupervised ML, K-Means Clustering & Customer Segmentation

The goal of this post is to describe the most popular, easy-to-use, and effective data-driven techniques that help organizations streamline their operations, build effective relationships with with individual people (customers, service users, colleagues, or suppliers), increase sales, improve customer service, and increase profitability.
Customer Segmentation (CS) is a core concept in this technology. This is the process of tagging and grouping customers based on shared characteristics. This process also makes it easy to tailor and personalize your marketing, service, and sales efforts to the needs of specific groups. The result is a potential boost to customer loyalty and conversions.
This article represents a comprehensive step-by-step CS guide in Python that helps you manage customer relationships across the entire customer lifecycle, at every marketing, sales, e-commerce, and customer service interaction.
Our guide consists of hands-on case examples and best industry practices with open-source data and Python source codes: mall customers, online retail, marketing campaigns, supply chain, bank churn, groceries market, and e-commerce.

Table of Clickable Contents

Motivation

Uniform communication strategies that are not customer centric have failed to exploit millions of dollars in opportunities at the individual and segment level.

The more comprehensive CS-based modeling provides executives and marketers with the ability to identify high-profit customers and micro-segments that will drive increases in traffic, sales, profit and retention.

Methods

We will use the following techniques: unsupervised ML clustering, customer segmentation, cohort, market basket, bank churn, CRM data analytics, ABC & RFM analysis.

ABC analysis is a tool that allows companies to divide their inventory or customers into groups based on their respective values to the company. The goal of this type of analysis is to ensure that businesses are optimizing their time and resources when serving customers.
What is RFM (recency, frequency, monetary) analysis? RFM analysis is a marketing technique used to quantitatively rank and group customers based on the recency, frequency and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns.
Cohort analysis is a type of behavioral analytics in which you take a group of users, and analyze their usage patterns based on their shared traits to better track and understand their actions. A cohort is simply a group of people with shared characteristics.
Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.
Customer Churn prediction means knowing which customers are likely to leave or unsubscribe from your service. Customers have different behaviors and preferences, and reasons for cancelling their subscriptions. Therefore, it is important to actively communicate with each of them to keep them on your customer list. You need to know which marketing activities are most effective for individual customers and when they are most effective.

Open-Source Datasets

Mall Customer Segmentation Data

This file contains the basic information (ID, age, gender, income, and spending score) about the customers.

Online Retail Dataset

Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

We will be using the online retail transnational dataset to build RFM clustering and choose the best set of customers which the company should target.

Online Retail Dataset II

Online Retail II dataset contains all the transactions occurring for a UK-based and registered, non-store online retail. This is a real online retail transaction data set of two years between 01/12/2009 and 09/12/2011.

The Marketing Campaign Dataset

The dataset was used in the Customer Personality Analysis (CPA). CPA is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviours and concerns of different types of customers.

DataCo SMART SUPPLY CHAIN FOR BIG DATA ANALYSIS

SUPPLY CHAIN FOR BIG DATA ANALYSIS

Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.

Type Data :
Structured Data : DataCoSupplyChainDataset.csv
Unstructured Data : tokenized_access_logs.csv (Clickstream)

Types of Products : Clothing , Sports , and Electronic Supplies

Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.

Groceries Dataset Groceries_dataset.csv

3 columns: ID of customer, Date of purchase, and Description of product purchased.

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions.

Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

Pakistan’s Largest E-Commerce Dataset

Half a million transaction record for e-commerce sales in Pakistan.

Attributes

People

ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company
Recency: Number of days since customer’s last purchase
Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Products

MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years

Promotion

NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Place

NumWebPurchases: Number of purchases made through the company’s website
NumCatalogPurchases: Number of purchases made using a catalogue
NumStorePurchases: Number of purchases made directly in stores
NumWebVisitsMonth: Number of visits to company’s website in the last month

Bank Churn Modelling Data

Scope: Bank Churn Modelling, Bank Customers Churn , Churn Prediction of bank customers

Mall Customer Segmentation

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objs as go
import missingno as msno

from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings(‘ignore’)

plt.style.use(‘fivethirtyeight’)
%matplotlib inline

Let’s read the input dataset

df = pd.read_csv(‘Mall_Customers.csv’)

and check the content
df.head()

Let’s check the descriptive statistics

df.describe().T

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB

Let’s drop the column

df.drop(‘CustomerID’, axis = 1, inplace = True)

and plot the following distribution plots

plt.figure(figsize = (20, 8))
plotnumber = 1

for col in [‘Age’, ‘Annual Income (k$)’, ‘Spending Score (1-100)’]:
if plotnumber <= 3:
ax = plt.subplot(1, 3, plotnumber)
sns.distplot(df[col])

plotnumber += 1

plt.tight_layout()
plt.show()

Distributions: age, annual income, and spending score.

Let’s look at the Gender pie-plot

values = df[‘Gender’].value_counts()
labels = [‘Male’, ‘Female’]

fig, ax = plt.subplots(figsize = (4, 4), dpi = 100)
explode = (0, 0.06)

patches, texts, autotexts = ax.pie(values, labels = labels, autopct = ‘%1.2f%%’, shadow = True,
startangle = 90, explode = explode)

plt.setp(texts, color = ‘red’)
plt.setp(autotexts, size = 12, color = ‘white’)
autotexts[1].set_color(‘black’)
plt.show()

Let’s compare the corresponding violin plots

plt.figure(figsize = (20, 8))
plotnumber = 1

for col in [‘Age’, ‘Annual Income (k$)’, ‘Spending Score (1-100)’]:
if plotnumber <= 3:
ax = plt.subplot(1, 3, plotnumber)
sns.violinplot(x = col, y = ‘Gender’, data = df)

plotnumber += 1

plt.tight_layout()
plt.show()

Violin plots: gender vs age, annual income, and spending score.

Let’s compare different age groups

age_18_25 = df.Age[(df.Age >= 18) & (df.Age <= 25)] age_26_35 = df.Age[(df.Age >= 26) & (df.Age <= 35)] age_36_45 = df.Age[(df.Age >= 36) & (df.Age <= 45)] age_46_55 = df.Age[(df.Age >= 46) & (df.Age <= 55)] age_55above = df.Age[df.Age >= 55]

x_age = [’18-25′, ’26-35′, ’36-45′, ’46-55′, ’55+’]
y_age = [len(age_18_25.values), len(age_26_35.values), len(age_36_45.values), len(age_46_55.values),
len(age_55above.values)]

px.bar(data_frame = df, x = x_age, y = y_age, color = x_age,
title = ‘Number of customers per age group’)

Let’s plot the Relation between Annual Income and Spending Score

fig=px.scatter(data_frame = df, x = ‘Annual Income (k$)’, y = ‘Spending Score (1-100)’, color=”Age”,
title = ‘Relation between Annual Income and Spending Score’)
fig.update_traces(marker=dict(size=18,
line=dict(width=1,
color=’DarkSlateGrey’)),
selector=dict(mode=’markers’))

Let’s look at the Number of customers per Spending Score group

ss_1_20 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 1) &
(df[‘Spending Score (1-100)’] <= 20)]

ss_21_40 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 21) &
(df[‘Spending Score (1-100)’] <= 40)]

ss_41_60 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 41) &
(df[‘Spending Score (1-100)’] <= 60)]

ss_61_80 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 61) &
(df[‘Spending Score (1-100)’] <= 80)]

ss_81_100 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 81) &
(df[‘Spending Score (1-100)’] <= 100)]

x_ss = [‘1-20′, ’21-40′, ’41-60′, ’61-80′, ’81-100’]
y_ss = [len(ss_1_20.values), len(ss_21_40.values), len(ss_41_60.values), len(ss_61_80.values),
len(ss_81_100.values)]

px.bar(data_frame = df, x = x_ss, y = y_ss, color = x_ss,
title = ‘Number of customers per Spending Score group’)

Let’s check the Number of customers per Annual Income group

ai_0_30 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 0) & (df[‘Annual Income (k$)’] <= 30)] ai_31_60 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 31)&(df[‘Annual Income (k$)’] <= 60)] ai_61_90 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 61)&(df[‘Annual Income (k$)’] <= 90)] ai_91_120 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’]>= 91)&(df[‘Annual Income (k$)’]<=120)] ai_121_150 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’]>=121)&(df[‘Annual Income (k$)’]<=150)]

x_ai = [‘$ 0-30,000’, ‘$ 30,001-60,000’, ‘$ 60,001-90,000’, ‘$ 90,001-120,000’, ‘$ 120,000-150,000’]
y_ai = [len(ai_0_30.values) , len(ai_31_60.values) , len(ai_61_90.values) , len(ai_91_120.values),
len(ai_121_150.values)]

px.bar(data_frame = df, x = x_ai, y = y_ai, color = x_ai,
title = ‘Number of customers per Annual Income group’)

Let’s perform K-means clustering using Age and Spending Score columns

X1 = df.loc[:, [‘Age’, ‘Spending Score (1-100)’]].values

wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X1)
wcss.append(kmeans.inertia_)

plt.figure(figsize = (12, 7))

plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()

K-means clustering using Age and Spending Score columns: Elbow plot

Let’s set n_clusters = 4 and run

kmeans = KMeans(n_clusters = 4)
labels = kmeans.fit_predict(X1)

Let’s plot customer clusters in the domain Age vs Spending Score (1-100)

plt.figure(figsize = (14, 8))

plt.scatter(X1[:, 0], X1[:, 1], c = kmeans.labels_, s = 200)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color = ‘red’, s = 350)
plt.title(‘Clusters of Customers\n’, fontsize = 20)
plt.xlabel(‘Age’)
plt.ylabel(‘Spending Score (1-100)’)
plt.show()

Customer clusters in the domain Age vs Spending Score (1-100)

Let’s perform K-means clustering using Age and Annual Income columns

X2 = df.loc[:, [‘Age’, ‘Annual Income (k$)’]].values

wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X2)
wcss.append(kmeans.inertia_)

plt.figure(figsize = (12, 7))

plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()

K-means clustering using Age and Annual Income columns: Elbow Plot

Let’s set n_clusters = 5 and run

kmeans = KMeans(n_clusters = 5)
labels = kmeans.fit_predict(X2)

Let’s plot Customer Clusters in the domain Annual Income (k$) vs Spending Score (1-100)

Let’s perform K-means clustering using Age, Annual Score and Spending Score columns
X4 = df.iloc[:, 1:]

wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X4)
wcss.append(kmeans.inertia_)

plt.figure(figsize = (12, 7))

plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()

K-means clustering using Age, Annual Score and Spending Score columns: Elbow plot

Let’s set n_clusters = 6 and run

kmeans = KMeans(n_clusters = 6)
clusters = kmeans.fit_predict(X4)
X4[‘label’] = clusters

Let’s look at the corresponding 3D plot

fig = px.scatter_3d(X4, x=”Annual Income (k$)”, y=”Spending Score (1-100)”, z=”Age”,
color = ‘label’, size = ‘label’)
fig.show()

Customer clusters 3D plot: x="Annual Income (k$)", y="Spending Score (1-100)", z="Age"

Let’s plot the Hierarchical Clustering dendrogram
plt.figure(figsize = (22, 8))

dendo = dendrogram(linkage(X3, method = ‘ward’))
plt.title(‘Dendrogram’, fontsize = 15)
plt.show()

Let’s run Agglomerative Clustering with n_clusters = 5

agc = AgglomerativeClustering(n_clusters = 5, affinity = ‘euclidean’, linkage = ‘ward’)
labels = agc.fit_predict(X3)

plt.figure(figsize = (12, 8))
smb=200

plt.scatter(X3[labels == 0,0], X3[labels == 0,1], label = ‘Cluster 1’, s = smb)
plt.scatter(X3[labels == 1,0], X3[labels == 1,1], label = ‘Cluster 2’, s = smb)
plt.scatter(X3[labels == 2,0], X3[labels == 2,1], label = ‘Cluster 3’, s = smb)
plt.scatter(X3[labels == 3,0], X3[labels == 3,1], label = ‘Cluster 4’, s = smb)
plt.scatter(X3[labels == 4,0], X3[labels == 4,1], label = ‘Cluster 5’, s = smb)

plt.legend(loc = ‘best’)
plt.title(‘Clusters of Customers\n ‘, fontsize = 20)
plt.xlabel(‘Annual Income (k$)’)
plt.ylabel(‘Spending Score (1-100)’)
plt.show()

Clusters of Customers:
Annual Income (k$) vs
Spending Score (1-100)

Let’s compare these results with the DBSCAN clustering algorithm

centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0) # generate sample blobs

X = StandardScaler().fit_transform(X)

db = DBSCAN(eps=0.3, min_samples=10).fit(X)

Creating an array of true and false as the same size as db.labels
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

(setting the indices of the core regions to True)
labels = db.labels_

(similar to the model.fit() method, it gives the labels of the clustered data).

The label -1 is considered as noise by the DBSCAN algorithm
n_clusters_ = len(set(labels)) – (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1) # calculating the number of clusters

print(‘Estimated number of clusters: %d’ % n_clusters_)
print(‘Estimated number of noise points: %d’ % n_noise_)

“””Homogeneity metric of a cluster labeling given a ground truth.

A clustering result satisfies homogeneity if all of its clusters
contain only data points which are members of a single class.”””

print(“Homogeneity: %0.3f” % metrics.homogeneity_score(labels_true, labels))

Estimated number of clusters: 3
Estimated number of noise points: 18
Homogeneity: 0.953

Let’s plot the result representing 3 clusters + noise

plt.figure(figsize = (10, 8))

unique_labels = set(labels) # identifying all the unique labels/clusters
colors = [plt.cm.Spectral(each)
# creating the list of colours, generating the colourmap
for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k) # assigning class members for each class
xy = X[class_member_mask & core_samples_mask] # creating the list of points for each class
plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),markeredgecolor=’k’, markersize=14)
xy = X[class_member_mask & ~core_samples_mask] # creating the list of noise points
plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col), markeredgecolor=’k’, markersize=14)

plt.title(‘Clustering using DBSCAN\n’, fontsize = 15)
plt.show()

Here black symbols designate noise.

Read more here.

Online Retail K-Means Clustering

Let’s continue using the dataset Mall_Customers.csv

from sklearn.preprocessing import StandardScaler

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import os
import warnings

warnings.filterwarnings(‘ignore’)

df = pd.read_csv(‘Mall_Customers.csv’)

Let’s rename the columns

df.rename(index=str, columns={‘Annual Income (k$)’: ‘Income’,
‘Spending Score (1-100)’: ‘Score’}, inplace=True)
df.head()

Let’s examine the input data using sns.pairplot
X = df.drop([‘CustomerID’, ‘Gender’], axis=1)
sns.pairplot(df.drop(‘CustomerID’, axis=1), hue=’Gender’, aspect=1.5)
plt.show()

Let’s look at the K-Means Elbow plot

from sklearn.cluster import KMeans

clusters = []

for i in range(1, 11):
km = KMeans(n_clusters=i).fit(X)
clusters.append(km.inertia_)

fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax,linewidth = 4)
ax.set_title(‘Searching for Elbow’)
ax.set_xlabel(‘Clusters’)
ax.set_ylabel(‘Inertia’)

ax.annotate(‘Possible Elbow Point’, xy=(3, 140000), xytext=(3, 50000), xycoords=’data’,
arrowprops=dict(arrowstyle=’->’, connectionstyle=’arc3′, color=’red’, lw=2))

ax.annotate(‘Possible Elbow Point’, xy=(5, 80000), xytext=(5, 150000), xycoords=’data’,
arrowprops=dict(arrowstyle=’->’, connectionstyle=’arc3′, color=’red’, lw=2))

plt.show()

Let’s plot K-means with 3 clusters Income vs Score

km3 = KMeans(n_clusters=3).fit(X)

X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))

sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Income”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 3 Clusters’)
plt.show()

Let’s plot K-means with 3 clusters Income-Age vs Score

km3 = KMeans(n_clusters=3).fit(X)

X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))

sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 3 Clusters’)
plt.show()

K-means with 3 clusters Income-Age vs Score

Let’s plot K-means with 5 clusters Income vs Score

km3 = KMeans(n_clusters=5).fit(X)

X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))
sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Income”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 5 Clusters’)
plt.show()

Let’s plot K-means with 5 clusters Income-Age vs Score

km3 = KMeans(n_clusters=5).fit(X)

X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))
sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 5 Clusters’)
plt.show()

K-means with 5 clusters Income-Age vs Score

Let’s invoke sns.swarmplot to compare Labels According to Annual Income and Scoring History

fig = plt.figure(figsize=(20,8))
sns.set(font_scale=2)
ax = fig.add_subplot(121)
sns.swarmplot(x=’Labels’, y=’Income’, data=X, ax=ax,hue=”Labels”, legend=False)
ax.set_title(‘Labels According to Annual Income’)

ax = fig.add_subplot(122)
sns.swarmplot(x=’Labels’, y=’Score’, data=X, ax=ax)
ax.set_title(‘Labels According to Scoring History’)

plt.show()

sns.swarmplot: Labels according to Annual Income and Scoring History

Hierarchical Clustering:
from sklearn.cluster import AgglomerativeClustering
sns.set(font_scale=1)
agglom = AgglomerativeClustering(n_clusters=5, linkage=’average’).fit(X)

X[‘Labels’] = agglom.labels_
plt.figure(figsize=(12, 8))

sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=sns.color_palette(‘hls’, 5),
height=6, data=X)

plt.title(‘Agglomerative with 5 Clusters’)
plt.show()

Agglomerative Clustering with 5 clusters

Let’s compute the spatial distance matrix

from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix

dist = distance_matrix(X, X)
Z = hierarchy.linkage(dist, ‘complete’)

and plot the dendrogram with dist=’complete’

plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size=12, orientation=’right’)

Let’s plot the dendrogram with dist=’average’

Z = hierarchy.linkage(dist, ‘average’)
plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size =12, orientation = ‘right’)

Density Based Clustering (DBSCAN):
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=11, min_samples=6).fit(X)

X[‘Labels’] = db.labels_
plt.figure(figsize=(12, 8))

sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=sns.color_palette(‘hls’, 5),
height=6, data=X)
plt.title(‘DBSCAN with epsilon 11, min samples 6’)
plt.show()

Read more here.

Online Retail Data Analytics

Let’s look at the OnlineRetail.csv dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
df = pd.read_csv(“OnlineRetail.csv”, delimiter=’,’, encoding = “ISO-8859-1”)

Let’s check the data structure

df.head()

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

df.describe()

Data Cleaning: Checking for Null Values
msno.bar(df)

Checking for Null Values before data cleaning

df.count()

InvoiceNo      541909
StockCode      541909
Description    540455
Quantity       541909
InvoiceDate    541909
UnitPrice      541909
CustomerID     406829
Country        541909
dtype: int64

df[df[‘CustomerID’].isnull()].count()

InvoiceNo      135080
StockCode      135080
Description    133626
Quantity       135080
InvoiceDate    135080
UnitPrice      135080
CustomerID          0
Country        135080
dtype: int64

Let’s check the percentage

100 – ((541909-135000)/541909 * 100)

24.911931708091203

So, approximately 25% of the data is missing.

We will proceed with dropping the missing rows now.
df.dropna(inplace=True)

msno.bar(df)

Checking for Null Values after data cleaning

Let’s perform further data editing

df[‘InvoiceDate’] = pd.to_datetime(df[‘InvoiceDate’], format=’%d-%m-%Y %H:%M’)

df[‘Total Amount Spent’]= df[‘Quantity’] * df[‘UnitPrice’]

total_amount = df[‘Total Amount Spent’].groupby(df[‘CustomerID’]).sum()
total_amount = pd.DataFrame(total_amount).reset_index()

transactions = df[‘InvoiceNo’].groupby(df[‘CustomerID’]).count()
transaction = pd.DataFrame(transactions).reset_index()

final = df[‘InvoiceDate’].max()
df[‘Last_transact’] = final – df[‘InvoiceDate’]
LT = df.groupby(df[‘CustomerID’]).min()[‘Last_transact’]
LT = pd.DataFrame(LT).reset_index()

df_new = pd.merge(total_amount, transaction, how=’inner’, on=’CustomerID’)
df_new = pd.merge(df_new, LT, how=’inner’, on=’CustomerID’)

df_new[‘Last_transact’] = df_new[‘Last_transact’].dt.days

Let’s run the K-Means Clustering Model
from sklearn.cluster import KMeans
kmeans= KMeans(n_clusters=2)

kmeans.fit(df_new[[‘Total Amount Spent’, ‘InvoiceNo’, ‘Last_transact’]])
pred = kmeans.predict(df_new[[‘Total Amount Spent’, ‘InvoiceNo’, ‘Last_transact’]])

df_new=df_new.join(pred, lsuffix=”_left”, rsuffix=”_right”)

error_rate = []
for clusters in range(1,16):
kmeans = KMeans(n_clusters = clusters)
kmeans.fit(df_new)
kmeans.predict(df_new)
error_rate.append(kmeans.inertia_)

error_rate = pd.DataFrame({‘Cluster’:range(1,16) , ‘Error’:error_rate})

Let’s plot the Error Rate and Clusters

plt.figure(figsize=(12,8))
p = sns.barplot(x=’Cluster’, y= ‘Error’, data= error_rate, palette=’coolwarm_r’)
sns.despine(left=True)
p.set_title(‘Error Rate and Clusters’)

Let’s drop 1 column

del df[‘InvoiceDate’]

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 392692 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   InvoiceNo     392692 non-null  object 
 1   StockCode     392692 non-null  object 
 2   Description   392692 non-null  object 
 3   Quantity      392692 non-null  int64  
 4   UnitPrice     392692 non-null  float64
 5   CustomerID    392692 non-null  float64
 6   Country       392692 non-null  object 
 7   cancellation  392692 non-null  int64  
 8   TotalPrice    392692 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 30.0+ MB

Let’s group the data

country_wise = df.groupby(‘Country’).sum()

and read the full country list

country_codes = pd.read_csv(‘wikipedia-iso-country-codes.csv’, names=[‘Country’, ‘two’, ‘three’, ‘numeric’, ‘ISO’])

country_codes.head()

Let’s merge df and country_codes

country_wise = pd.merge(country_codes,country_wise, on=’Country’)

Let’s invoke plotly for plotting purposes

from plotly import version
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
import plotly.graph_objs as go

Let’s plot European Countries According to Revenue

data = dict(type=’choropleth’,colorscale=’GnBu’, locations = country_wise[‘three’], locationmode = ‘ISO-3’, z= country_wise[‘Total Amount Spent’], text = country_wise[‘Country’], colorbar={‘title’:’Revenue’}, marker = dict(line=dict(width=0)))
layout = dict(title = ‘European Countries According to Revenue!’, geo = dict(scope=’europe’,showlakes=False, projection = {‘type’: ‘winkel tripel’}))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)

Let’s plot All Countries According to Revenue

data = dict(type=’choropleth’,colorscale=’rainbow’, locations = country_wise[‘three’], locationmode = ‘ISO-3’, z= country_wise[‘Total Amount Spent’], text = country_wise[‘Country’], colorbar={‘title’:’Revenue’}, marker = dict(line=dict(width=0)))
layout = dict(title = ‘All Countries According to Revenue!’, geo = dict(scope=’world’,showlakes=False, projection = {‘type’: ‘winkel tripel’}))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)

Let’s create the list of countries based upon the percentage of Total Price

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(“OnlineRetail.csv”, encoding= ‘unicode_escape’)

df = df.drop_duplicates()

df[‘cancellation’] = df.InvoiceNo.str.extract(‘([C])’).fillna(0).replace({‘C’:1})
df.cancellation.value_counts()

cancellation
0    527390
1      9251
Name: count, dtype: int64

df[df.cancellation == 1][‘CustomerID’].nunique() / df.CustomerID.nunique() * 100

36.34492223238792

df = df[df.CustomerID.notnull()]
df = df[(df.Quantity > 0) & (df.UnitPrice > 0)]

df = df.drop(‘cancellation’, axis = 1)

df[“TotalPrice”] = df.UnitPrice * df.Quantity

df2 = pd.DataFrame(df.groupby(‘Country’).TotalPrice.sum().apply(lambda x: round(x, 2))).sort_values(‘TotalPrice’, ascending = False)

df2[‘perc_of_TotalPrice’] = round(df2.TotalPrice / df2.TotalPrice.sum() * 100, 2)
df2

The list of countries based upon the percentage of Total Price

Read more here.

RFM Customer Segmentation

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import the key libraries

!pip install termcolor

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import ipywidgets
from ipywidgets import interact

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mticker
import squarify as sq

import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import missingno as msno

import datetime as dt
from datetime import datetime

from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import plotly.express as px
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import colorama
from colorama import Fore, Style # maakes strings colored
from termcolor import colored
from termcolor import cprint

from wordcloud import WordCloud

import warnings
warnings.filterwarnings(“ignore”)
warnings.warn(“this will not show”)

plt.rcParams[“figure.figsize”] = (10,6)
pd.set_option(‘max_colwidth’,200)
pd.set_option(‘display.max_rows’, 1000)
pd.set_option(‘display.max_columns’, 200)
pd.set_option(‘display.float_format’, lambda x: ‘%.3f’ % x)

Let’s define the following functions

##########################

def missing_values(df):
missing_number = df.isnull().sum().sort_values(ascending=False)
missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=[‘Missing_Number’, ‘Missing_Percent’])
return missing_values[missing_values[‘Missing_Number’]>0]

def first_looking(df):
print(colored(“Shape:”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]),
colored(“\nInfo:\n”, attrs=[‘bold’]), sep=”)
print(df.info(), ‘\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Number of Uniques:\n”, attrs=[‘bold’]), df.nunique(),’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Missing Values:\n”, attrs=[‘bold’]), missing_values(df),’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“All Columns:”, attrs=[‘bold’]), list(df.columns),’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)

df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')
print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
      colored('*'*100, 'red', attrs=['bold']), sep='')  
print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
      colored('*'*100, 'red', attrs=['bold']), sep='')
print(colored("Descriptive Statistics \n", attrs=['bold']), df.describe().round(2),'\n',
      colored('*'*100, 'red', attrs=['bold']), sep='') # Gives a statstical breakdown of the data.
print(colored("Descriptive Statistics (Categorical Columns) \n", attrs=['bold']), df.describe(include=object).T,'\n',
      colored('*'*100, 'red', attrs=['bold']), sep='') # Gives a statstical breakdown of the data.

def multicolinearity_control(df):
feature =[]
collinear=[]
for col in df.corr().columns:
for i in df.corr().index:
if (abs(df.corr()[col][i])> .9 and abs(df.corr()[col][i]) < 1):
feature.append(col)
collinear.append(i)
print(colored(f”Multicolinearity alert in between:{col} – {i}”,
“red”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)

def duplicate_values(df):
print(colored(“Duplicate check…”, attrs=[‘bold’]), sep=”)
print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)

def drop_null(df, limit):
print(‘Shape:’, df.shape)
for i in df.isnull().sum().index:
if (df.isnull().sum()[i]/df.shape[0]*100)>limit:
print(df.isnull().sum()[i], ‘percent of’, i ,’null and were dropped’)
df.drop(i, axis=1, inplace=True)
print(‘new shape:’, df.shape)
print(‘New shape after missing value control:’, df.shape)

def first_look(col):
print(“column name : “, col)
print(“——————————–“)
print(“Per_of_Nulls : “, “%”, round(df[col].isnull().sum()/df.shape[0]*100, 2))
print(“Num_of_Nulls : “, df[col].isnull().sum())
print(“Num_of_Uniques : “, df[col].nunique())
print(“Duplicates : “, df.duplicated(subset=None, keep=’first’).sum())
print(df[col].value_counts(dropna = False))

def fill_most(df, group_col, col_name):
”’Fills the missing values with the most existing value (mode) in the relevant column according to single-stage grouping”’
for group in list(df[group_col].unique()):
cond = df[group_col]==group
mode = list(df[cond][col_name].mode())
if mode != []:
df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
else:
df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
print(“Number of NaN : “,df[col_name].isnull().sum())
print(“——————“)
print(df[col_name].value_counts(dropna=False))

##########################

Let’s read and copy the input Online Retail dataset

df0=pd.read_csv(‘Online Retail.csv’,index_col=0)
df = df0.copy()

Let’s check the dataset content

first_looking(df)
duplicate_values(df)

Shape:(541909, 8)
****************************************************************************************************
Info:

<class 'pandas.core.frame.DataFrame'>
Index: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  float64
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(3), object(5)
memory usage: 37.2+ MB
None
****************************************************************************************************
Number of Uniques:
InvoiceNo      25900
StockCode       4070
Description     4223
Quantity         722
InvoiceDate    23260
UnitPrice       1630
CustomerID      4372
Country           38
dtype: int64
****************************************************************************************************
Missing Values:
             Missing_Number  Missing_Percent
CustomerID           135080            0.249
Description            1454            0.003
****************************************************************************************************
All Columns:['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']
****************************************************************************************************
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
****************************************************************************************************
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
****************************************************************************************************
Descriptive Statistics 
        quantity  unitprice  customerid
count 541909.000 541909.000  406829.000
mean       9.550      4.610   15287.690
std      218.080     96.760    1713.600
min   -80995.000 -11062.060   12346.000
25%        1.000      1.250   13953.000
50%        3.000      2.080   15152.000
75%       10.000      4.130   16791.000
max    80995.000  38970.000   18287.000
****************************************************************************************************
Descriptive Statistics (Categorical Columns) 
              count unique                                 top    freq
invoiceno    541909  25900                            573585.0    1114
stockcode    541909   4070                              85123A    2313
description  540455   4223  WHITE HANGING HEART T-LIGHT HOLDER    2369
invoicedate  541909  23260                 2011-10-31 14:41:00    1114
country      541909     38                      United Kingdom  495478
****************************************************************************************************
Duplicate check...
There are 5268 duplicated observations in the dataset.

Let’s calculate the total price

df[‘total_price’] = df[‘quantity’] * df[‘unitprice’]
print(“There is”, df.shape[0], “observations and”, df.shape[1], “columns in the dataset”)

There is 541909 observations and 9 columns in the dataset

Let’s plot Invoice Counts Per Country

fig = px.histogram(df, x = df.groupby(‘country’)[‘invoiceno’].nunique().index,
y = df.groupby(‘country’)[‘invoiceno’].nunique().values,
title = ‘Invoice Counts Per Country’,
labels = dict(x = “Countries”, y =”Invoice”))
fig.show();

Let’s group the data

df.groupby([‘invoiceno’,’customerid’, ‘country’])[‘total_price’].sum().sort_values().head()

invoiceno  customerid  country       
C581484    16446.000   United Kingdom   -168469.600
C541433    12346.000   United Kingdom    -77183.600
C556445    15098.000   United Kingdom    -38970.000
C550456    15749.000   United Kingdom    -22998.400
C570556    16029.000   United Kingdom    -11816.640
Name: total_price, dtype: float64

Let’s plot Customer Counts by Country

fig = px.histogram(df, x = df.groupby(‘country’)[‘customerid’].nunique().index,
y = df.groupby(‘country’)[‘customerid’].nunique().values,
title = ‘Customer Counts by Country’,
labels = dict(x = “Countries”, y =”Customer”))
fig.show()

Let’s check missing values

df.isnull().melt(value_name=”missing”)

Clean the Data from the Noise and Missing Values:
missing_values(df)

Drop observations with missing values:
df = df.dropna(subset=[“customerid”])

Get rid of cancellations and negative values:
df = df[(df[‘unitprice’] > 0) & (df[‘quantity’] > 0)]

Check for duplicates and get rid of them:
print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)
print(df.duplicated(subset=None, keep=’first’).sum(), “Duplicated observations are dropped!”)
df.drop_duplicates(keep=’first’, inplace=True)

print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)
print(df.duplicated(subset=None, keep=’first’).sum(), “Duplicated observations are dropped!”)
df.drop_duplicates(keep=’first’, inplace=True)

There are 5192 duplicated observations in the dataset.
5192 Duplicated observations are dropped!

df.columns

Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate',
       'unitprice', 'customerid', 'country', 'total_price'],
      dtype='object')

Explore the Orders:
cprint(“Unique number of invoice per customer”,’red’)
df.groupby(“customerid”)[“invoiceno”].nunique().sort_values(ascending = False)

Unique number of invoice per customer
customerid
12748.000    209
14911.000    201
17841.000    124
13089.000     97
14606.000     93
            ... 
15314.000      1
15313.000      1
15308.000      1
15307.000      1
15300.000      1
Name: invoiceno, Length: 4338, dtype: int64

Let’s plot the country-based wordcloud

from wordcloud import WordCloud

country_text = df[“country”].str.split(” “).str.join(“_”)
all_countries = ” “.join(country_text)

wc = WordCloud(background_color=”red”,
max_words=250,
max_font_size=256,
random_state=42,
width=800, height=400)
wc.generate(all_countries)
plt.figure(figsize = (12, 10))
plt.imshow(wc)
plt.axis(‘off’)
plt.show()

Let’s focus on the UK online retail market

df_uk = df[df[“country”]==”United Kingdom”]

first_looking(df_uk)
duplicate_values(df_uk)

Shape:(349203, 9)
****************************************************************************************************
Info:

<class 'pandas.core.frame.DataFrame'>
Index: 349203 entries, 0 to 541893
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   invoiceno    349203 non-null  object 
 1   stockcode    349203 non-null  object 
 2   description  349203 non-null  object 
 3   quantity     349203 non-null  float64
 4   invoicedate  349203 non-null  object 
 5   unitprice    349203 non-null  float64
 6   customerid   349203 non-null  float64
 7   country      349203 non-null  object 
 8   total_price  349203 non-null  float64
dtypes: float64(4), object(5)
memory usage: 26.6+ MB
None
****************************************************************************************************
Number of Uniques:
invoiceno      16646
stockcode       3645
description     3844
quantity         293
invoicedate    15612
unitprice        402
customerid      3920
country            1
total_price     2792
dtype: int64
****************************************************************************************************
Missing Values:
Empty DataFrame
Columns: [Missing_Number, Missing_Percent]
Index: []
****************************************************************************************************
All Columns:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
****************************************************************************************************
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
****************************************************************************************************
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
****************************************************************************************************
Descriptive Statistics 
        quantity  unitprice  customerid  total_price
count 349203.000 349203.000  349203.000   349203.000
mean      12.150      2.970   15548.380       20.860
std      190.630     17.990    1594.380      328.420
min        1.000      0.000   12346.000        0.000
25%        2.000      1.250   14191.000        4.200
50%        4.000      1.950   15518.000       10.200
75%       12.000      3.750   16931.000       17.850
max    80995.000   8142.750   18287.000   168469.600
****************************************************************************************************
Descriptive Statistics (Categorical Columns) 
              count unique                                 top    freq
invoiceno    349203  16646                            576339.0     542
stockcode    349203   3645                              85123A    1936
description  349203   3844  WHITE HANGING HEART T-LIGHT HOLDER    1929
invoicedate  349203  15612                 2011-11-14 15:27:00     542
country      349203      1                      United Kingdom  349203
****************************************************************************************************
Duplicate check...
There are 0 duplicated observations in the dataset.

Let’s prepare the date for the RFM analysis

import datetime as dt
from datetime import datetime
import datetime as dt

df_uk[“ref_date”] = pd.to_datetime(df_uk[‘invoicedate’]).dt.date.max()

df_uk[‘ref_date’] = pd.to_datetime(df_uk[‘ref_date’])

df_uk[“date”] = pd.to_datetime(df_uk[“invoicedate”]).dt.date

df_uk[‘date’] = pd.to_datetime(df_uk[‘date’])

df_uk[‘last_purchase_date’] = df_uk.groupby(‘customerid’)[‘invoicedate’].transform(max)
df_uk[‘last_purchase_date’] = pd.to_datetime(df_uk[‘last_purchase_date’]).dt.date
df_uk[‘last_purchase_date’] = pd.to_datetime(df_uk[‘last_purchase_date’])

df_uk.groupby(‘customerid’)[[‘last_purchase_date’]].max().sample(5)

df_uk[“customer_recency”] = df_uk[“ref_date”] – df_uk[“last_purchase_date”]

df_uk[‘recency_value’] = pd.to_numeric(df_uk[‘customer_recency’].dt.days.astype(‘int64’))

customer_recency = pd.DataFrame(df_uk.groupby(‘customerid’)[‘recency_value’].min())
customer_recency.rename(columns={‘recency_value’:’recency’}, inplace=True)
customer_recency.reset_index(inplace=True)

df_uk.drop(‘last_purchase_date’, axis=1, inplace=True)

Let’s plot Customer Regency Distribution

fig = px.histogram(df_uk, x = ‘recency_value’, title = ‘Customer Regency Distribution’)
fig.show()

Let’s compute Frequency – Number of purchases

df_uk1 = df_uk.copy()
customer_frequency = pd.DataFrame(df_uk.groupby(‘customerid’)[‘invoiceno’].nunique())
customer_frequency.rename(columns={‘invoiceno’:’frequency’}, inplace=True)
customer_frequency.reset_index(inplace=True)

df_uk[‘customer_frequency’] = df_uk.groupby(‘customerid’)[‘invoiceno’].transform(‘count’)

Let’s plot Customer Frequency Distribution

fig = px.histogram(df_uk, x = ‘customer_frequency’, title = ‘Customer Frequency Distribution’)
fig.show()

Let’s calculate Monetary

customer_monetary = pd.DataFrame(df_uk.groupby(‘customerid’)[‘total_price’].sum())
customer_monetary.rename(columns={‘total_price’:’monetary’}, inplace=True)
customer_monetary.reset_index(inplace=True)

df_uk[‘customer_monetary’] = df_uk.groupby(‘customerid’)[‘total_price’].transform(‘sum’)

Let’s plot Customer Monetary Distribution

fig = px.histogram(df_uk, x = ‘customer_monetary’, title = ‘Customer Monetary Distribution’)
fig.show()

Let’s create the RFM Table
customer_rfm = pd.merge(pd.merge(customer_recency, customer_frequency, on=’customerid’), customer_monetary, on=’customerid’)
customer_rfm.head()

customer_rfm.set_index(‘customerid’, inplace = True)

quantiles = customer_rfm.quantile(q = [0.25,0.50,0.75])
quantiles

3 quantiles of recency, frequency and monetary columns

Let’s introduce several functions creating the RFM Segmentation Table

def rec_score(x):
if x < 17.000:
return 4
elif 17.000 <= x < 50.000:
return 3
elif 50.000 <= x < 142.000:
return 2
else:
return 1

def freq_score(x):
if x > 5.000:
return 4
elif 5.000 >= x > 2.000:
return 3
elif 2.000 >= x > 1:
return 2
else:
return 1

def mon_score(x):
if x > 1571.285:
return 4
elif 1571.285 >= x > 644.975:
return 3
elif 644.975 >= x > 298.185:
return 2
else:
return 1

Let’s calculate the RFM score

customer_rfm[‘recency_score’] = customer_rfm[‘recency’].apply(rec_score)
customer_rfm[‘frequency_score’] = customer_rfm[‘frequency’].apply(freq_score)
customer_rfm[‘monetary_score’] = customer_rfm[‘monetary’].apply(mon_score)

customer_rfm.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3920 entries, 12346.0 to 18287.0
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   recency          3920 non-null   int64  
 1   frequency        3920 non-null   int64  
 2   monetary         3920 non-null   float64
 3   recency_score    3920 non-null   int64  
 4   frequency_score  3920 non-null   int64  
 5   monetary_score   3920 non-null   int64  
dtypes: float64(1), int64(5)
memory usage: 214.4 KB

Let’s calculate the total RFM score

customer_rfm[‘RFM_score’] = (customer_rfm[‘recency_score’].astype(str) + customer_rfm[‘frequency_score’].astype(str) +
customer_rfm[‘monetary_score’].astype(str))
customer_rfm.sample(5)

Let’s plot Customer_RFM Distribution

fig = px.histogram(customer_rfm, x = customer_rfm[‘RFM_score’].value_counts().index,
y = customer_rfm[‘RFM_score’].value_counts().values,
title = ‘Customer_RFM Distribution’,
labels = dict(x = “RFM_score”, y =”counts”))
fig.show()

Let’s calculate the RFM_level

customer_rfm[‘RFM_level’] = customer_rfm[‘recency_score’] + customer_rfm[‘frequency_score’] + customer_rfm[‘monetary_score’]
customer_rfm.sample(5)

print(‘Min value for RFM_level : ‘, customer_rfm[‘RFM_level’].min())
print(‘Max value for RFM_level : ‘, customer_rfm[‘RFM_level’].max())

Min value for RFM_level :  3
Max value for RFM_level :  12

Let’s introduce the RFM segment function

def segments(df_rfm):
if df_rfm[‘RFM_level’] == 12 :
return ‘champion’
elif (df_rfm[‘RFM_level’] == 11) or (df_rfm[‘RFM_level’] == 10 ):
return ‘loyal_customer’
elif (df_rfm[‘RFM_level’] == 9) or (df_rfm[‘RFM_level’] == 8 ):
return ‘promising’
elif (df_rfm[‘RFM_level’] == 7) or (df_rfm[‘RFM_level’] == 6 ):
return ‘need_attention’
elif (df_rfm[‘RFM_level’] == 5) or (df_rfm[‘RFM_level’] == 4 ):
return ‘hibernating’
else:
return ‘almost_lost’

Let’s call this function to compute customer_segment

customer_rfm[‘customer_segment’] = customer_rfm.apply(segments,axis=1)
customer_rfm.sample(5)

Let’s plot the customer_segment distribution

fig = px.histogram(customer_rfm, x = customer_rfm[‘customer_segment’].value_counts().index,
y = customer_rfm[‘customer_segment’].value_counts().values,
title = ‘Customer_Segment Distribution’,
labels = dict(x = “customer_segment”, y =”counts”))
fig.show()

Let’s calculate the RFM_level Mean Values

avg_rfm_segment = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False)
avg_rfm_segment

customer_segment
champion         12.000
loyal_customer   10.482
promising         8.530
need_attention    6.479
hibernating       4.499
almost_lost       3.000
Name: RFM_level, dtype: float64

Let’s plot the Customer_Segment RFM_level Mean Values

fig = px.histogram(customer_rfm, x = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False).index,
y = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False).values,
title = ‘Customer_Segment RFM_level Mean Values’,
labels = dict(x = “customer_segment”, y =”RFM_level Mean Values”))
fig.show()

Let’s calculate the size of RFM_Segment

size_rfm_segment = customer_rfm[‘customer_segment’].value_counts()
size_rfm_segment

hibernating       918
promising         783
need_attention    771
loyal_customer    650
champion          402
almost_lost       396
Name: count, dtype: int64

Let’s plot Size RFM_Segment

fig = px.histogram(customer_rfm, x = customer_rfm[‘customer_segment’].value_counts().index,
y = customer_rfm[‘customer_segment’].value_counts().values,
title = ‘Size RFM_Segment’,
labels = dict(x = “customer_segment”, y =”Size RFM_Segment”))
fig.show()

Final output table:

customer_segment = pd.DataFrame(pd.concat([avg_rfm_segment, size_rfm_segment], axis=1))
customer_segment.rename(columns={‘RFM_level’: ‘avg_RFM_level’, ‘customer_segment’: ‘segment_size’}, inplace=True)
customer_segment

Let’s plot the wordcloud of RFM Segments

from wordcloud import WordCloud

segment_text = customer_rfm[“customer_segment”].str.split(” “).str.join(“_”)
all_segments = ” “.join(segment_text)

wc = WordCloud(background_color=”orange”,
max_words=250,
max_font_size=256,
random_state=42,
width=800, height=400)
wc.generate(all_segments)
plt.figure(figsize = (16, 15))
plt.imshow(wc)
plt.title(“RFM Segments”,fontsize=18,fontweight=”bold”)
plt.axis(‘off’)
plt.show()

Let’s check the structure and content of customer_rfm

first_looking(customer_rfm)
duplicate_values(customer_rfm)

Shape:(3920, 9)
****************************************************************************************************
Info:

<class 'pandas.core.frame.DataFrame'>
Index: 3920 entries, 12346.0 to 18287.0
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   recency           3920 non-null   int64  
 1   frequency         3920 non-null   int64  
 2   monetary          3920 non-null   float64
 3   recency_score     3920 non-null   int64  
 4   frequency_score   3920 non-null   int64  
 5   monetary_score    3920 non-null   int64  
 6   RFM_score         3920 non-null   object 
 7   RFM_level         3920 non-null   int64  
 8   customer_segment  3920 non-null   object 
dtypes: float64(1), int64(6), object(2)
memory usage: 306.2+ KB
None
****************************************************************************************************
Number of Uniques:
recency              302
frequency             57
monetary            3849
recency_score          4
frequency_score        4
monetary_score         4
RFM_score             61
RFM_level             10
customer_segment       6
dtype: int64
****************************************************************************************************
Missing Values:
Empty DataFrame
Columns: [Missing_Number, Missing_Percent]
Index: []
****************************************************************************************************
All Columns:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'RFM_score', 'RFM_level', 'customer_segment']
****************************************************************************************************
Columns after rename:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_level', 'customer_segment']
****************************************************************************************************
Columns after rename:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_level', 'customer_segment']
****************************************************************************************************
Descriptive Statistics 
       recency  frequency   monetary  recency_score  frequency_score   
count 3920.000   3920.000   3920.000       3920.000         3920.000  \
mean    91.740      4.250   1858.420          2.480            2.320   
std     99.530      7.200   7478.630          1.110            1.150   
min      0.000      1.000      3.750          1.000            1.000   
25%     17.000      1.000    298.180          1.000            1.000   
50%     50.000      2.000    644.970          2.000            2.000   
75%    142.000      5.000   1571.280          3.000            3.000   
max    373.000    209.000 259657.300          4.000            4.000   

       monetary_score  rfm_level  
count        3920.000   3920.000  
mean            2.500      7.300  
std             1.120      2.880  
min             1.000      3.000  
25%             1.750      5.000  
50%             2.500      7.000  
75%             3.250     10.000  
max             4.000     12.000  
****************************************************************************************************
Descriptive Statistics (Categorical Columns) 
                 count unique          top freq
rfm_score         3920     61          444  402
customer_segment  3920      6  hibernating  918
****************************************************************************************************
Duplicate check...
There are 0 duplicated observations in the dataset.

customer_rfm.groupby(‘rfm_level’).agg({‘recency’: [‘mean’,’min’,’max’,’count’],
‘frequency’: [‘mean’,’min’,’max’,’count’],
‘monetary’: [‘mean’,’min’,’max’,’count’] }).round(1)

Let’s check RFM correlations

plt.figure(figsize = (20,20))
sns.pairplot(customer_rfm[[‘recency’, ‘frequency’, ‘monetary’,’customer_segment’]], hue = ‘customer_segment’);

Let’s compute the RFM correlation matrix

matrix = np.triu(customer_rfm[[‘recency’,’frequency’,’monetary’]].corr())
fig, ax = plt.subplots(figsize=(14,10))
sns.heatmap (customer_rfm[[‘recency’,’frequency’,’monetary’]].corr(), annot=True, fmt= ‘.2f’, vmin=-1, vmax=1, center=0, cmap=’coolwarm’,mask=matrix, ax=ax);

Read more here.

RFM TreeMap: Online Retail Dataset II

Let’s import libraries and read the Online Retail II dataset Year 2010-2011

import datetime as dt
import pandas as pd
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.max_rows’, None)
pd.set_option(‘display.float_format’, lambda x: ‘%.2f’ % x)

df_ = pd.read_excel(“online_retail_II.xlsx”,sheet_name=”Year 2010-2011″ )
df = df_.copy()

Let’s check the overall data structure

def check_df(dataframe):
print(“################ Shape ####################”)
print(dataframe.shape)
print(“############### Columns ###################”)
print(dataframe.columns)
print(“############### Types #####################”)
print(dataframe.dtypes)
print(“############### Head ######################”)
print(dataframe.head())
print(“############### Tail ######################”)
print(dataframe.tail())
print(“############### Describe ###################”)
print(dataframe.describe().T)

check_df(df)

################ Shape ####################
(541910, 8)
############### Columns ###################
Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'Price', 'Customer ID', 'Country'],
      dtype='object')
############### Types #####################
Invoice                object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
Price                 float64
Customer ID           float64
Country                object
dtype: object
############### Head ######################
  Invoice StockCode                          Description  Quantity   
0  536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6  \
1  536365     71053                  WHITE METAL LANTERN         6   
2  536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3  536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4  536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

          InvoiceDate  Price  Customer ID         Country  
0 2010-12-01 08:26:00   2.55     17850.00  United Kingdom  
1 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
2 2010-12-01 08:26:00   2.75     17850.00  United Kingdom  
3 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
4 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
############### Tail ######################
       Invoice StockCode                      Description  Quantity   
541905  581587     22899     CHILDREN'S APRON DOLLY GIRL          6  \
541906  581587     23254    CHILDRENS CUTLERY DOLLY GIRL          4   
541907  581587     23255  CHILDRENS CUTLERY CIRCUS PARADE         4   
541908  581587     22138    BAKING SET 9 PIECE RETROSPOT          3   
541909  581587      POST                          POSTAGE         1   

               InvoiceDate  Price  Customer ID Country  
541905 2011-12-09 12:50:00   2.10     12680.00  France  
541906 2011-12-09 12:50:00   4.15     12680.00  France  
541907 2011-12-09 12:50:00   4.15     12680.00  France  
541908 2011-12-09 12:50:00   4.95     12680.00  France  
541909 2011-12-09 12:50:00  18.00     12680.00  France  
############### Describe ###################
                count                           mean                  min   
Quantity    541910.00                           9.55            -80995.00  \
InvoiceDate    541910  2011-07-04 13:35:22.342307584  2010-12-01 08:26:00   
Price       541910.00                           4.61            -11062.06   
Customer ID 406830.00                       15287.68             12346.00   

                             25%                  50%                  75%   
Quantity                    1.00                 3.00                10.00  \
InvoiceDate  2011-03-28 11:34:00  2011-07-19 17:17:00  2011-10-19 11:27:00   
Price                       1.25                 2.08                 4.13   
Customer ID             13953.00             15152.00             16791.00   

                             max     std  
Quantity                80995.00  218.08  
InvoiceDate  2011-12-09 12:50:00     NaN  
Price                   38970.00   96.76  
Customer ID             18287.00 1713.60

####### DATA PREPARATION

df.isnull().any()
df.isnull().sum()
df.dropna(inplace=True)

df[“StockCode”].nunique()

df.groupby(“StockCode”)[“Quantity”].sum()

df.groupby(“StockCode”).agg({“Quantity”:”sum”}).sort_values(by=”Quantity”, ascending=False).head()

df = df[~df[“Invoice”].str.contains(“C”, na=False)]

df[“TotalPrice”] = df[“Price”] * df[“Quantity”]

df[“InvoiceDate”].max()
today_date = dt.datetime(2011, 12, 11)

rfm = df.groupby(“Customer ID”).agg({“InvoiceDate”: lambda InvıiceDate: (today_date- InvıiceDate.max()).days,
“Invoice”: lambda Invoice: Invoice.nunique(),
“TotalPrice”: lambda TotalPrice: TotalPrice.sum()})

rfm.columns = [“recency”,”frequency”,”monetary”]
rfm = rfm[rfm[“monetary”] > 0]

####### RFM CALCULATION

rfm[“recency_score”] = pd.qcut(rfm[‘recency’], 5, labels=[5, 4, 3, 2, 1])

rfm[“frequency_score”] = pd.qcut(rfm[‘frequency’].rank(method=”first”), 5, labels=[1, 2, 3, 4, 5])

rfm[“monetary_score”] = pd.qcut(rfm[‘monetary’], 5, labels=[1, 2, 3, 4, 5])

rfm[“RFM_SCORE”] = (rfm[‘recency_score’].astype(str) + rfm[‘frequency_score’].astype(str))
rfm.head()

rfm.describe().T

rfm[rfm[“RFM_SCORE”] == “55”].head() # champions

rfm[rfm[“RFM_SCORE”] == “11”].head() # hibernating

#######CUSTOMER SEGMENTATION

seg_map = {
r'[1-2][1-2]’: ‘hibernating’,
r'[1-2][3-4]’: ‘at_Risk’,
r'[1-2]5′: ‘cant_loose’,
r’3[1-2]’: ‘about_to_sleep’,
r’33’: ‘need_attention’,
r'[3-4][4-5]’: ‘loyal_customers’,
r’41’: ‘promising’,
r’51’: ‘new_customers’,
r'[4-5][2-3]’: ‘potential_loyalists’,
r’5[4-5]’: ‘champions’
}

rfm[‘segment’] = rfm[‘RFM_SCORE’].replace(seg_map, regex=True)

rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

rfm[rfm[“segment”] == “need_attention”].head()

rfm[rfm[“segment”] == “new_customers”].index

seg_list = [“at_Risk”, “hibernating”, “cant_loose”, “loyal_customers”]
for i in seg_list:
print(F” {i.upper()} “.center(50, “*”))
print(rfm[rfm[“segment”]==i].describe().T)

rfm[rfm[“segment”] == “loyal_customers”].index
loyalCus_df = pd.DataFrame()
loyalCus_df[“loyal_customersId”] = rfm[rfm[“segment”] == “loyal_customers”].index
loyalCus_df.head()

loyalCus_df.to_csv(“loyal_customers.csv”)

******************** AT_RISK *********************
           count    mean     std   min    25%    50%     75%      max
recency   593.00  153.79   68.62 73.00  96.00 139.00  195.00   374.00
frequency 593.00    2.88    0.95  2.00   2.00   3.00    3.00     6.00
monetary  593.00 1084.54 2562.07 52.00 412.78 678.25 1200.62 44534.30
****************** HIBERNATING *******************
            count   mean     std   min    25%    50%    75%      max
recency   1071.00 217.61   92.01 73.00 135.00 219.00 289.50   374.00
frequency 1071.00   1.10    0.30  1.00   1.00   1.00   1.00     2.00
monetary  1071.00 488.64 2419.68  3.75 155.11 296.25 457.93 77183.60
******************* CANT_LOOSE *******************
           count    mean     std   min     25%     50%     75%      max
recency    63.00  132.97   65.25 73.00   89.00  108.00  161.00   373.00
frequency  63.00    8.38    4.29  6.00    6.00    7.00    9.00    34.00
monetary   63.00 2796.16 2090.49 70.02 1137.51 2225.97 3532.24 10254.18
**************** LOYAL_CUSTOMERS *****************
           count    mean     std   min    25%     50%     75%       max
recency   819.00   33.61   15.58 15.00  20.00   30.00   44.00     72.00
frequency 819.00    6.48    4.55  3.00   4.00    5.00    8.00     63.00
monetary  819.00 2864.25 6007.06 36.56 991.80 1740.48 3052.91 124914.53

#####CREATE RFM

def create_rfm(dataframe):

dataframe["TotalPrice"] = dataframe["Quantity"] * dataframe["Price"]
dataframe.dropna(inplace=True)
dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]

today_date = dt.datetime(2011, 12, 11)
rfm = dataframe.groupby('Customer ID').agg({'InvoiceDate': lambda date: (today_date - date.max()).days,
                                            'Invoice': lambda num: num.nunique(),
                                            "TotalPrice": lambda price: price.sum()})
rfm.columns = ['recency', 'frequency', "monetary"]
rfm = rfm[(rfm['monetary'] > 0)]


rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
rfm["frequency_score"] = pd.qcut(rfm["frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])


rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) +
                    rfm['frequency_score'].astype(str))

rfm.head()

rfm[‘score’]=rfm[‘recency’]rfm[‘frequency’]rfm[‘monetary’]

Statistical informations of segments
rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

How many segments are there
segments_counts = rfm[‘segment’].value_counts().sort_values(ascending=True)

Let’s plot our RFM segments

fig, ax = plt.subplots()

bars = ax.barh(range(len(segments_counts)),
segments_counts,
color=’lightcoral’)
ax.set_frame_on(False)
ax.tick_params(left=False,
bottom=False,
labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index)

for i, bar in enumerate(bars):
value = bar.get_width()
if segments_counts.index[i] in [‘Can\’t loose’]:
bar.set_color(‘firebrick’)
ax.text(value,
bar.get_y() + bar.get_height()/2,
‘{:,} ({:}%)’.format(int(value),
int(value*100/segments_counts.sum())),
va=’center’,
ha=’left’
)

plt.show(block=True)

Let’s plot the RFM segmentation treemap

import matplotlib.pyplot as plt
import squarify
sizes=[24, 18, 14, 13,11,8,4,3,2,1]
label=[“hibernating”, “loyal_customers”, ‘champions’, ‘at_risk’,’potential_loyalists’,’about_to_sleep’,’need_attention’,’promising’,’CL’,’NC’]

colors = [‘#91DCEA’, ‘#64CDCC’, ‘#5FBB68’,
‘#F9D23C’, ‘#F9A729’, ‘#FD6F30′,’grey’,’red’,’blue’,’cyan’]
squarify.plot(sizes=sizes, label=label, color=colors, alpha=0.6 )
plt.show()
plt.axis(“off”)

where CL = cant_loose and NC = new_cutomers.

Read more here.

CRM Analytics: CLTV

Let’s look at the CRM Analytics – CLTV (Customer Lifetime Value).

Importing libraries:
import datetime as dt
import pandas as pd
import warnings
warnings.filterwarnings(‘ignore’)
import matplotlib.pyplot as plt

and set display setting
pd.set_option(“display.max_columns”, None)
pd.set_option(“display.float_format”, lambda x: “%.3f” % x)

####### DATA PREPARATION

Reading the Online Retail II dataset and creating a copy of it
df_ = pd.read_excel(“online_retail.xlsx”)
df = df_.copy()
df.head()

df.shape

(525461, 8)

Number of unique products
df[“Description”].nunique()

value_counts() of each product
df[“Description”].value_counts().head()

WHITE HANGING HEART T-LIGHT HOLDER    3549
REGENCY CAKESTAND 3 TIER              2212
STRAWBERRY CERAMIC TRINKET BOX        1843
PACK OF 72 RETRO SPOT CAKE CASES      1466
ASSORTED COLOUR BIRD ORNAMENT         1457
Name: Description, dtype: int64

What is the most ordered product
df.groupby(“Description”).agg({“Quantity”: “sum”}).head()

Missing value check
df.isnull().sum

Customers with missing Customer ID must be removed from the dataset
df.dropna(inplace =True)

Adding the total price on the basis of products to the data set as a variable
df[“TotalPrice”] = df[“Quantity”] * df[“Price”]

Let’s remove attribute values started with “C” (returned products)

df = df[~df[“Invoice”].str.contains(“C”, na=False)]

with the descriptive statistics of the dataset
df.describe().T

######### CALCULATION OF RFM METRICS

Last date in the InvoiceDate
df[“InvoiceDate”].max()

Timestamp('2010-12-09 20:01:00')

Let’s set 2 days after the last date
analysis_date = dt.datetime(2010, 12, 11)

Let’s introduce the following RFM columns

rfm = df.groupby(“Customer ID”).agg({“InvoiceDate”: lambda invoice_date: (analysis_date – invoice_date.max()).days,
“Invoice”: lambda invoice: invoice.nunique(),
“TotalPrice”: lambda total_price: total_price.sum()})
rfm.columns = [“recency”, “frequency”, “monetary”]
rfm.head()

while excluding monetary<=0

rfm = rfm[rfm[“monetary”] > 0]

and checking descriptive statistics

rfm.describe().T

RECENCY SCORE
rfm[“recency_score”] = pd.qcut(rfm[“recency”], 5, labels=[5, 4, 3, 2, 1])

FREQUENCY SCORE
rfm[“frequency_score”] = pd.qcut(rfm[“frequency”].rank(method=”first”), 5, labels=[1, 2, 3, 4, 5])

MONETARY SCORE
rfm[“monetary_score”] = pd.qcut(rfm[“monetary”], 5, labels=[1, 2, 3, 4, 5])

RF SCORE
rfm[“RF_SCORE”] = (rfm[“recency_score”].astype(str) + rfm[“frequency_score”].astype(str))

rfm.head()

Let’s convert RF_SCORE to segment with regex=True

rfm[‘segment’] = rfm[‘RF_SCORE’].replace(seg_map, regex=True)

rfm.head()

Let’s compute mean and count of our RFM columns sorted by segment in the ascending order

rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

segments_counts = rfm[‘segment’].value_counts().sort_values(ascending=True)

Let’s plot our RFM segments

fig, ax = plt.subplots()

plt.show(block=True)

########## CLTV CALCULATION

Let’s prepare our input data

cltv_c = df.groupby(‘Customer ID’).agg({“Invoice”: lambda x: x.nunique(),
“Quantity”: lambda x: x.sum(),
“TotalPrice”: lambda x: x.sum()})

cltv_c.columns = [“total_transaction”, “total_unit”, “total_price”]

Let’s calculate the following metrics:

profit margin:
cltv_c[“profit_margin”] = cltv_c[“total_price”] * 0.10

Churn Rate:

repeat_rate = cltv_c[cltv_c[“total_transaction”] > 1].shape[0] / cltv_c.shape[0]

churn_rate = 1 – repeat_rate

Purchase Frequency:
cltv_c[“purchase_frequency”] = cltv_c[“total_transaction”] / cltv_c.shape[0]

Average Order Value:
cltv_c[“average_order_value”] = cltv_c[“total_price”] / cltv_c[“total_transaction”]

Customer Value:
cltv_c[“customer_value”] = cltv_c[“average_order_value”] * cltv_c[“purchase_frequency”]

CLTV (Customer Lifetime Value):
cltv_c[“cltv”] = (cltv_c[“customer_value”] / churn_rate) * cltv_c[“profit_margin”]

Sorting our dataset by cltv in the descending order
cltv_c.sort_values(by = “cltv”, ascending=False).head()

Eliminating outliers with interquantile_range
def outlier_threshold(dataframe, variable):
quartile1 = dataframe[variable].quantile(0.01)
quartile3 = dataframe[variable].quantile(0.99)
interquantile_range = quartile3 – quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 – 1.5 * interquantile_range
return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_threshold(dataframe, variable)
# dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

replace_with_thresholds(df, “Quantity”)
replace_with_thresholds(df, “Price”)

Define the analysis date

df[“InvoiceDate”].max()
analysis_date = dt.datetime(2011, 12, 11)

############ Preparation of Lifetime Data Structure

cltv_df = df.groupby(“Customer ID”).agg({“InvoiceDate”: [lambda InvoiceDate: (InvoiceDate.max() – InvoiceDate.min()).days / 7,
lambda InvoiceDate: (analysis_date – InvoiceDate.min()).days / 7],
“Invoice”: lambda Invoice: Invoice.nunique(),
“TotalPrice”: lambda TotalPrice: TotalPrice.sum()})

cltv_df.columns = cltv_df.columns.droplevel(0)

cltv_df.columns = [“recency”, “T”, “frequency”, “monetary”]

cltv_df[“monetary”] = cltv_df[“monetary”] / cltv_df[“frequency”]

cltv_df = cltv_df[cltv_df[“frequency”] > 1]

cltv_df.head()

Data columns recency, T, frequency, and monetary.

cltv_df.describe().T

Descriptive statistics of data columns recency, T, frequency, and monetary.

#######CLTV MODELLING

Let’s install and import the following libraries

!pip install Lifetimes

from lifetimes import BetaGeoFitter
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from lifetimes.plotting import plot_period_transactions

import datetime as dt
import pandas as pd
import warnings
warnings.filterwarnings(‘ignore’)
import matplotlib.pyplot as plt
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.float_format’, lambda x: ‘%.5f’ % x)
pd.set_option(‘display.width’, 500)

######BG/NBD Model

bgf = BetaGeoFitter(penalizer_coef=0.001)

bgf.fit(cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])

<lifetimes.BetaGeoFitter: fitted with 2893 subjects, a: 1.93, alpha: 9.47, b: 6.27, r: 2.22>

Let’s look at the first 10 customers with the highest purchase:
bgf.conditional_expected_number_of_purchases_up_to_time(1, # number of weeks
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)

Customer ID
15989.00000   0.00756
16720.00000   0.00722
14119.00000   0.00720
16204.00000   0.00715
17591.00000   0.00714
15169.00000   0.00700
17193.00000   0.00698
17251.00000   0.00696
17411.00000   0.00690
17530.00000   0.00680
dtype: float64

We can do same thing with predict:
bgf.predict(1, #number of weeks
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)

Let’s apply predict for 1 month:
bgf.predict(4, # 4 weeks = 1 month
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)

Customer ID
15989.00000   0.02986
16720.00000   0.02849
14119.00000   0.02841
16204.00000   0.02825
17591.00000   0.02816
15169.00000   0.02763
17193.00000   0.02756
17251.00000   0.02748
17411.00000   0.02726
17530.00000   0.02682
dtype: float64

Let’s compare expected purchases for several time intervals:

Expected purchases for 1 months
cltv_df[“expected_purchase_1_month”] = bgf.predict(4,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])

Expected purchases for 3 months
cltv_df[“expected_purchase_3_month”] = bgf.predict(4 * 3,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])

Expected purchases for 1 year
cltv_df[“expected_purchase_1_year”] = bgf.predict(4 * 12,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])
cltv_df.head()

Expected purchases for 1, 3, and 12 months

Evaluation of predicted results:
plot_period_transactions(bgf)
plt.show(block=True)

Frequency of repeat transactions: actual vs model.

Let’s plot the FR matrix

from lifetimes.plotting import plot_frequency_recency_matrix
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,8))
plot_frequency_recency_matrix(bgf)
plt.show(block=True)

Expected number of future purchases for 1 unit of time, by frequency and recency of a customer

############ Gamma-Gamma Submodel

ggf = GammaGammaFitter(penalizer_coef=0.01)

ggf.fit(cltv_df[“frequency”], cltv_df[“monetary”])

cltv_df[“expected_average_profit”] = ggf.conditional_expected_average_profit(cltv_df[“frequency”],
cltv_df[“monetary”])

cltv_df.sort_values(“expected_average_profit”, ascending=False).head(10)

######### Calculation of CLTV with BG/NBD and GG Model

cltv = ggf.customer_lifetime_value(bgf,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”],
cltv_df[“monetary”],
time=3, # 3 month
freq=”W”, # frequency of T
discount_rate=0.01)

cltv = cltv.reset_index()

cltv_final = cltv_df.merge(cltv, on=”Customer ID”, how=”left”)

cltv_final.sort_values(by=”clv”, ascending=False).head(10)

########## Segmentation according to CLTV

cltv_final[“segment”] = pd.qcut(cltv_final[“clv”], 4, labels=[“D”, “C”, “B”, “A”])

cltv_final.groupby(“segment”).agg({“count”, “mean”, “sum”})

_df_plt = cltv_final[[“expected_average_profit”, “segment”]]
plot_df = _df_plt.groupby(“segment”).agg(“mean”)
plot_df.plot(kind=”bar”, color=”tomato”,figsize=(15, 8))
plt.ylabel(“mean”)

plt.show(block=True)

Expected average profit for CLTV segments A-D

Read more here.

Cohort Analysis in Online Retail

Let’s dive into Cohort Analysis using the Online Retail Dataset.

Let’s install and import standard libraries

!pip install sklearn-pandas

import pandas as pd
import numpy as np
import datetime as dt
import seaborn as sns

from sklearn import preprocessing
import sklearn
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from matplotlib import pyplot as plt

Reading the dataset
df = pd.read_excel(“online_retail.xlsx”)

#############DATA PREPARATION

df.shape

(525461, 8)

print(‘Duplicate entries: {}’.format(df.duplicated().sum()))
print(‘{}% rows are duplicate.’.format(round((df.duplicated().sum()/df.shape[0])*100),2))
df.drop_duplicates(inplace = True)

Duplicate entries: 6865
1% rows are duplicate.

pd.DataFrame([{‘products’: len(df[‘StockCode’].value_counts()),
‘transactions’: len(df[‘Invoice’].value_counts()),
‘customers’: len(df[‘Customer ID’].value_counts()),
}], columns = [‘products’, ‘transactions’, ‘customers’], index = [‘quantity’])

products transactions customers

quantity 4632 28816 4383

temp = df.groupby([‘Country’],as_index=False).agg({‘Invoice’:’nunique’}).rename(columns = {‘Invoice’:’Orders’})
total = temp[‘Orders’].sum(axis=0)
temp[‘%Orders’] = round((temp[‘Orders’]/total)*100,4)

temp.sort_values(by=[‘%Orders’],ascending=False,inplace=True)
temp.reset_index(drop=True,inplace=True)

Let’s plot %Orders vs Country

plt.figure(figsize=(13,6))
splot=sns.barplot(x=”Country”,y=”%Orders”,data=temp[:10])
for p in splot.patches:
splot.annotate(format(p.get_height(), ‘.1f’),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = ‘center’, va = ‘center’,
xytext = (0, 9),
textcoords = ‘offset points’)
plt.xlabel(“Country”, size=14)
plt.ylabel(“%Orders”, size=14)

invoices = df[‘Invoice’]

x = invoices.str.contains(‘C’, regex=True)
x.fillna(0, inplace=True)

x = x.astype(int)

x.value_counts()

0    508414
1     10182
Name: Invoice, dtype: int64

df[‘order_canceled’] = x
#df.head()

n1 = df[‘order_canceled’].value_counts()[1]
n2 = df.shape[0]
print(‘Number of orders canceled: {}/{} ({:.2f}%) ‘.format(n1, n2, n1/n2*100))

Number of orders canceled: 10182/518596 (1.96%)

df = df.loc[df[‘order_canceled’] == 0,:]

df.reset_index(drop=True,inplace=True)

df.loc[df[‘Quantity’] < 0,:]

df = df[df[‘Customer ID’].notna()]

df.reset_index(drop=True,inplace=True)

#############NL COHORT INDEX ANALYSIS

df_uk = df[df.Country == ‘Netherlands’]

df_uk.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2729 entries, 4346 to 387139
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Invoice         2729 non-null   object        
 1   StockCode       2729 non-null   object        
 2   Description     2729 non-null   object        
 3   Quantity        2729 non-null   int64         
 4   InvoiceDate     2729 non-null   datetime64[ns]
 5   Price           2729 non-null   float64       
 6   Customer ID     2729 non-null   float64       
 7   Country         2729 non-null   object        
 8   order_canceled  2729 non-null   int32         
dtypes: datetime64[ns](1), float64(2), int32(1), int64(1), object(4)
memory usage: 202.5+ KB

products transactions customers

Quantity 4017 19215 4314

cohort_data = df_uk[[‘Invoice’,’StockCode’,’Description’,’Quantity’,’InvoiceDate’,’Price’,’Customer ID’,’Country’]]

Checking for nulls in the data
cohort_data.isnull().sum()

Invoice        0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
Price          0
Customer ID    0
Country        0
dtype: int64

all_dates = (pd.to_datetime(cohort_data[‘InvoiceDate’])).apply(lambda x:x.date())

(all_dates.max() – all_dates.min()).days

Start and end dates:
print(‘Start date: {}’.format(all_dates.min()))
print(‘End date: {}’.format(all_dates.max()))

Start date: 2009-12-02
End date: 2010-12-01

cohort_data.head()

Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country

def get_month(x):
return dt.datetime(x.year, x.month, 1)

cohort_data[‘InvoiceMonth’] = cohort_data[‘InvoiceDate’].apply(get_month)
grouping = cohort_data.groupby(‘Customer ID’)[‘InvoiceMonth’]
cohort_data[‘CohortMonth’] = grouping.transform(‘min’)
cohort_data.head()

Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country	InvoiceMonth	CohortMonth

def get_date_int(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day

invoice_year, invoice_month, _ = get_date_int(cohort_data, ‘InvoiceMonth’)
cohort_year, cohort_month, _ = get_date_int(cohort_data, ‘CohortMonth’)
years_diff = invoice_year – cohort_year
months_diff = invoice_month – cohort_month
cohort_data[‘CohortIndex’] = years_diff * 12 + months_diff
#cohort_data.head()

grouping = cohort_data.groupby([‘CohortMonth’, ‘CohortIndex’])

cohort_data = grouping[‘Quantity’].mean()
cohort_data = cohort_data.reset_index()
average_quantity = cohort_data.pivot(index=’CohortMonth’,columns=’CohortIndex’,values=’Quantity’)
average_quantity.round(1)
#average_quantity

Let’s plot retention rates heatmap

plt.figure(figsize=(10, 8))
plt.title(‘Retention rates’)
sns.heatmap(data = retention,annot = True,fmt = ‘.0%’,vmin = 0.0,vmax = 0.5,cmap = ‘BuGn’)
plt.show()

Retention rates vs Cohort Month/Index heatmap

Let’s compute TotalPrice

grdf=grouping.apply(pd.DataFrame)

grdf[‘TotalPrice’]=grdf[‘Quantity’]*grdf[‘Price’]

grdf[grdf[‘TotalPrice’]>139.2].sort_values(‘TotalPrice’,ascending=False)

#############RFM SCORE ANALYSIS

from dateutil.relativedelta import relativedelta
start_date = all_dates.max()-relativedelta(months=12,days=-1)
print(‘Start date: {}’.format(start_date))
print(‘End date: {}’.format(all_dates.max()))

Start date: 2009-12-02
End date: 2010-12-01

data_rfm = grdf[grdf[‘InvoiceDate’] >= pd.to_datetime(start_date)]
data_rfm.reset_index(drop=True,inplace=True)
data_rfm.head()

Invoice	StockCode	Description	Quantity	InvoiceDate	Price	Customer ID	Country	InvoiceMonth	CohortMonth	CohortIndex	TotalPrice

snapshot_date = max(data_rfm.InvoiceDate) + dt.timedelta(days=1)
print(‘Snapshot date: {}’.format(snapshot_date.date()))

Snapshot date: 2010-12-02

Aggregate data on a customer level
data = data_rfm.groupby([‘Customer ID’],as_index=False).agg({‘InvoiceDate’: lambda x: (snapshot_date – x.max()).days,
‘Invoice’: ‘count’,
‘TotalPrice’: ‘sum’}).rename(columns = {‘InvoiceDate’: ‘Recency’,
‘Invoice’: ‘Frequency’,
‘TotalPrice’: ‘MonetaryValue’})

r_labels = range(4, 0, -1)
r_quartiles = pd.qcut(data[‘Recency’], 4, labels = r_labels)
data = data.assign(R = r_quartiles.values)

f_labels = range(1,5)
m_labels = range(1,5)
f_quartiles = pd.qcut(data[‘Frequency’], 4, labels = f_labels)
m_quartiles = pd.qcut(data[‘MonetaryValue’], 4, labels = m_labels)
data = data.assign(F = f_quartiles.values)
data = data.assign(M = m_quartiles.values)
#data.head()

def join_rfm(x):
return str(x[‘R’]) + str(x[‘F’]) + str(x[‘M’])

data[‘RFM_Segment’] = data.apply(join_rfm, axis=1)
data[‘RFM_Score’] = data[[‘R’,’F’,’M’]].sum(axis=1)
data.head()

data.groupby(‘RFM_Score’).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’] }).round(1)

Let’s create segment

def create_segment(df):
if df[‘RFM_Score’] >= 9:
return ‘Top’
elif (df[‘RFM_Score’] >= 5) and (df[‘RFM_Score’] < 9):
return ‘Middle’
else:
return ‘Low’

data[‘General_Segment’] = data.apply(create_segment, axis=1)
data.groupby(‘General_Segment’).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’]}).round(1)

Plotting the distributions of Recency, Frequency and Monetary Value variables

plt.figure(figsize=(12,10))

plt.subplot(3, 1, 1); sns.distplot(data[‘Recency’])

plt.subplot(3, 1, 2); sns.distplot(data[‘Frequency’])

plt.subplot(3, 1, 3); sns.distplot(data[‘MonetaryValue’])

Checking for constant mean and variance.
data[[‘Recency’,’Frequency’,’MonetaryValue’]].describe()

data[data[‘MonetaryValue’] == 0]

raw_data = data[[‘Recency’,’Frequency’,’MonetaryValue’]]

#############RFM SCORE TRANSFORMATIONS

Let’s unskew the data
data_log = np.log(raw_data)

Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(data_log)

Scale and center the data
data_normalized = scaler.transform(data_log)

Create a pandas DataFrame
data_norm = pd.DataFrame(data=data_log, index=raw_data.index, columns=raw_data.columns)

Let’s plot the transformed RFM distributions

plt.figure(figsize=(12,10))

plt.subplot(3, 1, 1); sns.distplot(data_norm[‘Recency’])

plt.subplot(3, 1, 2); sns.distplot(data_norm[‘Frequency’])

plt.subplot(3, 1, 3); sns.distplot(data_norm[‘MonetaryValue’])

plt.show()

#############K-MEANS CLUSTERING

from sklearn.cluster import KMeans

sse = {}

Fit KMeans and calculate SSE for each k
for k in range(1, 21):

# Initialize KMeans with k clusters
kmeans = KMeans(n_clusters=k, random_state=1)

# Fit KMeans on the normalized dataset
kmeans.fit(data_norm)

# Assign sum of squared distances to k element of dictionary
sse[k] = kmeans.inertia_

Let’s look at the Elbow plot

plt.figure(figsize=(12,8))

plt.title(‘The Elbow Method’)
plt.xlabel(‘k’);
plt.ylabel(‘Sum of squared errors’)
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

Let’s apply Kmeans with n_clusters=3

kmeans = KMeans(n_clusters=3, random_state=1)

Compute k-means clustering on pre-processed data
kmeans.fit(data_norm)

Extract cluster labels from labels_ attribute
cluster_labels = kmeans.labels_

Create a cluster label column in the original DataFrame
data_norm_k3 = data_norm.assign(Cluster = cluster_labels)
data_k3 = raw_data.assign(Cluster = cluster_labels)

Calculate average RFM values and size for each cluster
summary_k3 = data_k3.groupby([‘Cluster’]).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’],}).round(0)

summary_k3

Let’s apply Kmeans with n_clusters=4

kmeans = KMeans(n_clusters=4, random_state=1)

Compute k-means clustering on pre-processed data
kmeans.fit(data_norm)

Extract cluster labels from labels_ attribute
cluster_labels = kmeans.labels_

Create a cluster label column in the original DataFrame
data_norm_k4 = data_norm.assign(Cluster = cluster_labels)
data_k4 = raw_data.assign(Cluster = cluster_labels)

Calculate average RFM values and size for each cluster
summary_k4 = data_k4.groupby([‘Cluster’]).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’],}).round(0)

summary_k4

data_norm_k4.index = data[‘Customer ID’].astype(int)

Let’s invoke melt to massage the data as follows
data_melt = pd.melt(data_norm_k4.reset_index(),
id_vars=[‘Customer ID’, ‘Cluster’],
value_vars=[‘Recency’, ‘Frequency’, ‘MonetaryValue’],
var_name=’Attribute’,
value_name=’Value’)

Let’s look at the RFM snake plot for 4 clusters

plt.title(‘Snake plot of standardized variables’)
sns.lineplot(x=”Attribute”, y=”Value”, hue=’Cluster’, data=data_melt)

######RFM Data Wrangling:

data_k4.index = data[‘Customer ID’].astype(int)

raw_data.index = data[‘Customer ID’].astype(int)

cluster_avg = data_k4.groupby([‘Cluster’]).mean()
population_avg = raw_data.head().mean()

population_avg

Recency            88.400
Frequency          65.800
MonetaryValue    1224.666
dtype: float64

relative_imp = cluster_avg / population_avg – 1

Let’s plot Relative importance of attributes as a sns heatmap

plt.figure(figsize=(8, 4))
plt.title(‘Relative importance of attributes’)
sns.heatmap(data=relative_imp, annot=True, fmt=’.2f’, cmap=’RdYlGn’)
plt.show()

Read more here.

Customer Segmentation, RFM, K-Means and Cohort Analysis in Online Retail

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import standard libraries with optional settings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
import statsmodels.api as sm
import statsmodels.formula.api as smf
import missingno as msno

from sklearn.compose import make_column_transformer

from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import warnings
warnings.filterwarnings(“ignore”)
warnings.warn(“this will not show”)

import colorama
from colorama import Fore, Style
from termcolor import colored

import ipywidgets
from ipywidgets import interact

sns.set_style(“whitegrid”)

Let’s read the input dataset

df=pd.read_csv(‘Online Retail.csv’,index_col=0)

and define some useful functions

#######################

def first_looking(df):
print(colored(“Shape:”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), colored(“\nInfo:\n”, attrs=[‘bold’]), sep=”) print(df.info(), ‘\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Number of Uniques:\n”, attrs=[‘bold’]), df.nunique(),’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”) print(colored(“Missing Values:\n”, attrs=[‘bold’]), missing_values(df),’\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“All Columns:”, attrs=[‘bold’]), list(df.columns),’\n’,
colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)

df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')

print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
          colored('-'*79, 'red', attrs=['bold']), sep='')

def duplicate_values(df):
print(colored(“Duplicate check…”, attrs=[‘bold’]), sep=”)
duplicate_values = df.duplicated(subset=None, keep=’first’).sum()
if duplicate_values > 0:
df.drop_duplicates(keep=’first’, inplace=True)
print(duplicate_values, colored(” Duplicates were dropped!”),’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”) else: print(colored(“There are no duplicates”),’\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)

def drop_columns(df, drop_columns):
if drop_columns !=[]:
df.drop(drop_columns, axis=1, inplace=True)
print(drop_columns, ‘were dropped’)
else:
print(colored(‘We will now check the missing values and if necessary will drop realted columns!’, attrs=[‘bold’]),’\n’,
colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)

#######################

Let’s check the structure and content of our input data

first_looking(df)
duplicate_values(df)
drop_columns(df,[])
drop_null(df, 90)
df.head()
df.tail()
df.sample(5)
df.describe().T
df.describe(include=object).T

Shape:(541909, 8)
-------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
Index: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  float64
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(3), object(5)
memory usage: 37.2+ MB
None
-------------------------------------------------------------------------------
Number of Uniques:
InvoiceNo      25900
StockCode       4070
Description     4223
Quantity         722
InvoiceDate    23260
UnitPrice       1630
CustomerID      4372
Country           38
dtype: int64
-------------------------------------------------------------------------------
Missing Values:
             Missing_Number  Missing_Percent
CustomerID           135080            0.249
Description            1454            0.003
-------------------------------------------------------------------------------
All Columns:['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']
-------------------------------------------------------------------------------
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
-------------------------------------------------------------------------------
Duplicate check...
5268 Duplicates were dropped!
-------------------------------------------------------------------------------
We will now check the missing values and if necessary will drop realted columns!
-------------------------------------------------------------------------------
Shape: (536641, 8)
New shape after missing value control: (536641, 8)

######## DATA WRANGLING

df_model = df.copy()

df.isnull().sum()

invoiceno           0
stockcode           0
description      1454
quantity            0
invoicedate         0
unitprice           0
customerid     135037
country             0
dtype: int64

We will drop NaN values of customerid
df = df.dropna(subset=[‘customerid’])
df.shape

(401604, 8)

df[‘cancelled’] = df[‘invoiceno’].str.contains(‘C’)

Let’s drop cancelled
df = df.drop([‘cancelled’],axis=1)

Clean the Data from the Noise and Missing Values
df = df[(df.quantity > 0) & (df.unitprice > 0)]
df.shape

(392692, 8)

Let’s look at the treemap Number of Customers Per Country Without UK

fig = px.treemap(df.groupby(‘country’)[[“customerid”]].nunique().sort_values(by=”customerid”, ascending=False).iloc[1:],
path=[df.groupby(‘country’)[[“customerid”]].nunique().sort_values(by=”customerid”, ascending=False).iloc[1:].index],
values=’customerid’,
width=1000,
height=600)
fig.update_layout(title_text=’Number of Customers Per Country Without UK’,
title_x = 0.5, title_font = dict(size=20)
)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

#######TOTAL PRICE aka REVENUE

The calculation of total price
df[‘total_price’] = df[‘quantity’] * df[‘unitprice’]

Let’s save this dataframe for future analysis
df_total_price = pd.DataFrame(df.groupby(‘country’)[“total_price”].sum().sort_values(ascending=False))

Let’s plot the treemap Total Prices By Countries Without UK

fig = px.treemap(df_total_price.iloc[1:],
path=[df_total_price.iloc[1:].index],
values=’total_price’,
width=1000,
height=600)
fig.update_layout(title_text=’Total Prices By Countries Without UK’,
title_x = 0.5, title_font = dict(size=20)
)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

#########UK RFM ANALYSIS

df_uk = df[df[“country”] == ‘United Kingdom’]

Top 15 most owned products
df_uk[“stockcode”].value_counts().head(15).plot(kind=”bar”, width=0.5, color=’pink’, edgecolor=’purple’, figsize=(16,9))
plt.xticks(rotation=45);

last_invoice = max(df_uk[‘invoicedate’])
last_invoice

'2011-12-09 12:49:00'

first_invoice = min(df_uk[‘invoicedate’])
first_invoice

'2010-12-01 08:26:00'

df_uk[‘invoicedate’] = pd.to_datetime(df_uk[‘invoicedate’])

df_uk[“date”] = df_uk[‘invoicedate’].dt.date

df_uk[“last_purchased_date”] = df_uk.groupby(“customerid”).date.transform(max)

last_invoice = pd.to_datetime(last_invoice).date()

df_uk.groupby(‘customerid’)[‘last_purchased_date’].apply(lambda x: last_invoice – x)

# Recency

df_recency = df_uk.groupby(by=’customerid’,
as_index=False)[‘date’].max()
df_recency.columns = [‘customerid’, ‘last_purchased_date’]
recent_date = df_recency[‘last_purchased_date’].max()
df_recency[‘recency’] = df_recency[‘last_purchased_date’].apply(
lambda x: (recent_date – x).days)
#df_recency.head()

Let’s plot recency

plt.figure(figsize=(16,9))
sns.distplot(df_recency[‘recency’], bins=35);

# Frequency: Number of purchases

df_uk[‘frequency’] = df_uk.groupby(‘customerid’)[“invoiceno”].transform(‘count’)

plt.figure(figsize=(16,9))
sns.distplot(df_uk[‘frequency’], bins=50);

########### Monetary: Total amount of money spent

df_uk[“monetary”] = df_uk.groupby(‘customerid’)[“total_price”].transform(‘sum’)

plt.figure(figsize=(16,9))
sns.distplot(df_uk[‘monetary’], bins=50);

############ UK RFM Table

df_uk.columns

Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate',
       'unitprice', 'customerid', 'country', 'total_price', 'date',
       'last_purchased_date', 'frequency', 'monetary'],
      dtype='object')

df_recency.columns

Index(['customerid', 'last_purchased_date', 'recency'], dtype='object')

df_uk[‘recency’] = df_recency[‘recency’]

df_rfm_table = df_uk[[‘customerid’, ‘recency’, ‘frequency’, ‘monetary’]]

Customer Segmentation with RFM Scores
df_rfm_table = df_rfm_table.set_index(‘customerid’)

df_rfm_table.reset_index().duplicated().sum()

df_rfm_table.drop_duplicates(inplace=True)
df_rfm_table.shape

(6050, 3)

########### Recency Scoring

df_rfm_table[“recency”].quantile(q = [.25,.5,.75])

0.250    21.000
0.500    58.000
0.750   163.000
Name: recency, dtype: float64

def recency_scoring(data):
if data[“recency”] <= 17.000:
return 4
elif data[“recency”] <= 50.000:
return 3
elif data[“recency”] <= 142.000:
return 2
else:
return 1

df_rfm_table[‘recency_quantile’] = df_rfm_table.apply(recency_scoring, axis =1)
#df_rfm_table.head()

########### Frequency Scoring

df_rfm_table[“frequency”].quantile(q = [.25,.5,.75])

0.250    23.000
0.500    65.000
0.750   147.000
Name: frequency, dtype: float64

def frequency_scoring(data):
if data.frequency <= 17.000:
return 1
elif data.frequency <= 40.000:
return 2
elif data.frequency <= 98.000:
return 3
else:
return 4

df_rfm_table[‘frequency_quantile’] = df_rfm_table.apply(frequency_scoring, axis =1)
#df_rfm_table.head()

########### Monetary Scoring

df_rfm_table[“monetary”].quantile(q = [.25,.5,.75])

def monetary_scoring(data):
if data.monetary <= 298.185:
return 1
elif data.monetary <= 644.975:
return 2
elif data.monetary <= 1571.285:
return 3
else:
return 4

df_rfm_table[‘monetary_quantile’] = df_rfm_table.apply(monetary_scoring, axis =1)
#df_rfm_table.head()

############### RFM Scoring

def rfm_scoring(data):
return str(int(data[‘recency_quantile’])) + str(int(data[‘frequency_quantile’])) + str(int(data[‘monetary_quantile’]))

df_rfm_table[‘rfm_score’] = df_rfm_table.apply(rfm_scoring, axis=1)
#df_rfm_table.head()

df_rfm_table[‘rfm_level’] = df_rfm_table[‘recency_quantile’] + df_rfm_table[‘frequency_quantile’] + df_rfm_table[‘monetary_quantile’]
#df_rfm_table.head()

def segments1(data):
if data[‘rfm_level’] >= 10 :
return ‘Gold’
elif (data[‘rfm_level’] >= 6) and (data[‘rfm_level’] < 10 ):
return ‘Silver’
else:
return ‘Bronze’

df_rfm_table [‘segments1’] = df_rfm_table.apply(segments1,axis=1)
#df_rfm_table.head()

segments2 = {
‘Customer Segment’:
[‘Champions’,
‘Loyal Customers’,
‘Potential Loyalist’,
‘Recent Customers’,
‘Customers Needing Attention’,
‘Still Got Hope’,
‘Need to Get Them Back’,
‘Lost’, ‘Give it a Try’],\
‘RFM’:
[‘(3|4)-(3|4)-(3|4)’,
‘(2|3|4)-(3|4)-(1|2|3|4)’,
‘(3|4)-(2|3)-(1|2|3|4)’,
‘(4)-(1)-(1|2|3|4)’,
‘(2|3)-(2|3)-(2|3)’,
‘(2|3)-(1|2)-(1|2|3|4)’,
‘(1|2)-(3|4)-(2|3|4)’,
‘(1|2)-(1|2)-(1|2)’,
‘(1|2)-(1|2|3)-(1|2|3|4)’]
}
pd.DataFrame(segments2)

def categorizer(rfm):
if (rfm[0] in [‘3’, ‘4’]) & (rfm[1] in [‘3’, ‘4’]) & (rfm[2] in [‘3’, ‘4’]):
rfm = ‘Champions’

elif (rfm[0] in ['2', '3', '4']) & (rfm[1] in ['3', '4']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Loyal Customers'

elif (rfm[0] in ['3', '4']) & (rfm[1] in ['2', '3']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Potential Loyalist'

elif (rfm[0] in ['4']) & (rfm[1] in ['1']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Recent Customers'

elif (rfm[0] in ['2', '3']) & (rfm[1] in ['2', '3']) & (rfm[2] in ['2', '3']):
    rfm = 'Customers Needing Attention'

elif (rfm[0] in ['2', '3']) & (rfm[1] in ['1', '2']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Still Got Hope'

elif (rfm[0] in ['1', '2']) & (rfm[1] in ['3', '4']) & (rfm[2] in ['2', '3', '4']):
    rfm = 'Need to Get Them Back'

elif (rfm[0] in ['1', '2']) & (rfm[1] in ['1', '2']) & (rfm[2] in ['1', '2']):
    rfm = 'Lost'

elif (rfm[0] in ['1', '2']) & (rfm[1] in ['1', '2', '3']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Give it a Try'

return rfm

df_rfm_table[‘segments2’] = df_rfm_table[“rfm_score”].apply(categorizer)
#df_rfm_table.head()

df_rfm_table[“segments2”].value_counts(dropna=False)

segments2
Need to Get Them Back          2343
Lost                           1666
Champions                       660
Loyal Customers                 602
Give it a Try                   483
Potential Loyalist              115
Still Got Hope                   93
Recent Customers                 46
Customers Needing Attention      40
141                               2
Name: count, dtype: int64

df_rfm_table[“segments1”].info()

<class 'pandas.core.series.Series'>
Index: 6050 entries, 17850.0 to 14569.0
Series name: segments1
Non-Null Count  Dtype 
--------------  ----- 
6050 non-null   object
dtypes: object(1)
memory usage: 94.5+ KB

Plot RFM Segments
df_plot1 = pd.DataFrame(df_rfm_table[“segments1”].value_counts(dropna=False).sort_values(ascending=False)).reset_index().rename(columns={‘index’:’Segments’, ‘segments1′:’Customers’})
df_plot1

import matplotlib.pyplot as plt
import squarify

############################# segment1

sizes=[3200, 1944, 906]
label=[“Silver”, “Bronze”, “Gold”]
color=[‘yellow’,’blue’,’orange’]
squarify.plot(sizes=sizes, label=label, color=color,alpha=0.8 , text_kwargs={‘fontsize’:16})
plt.axis(‘off’)
plt.show()

################ segment2

import matplotlib.pyplot as plt
import seaborn as sns

data = [2343, 1666, 660, 602, 483, 115]
labels = [‘Need to Get Them Back’, ‘Lost’,’Champions’,’Loyal Customers’,’Give it a Try’,’Potential Loyalist’]

colors = sns.color_palette(‘bright’)[0:6]

plt.pie(data, labels = labels, colors = colors, autopct=’%.0f%%’)
plt.show()

Read more here.

Customer Clustering PCA, K-Means and Agglomerative for Marketing Campaigns

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and read the dataset

import pandas as pd
import numpy as np
df = pd.read_csv(‘marketing_campaign.csv’, sep=”\t”)

Get the shape of the dataframe
print(“Shape of the dataframe:”, df.shape)

Shape of the dataframe: (2240, 29)

Get the information about the dataframe
print(“\nInformation about the dataframe:”)
print(df.info())

Information about the dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
None

Get the summary statistics of the dataframe
print(“\nSummary statistics of the dataframe:”)
df.describe().T

############# DATA PREPARATION

Check for missing data
print(“Missing data in the dataframe:”)
print(df.isnull().sum())

Missing data in the dataframe:
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

Let’s drop duplicates

df = df.dropna()
df.duplicated().sum()

########### Outliers

Calculate the IQR for the Income column
Q1 = df[‘Income’].quantile(0.25)
Q3 = df[‘Income’].quantile(0.75)
IQR = Q3 – Q1

Identify the outliers in the Income column
outliers = df[(df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR))]

Print the number of outliers
print(“Number of outliers in the Income column:”, len(outliers))

Number of outliers in the Income column: 8

Remove the outliers in the Income column
df = df[~((df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR)))]

Print the updated shape of the dataframe
print(“Updated shape of the dataframe:”, df.shape)

############## Feature engineering
print(“Unique values in Education column:”, df[‘Education’].unique())
print(“Unique values in Marital_Status column:”, df[‘Marital_Status’].unique())

Unique values in Education column: ['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle']
Unique values in Marital_Status column: ['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']

Let’s define the following new columns

def education_level(education):
if education in [‘Graduation’, ‘PhD’, ‘Master’]:
return ‘High’
elif education in [‘Basic’]:
return ‘Middle’
else:
return ‘Low’

df[‘Education_Level’] = df[‘Education’].apply(education_level)

def living_status(marital_status):
if marital_status in [‘Alone’, ‘Absurd’, ‘YOLO’]:
return ‘Living Alone’
else:
return ‘Living with Others’

df[‘Living_Status’] = df[‘Marital_Status’].apply(living_status)

df[‘Age’] = 2022 – df[‘Year_Birth’]

df[‘Total_Campaigns_Accepted’] = df[[‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘AcceptedCmp3’, ‘AcceptedCmp4’, ‘AcceptedCmp5’]].sum(axis=1)

df[‘Average_Spend’] = (df[[‘MntWines’, ‘MntFruits’, ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’]].sum(axis=1)) / df[‘NumDealsPurchases’]

df[‘Spent’] = df[‘MntWines’]+df[“MntWines”] +df[‘MntFruits’]+ df[‘MntMeatProducts’] +df[‘MntFishProducts’]+df[‘MntSweetProducts’]+ df[‘MntGoldProds’]

df[‘Is_Parent’] = (df[‘Kidhome’] + df[‘Teenhome’] > 0).astype(int)

Create the new feature for total spending in the last 2 years
df[‘total_spending’] = df[‘MntWines’] + df[‘MntFruits’] + df[‘MntMeatProducts’] + df[‘MntFishProducts’] + df[‘MntSweetProducts’] + df[‘MntGoldProds’]

Create new feature for average monthly visits to the company’s website
df[‘avg_web_visits’] = df[‘NumWebVisitsMonth’] / 12

Create new feature for the ratio of online purchases to total purchases
df[‘online_purchase_ratio’] = df[‘NumWebPurchases’] / (df[‘NumWebPurchases’] + df[‘NumCatalogPurchases’] + df[‘NumStorePurchases’])

Let’s drop a few columns

to_drop = [‘Dt_Customer’, ‘Z_CostContact’, ‘Z_Revenue’, ‘Year_Birth’, ‘ID’]
df = df.drop(to_drop, axis=1)

df.dtypes

Education                    object
Marital_Status               object
Income                      float64
Kidhome                       int64
Teenhome                      int64
Recency                       int64
MntWines                      int64
MntFruits                     int64
MntMeatProducts               int64
MntFishProducts               int64
MntSweetProducts              int64
MntGoldProds                  int64
NumDealsPurchases             int64
NumWebPurchases               int64
NumCatalogPurchases           int64
NumStorePurchases             int64
NumWebVisitsMonth             int64
AcceptedCmp3                  int64
AcceptedCmp4                  int64
AcceptedCmp5                  int64
AcceptedCmp1                  int64
AcceptedCmp2                  int64
Complain                      int64
Response                      int64
Education_Level              object
Living_Status                object
Age                           int64
Total_Campaigns_Accepted      int64
Average_Spend               float64
Spent                         int64
Is_Parent                     int32
total_spending                int64
avg_web_visits              float64
online_purchase_ratio       float64
dtype: object

########### PLOTLY EXPLORATORY DATA ANALYSIS (EDA)

import plotly.express as px

fig0 = px.histogram(df, x=”Income”, nbins=50)
fig0.show()

fig1 = px.histogram(df, x=”Age”, nbins=30, color=’Age’, title=”Distribution of Age”)
fig1.show()

fig2 = px.histogram(df, x=’Marital_Status’, nbins=5, title=”Marital Status Distribution”)
fig2.show()

fig3 = px.histogram(df, x=’Education_Level’, nbins=5, title=”Education Level Distribution”)
fig3.show()

import plotly.express as px

df_plot = df.groupby([‘Marital_Status’])[‘Average_Spend’].mean().reset_index()

fig4 = px.bar(df_plot, x=’Marital_Status’, y=’Average_Spend’, color=’Marital_Status’)

fig4.show()

Marital Status vs Average Spend Distribution

Plot the distribution of number of children in household
fig6 = px.histogram(df, x=’Kidhome’)
fig6.show()

Total Campaigns Accepted Distribution
fig8 = px.histogram(df, x=’Total_Campaigns_Accepted’, nbins=20, title=”Total Campaigns Accepted Distribution”)
fig8.show()

Average Spend per Purchase Distribution
fig9 = px.histogram(df, x=’Average_Spend’, nbins=20, title=”Average Spend per Purchase Distribution”)
fig9.show()

Spending distribution by marital status, education, and is_parent

fig10 = px.histogram(df, x=’total_spending’, color=’Marital_Status’, nbins=50,
title=’Spending Distribution by Marital Status’)
fig11 = px.histogram(df, x=’total_spending’, color=’Education_Level’, nbins=50,
title=’Spending Distribution by Education Level’)
fig12 = px.histogram(df, x=’total_spending’, color=’Is_Parent’, nbins=50,
title=’Spending Distribution by Is_Parent’)

fig10.show()
fig11.show()
fig12.show()

Spending distribution by education level

Plot the Distribution of Online Purchase Ratio
fig13 = px.histogram(df, x=’online_purchase_ratio’)
fig13.show()

Plot the Distribution of Number of Web Visits per Month
fig14 = px.histogram(df, x=’NumWebVisitsMonth’)
fig14.show()

Plot the Distribution of Number of Web Purchases
fig15 = px.histogram(df, x=’NumWebPurchases’)
fig15.show()

Plot the Distribution of Number of Catalog Purchases
fig16 = px.histogram(df, x=’NumCatalogPurchases’)
fig16.show()

Plot the Distribution of Number of Store Purchases
fig17 = px.histogram(df, x=’NumStorePurchases’)
fig17.show()

Box plot of NumWebPurchases vs NumStorePurchases
fig18 = px.box(df, x=”NumWebPurchases”, y=”NumStorePurchases”)
fig18.show()

Scatter plot of “MntWines” vs “MntSweetProducts” with a third variable represented by size or color
fig22 = px.scatter(df, x=”MntWines”, y=”MntSweetProducts”, size=”NumWebVisitsMonth”, color=”Income”, size_max=50)
fig22.show()

fig22 = px.scatter(df, x="MntWines", y="MntSweetProducts", size="NumWebVisitsMonth", color="Income", size_max=50)

########### CLUSTERING ALGORITHMS

import plotly.express as px
import plotly.io as pio

from sklearn.preprocessing import StandardScaler, MinMaxScaler

######## Data Preparation

df = pd.read_csv(‘marketing_campaign.csv’, sep=”\t”)

df = df.dropna()

df.duplicated().sum()

Calculate the IQR for the Income column
Q1 = df[‘Income’].quantile(0.25)
Q3 = df[‘Income’].quantile(0.75)
IQR = Q3 – Q1

Identify the outliers in the Income column
outliers = df[(df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR))]

Remove the outliers in the Income column
df = df[~((df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR)))]

df[‘Age’] = 2022 – df[‘Year_Birth’]
df[‘Total_Campaigns_Accepted’] = df[[‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘AcceptedCmp3’, ‘AcceptedCmp4’, ‘AcceptedCmp5’]].sum(axis=1)

df[‘Average_Spend’] = (df[[‘MntWines’, ‘MntFruits’, ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’]].sum(axis=1)) / df[‘NumDealsPurchases’]

df[‘Spent’] = df[‘MntWines’]+df[“MntWines”] +df[‘MntFruits’]+ df[‘MntMeatProducts’] +df[‘MntFishProducts’]+df[‘MntSweetProducts’]+ df[‘MntGoldProds’]

df[‘Is_Parent’] = (df[‘Kidhome’] + df[‘Teenhome’] > 0).astype(int)

df[‘total_spending’] = df[‘MntWines’] + df[‘MntFruits’] + df[‘MntMeatProducts’] + df[‘MntFishProducts’] + df[‘MntSweetProducts’]

df[‘avg_web_visits’] = df[‘NumWebVisitsMonth’] / 12

df[‘online_purchase_ratio’] = df[‘NumWebPurchases’] / (df[‘NumWebPurchases’] + df[‘NumCatalogPurchases’] + df[‘NumStorePurchases’])

to_drop = [‘Dt_Customer’, ‘Z_CostContact’, ‘Z_Revenue’, ‘Year_Birth’, ‘ID’]
df = df.drop(to_drop, axis=1)

One-hot encode the categorical variables

Select the numerical columns to scale
num_cols = [‘Income’, ‘Kidhome’, ‘Teenhome’, ‘Recency’, ‘MntWines’, ‘MntFruits’,
‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’,
‘NumDealsPurchases’, ‘NumWebPurchases’, ‘NumCatalogPurchases’,
‘NumStorePurchases’, ‘NumWebVisitsMonth’, ‘AcceptedCmp3’, ‘AcceptedCmp4’,
‘AcceptedCmp5’, ‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘Complain’, ‘Response’,
‘total_spending’, ‘avg_web_visits’, ‘online_purchase_ratio’, ‘Age’,
‘Total_Campaigns_Accepted’, ‘Is_Parent’]

Initialize the StandardScaler
scaler = StandardScaler()

Fit the scaler to the numerical columns
df[num_cols] = scaler.fit_transform(df[num_cols])

df.isnull().sum().sum()

al=df[‘online_purchase_ratio’].mean()
print(al)

4.1834490782976914e-17

df[‘online_purchase_ratio’] = df[‘online_purchase_ratio’].fillna(0.33)

df.replace([np.inf, -np.inf], np.nan, inplace=True)

al=df[‘Average_Spend’].mean()
print(al)

412.04631023016225

df[‘Average_Spend’] = df[‘Average_Spend’].fillna(al)

######## K-means Clusters

from sklearn.decomposition import PCA

Initialize the PCA model
pca = PCA(n_components=8)

Fit and transform the data
df_pca = pca.fit_transform(df)

Determining the optimal number of clusters using Silhouette Score
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

list_k = list(range(2, 10))
silhouette_scores = []
for k in list_k:
km = KMeans(n_clusters=k)
preds = km.fit_predict(df_pca)
silhouette_scores.append(silhouette_score(df_pca, preds))

best_k = list_k[np.argmax(silhouette_scores)]

best_k

Let’s fit the KMeans model with the number of clusters set to 4

import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score,davies_bouldin_score

kmeans = KMeans(n_clusters=4)
kmeans.fit(df_pca)

predictions = kmeans.predict(df_pca)

silhouette_score_value = silhouette_score(df_pca, predictions)

calinski_harabasz_score_value = calinski_harabasz_score(df_pca, predictions)

davies_bouldin_score_value = davies_bouldin_score(df_pca, predictions)

plt.scatter(df_pca[:, 0], df_pca[:, 1], c=predictions, cmap=’viridis’)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
plt.title(‘KMeans Clustering Results in 2D\nSilhouette Score: {0:.3f}’.format(silhouette_score_value))
plt.show()

fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(df_pca[:, 0], df_pca[:, 1], df_pca[:, 2], c=predictions, cmap=’viridis’)
ax.set_xlabel(‘Feature 1’)
ax.set_ylabel(‘Feature 2’)
ax.set_zlabel(‘Feature 3’)
ax.set_title(‘KMeans Clustering Results in 3D\nSilhouette Score: {0:.3f}’.format(silhouette_score_value))
plt.show()

####### Agglomerative Clustering

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics import davies_bouldin_score

Generate sample data
X = df_pca

Compute the pairwise distances between samples
dist_matrix = squareform(pdist(X))

Fit the Agglomerative Clustering model
agg_cluster = AgglomerativeClustering(n_clusters=4)
agg_cluster.fit(X)

Plotting

plt.scatter(X[:, 0], X[:, 1], c=agg_cluster.labels_, cmap=’viridis’)
plt.show()

fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=agg_cluster.labels_, cmap=’viridis’)
plt.show()

davies_bouldin_index = davies_bouldin_score(X, agg_cluster.labels_)
print(“Davies-Bouldin Index:”, davies_bouldin_index)

silhouette_score_value = silhouette_score(X, agg_cluster.labels_)

calinski_harabasz_score_value = calinski_harabasz_score(X, agg_cluster.labels_)

print(“Silhouette Score:”, silhouette_score_value)
print(“Calinski-Harabasz Index:”, calinski_harabasz_score_value)

Read more here.

Supply Chain RFM & ABC Analysis

Let’s apply the RFM/ABC analysis to the Data Co Supply Chain Dataset. Recency, Frequency, And Monetary Analysis(RFM) and ABC are techniques would give us an idea about important factors in our business and how we can optimize the future business growth.

The Recency-Frequency-Monetary analysis helps demand side sector of an organization by segmenting customers and providing more insights to marketing department and sales department about the customers who buy products at the store.
The ABC analysis helps the supply side sector of an organization by segmenting products based on revenue they generate. It helps manage inventory more efficiently.

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import standard libraries

import pandas as pd
import dateutil.parser
import numpy as np
import datetime
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns;
import matplotlib.pyplot as plt
from plotly import graph_objects as go
import plotly.express as px

Let’s read the input data

dFppsc= pd.read_csv(“DataCoSupplyChainDataset.csv”, encoding_errors=”ignore”)
dFppsc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180519 entries, 0 to 180518
Data columns (total 53 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Type                           180519 non-null  object 
 1   Days for shipping (real)       180519 non-null  int64  
 2   Days for shipment (scheduled)  180519 non-null  int64  
 3   Benefit per order              180519 non-null  float64
 4   Sales per customer             180519 non-null  float64
 5   Delivery Status                180519 non-null  object 
 6   Late_delivery_risk             180519 non-null  int64  
 7   Category Id                    180519 non-null  int64  
 8   Category Name                  180519 non-null  object 
 9   Customer City                  180519 non-null  object 
 10  Customer Country               180519 non-null  object 
 11  Customer Email                 180519 non-null  object 
 12  Customer Fname                 180519 non-null  object 
 13  Customer Id                    180519 non-null  int64  
 14  Customer Lname                 180511 non-null  object 
 15  Customer Password              180519 non-null  object 
 16  Customer Segment               180519 non-null  object 
 17  Customer State                 180519 non-null  object 
 18  Customer Street                180519 non-null  object 
 19  Customer Zipcode               180516 non-null  float64
 20  Department Id                  180519 non-null  int64  
 21  Department Name                180519 non-null  object 
 22  Latitude                       180519 non-null  float64
 23  Longitude                      180519 non-null  float64
 24  Market                         180519 non-null  object 
 25  Order City                     180519 non-null  object 
 26  Order Country                  180519 non-null  object 
 27  Order Customer Id              180519 non-null  int64  
 28  order date (DateOrders)        180519 non-null  object 
 29  Order Id                       180519 non-null  int64  
 30  Order Item Cardprod Id         180519 non-null  int64  
 31  Order Item Discount            180519 non-null  float64
 32  Order Item Discount Rate       180519 non-null  float64
 33  Order Item Id                  180519 non-null  int64  
 34  Order Item Product Price       180519 non-null  float64
 35  Order Item Profit Ratio        180519 non-null  float64
 36  Order Item Quantity            180519 non-null  int64  
 37  Sales                          180519 non-null  float64
 38  Order Item Total               180519 non-null  float64
 39  Order Profit Per Order         180519 non-null  float64
 40  Order Region                   180519 non-null  object 
 41  Order State                    180519 non-null  object 
 42  Order Status                   180519 non-null  object 
 43  Order Zipcode                  24840 non-null   float64
 44  Product Card Id                180519 non-null  int64  
 45  Product Category Id            180519 non-null  int64  
 46  Product Description            0 non-null       float64
 47  Product Image                  180519 non-null  object 
 48  Product Name                   180519 non-null  object 
 49  Product Price                  180519 non-null  float64
 50  Product Status                 180519 non-null  int64  
 51  shipping date (DateOrders)     180519 non-null  object 
 52  Shipping Mode                  180519 non-null  object 
dtypes: float64(15), int64(14), object(24)
memory usage: 73.0+ MB

############ Data Cleaning for Future Regression Models
dFppsc.drop(“Product Description”,axis=1,inplace=True)

dFppsc.drop(“Product Description”,axis=1,inplace=True)

dFppsc[“order date (DateOrders)”]=pd.to_datetime(dFppsc[“order date (DateOrders)”])
dFppsc[“shipping date (DateOrders)”]=pd.to_datetime(dFppsc[“shipping date (DateOrders)”])
dFppsc=dFppsc.sort_values(by=”order date (DateOrders)”)

dFppsc.drop([“Benefit per order”,”Sales per customer”,”Order Item Cardprod Id”,”Order Item Product Price”,”Product Category Id”,”Order Customer Id”],axis=1,inplace=True)

######### Variance Inflation Factor (VIF)
vif=pd.DataFrame()
vif[“columns”]=[‘Order Item Discount’,
‘Order Item Discount Rate’, ‘Order Item Profit Ratio’,
‘Order Item Quantity’, ‘Sales’, ‘Order Item Total’,
‘Order Profit Per Order’,’Product Price’]
vif[“vif value”] = [variance_inflation_factor(dFppsc[[‘Order Item Discount’,
‘Order Item Discount Rate’, ‘Order Item Profit Ratio’,
‘Order Item Quantity’, ‘Sales’, ‘Order Item Total’,
‘Order Profit Per Order’,’Product Price’]].values, i) for i in range(len(vif[“columns”]))]
vif.T

df1=dFppsc.drop([“Order Item Total”,”Product Price”,”Order Item Discount Rate”,”Order Profit Per Order”],axis=1)

vif=pd.DataFrame()
vif[“columns”]=[‘Sales’,
‘Order Item Quantity’,’Order Item Discount’,’Order Item Profit Ratio’]
vif[“data”] = [variance_inflation_factor(df1[[‘Sales’,
‘Order Item Quantity’,’Order Item Discount’,’Order Item Profit Ratio’]].values, i) for i in range(len(vif[“columns”]))]
dFppsc[“Delivery Status”].unique()

array(['Advance shipping', 'Late delivery', 'Shipping on time',
       'Shipping canceled'], dtype=object)

dFppsc.groupby(“Delivery Status”)[“Order Id”].count()

Delivery Status
Advance shipping     41592
Late delivery        98977
Shipping canceled     7754
Shipping on time     32196
Name: Order Id, dtype: int64

dFppsc[dFppsc[“Delivery Status”]==”Shipping canceled”].groupby(“Order Status”).agg(tc=(“Order Id”,”count”))


Order Status	tc
CANCELED	3692
SUSPECTED_FRAUD	4062

dF_clean=dFppsc[dFppsc[“Delivery Status”]!=”Shipping canceled”].copy()

############# RFM Analysis
dF_frequency=dF_clean.groupby([“Customer Id”]).agg(Total_count=(“Order Id”,”nunique”))

dF_clean[“Year”]=dF_clean[“order date (DateOrders)”].dt.year
dF_frequency_year=dF_clean.groupby([“Customer Id”,”Year”],as_index=False)[“Order Id”].nunique()

dF_frequency_year=dF_frequency_year.pivot_table(index=”Customer Id”,columns=”Year”,values=”Order Id”,fill_value=0)

pd.set_option(“display.max_columns”,None)
dF_clean[dF_clean[“Customer Id”]==12436]

dt1=datetime.datetime.strptime(’02-10-2017 12:46:00′, ‘%d-%m-%Y %H:%M:%S’)
dt2=dt1-datetime.timedelta(days=5)

df_exploring=dF_clean[dF_clean[“order date (DateOrders)”]>=dt2]
df_exploring.to_excel(“afterdt1.xlsx”)
df_exploring[df_exploring[“order date (DateOrders)”]>=dt1]

7964 rows × 47 columns

dF_clean[“Customer Full Name”]=dF_clean[“Customer Fname”]+” “+dF_clean[“Customer Lname”]
datetime_val = datetime.datetime.strptime(’02-10-2017 12:46:00′, ‘%d-%m-%Y %H:%M:%S’)
Customer_Names_After_Dt=dF_clean[dF_clean[“order date (DateOrders)”]>=datetime_val][[“Customer Full Name”,”Customer City”]].drop_duplicates()

Customer_Name_Common_Before_After_dt=pd.merge(dF_clean[(dF_clean[“order date (DateOrders)”]<=datetime_val)],Customer_Names_After_Dt,how=”inner”,on=[“Customer Full Name”,”Customer City”])
Records=pd.merge(Customer_Name_Common_Before_After_dt[[“Customer Full Name”,”Customer City”]].drop_duplicates(),dF_clean,how=”inner”,on=[“Customer Full Name”,”Customer City”])

dF_clean=dF_clean[dF_clean[“order date (DateOrders)”]<datetime_val].copy()

####### FREQUENCY

dF_frequency=dF_clean[[“Customer Id”,”Order Id”]].groupby(“Customer Id”,as_index=False).nunique()
dF_frequency.columns=[“Customer Id”,”Frequency”]
sns.histplot(dF_frequency.Frequency,bins=15,kde=False)

Frequency Histogram:

dF_frequency[“F”],Intervals_Frequency=pd.qcut(dF_frequency[“Frequency”],q=3,labels=[1,2,3],retbins=True)

########### RECENCY

dF_recency=dF_clean[[“Customer Id”,”order date (DateOrders)”]].groupby(“Customer Id”,as_index=False).max()
dF_recency.rename(columns ={“order date (DateOrders)”:”last_purchase_date”},inplace=True)
max_date=dF_recency[“last_purchase_date”].max()
dF_recency[“recency”]=max_date-dF_recency[“last_purchase_date”]
dF_recency[“recency”]=dF_recency[“recency”].dt.days

sns.displot(dF_recency.recency,bins=8,kde=False)

Recency Histogram

dF_recency[‘R’],Intervals_Recency=pd.qcut(dF_recency[“recency”],q=3,labels=[3,2,1],retbins=True)

######MONETARY

dF_monetory=dF_clean[[“Customer Id”,”Order Item Total”]].groupby(“Customer Id”,as_index=False).sum()
dF_monetory.columns=[“Customer Id”,”Sum_of_Sales”]
dF_monetory[“M”],Intervals_Monetory=pd.qcut(dF_monetory[“Sum_of_Sales”],q=3,labels=[1,2,3],retbins=True)

sns.displot(dF_monetory.Sum_of_Sales,kde=True)

Monetary histogram

####RFM SCORE

dF_rfm=pd.merge(dF_recency[[“Customer Id”,”R”]],dF_monetory[[“Customer Id”,”M”]],on=”Customer Id”,how=”inner”)
dF_rfm=pd.merge(dF_rfm,dF_frequency[[“Customer Id”,”F”]],on=”Customer Id”,how=”inner”)
dF_rfm[“RFM”]=(dF_rfm[“R”]).astype(str)+(dF_rfm[“F”]).astype(str)+(dF_rfm[“M”]).astype(str)

######SEGMENT STRATEGY

Let’s read the input data

pd.set_option(‘display.max_colwidth’, None)
Segment_Strategy=pd.read_excel(“Segments_Strategy.xlsx”)
Segment_Strategy

#####Customer Segment Data preparation

def Customer_Segment(data):
if data[“R”]==1 and data[“F”] in [1,2,3] and (data[“M”]==3):
return “Lost Customers – Big Spenders”
elif data[“R”]== 1 and data[“F”] in [1,2] and data[“M”] in [1,2]:
return “Lost Customers – Bargain”
elif data[“R”] in [1,2] and data[“F”]==3 and data[“M”] in [1,2]:
return “Lost/Almost Lost Customers – Loyal”
elif (data[“R”]==3) and (data[“F”]==3) and data[“M”] in [1,2]:
return “Loyal Customers”
elif (data[“R”]==3) and data[“F”] in [3,2] and data[“M”]==3:
return “Big Spenders”
elif (data[“R”]==3) and (data[“F”]==1) and data[“M”] in [1,2,3]:
return “New Customers”
elif (data[“R”]==3) and (data[“F”]==2) and data[“M”] in [1,2]:
return “Bargain Customers”
elif (data[“R”]==2) and data[“F”]==2 and data[“M”] in [1,2]:
return “Occasional Customers-Bargain”
elif (data[“R”]==2) and data[“F”] in [2,3] and data[“M”]==3:
return “Occasional Customers- Big Spenders”
elif (data[“R”]==2) and data[“F”]==1 and data [“M”] in [1,2,3]:
return “Unsatisfied Customers”
else:
return “No Segment”

dF_rfm[“R”]=dF_rfm[“R”].astype(“category”)
dF_rfm[“F”]=dF_rfm[“F”].astype(“category”)
dF_rfm[“M”]=dF_rfm[“M”].astype(“category”)
dF_rfm[“Segment”]=dF_rfm.apply(Customer_Segment,axis=1)

Segment_count=(dF_rfm.groupby(“Segment”,as_index=False).agg(Total_Count=(“Customer Id”,”count”))).sort_values(by=”Total_Count”,ascending=False)

Let’s plot Number of Customer in Each Segment

fig2=go.Figure()
fig2.add_trace(go.Bar(x=Segment_count.Segment,
y=Segment_count.Total_Count,
hovertemplate =”%{label}
Number of Customers:%{value}”,
texttemplate = “%{value}”))
fig2.update_layout(title=”Number of Customers in Each Segment”,
xaxis=dict(title=”Customer Segment”),
yaxis=dict(title=”Number Of Customers”),width=800)

Let’s plot Revenue Generated by Each Segment

dF_clean=pd.merge(dF_clean,dF_rfm[[“Customer Id”,”RFM”,”Segment”]],on=”Customer Id”,how=”left”)
dF_segment_revenue=dF_clean.groupby(“Segment”,as_index=False).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=”Total_Revenue”,ascending=False)
fig3=go.Figure()
fig3.add_trace(go.Bar(x=dF_segment_revenue.Segment,
y=dF_segment_revenue.Total_Revenue,
hovertemplate =”%{label}
Revenue:%{value}”,
texttemplate = “%{value}”))
fig3.update_layout(title=”Revenue Generated by Each Segment”,xaxis=dict(title=”Customer Segment”),
yaxis=dict(title=”Revenue”))

######TEXT WRAPS & SEGMENT TREE MAPS

def wrap(a):
str1=a.split(” “)
final_str=””
for c in str1:
final_str=final_str+c+’
‘
return final_str

Df1=dF_clean.groupby([“Segment”,”Department Name”,”Product Name”],
as_index=False
).agg(Total_Revenue=(“Order Item Total”,”sum”),Total_Quantity=(“Order Item Quantity”,”sum”))
Df1[“Product Name”]=Df1[“Product Name”].apply(wrap) #Applying text wrap because our product names are too long
segment=[“Big Spenders”,”Bargain Customers”,”Loyal Customers”,”New Customers”]
def type1(data):
if “Big Spenders” in data[“Segment”]:
return “Big spenders”
elif “Bargain” in data[“Segment”]:
return “Bargain”
elif “Loyal” in data[“Segment”]:
return “Loyal”
elif “New” in data[“Segment”]:
return “New Customers”
else:
return “Unsatisfied”

Df1[“Customer Type”]=Df1.apply(type1,axis=1)

data_tm=list(Df1.groupby(“Customer Type”))+[(‘All’,Df1)]

traces=[]
buttons=[]
visible=[]
for i,d in enumerate(data_tm):
visible=[False]*len(data_tm)
visible[i]=True
title=d[0]
traces.append((px.treemap(d[1],
path=[px.Constant(“All”),”Customer Type”,”Segment”,”Department Name”,”Product Name”],
values=”Total_Revenue”).update_traces(visible= True if i==0 else False)).data[0])
buttons.append(dict(label=title,
method=”update”,
args=[{
“visible”:visible},
{“title”:f”{title}”}
]))

updatemenus=[{“active”:0,”buttons”:buttons}]
fig11=go.Figure(data=traces,layout=dict(updatemenus=updatemenus))
fig11.update_layout(title=data_tm[0][0],title_x=0.6)
fig11.update_coloraxes(colorbar_title_text=”Total
Quantity”)
fig11.show()

#####ABC ANALYSIS

Total_Products=dF_clean[“Product Name”].nunique()
print(“Total Number of products: “+f”{Total_Products}”)

Total Number of products: 100

Revenue_ABC=dF_clean.groupby([“Department Name”,”Product Name”]).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=”Total_Revenue”,ascending=False).reset_index()
Revenue_ABC[“cum_sum”]=Revenue_ABC[“Total_Revenue”].cumsum()
Revenue_ABC[“cum_per”]=Revenue_ABC[“cum_sum”]/Revenue_ABC[“Total_Revenue”].sum()*100
Revenue_ABC[“per”]=Revenue_ABC[“cum_per”]-Revenue_ABC[“cum_per”].shift(1)
Revenue_ABC.loc[0,”per”]=Revenue_ABC[“cum_per”][0]

def ABC(data):
if data[“cum_per”]<=75: return “A” elif data[“cum_per”]>75 and data[“cum_per”]<=95: return “B” elif data[“cum_per”]>95:
return “C”

Revenue_ABC[“ABC_Revenue”]=Revenue_ABC.apply(ABC,axis=1)

Bar_graph_Abc=Revenue_ABC[[“ABC_Revenue”,”Product Name”,”Total_Revenue”]].groupby(“ABC_Revenue”).agg(Revenue=(“Total_Revenue”,”sum”),count=(“Product Name”,”count”))

##########Bar_graph_Abc
fig2=go.Figure(go.Bar(x=Bar_graph_Abc.index,
y=Bar_graph_Abc[“Revenue”],
hovertemplate =”%{label}
Revenue:%{value}”,
texttemplate = “Revenue
%{value}”,
marker_color=[“orange”,”lightgreen”,”red”],
showlegend=False))
fig2.add_trace(
go.Scatter(
x=Bar_graph_Abc.index,
y=Bar_graph_Abc[“count”],
name=”Number Of Products”,
mode=’lines’,
line = dict(color=’blue’, width=3),
yaxis=”y2″,
marker_line_width = 0
))

fig2.update_layout(
title=”Revenue Generated By Products in Different ABC Segments”,
xaxis=dict(title=”Segment” ),
yaxis=dict(title=”Revenue”,showgrid=False),
yaxis2=dict(title=”Number Of Products”, anchor=”x”, overlaying=”y”,side=”right”,dtick=10),
legend = dict(x = 1.05, y = 1))
fig2.show()

######A Segment Products in Top 6 Cities

Top6Cities=dF_clean[[“Order City”,”Order Item Total”]].groupby(“Order City”).sum().sort_values(by=”Order Item Total”,ascending=False).head(6)

data=dF_clean[dF_clean[“Order City”].isin(Top6Cities.index.tolist())].groupby([“Order City”,”Product Name”],as_index=False).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=[“Order City”,”Total_Revenue”],ascending=False)
data[“sum”]=data.groupby(“Order City”)[“Total_Revenue”].transform(‘sum’)
data[“cum_per”]=data.groupby(“Order City”)[“Total_Revenue”].cumsum()/data[“sum”]*100
data[“segment”]=data.apply(ABC,axis=1)
data[“Product Name”]=data[“Product Name”].apply(wrap)
data.sort_values(by=[“sum”,”Total_Revenue”],inplace=True,ascending=False)

fig = px.bar(data[data[“segment”]==”A”], x=’Product Name’, y=’Total_Revenue’,
facet_col=’Order City’,facet_col_wrap=2,facet_row_spacing=0.12,
height=1500,title=”A Segment Products in Top 6 Cities”)
fig.for_each_annotation(lambda a: a.update(text=a.text.split(“=”)[-1]))
fig.update_xaxes(showticklabels=True,tickangle=0)
fig.update_layout(width=950)
fig.show()

Read more here.

Bank Churn ML Prediction

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

Basic Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Processing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif, chi2

Model libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split
from sklearn import metrics

Let’s read the input dataset

df = pd.read_csv(‘BankChurners.csv’)

df = df.drop(‘Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2’, axis = 1)
df = df.drop(‘Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1’, axis = 1

Let’s check the statistics

df.describe().T

Number of churn and non-churn
counts = df.Attrition_Flag.value_counts()
perc_churn = (counts[1] / (counts[0] + counts[1])) * 100

No. of duplicates
duplicates = len(df[df.duplicated()])

No. of missing values
missing_values = df.isnull().sum().sum()

Data types in dataset
types = df.dtypes.value_counts()
print(“Churn Rate = %.1f %%”%(perc_churn))
print(‘Number of Duplicate Entries: %d’%(duplicates))
print(‘Number of Missing Values: %d’%(missing_values))
print(‘Number of Features: %d’%(df.shape[1]))
print(‘Number of Customers: %d’%(df.shape[0]))
print(‘Data Types and Frequency in Dataset:’)
print(types)

Churn Rate = 16.1 %
Number of Duplicate Entries: 0
Number of Missing Values: 0
Number of Features: 21
Number of Customers: 10127
Data Types and Frequency in Dataset:
int64      10
object      6
float64     5
Name: count, dtype: int64

####### Data Pre-processing

Make gender and outcome numerical

df[‘Gender’] = df[‘Gender’].map({‘M’: 1, ‘F’: 0})
df[‘Attrition_Flag’] = df[‘Attrition_Flag’].map({‘Attrited Customer’: 1, ‘Existing Customer’: 0})

drop client id
df = df.drop(‘CLIENTNUM’, axis = 1)

catcols = df.select_dtypes(exclude = [‘int64′,’float64’]).columns
intcols = df.select_dtypes(include = [‘int64’]).columns
floatcols = df.select_dtypes(include = [‘Float64’]).columns

One-hot encoding on categorical columns
df = pd.get_dummies(df, columns = catcols)

Minmax scaling numeric features
for col in df[floatcols]:
df[col] = MinMaxScaler().fit_transform(df[[col]])

for col in df[intcols]:
df[col] = MinMaxScaler().fit_transform(df[[col]])

print(‘New Number of Features: %d’%(df.shape[1]))

New Number of Features: 37

Split into X and y
X = df.drop(‘Attrition_Flag’, axis = 1)
y = df[‘Attrition_Flag’]

##################Correlation between numerical features

heat = df.corr()
plt.figure(figsize=[16,8])
plt.title(“Correlation between numerical features”, size = 25, pad = 20, color = ‘#8cabb6’)
sns.heatmap(heat,cmap = sns.diverging_palette(20, 220, n = 200), annot=False)
plt.show()

#######Feature Engineering

print(“Correlation Coefficient of all the Features”)
corr = df.corr()
corr.sort_values([“Attrition_Flag”], ascending = False, inplace = True)
correlations = corr.Attrition_Flag
a = correlations[correlations > 0.1]
b = correlations[correlations < -0.1]
top_corr_features = a._append(b)
top_corr_features

Correlation Coefficient of all the Features

Out[12]:

Attrition_Flag              1.000000
Contacts_Count_12_mon       0.204491
Months_Inactive_12_mon      0.152449
Total_Amt_Chng_Q4_Q1       -0.131063
Total_Relationship_Count   -0.150005
Total_Trans_Amt            -0.168598
Avg_Utilization_Ratio      -0.178410
Total_Revolving_Bal        -0.263053
Total_Ct_Chng_Q4_Q1        -0.290054
Total_Trans_Ct             -0.371403
Name: Attrition_Flag, dtype: float64

Let’s define the following functions

def plot_importances(model, model_name, features_to_plot, feature_names):
#fit model and performances
model.fit(X,y)
importances = model.feature_importances_

# sort and rank importances
indices = np.argsort(importances)
best_features = np.array(feature_names)[indices][-features_to_plot:]
values = importances[indices][-features_to_plot:]

# plot a graph
y_ticks = np.arange(0, features_to_plot)
fig, ax = plt.subplots()
ax.barh(y_ticks, values, color = '#b2c4cc')
ax.set_yticklabels(best_features)
ax.set_yticks(y_ticks)
ax.set_title("%s Feature Importances"%(model_name))
fig.tight_layout()
plt.show()

def best_features(model, features_to_plot, feature_names):
# get list of best features
model.fit(X,y)
importances = model.feature_importances_

indices = np.argsort(importances)
best_features = np.array(feature_names)[indices][-features_to_plot:]
return best_features

Let’s plot feature importance weights for the following 3 models

feature_names = list(X.columns)

model1 = RandomForestClassifier(random_state = 1234)
plot_importances(model1, ‘Random Forest’, 10, feature_names)

model2 = GradientBoostingClassifier(n_estimators = 100, learning_rate = 1.0, max_depth = 1, random_state = 0)
plot_importances(model2, ‘XGBoost’, 10, feature_names)

model3 = AdaBoostClassifier(n_estimators = 100, learning_rate = 1.0, random_state = 0)
plot_importances(model3, ‘AdaBoost’, 10, feature_names)

Looking at the F-value between label/feature for classification tasks

f_selector = SelectKBest(f_classif, k = 10)
f_selector.fit_transform(X, y)
f_selector_best = f_selector.get_feature_names_out()
print(f_selector_best)

['Gender' 'Total_Relationship_Count' 'Months_Inactive_12_mon'
 'Contacts_Count_12_mon' 'Total_Revolving_Bal' 'Total_Amt_Chng_Q4_Q1'
 'Total_Trans_Amt' 'Total_Trans_Ct' 'Total_Ct_Chng_Q4_Q1'
 'Avg_Utilization_Ratio']

forest_best = list(best_features(model1, 10, feature_names))
XG_best = list(best_features(model2, 10, feature_names))
ada_best = list(best_features(model3, 10, feature_names))
top_corr_features = list(top_corr_features.index[1:])
f_selector_best = list(f_selector_best)

best_features_overall = forest_best + XG_best + ada_best + top_corr_features + f_selector_best

# create a dictionary with the number of times features appear
from collections import Counter
count_best_features = dict(Counter(best_features_overall))

# list of the features without any repeatitions
features_no_repeats = list(dict.fromkeys(best_features_overall))

display(count_best_features)

{'Customer_Age': 3,
 'Avg_Open_To_Buy': 3,
 'Credit_Limit': 2,
 'Total_Amt_Chng_Q4_Q1': 5,
 'Avg_Utilization_Ratio': 3,
 'Total_Relationship_Count': 4,
 'Total_Revolving_Bal': 5,
 'Total_Ct_Chng_Q4_Q1': 5,
 'Total_Trans_Ct': 5,
 'Total_Trans_Amt': 5,
 'Months_Inactive_12_mon': 4,
 'Contacts_Count_12_mon': 4,
 'Gender': 1}

# get list of features with high counts in the dictionary
def get_features(threshold):
# remove features below a certain number of appearances
chosen_features = []
for i in features_no_repeats:
if count_best_features[i] > threshold:
chosen_features.append(i)
return chosen_features

chosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
chosen_features

['Customer_Age',
 'Total_Amt_Chng_Q4_Q1',
 'Total_Relationship_Count',
 'Total_Revolving_Bal',
 'Total_Ct_Chng_Q4_Q1',
 'Total_Trans_Ct',
 'Total_Trans_Amt',
 'Months_Inactive_12_mon',
 'Contacts_Count_12_mon']

######### ML Model Evaluation

def eval_model(model, model_name, X, y, threshold):
# make X the chosen subset
chosen_features = get_features(threshold)
X = X[chosen_features]

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)

# fit model
model.fit(train_x,train_y)
model.score(test_x, test_y)
pred_test = model.predict(test_x)

# get metrics
f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)

print(con,'%s model with %s threshold: %.4f F1-score and %.4f accuracy'%(model_name, threshold, f1, test_acc))

Let’s use the full dataset to perform train/test data splitting with test_size = 0.25

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)

####Model1 Fitting

model1.fit(train_x,train_y)
model1.score(test_x, test_y)
pred_test = model1.predict(test_x)

f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)

print(con,f1,test_acc)

[[2094   19]
 [  96  323]] 0.8488830486202367 0.9545813586097947

Let’s look at the reduced dataset and repeat model1 fitting

chosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
Xnew = X[chosen_features]

train_x, test_x, train_y, test_y = train_test_split(Xnew, y, test_size = 0.25, random_state = 42)

model1.fit(train_x,train_y)
model1.score(test_x, test_y)
pred_test = model1.predict(test_x)

f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)

print(con,f1,test_acc)

[[2086   27]
 [  60  359]] 0.8919254658385094 0.9656398104265402

######More ML Classifiers

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
model4 = BaggingClassifier(KNeighborsClassifier(n_neighbors = 7), max_samples = 0.8, max_features = 0.8)

eval_model(model4, ‘KNN’, X, y, 2)

[[2074   39]
 [ 176  243]] KNN model with 2 threshold: 0.6933 F1-score and 0.9151 accuracy

from sklearn.linear_model import LogisticRegression

model5 = LogisticRegression(random_state=1)
eval_model(model5, ‘Logistic’, X, y, 2)

[[2059   54]
 [ 217  202]] Logistic model with 2 threshold: 0.5985 F1-score and 0.8930 accuracy

# lists of possible parameters
n = [400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]
depth = [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
rand = [600, 650, 700, 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250]

forest = RandomForestClassifier(n_estimators = 100, max_depth = 15, random_state = 750)
eval_model(forest, ‘forest’, X, y, 2)

[[2090   23]
 [  64  355]] forest model with 2 threshold: 0.8908 F1-score and 0.9656 accuracy

####Further Evaluation of Random Forest

def eval_forest(model, model_name, X, y, threshold, n, depth, rand):
# create subset from feature selection
chosen_features = get_features(threshold)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
X = X[chosen_features]

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)

model.fit(train_x,train_y)
model.score(test_x, test_y)
pred_test = model.predict(test_x)

f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)

print('Model: %s Threshold: %s F1-Score %.4f Accuracy: %.4f n_estimators: %s depth: %s rand: %s'%(model_name, threshold, f1, test_acc,n,depth,rand))

Let’s define the model

forest = RandomForestClassifier(n_estimators = 850, max_depth = 19, random_state = 1200)

and train/test this model using 2 selected features

chosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
X_new = X[chosen_features]

train_x, test_x, train_y, test_y = train_test_split(X_new, y, test_size = 0.25, random_state = 42)

forest.fit(train_x,train_y)
forest.score(test_x, test_y)
pred_test = forest.predict(test_x)

f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)
precision = metrics.precision_score(test_y, pred_test)
recall = metrics.recall_score(test_y, pred_test)
roc = metrics.roc_auc_score(test_y, pred_test)
print(‘Accuracy Score’, test_acc)
print(‘Precision’, precision)
print(‘Recall’, recall)
print(‘F1-Score’, f1)
print(‘ROC Score’, roc)
print(con)

Accuracy Score 0.9664296998420221
Precision 0.9304123711340206
Recall 0.8615751789976134
F1-Score 0.8946716232961587
ROC Score 0.9243985691485936
[[2086   27]
 [  58  361]]

#####CROSS-VALIDATION

from sklearn.model_selection import cross_validate

cv_results = cross_validate(forest, X_new, y, scoring = (‘f1’, ‘accuracy’, ‘roc_auc’), cv = 8)
sorted(cv_results.keys())

['fit_time', 'score_time', 'test_accuracy', 'test_f1', 'test_roc_auc']

cv_results[‘test_roc_auc’]

array([0.91056078, 0.96866615, 0.98541168, 0.9918462 , 0.99539807,
       0.99796213, 0.8989536 , 0.90523967])

cv_results[‘test_accuracy’]

array([0.88230648, 0.94233807, 0.96366509, 0.96366509, 0.96840442,
       0.97630332, 0.92969984, 0.92885375])

#####ML PERFORMANCE ANALYSIS USING Scikit Plot

Import standard libraries

import scikitplot as skplt

import sklearn

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys
import warnings
warnings.filterwarnings(“ignore”)

print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)

%matplotlib inline

Scikit Plot Version :  0.3.7
Scikit Learn Version :  1.2.2
Python Version :  3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]

#####RFC Learning Curve

skplt.estimators.plot_learning_curve(forest, train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”RFC Learning Curve”);

######KNN Classification Learning Curve

skplt.estimators.plot_learning_curve(model4, train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”KNN Classification Learning Curve”);

####ETC Learning Curve

skplt.estimators.plot_learning_curve(ExtraTreesClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”ETC Learning Curve”);

#####GBC Learning Curve

skplt.estimators.plot_learning_curve(GradientBoostingClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”GBC Learning Curve”);

#######DTC Learning Curve

from sklearn.tree import DecisionTreeClassifier
skplt.estimators.plot_learning_curve(DecisionTreeClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”DTC Learning Curve”);

######ABC Learning Curve

from sklearn.ensemble import AdaBoostClassifier
skplt.estimators.plot_learning_curve(AdaBoostClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”ABC Learning Curve”);

########MLPC Learning Curve

from sklearn.neural_network import MLPClassifier
skplt.estimators.plot_learning_curve(MLPClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”MLPC Learning Curve”);

#######GNB Learning Curve

from sklearn.naive_bayes import GaussianNB
skplt.estimators.plot_learning_curve(GaussianNB(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”GNB Learning Curve”);

######LR Learning Curve

skplt.estimators.plot_learning_curve(LogisticRegression(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”LR Learning Curve”);

#######QDA Learning Curve

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
skplt.estimators.plot_learning_curve(QuadraticDiscriminantAnalysis(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”QDA Learning Curve”);

########SVC Learning Curve

from sklearn.svm import SVC
skplt.estimators.plot_learning_curve(SVC(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”SVC Learning Curve”);

######CALIBRATION PLOTS

lr_probas = QuadraticDiscriminantAnalysis().fit(train_x, train_y).predict_proba(test_x)
rf_probas = RandomForestClassifier().fit(train_x, train_y).predict_proba(test_x)
gb_probas = GradientBoostingClassifier().fit(train_x, train_y).predict_proba(test_x)
et_scores = AdaBoostClassifier().fit(train_x, train_y).predict_proba(test_x)

probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = [‘QDA’, ‘RFC’, ‘GBC’, ‘ABC’]

skplt.metrics.plot_calibration_curve(test_y,
probas_list,
clf_names, n_bins=15,
figsize=(12,6)
);

######GBC Normalized Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns

model=GradientBoostingClassifier()

model.fit(train_x,train_y)
pred_test = model.predict(test_x)

cm = confusion_matrix(test_y, pred_test)

target_names=[‘0′,’1’]

cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.title(‘GBC Normalized Confusion Matrix’)

########RFC Normalized Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns

forest.fit(train_x,train_y)

pred_test = forest.predict(test_x)

forest.fit(train_x,train_y)
pred_test = forest.predict(test_x)

cm = confusion_matrix(test_y, pred_test)

target_names=[‘0′,’1’]

########RFC ROC Curve

Y_test_probs = forest.predict_proba(test_x)

skplt.metrics.plot_roc_curve(test_y, Y_test_probs,
title=”RFC ROC Curve”, figsize=(12,6));

##########GBC ROC Curve

model=GradientBoostingClassifier()
model.fit(train_x,train_y)
Y_test_probs = model.predict_proba(test_x)

skplt.metrics.plot_roc_curve(test_y, Y_test_probs,
title=”GBC ROC Curve”, figsize=(12,6));

##########RFC Precision-Recall Curve

Y_test_probs = forest.predict_proba(test_x)
skplt.metrics.plot_precision_recall_curve(test_y, Y_test_probs,
title=”RFC Precision-Recall Curve”, figsize=(12,6));

########GBC Precision-Recall Curve

model=GradientBoostingClassifier()
model.fit(train_x,train_y)
Y_test_probs = model.predict_proba(test_x)
skplt.metrics.plot_precision_recall_curve(test_y, Y_test_probs,
title=”GBC Precision-Recall Curve”, figsize=(12,6));

#####K-Means Elbow Plot

skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
train_x,
cluster_ranges=range(2, 20),
figsize=(8,6));

######Silhouette Analysis

kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(train_x, train_y)
cluster_labels = kmeans.predict(test_x)

skplt.metrics.plot_silhouette(test_x, cluster_labels,
figsize=(8,6));

#######PCA Component Variance

pca = PCA(random_state=1)
pca.fit(train_x)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

#####PCA 2-D Projection

skplt.decomposition.plot_pca_2d_projection(pca, train_x, train_y,
figsize=(10,10),
cmap=”tab10″);

######Classification Reports via yellowbrick

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import sklearn
import yellowbrick

pd.set_option(“display.max_columns”, 35)

import warnings
warnings.filterwarnings(“ignore”)

from yellowbrick.classifier import ClassificationReport

viz = ClassificationReport(forest,
classes=target_names,
support=True,
fig=plt.figure(figsize=(8,6)))

viz.fit(train_x, train_y)

viz.score(test_x, test_y)

viz.show();

from yellowbrick.classifier import ClassificationReport

model=GradientBoostingClassifier()
model.fit(train_x,train_y)

viz = ClassificationReport(model,
classes=target_names,
support=True,
fig=plt.figure(figsize=(8,6)))

viz.fit(train_x, train_y)

viz.score(test_x, test_y)

viz.show();

######CLASS PREDICTION ERROR

from yellowbrick.classifier import ClassPredictionError

viz = ClassPredictionError(forest,
classes=target_names,
fig=plt.figure(figsize=(9,6)))

viz.fit(train_x, train_y)

viz.score(test_x, test_y)

viz.show();

model=GradientBoostingClassifier()
model.fit(train_x,train_y)
viz = ClassPredictionError(model,
classes=target_names,
fig=plt.figure(figsize=(9,6)))

viz.fit(train_x, train_y)

viz.score(test_x, test_y)

viz.show();

Read more here.

Groceries Market Basket & RFM Analysis

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import/install standard libraries

!pip install apriori

import os
import plotly.express as px
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import operator as op
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Loading the input dataset
data = pd.read_csv(‘Groceries_dataset.csv’)
data.head()

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Member_number    38765 non-null  int64 
 1   Date             38765 non-null  object
 2   itemDescription  38765 non-null  object
dtypes: int64(1), object(2)
memory usage: 908.7+ KB

Renaming the columns
data.columns = [‘memberID’, ‘Date’, ‘itemName’]

Checking for the missing values
nan_values = data.isna().sum()
nan_values

memberID    0
Date        0
itemName    0
dtype: int64

Converting Date column into correct data type which is datetime
data.Date = pd.to_datetime(data.Date)
data.memberID = data[‘memberID’].astype(‘str’)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   memberID  38765 non-null  object        
 1   Date      38765 non-null  datetime64[ns]
 2   itemName  38765 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 908.7+ KB

#######EXPLORATORY DATA ANALYSIS (EDA)

Sales_weekly = data.resample(‘w’, on=’Date’).size()
fig = px.line(data, x=Sales_weekly.index, y=Sales_weekly,
labels={‘y’: ‘Number of Sales’,
‘x’: ‘Date’})
fig.update_layout(title_text=’Number of Sales Weekly’,
title_x=0.5, title_font=dict(size=18))
fig.show()

####Number of Sales Weekly

#########Number of Customers Weekly

Unique_customer_weekly = data.resample(‘w’, on=’Date’).memberID.nunique()
fig = px.line(Unique_customer_weekly, x=Unique_customer_weekly.index, y=Unique_customer_weekly,
labels={‘y’: ‘Number of Customers’})
fig.update_layout(title_text=’Number of Customers Weekly’,
title_x=0.5, title_font=dict(size=18))
fig.show()

#####Sales per Customer Ratio Weekly

Sales_per_Customer = Sales_weekly / Unique_customer_weekly
fig = px.line(Sales_per_Customer, x=Sales_per_Customer.index, y=Sales_per_Customer,
labels={‘y’: ‘Sales per Customer Weekly’})
fig.update_layout(title_text=’Sales per Customer Weekly’,
title_x=0.5, title_font=dict(size=18))
fig.update_yaxes(rangemode=”tozero”)
fig.show()

#######Frequency of the Items Sold

Frequency_of_items = data.groupby(pd.Grouper(key=’itemName’)).size().reset_index(name=’count’)
fig = px.treemap(Frequency_of_items, path=[‘itemName’], values=’count’)
fig.update_layout(title_text=’Frequency of the Items Sold’,
title_x=0.5, title_font=dict(size=18)
)
fig.update_traces(textinfo=”label+value”)
fig.show()

########Top 20 Customers regarding Number of Items Bought

user_item = data.groupby(pd.Grouper(key=’memberID’)).size().reset_index(name=’count’) \
.sort_values(by=’count’, ascending=False)
fig = px.bar(user_item.head(25), x=’memberID’, y=’count’,
labels={‘y’: ‘Number of Sales’,
‘count’: ‘Number of Items Bought’},
color=’count’)
fig.update_layout(title_text=’Top 20 Customers regarding Number of Items Bought’,
title_x=0.5, title_font=dict(size=18))
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.show()

##########Number of Sales per Discrete Week Days

day = data.groupby(data[‘Date’].dt.strftime(‘%A’))[‘itemName’].count()
fig = px.bar(day, x=day.index, y=day, color=day,
labels={‘y’: ‘Number of Sales’,
‘Date’: ‘Week Days’})
fig.update_layout(title_text=’Number of Sales per Discrete Week Days’,
title_x=0.5, title_font=dict(size=18))
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.show()

##############Number of Sales per Discrete Months

month = data.groupby(data[‘Date’].dt.strftime(‘%m’))[‘itemName’].count()
fig = px.bar(month, x=month.index, y=month, color=month,
labels={‘y’: ‘Number of Sales’,
‘Date’: ‘Months’})
fig.update_layout(title_text=’Number of Sales per Discrete Months’,
title_x=0.5, title_font=dict(size=18))
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.show()

###########Number of Sales per Discrete Month Days

month_day = data.groupby(data[‘Date’].dt.strftime(‘%d’))[‘itemName’].count()
fig = px.bar(month_day, x=month_day.index, y=month_day, color=month_day,
labels={‘y’: ‘Number of Sales’,
‘Date’: ‘Month Days’})
fig.update_layout(title_text=’Number of Sales per Discrete Month Days’,
title_x=0.5, title_font=dict(size=18))
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.show()

###########Market Basket Analysis
baskets = data.groupby([‘memberID’, ‘itemName’])[‘itemName’].count().unstack().fillna(0).reset_index()
#baskets.head()

Let’s check the most sold -item which is whole milk- has the same number of sales as we discussed above in the treemap.
baskets[‘whole milk’].sum()

2502.0

Encoding the items that sold more than 1
def one_hot_encoder(k):
if k <= 0: return 0 if k >= 1:
return 1

baskets_final = baskets.iloc[:, 1:baskets.shape[1]].applymap(one_hot_encoder)
#baskets_final.head()

Finding the most frequent items sold together
frequent_itemsets = apriori(baskets_final, min_support=0.025, use_colnames=True, max_len=3).sort_values(by=’support’)
#frequent_itemsets.head(25)

Creating association rules for indicating astecedent and consequent items
rules = association_rules(frequent_itemsets, metric=”lift”, min_threshold=1).sort_values(‘lift’, ascending=False)
rules = rules[[‘antecedents’, ‘consequents’, ‘support’, ‘confidence’, ‘lift’]]
rules.head(25)

###########RFM Analysis

Finding last purchase date of each customer
Recency = data.groupby(by=’memberID’)[‘Date’].max().reset_index()
Recency.columns = [‘memberID’, ‘LastDate’]
#Recency.head()

Finding last date for our dataset
last_date_dataset = Recency[‘LastDate’].max()
last_date_dataset

Timestamp('2015-12-30 00:00:00')

Calculating Recency by subtracting (last transaction date of dataset) and (last purchase date of each customer)
Recency[‘Recency’] = Recency[‘LastDate’].apply(lambda x: (last_date_dataset – x).days)
#Recency.head()

##########Recency Distribution of the Customers

fig = px.histogram(Recency, x=’Recency’, opacity=0.85, marginal=’box’)
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.update_layout(title_text=’Recency Distribution of the Customers’,
title_x=0.5, title_font=dict(size=20))
fig.show()

###########Frequency of the customer visits
Frequency = data.drop_duplicates([‘Date’, ‘memberID’]).groupby(by=[‘memberID’])[‘Date’].count().reset_index()
Frequency.columns = [‘memberID’, ‘Visit_Frequency’]
#Frequency.head()

#######Visit Frequency Distribution of the Customers

fig = px.histogram(Frequency, x=’Visit_Frequency’, opacity=0.85, marginal=’box’)
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.update_layout(title_text=’Visit Frequency Distribution of the Customers’,
title_x=0.5, title_font=dict(size=20))
fig.show()

#####MONETARY

Monetary = data.groupby(by=”memberID”)[‘itemName’].count().reset_index()
Monetary.columns = [‘memberID’, ‘Monetary’]
#Monetary.head()

fig = px.histogram(Monetary, x=’Monetary’, opacity=0.85, marginal=’box’,
labels={‘itemName’: ‘Monetary’})
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.update_layout(title_text=’Monetary Distribution of the Customers’,
title_x=0.5, title_font=dict(size=20))
fig.show()

#######RFM Score

Combining all scores into one DataFrame
RFM = pd.concat([Recency[‘memberID’], Recency[‘Recency’], Frequency[‘Visit_Frequency’], Monetary[‘Monetary’]], axis=1)
#RFM.head()

5-5 score = the best customers
RFM[‘Recency_quartile’] = pd.qcut(RFM[‘Recency’], 5, [5, 4, 3, 2, 1])
RFM[‘Frequency_quartile’] = pd.qcut(RFM[‘Visit_Frequency’], 5, [1, 2, 3, 4, 5])

RFM[‘RF_Score’] = RFM[‘Recency_quartile’].astype(str) + RFM[‘Frequency_quartile’].astype(str)
#RFM.head()

segt_map = { # Segmentation Map [Ref]
r'[1-2][1-2]’: ‘hibernating’,
r'[1-2][3-4]’: ‘at risk’,
r'[1-2]5′: ‘can\’t loose’,
r’3[1-2]’: ‘about to sleep’,
r’33’: ‘need attention’,
r'[3-4][4-5]’: ‘loyal customers’,
r’41’: ‘promising’,
r’51’: ‘new customers’,
r'[4-5][2-3]’: ‘potential loyalists’,
r’5[4-5]’: ‘champions’
}

RFM[‘RF_Segment’] = RFM[‘RF_Score’].replace(segt_map, regex=True)
#RFM.head()

x = RFM.RF_Segment.value_counts()
fig = px.treemap(x, path=[x.index], values=x)
fig.update_layout(title_text=’Distribution of the RFM Segments’, title_x=0.5,
title_font=dict(size=20))
fig.update_traces(textinfo=”label+value+percent root”)
fig.show()

###########Relationship between Visit_Frequency and Recency

fig = px.scatter(RFM, x=”Visit_Frequency”, y=”Recency”, color=’RF_Segment’,
labels={“math score”: “Math Score”,
“writing score”: “Writing Score”})
fig.update_layout(title_text=’Relationship between Visit_Frequency and Recency’,
title_x=0.5, title_font=dict(size=20))
fig.show()

Read more here.

Largest E-Commerce Showcase in Pakistan

Let’s apply the customer segmentation RFM analysis to the Pakistan’s largest e-commerce dataset.

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import standard libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import opendatasets as od
import os

import plotly.express as px

%matplotlib inline

import matplotlib.style as style

style.use(‘seaborn-poster’)
style.use(“fivethirtyeight”)
plt.rcParams[‘font.family’] = ‘serif’
sns.set(rc={‘axes.facecolor’:’black’, ‘figure.facecolor’:’black’, ‘axes.grid’ : True})

Let’s read the input dataset

ecommerce = pd.read_csv(‘Pakistan Largest Ecommerce Dataset.csv’,low_memory=False,
comment=’,’ )

ecommerce.shape

(584524, 26)

#ecommerce.head()

ecommerce = ecommerce.dropna(how=’all’, axis=1)

ecommerce.describe().T

#########Order Status

total = ecommerce[ecommerce.grand_total < 0].status.value_counts().to_dict()

total_df = pd.DataFrame(list(total.items()), columns=[‘Order Status’, ‘Counts’])

plt.figure(figsize=(20,10))
ax = sns.barplot(data=total_df, x=’Order Status’, y=’Counts’, palette=’bright’)
ax.bar_label(container = ax.containers[0], padding = 0, fontsize = 22, color=’white’)
plt.xticks( fontsize=20)
plt.yticks( fontsize=20)
plt.xlabel(‘Order Status’, fontsize=24, labelpad=24, color=’white’)
plt.ylabel(‘Counts’, fontsize=24, labelpad=24, color=’white’);
ax.tick_params(axis=’x’, colors=’white’)
ax.tick_params(axis=’y’, colors=’white’);

#########Data Editing

ecommerce = ecommerce[ecommerce.grand_total > 0]

ecommerce.status = ecommerce.status.replace({‘complete’: ‘Completed’,
‘received’: ‘Completed’,
‘cod’: ‘Completed’,
‘paid’: ‘Completed’,
‘closed’: ‘Completed’,
‘exchange’: ‘Completed’,
‘canceled’: ‘Canceled’,
‘order_refunded’: ‘Canceled’,
‘refund’: ‘Canceled’,
‘fraud’: ‘Canceled’,
‘payment_review’: ‘Pending’,
‘pending’: ‘Pending’,
‘processing’: ‘Pending’,
‘holded’: ‘Pending’,
‘pending_paypal’: ‘Pending’})

ecommerce = ecommerce.drop([‘created_at’, ‘sku’, ‘sales_commission_code’, ‘Customer Since’, ‘M-Y’], axis=1)

ecommerce = ecommerce.dropna()

##########Number of Orders: One Order vs Orders > 1

number_of_orders = ecommerce.groupby(‘Customer ID’)[‘increment_id’].nunique().sort_values(ascending=False)

number_of_orders_df = pd.DataFrame(list(number_of_orders.items()), columns=[‘Customer ID’, ‘Number of Orders’])

a = number_of_orders_df[number_of_orders_df[‘Number of Orders’] == 1].value_counts().sum()
b = number_of_orders_df[number_of_orders_df[‘Number of Orders’] != 1].value_counts().sum()

data = {‘Order’: [‘One Order’, ‘More than One Order’], ‘Customer_Counts’: [a, b]}

order_counts = pd.DataFrame.from_dict(data)

fig = px.pie(order_counts,
values = order_counts.Customer_Counts,
names = order_counts.Order,
template = ‘plotly_dark’)
fig.update_traces(textposition=’inside’, textinfo=’percent+label’, textfont_size=20,
marker = dict(line = dict(color = ‘white’, width = 6)))
fig.show()

Pie Chart: One Order, More than One Order

########Number of Products: One Category, More than One Category

number_of_prod = ecommerce.groupby(‘Customer ID’)[‘category_name_1’].nunique().sort_values(ascending=False)

number_of_prod_df = pd.DataFrame(list(number_of_prod.items()), columns=[‘Customer ID’, ‘Number of Products’])

a = number_of_prod_df[number_of_prod_df[‘Number of Products’] == 1].value_counts().sum()
b = number_of_prod_df[number_of_prod_df[‘Number of Products’] != 1].value_counts().sum()

data = {‘Order’: [‘One Category’, ‘More than One Category’], ‘Customer_Counts’: [a, b]}

category_counts = pd.DataFrame.from_dict(data)

fig = px.pie(category_counts,
values = category_counts.Customer_Counts,
names = category_counts.Order,
template = ‘plotly_dark’)
fig.update_traces(textposition=’inside’, textinfo=’percent+label’, textfont_size=20,
marker = dict(line = dict(color = ‘white’, width = 6)))
fig.show()

######Canceled Orders

a = ecommerce[‘Customer ID’].nunique() + canceled [‘Customer ID’].nunique()
b = canceled [‘Customer ID’].nunique()

data = {‘Customers’: [‘Total Customers’, ‘Those Who Canceled’], ‘Customer_Counts’: [a, b]}

customer_counts = pd.DataFrame.from_dict(data)

c = ecommerce[‘increment_id’].nunique() + canceled [‘increment_id’].nunique()
d = canceled [‘increment_id’].nunique()

data = {‘Orders’: [‘Total Orders’, ‘Canceled Orders’], ‘Order_Counts’: [c, d]}

order_counts = pd.DataFrame.from_dict(data)

sns.set(rc={‘axes.facecolor’:’black’, ‘figure.facecolor’:’black’, ‘axes.grid’ : True})

fig, ax = plt.subplots(1,2, figsize = (20,10))
sns.barplot(ax = ax[0], data = customer_counts, x=customer_counts.Customers, y=customer_counts.Customer_Counts, palette= ‘bright’)
sns.barplot(ax = ax[1], data = order_counts, x=order_counts.Orders, y=order_counts.Order_Counts, palette= ‘bright’)

ax[0].set_title(“Customers Who Canceled”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0].set_xlabel(“Customers”, fontsize = 24, labelpad = 15, color=’white’)
ax[0].set_ylabel(“Counts”, fontsize = 24, labelpad = 15, color=’white’)
ax[0].tick_params(axis=’x’, colors=’white’, labelsize=20)
ax[0].tick_params(axis=’y’, colors=’white’, labelsize=20)

ax[1].set_title(“Canceled Orders”, fontsize = 28, pad = 30, color=’red’, fontweight=’bold’)
ax[1].set_xlabel(“Orders”, fontsize = 24, labelpad = 15, color=’white’)
ax[1].set_ylabel(“Counts”, fontsize = 24, labelpad = 15, color=’white’)
ax[1].tick_params(axis=’x’, colors=’white’, labelsize=20)
ax[1].tick_params(axis=’y’, colors=’white’, labelsize=20);

plt.tight_layout(pad=2);

#######Cancelled Categories

canceled = canceled.dropna()
fig = px.treemap(canceled,
path=[‘category_name_1′], template=’plotly_dark’)
fig.update_traces(textfont_color=’yellow’,textfont_size=16, selector=dict(type=’treemap’))
fig.show()

######Converting Date column into correct data type which is datetime

data.Date = pd.to_datetime(data.Date)

ecommerce[‘Working Date’] = pd.to_datetime(ecommerce[‘Working Date’])

ecommerce.info()

<class 'pandas.core.frame.DataFrame'>
Index: 308167 entries, 0 to 584523
Data columns (total 18 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   item_id          308167 non-null  int64         
 1   status           308167 non-null  object        
 2   price            308167 non-null  float64       
 3   qty_ordered      308167 non-null  int64         
 4   grand_total      308167 non-null  float64       
 5   increment_id     308167 non-null  object        
 6   category_name_1  308167 non-null  object        
 7   discount_amount  308167 non-null  float64       
 8   payment_method   308167 non-null  object        
 9   Working Date     308167 non-null  datetime64[ns]
 10  BI Status        308167 non-null  object        
 11   MV              308167 non-null  object        
 12  Year             308167 non-null  int64         
 13  Month            308167 non-null  int64         
 14  FY               308167 non-null  object        
 15  Customer ID      308167 non-null  float64       
 16  Frequency        308167 non-null  int64         
 17  Monetary         308167 non-null  float64       
dtypes: datetime64[ns](1), float64(5), int64(5), object(7)
memory usage: 44.7+ MB

########Recency Frequency Monetary Columns

ref_date = ecommerce[‘Working Date’].max()
ecommerce[‘Date’] = ecommerce[‘Working Date’].apply(lambda x: x.date())
ecommerce[‘Most_Recent_Purchase’] = ecommerce.groupby(‘Customer ID’)[‘Date’].transform(max)

ecommerce[‘Recency’] = ref_date.date() – ecommerce.Most_Recent_Purchase

ecommerce = ecommerce.drop(‘Most_Recent_Purchase’, axis=1)

ecommerce[‘Frequency’] = ecommerce.groupby(‘Customer ID’)[‘increment_id’].transform(‘nunique’)

ecommerce[‘Monetary’] = ecommerce.groupby(‘Customer ID’)[‘grand_total’].transform(sum)

RFM_df = ecommerce[[‘Customer ID’, ‘Recency’, ‘Frequency’, ‘Monetary’]].drop_duplicates()
#RFM_df.head()

####RFM Quantiles & Total Score

RFM_quantiles = RFM_df[[‘Recency’, ‘Frequency’, ‘Monetary’]].quantile(q = [0.25, 0.5, 0.75])

RFM_df.loc[(RFM_df[‘Recency’] <= RFM_quantiles.Recency[0.25]), ‘Rscore’] = 4.0 RFM_df.loc[(RFM_df[‘Recency’] > RFM_quantiles.Recency[0.25]) & (RFM_df.Recency <= RFM_quantiles.Recency[0.5]), ‘Rscore’] = 3.0 RFM_df.loc[(RFM_df[‘Recency’] > RFM_quantiles.Recency[0.5]) & (RFM_df.Recency <= RFM_quantiles.Recency[0.75]), ‘Rscore’] = 2.0 RFM_df.loc[(RFM_df[‘Recency’] > RFM_quantiles.Recency[0.75]), ‘Rscore’] = 1.0

RFM_df.loc[(RFM_df[‘Frequency’] <= RFM_quantiles.Frequency[0.25]), ‘Fscore’] = 1.0 RFM_df.loc[(RFM_df[‘Frequency’] > RFM_quantiles.Frequency[0.25]) & (RFM_df[‘Frequency’] <= RFM_quantiles.Frequency[0.5]), ‘Fscore’] = 2.0 RFM_df.loc[(RFM_df[‘Frequency’] > RFM_quantiles.Frequency[0.5]) & (RFM_df[‘Frequency’] <= RFM_quantiles.Frequency[0.75]), ‘Fscore’] = 3.0 RFM_df.loc[(RFM_df[‘Frequency’] > RFM_quantiles.Frequency[0.75]), ‘Fscore’] = 4.0

RFM_df.loc[(RFM_df[‘Monetary’] <= RFM_quantiles.Monetary[0.25]), ‘Mscore’] = 1.0 RFM_df.loc[(RFM_df[‘Monetary’] > RFM_quantiles.Monetary[0.25]) & (RFM_df[‘Monetary’] <= RFM_quantiles.Monetary[0.5]), ‘Mscore’] = 2.0 RFM_df.loc[(RFM_df[‘Monetary’] > RFM_quantiles.Monetary[0.5]) & (RFM_df[‘Monetary’] <= RFM_quantiles.Monetary[0.75]), ‘Mscore’] = 3.0 RFM_df.loc[(RFM_df[‘Monetary’] > RFM_quantiles.Monetary[0.75]), ‘Mscore’] = 4.0

RFM_df[‘RFM_score’] = RFM_df.Rscore + RFM_df.Fscore + RFM_df.Mscore

RFM_df.loc[(RFM_df[‘RFM_score’] == 12.0), ‘Cluster’] = ‘Champions’
RFM_df.loc[(RFM_df[‘RFM_score’] == 11.0), ‘Cluster’] = ‘Loyal Customers’
RFM_df.loc[(RFM_df[‘RFM_score’] >= 9.0) & (RFM_df[‘RFM_score’] <= 10.0), ‘Cluster’] = ‘Promising Customers’ RFM_df.loc[(RFM_df[‘RFM_score’] >= 7.0) & (RFM_df[‘RFM_score’] <= 8.0), ‘Cluster’] = ‘At Risk’ RFM_df.loc[(RFM_df[‘RFM_score’] >= 4.0) & (RFM_df[‘RFM_score’] <= 6.0), ‘Cluster’] = ‘Hibernating’
RFM_df.loc[(RFM_df[‘RFM_score’] == 3.0), ‘Cluster’] = ‘Lost Customers’

######Customer Segments/Clusters vs Number of Purchases

frequency = RFM_df.groupby(‘Cluster’)[‘Frequency’].sum()
frequency_df = pd.DataFrame(list(frequency.items()), columns=[‘Cluster’, ‘Number_of_Purchases’])

plt.figure(figsize=(20,10))

ax = sns.barplot(data=frequency_df, x=’Cluster’, y=’Number_of_Purchases’, palette=’bright’)
ax.bar_label(container = ax.containers[0], padding = 0, fontsize = 22, color=’white’)
plt.xticks( fontsize=20)
plt.yticks( fontsize=20)
plt.xlabel(‘Clusters’, fontsize=30, labelpad=24, color=’white’)
plt.ylabel(‘No. of Purchases’, fontsize=30, labelpad=24, color=’white’);
ax.tick_params(axis=’x’, colors=’white’)
ax.tick_params(axis=’y’, colors=’white’);

######Customer Segments/Clusters vs Spendings

frequency = RFM_df.groupby(‘Cluster’)[‘Monetary’].mean()
frequency_df = pd.DataFrame(list(frequency.items()), columns=[‘Cluster’, ‘Spendings’])

plt.figure(figsize=(20,10))

ax = sns.barplot(data=frequency_df, x=’Cluster’, y=’Spendings’, palette=’bright’)
ax.bar_label(container = ax.containers[0], padding = 0, fontsize = 22, color=’white’)
plt.xticks( fontsize=20)
plt.yticks( fontsize=20)
plt.xlabel(‘Clusters’, fontsize=30, labelpad=24, color=’white’)
plt.ylabel(‘Spendings’, fontsize=30, labelpad=24, color=’white’);
ax.tick_params(axis=’x’, colors=’white’)
ax.tick_params(axis=’y’, colors=’white’);

######Customer Segments/Clusters vs RFM Score

cluster_mapping = pd.Series(RFM_df.Cluster.values, index=RFM_df[‘Customer ID’]).to_dict()

ecommerce[‘Cluster’] = ecommerce[‘Customer ID’].map(cluster_mapping)

cluster_counts = RFM_df.Cluster.value_counts()

cluster_counts_df = pd.DataFrame(list(cluster_counts.items()), columns=[‘Cluster’, ‘Number_of_Customers’])

fig = px.pie(cluster_counts_df,
values = cluster_counts_df.Number_of_Customers,
names = cluster_counts_df.Cluster,
template = ‘plotly_dark’)
fig.update_traces(textposition=’inside’, textinfo=’percent+label’, textfont_size=20,
marker = dict(line = dict(color = ‘white’, width = 6)))
fig.show()

fig = px.treemap(ecommerce,
path=[‘Cluster’,’category_name_1′], template=’plotly_dark’, height=1000)
fig.update_traces(textfont_color=’yellow’, textfont_size=16, selector=dict(type=’treemap’))
fig.show()

Treemap of customer segments vs Categories.

######6 Clusters vs Categories

cluster1 = ecommerce[ecommerce.Cluster == ‘Champions’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster1.columns = [‘Category’, ‘Grand_Total_Cluster1’]

cluster2 = ecommerce[ecommerce.Cluster == ‘Loyal Customers’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster2.columns = [‘Category’, ‘Grand_Total_Cluster2’]

cluster3 = ecommerce[ecommerce.Cluster == ‘Promising Customers’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster3.columns = [‘Category’, ‘Grand_Total_Cluster3’]

cluster4 = ecommerce[ecommerce.Cluster == ‘At Risk’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster4.columns = [‘Category’, ‘Grand_Total_Cluster4’]

cluster5 = ecommerce[ecommerce.Cluster == ‘Hibernating’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster5.columns = [‘Category’, ‘Grand_Total_Cluster5’]

cluster6 = ecommerce[ecommerce.Cluster == ‘Lost Customers’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster6.columns = [‘Category’, ‘Grand_Total_Cluster6’]

merged = cluster1.merge(cluster2, how=’left’).merge(cluster3, how=’left’).merge(cluster4, how=’left’).merge(cluster5, how=’left’).merge(cluster6, how=’left’)

sns.set(rc={‘axes.facecolor’:’black’, ‘figure.facecolor’:’black’, ‘axes.grid’ : True})

fig, ax = plt.subplots(3,2, figsize = (20,20))
sns.barplot(ax = ax[0,0], data = cluster1, y=cluster1.Category, x=cluster1.Grand_Total_Cluster1, palette= ‘bright’)
sns.barplot(ax = ax[0,1], data = cluster2, y=cluster2.Category, x=cluster2.Grand_Total_Cluster2, palette= ‘bright’)
sns.barplot(ax = ax[1,0], data = cluster3, y=cluster3.Category, x=cluster3.Grand_Total_Cluster3, palette= ‘bright’)
sns.barplot(ax = ax[1,1], data = cluster4, y=cluster4.Category, x=cluster4.Grand_Total_Cluster4, palette= ‘bright’)
sns.barplot(ax = ax[2,0], data = cluster5, y=cluster5.Category, x=cluster5.Grand_Total_Cluster5, palette= ‘bright’)
sns.barplot(ax = ax[2,1], data = cluster6, y=cluster6.Category, x=cluster6.Grand_Total_Cluster6, palette= ‘bright’)

ax[0,0].set_title(“Cluster 1 – Champions”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0,0].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,0].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,0].tick_params(axis=’x’, colors=’white’)
ax[0,0].tick_params(axis=’y’, colors=’white’)

ax[0,1].set_title(“Cluster 2 – Loyal Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0,1].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,1].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,1].tick_params(axis=’x’, colors=’white’)
ax[0,1].tick_params(axis=’y’, colors=’white’)

ax[1,0].set_title(“Cluster 3 – Promising Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[1,0].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,0].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,0].tick_params(axis=’x’, colors=’white’)
ax[1,0].tick_params(axis=’y’, colors=’white’)

ax[1,1].set_title(“Cluster 4 – At Risk”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[1,1].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,1].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,1].tick_params(axis=’x’, colors=’white’)
ax[1,1].tick_params(axis=’y’, colors=’white’)

ax[2,0].set_title(“Cluster 5 – Hibernating”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[2,0].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,0].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,0].tick_params(axis=’x’, colors=’white’)
ax[2,0].tick_params(axis=’y’, colors=’white’)

ax[2,1].set_title(“Cluster 6 – Lost Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[2,1].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,1].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,1].tick_params(axis=’x’, colors=’white’)
ax[2,1].tick_params(axis=’y’, colors=’white’)

plt.tight_layout(pad=2);

######6 Clusters vs Payment Method

cluster1_payment = ecommerce[ecommerce.Cluster == ‘Champions’].payment_method.value_counts().to_frame().reset_index()
cluster1_payment.columns = [‘Category’, ‘Payment_Method_Cluster1’]

cluster2_payment = ecommerce[ecommerce.Cluster == ‘Loyal Customers’].payment_method.value_counts().to_frame().reset_index()
cluster2_payment.columns = [‘Category’, ‘Payment_Method_Cluster2’]

cluster3_payment = ecommerce[ecommerce.Cluster == ‘Promising Customers’].payment_method.value_counts().to_frame().reset_index()
cluster3_payment.columns = [‘Category’, ‘Payment_Method_Cluster3’]

cluster4_payment = ecommerce[ecommerce.Cluster == ‘At Risk’].payment_method.value_counts().to_frame().reset_index()
cluster4_payment.columns = [‘Category’, ‘Payment_Method_Cluster4’]

cluster5_payment = ecommerce[ecommerce.Cluster == ‘Hibernating’].payment_method.value_counts().to_frame().reset_index()
cluster5_payment.columns = [‘Category’, ‘Payment_Method_Cluster5’]

cluster6_payment = ecommerce[ecommerce.Cluster == ‘Lost Customers’].payment_method.value_counts().to_frame().reset_index()
cluster6_payment.columns = [‘Category’, ‘Payment_Method_Cluster6’]

merged_payment = cluster1_payment.merge(cluster2_payment, how=’left’).merge(cluster3_payment, how=’left’).merge(cluster4_payment, how=’left’).merge(cluster5_payment, how=’left’).merge(cluster6_payment, how=’left’)

sns.set(rc={‘axes.facecolor’:’black’, ‘figure.facecolor’:’black’, ‘axes.grid’ : True})

fig, ax = plt.subplots(3,2, figsize = (20,20))
sns.barplot(ax = ax[0,0], data = cluster1_payment, y=cluster1_payment.Category, x=cluster1_payment.Payment_Method_Cluster1, palette= ‘bright’)
sns.barplot(ax = ax[0,1], data = cluster2_payment, y=cluster2_payment.Category, x=cluster2_payment.Payment_Method_Cluster2, palette= ‘bright’)
sns.barplot(ax = ax[1,0], data = cluster3_payment, y=cluster3_payment.Category, x=cluster3_payment.Payment_Method_Cluster3, palette= ‘bright’)
sns.barplot(ax = ax[1,1], data = cluster4_payment, y=cluster4_payment.Category, x=cluster4_payment.Payment_Method_Cluster4, palette= ‘bright’)
sns.barplot(ax = ax[2,0], data = cluster5_payment, y=cluster5_payment.Category, x=cluster5_payment.Payment_Method_Cluster5, palette= ‘bright’)
sns.barplot(ax = ax[2,1], data = cluster6_payment, y=cluster6_payment.Category, x=cluster6_payment.Payment_Method_Cluster6, palette= ‘bright’)

ax[0,0].set_title(“Cluster 1 – Champions”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0,0].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,0].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,0].tick_params(axis=’x’, colors=’white’)
ax[0,0].tick_params(axis=’y’, colors=’white’)

ax[0,1].set_title(“Cluster 2 – Loyal Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0,1].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,1].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,1].tick_params(axis=’x’, colors=’white’)
ax[0,1].tick_params(axis=’y’, colors=’white’)

ax[1,0].set_title(“Cluster 3 – Promising Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[1,0].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,0].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,0].tick_params(axis=’x’, colors=’white’)
ax[1,0].tick_params(axis=’y’, colors=’white’)

ax[1,1].set_title(“Cluster 4 – At Risk”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[1,1].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,1].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,1].tick_params(axis=’x’, colors=’white’)
ax[1,1].tick_params(axis=’y’, colors=’white’)

ax[2,0].set_title(“Cluster 5 – Hibernating”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[2,0].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,0].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,0].tick_params(axis=’x’, colors=’white’)
ax[2,0].tick_params(axis=’y’, colors=’white’)

ax[2,1].set_title(“Cluster 6 – Lost Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[2,1].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,1].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,1].tick_params(axis=’x’, colors=’white’)
ax[2,1].tick_params(axis=’y’, colors=’white’)

plt.tight_layout(pad=2);

############# All Clusters Pending vs Completed Orders

from plotly.subplots import make_subplots
import plotly.graph_objects as go

completed_orders = ecommerce[ecommerce.status == ‘Completed’].Cluster.value_counts().to_frame().reset_index()
completed_orders.columns = [‘Cluster’, ‘Order_Counts’]

pending_orders = ecommerce[ecommerce.status == ‘Pending’].Cluster.value_counts().to_frame().reset_index()
pending_orders.columns = [‘Cluster’, ‘Order_Counts’]

fig = make_subplots(rows=1, cols=2, specs=[[{‘type’:’domain’}, {‘type’:’domain’}]], subplot_titles=[‘Completed Orders’, ‘Pending Orders’])

fig.add_trace(go.Pie(labels=completed_orders.Cluster, values=completed_orders.Order_Counts),1, 1)

fig.add_trace(go.Pie(labels=pending_orders.Cluster, values=pending_orders.Order_Counts),1, 2)

fig.update_traces(textposition=’inside’, textinfo=’percent+label’, textfont_size=20,
marker = dict(line = dict(color = ‘white’, width = 6)))

fig.update_layout(template=’plotly_dark’, height=600, width=1000)
fig.show()

All Clusters Pending vs Completed Orders

Read more here.

Conclusions

Using market segmentation, companies are able to identify their target audiences and personalize marketing campaigns more effectively. This is why market segmentation is key to staying competitive. It allows you to understand your customers, anticipate their needs, and seize growth opportunities. This powerful technique allows you to improve your decision-making, marketing efforts, and improve your company’s bottom line.

The key to successful market segmentation remains data quality; therefore, you need to pick your data provider after doing your due diligence, ensuring that you have access to the latest industry information in accessible and easy-to-understand formats.

The main benefits of market segmentation include: personalized, more efficient and cost-effective marketing communications, the ability to select your clientele, customer retention and loyalty, and identification of growth opportunities.

The 5-Step Roadmap:

Identify the market that interests you
Conduct research and analyze your market
Segment the market into several parts using a proposed combination of criteria
Create and test your strategy using the insights from the previous step
Select a target market that will be the recipient of your strategy.

Explore More

E-Commerce Cohort Analysis in Python

E-Commerce Data Science Use-Case

E-Commerce ML/AI Classification

AI-Powered Customer Churn Prediction

Telco Customer Churn/Retention Rate ML/AI Strategies that Work!

A K-means Cluster Cohort E-Commerce

Brazilian E-Commerce Showcase

Simple E-Commerce Sales BI Analytics

Top E-Commerce Trends in Q1’23

Start Your E-Commerce with Shopify

Applications of ML/AI in HR – Predicting Employee Attrition

← Back

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

€5.00

€15.00

€100.00

Or enter a custom amount

€

Your contribution is appreciated.

Donate Donate monthly Donate yearly

Unsupervised ML, K-Means Clustering & Customer Segmentation

Motivation

Methods

Open-Source Datasets

Mall Customer Segmentation

Online Retail K-Means Clustering

Online Retail Data Analytics

RFM Customer Segmentation

RFM TreeMap: Online Retail Dataset II

####### DATA PREPARATION

CRM Analytics: CLTV

Cohort Analysis in Online Retail

Customer Segmentation, RFM, K-Means and Cohort Analysis in Online Retail

############################# segment1

Customer Clustering PCA, K-Means and Agglomerative for Marketing Campaigns

Supply Chain RFM & ABC Analysis

Bank Churn ML Prediction

Groceries Market Basket & RFM Analysis

Largest E-Commerce Showcase in Pakistan

Conclusions

Explore More

Thank you for your response. ✨

Make a one-time donation

Make a monthly donation

Make a yearly donation

Share this:

Discover more from Our Blogs

Leave a comment Cancel reply

Discover more from Our Blogs