Unsupervised ML Clustering, Customer Segmentation, Cohort, Market Basket, Bank Churn, CRM, ABC & RFM Analysis – A Comprehensive Guide in Python

  • The goal of this post is to describe the most popular, easy-to-use, and effective data-driven techniques that help organisations streamline their operations, build effective relationships with with individual people (customers, service users, colleagues, or suppliers), increase sales, improve customer service, and increase profitability.
  • Customer Segmentation (CS) is a core concept in this technology. This is the process of tagging and grouping customers based on shared characteristics. This process also makes it easy to tailor and personalize your marketing, service, and sales efforts to the needs of specific groups. The result is a potential boost to customer loyalty and conversions.
  • This article represents a comprehensive step-by-step CS guide in Python that helps you manage customer relationships across the entire customer lifecycle, at every marketingsalese-commerce, and customer service interaction.
  • Our guide consists of hands-on case examples and best industry practices with open-source data and Python source codes: mall customers, online retail, marketing campaigns, supply chain, bank churn, groceries market, and e-commerce.

Table of Clickable Contents

  1. Motivation
  2. Methods
  3. Open-Source Datasets
  4. Mall Customer Segmentation
  5. Online Retail K-Means Clustering
  6. Online Retail Data Analytics
  7. RFM Customer Segmentation
  8. RFM TreeMap: Online Retail Dataset II
    1. ####### DATA PREPARATION
  9. CRM Analytics: CLTV
  10. Cohort Analysis in Online Retail
  11. Customer Segmentation, RFM, K-Means and Cohort Analysis in Online Retail
    1. ############################# segment1
  12. Customer Clustering PCA, K-Means and Agglomerative for Marketing Campaigns
  13. Supply Chain RFM & ABC Analysis
  14. Bank Churn ML Prediction
  15. Groceries Market Basket & RFM Analysis
  16. Largest E-Commerce Showcase in Pakistan
  17. Conclusions
  18. Explore More

Motivation

  • Uniform communication strategies that are not customer centric have failed to exploit millions of dollars in opportunities at the individual and segment level. 
  • The more comprehensive  CS-based modeling provides executives and marketers with the ability  to identify high-profit customers and micro-segments that will drive increases in traffic, sales, profit and retention.

Methods

  • We will use the following techniques: unsupervised ML clustering, customer segmentation, cohort, market basket, bank churn, CRM data analytics, ABC & RFM analysis.
  • ABC analysis is a tool that allows companies to divide their inventory or customers into groups based on their respective values to the company. The goal of this type of analysis is to ensure that businesses are optimizing their time and resources when serving customers.
  • What is RFM (recency, frequency, monetary) analysis? RFM analysis is a marketing technique used to quantitatively rank and group customers based on the recency, frequency and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns.
  • Cohort analysis is a type of behavioral analytics in which you take a group of users, and analyze their usage patterns based on their shared traits to better track and understand their actions. A cohort is simply a group of people with shared characteristics.
  • Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.
  • Customer Churn prediction means knowing which customers are likely to leave or unsubscribe from your service. Customers have different behaviors and preferences, and reasons for cancelling their subscriptions. Therefore, it is important to actively communicate with each of them to keep them on your customer list. You need to know which marketing activities are most effective for individual customers and when they are most effective.

Open-Source Datasets

This file contains the basic information (ID, age, gender, income, and spending score) about the customers.

Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

We will be using the online retail transnational dataset to build RFM clustering and choose the best set of customers which the company should target.

Online Retail II dataset contains all the transactions occurring for a UK-based and registered, non-store online retail. This is a real online retail transaction data set of two years between 01/12/2009 and 09/12/2011.

The dataset was used in the Customer Personality Analysis (CPA). CPA is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviours and concerns of different types of customers.

SUPPLY CHAIN FOR BIG DATA ANALYSIS

Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.

Type Data :
Structured Data : DataCoSupplyChainDataset.csv
Unstructured Data : tokenized_access_logs.csv (Clickstream)

Types of Products : Clothing , Sports , and Electronic Supplies

Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.

3 columns: ID of customer, Date of purchase, and Description of product purchased.

Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. 

Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

Half a million transaction record for e-commerce sales in Pakistan.

Attributes

People

  • ID: Customer’s unique identifier
  • Year_Birth: Customer’s birth year
  • Education: Customer’s education level
  • Marital_Status: Customer’s marital status
  • Income: Customer’s yearly household income
  • Kidhome: Number of children in customer’s household
  • Teenhome: Number of teenagers in customer’s household
  • Dt_Customer: Date of customer’s enrollment with the company
  • Recency: Number of days since customer’s last purchase
  • Complain: 1 if the customer complained in the last 2 years, 0 otherwise

Products

  • MntWines: Amount spent on wine in last 2 years
  • MntFruits: Amount spent on fruits in last 2 years
  • MntMeatProducts: Amount spent on meat in last 2 years
  • MntFishProducts: Amount spent on fish in last 2 years
  • MntSweetProducts: Amount spent on sweets in last 2 years
  • MntGoldProds: Amount spent on gold in last 2 years

Promotion

  • NumDealsPurchases: Number of purchases made with a discount
  • AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
  • AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
  • AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
  • AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
  • AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
  • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

Place

  • NumWebPurchases: Number of purchases made through the company’s website
  • NumCatalogPurchases: Number of purchases made using a catalogue
  • NumStorePurchases: Number of purchases made directly in stores
  • NumWebVisitsMonth: Number of visits to company’s website in the last month

Scope:  Bank Churn ModellingBank Customers Churn Churn Prediction of bank customers

Mall Customer Segmentation

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

Importing libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
import plotly.graph_objs as go
import missingno as msno

from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings(‘ignore’)

plt.style.use(‘fivethirtyeight’)
%matplotlib inline

Let’s read the input dataset

df = pd.read_csv(‘Mall_Customers.csv’)

and check the content
df.head()

Mall_Customers input table

Let’s check the descriptive statistics

df.describe().T

Mall data descriptive statistics

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   CustomerID              200 non-null    int64 
 1   Gender                  200 non-null    object
 2   Age                     200 non-null    int64 
 3   Annual Income (k$)      200 non-null    int64 
 4   Spending Score (1-100)  200 non-null    int64 
dtypes: int64(4), object(1)
memory usage: 7.9+ KB

Let’s drop the column

df.drop(‘CustomerID’, axis = 1, inplace = True)

and plot the following distribution plots

plt.figure(figsize = (20, 8))
plotnumber = 1

for col in [‘Age’, ‘Annual Income (k$)’, ‘Spending Score (1-100)’]:
if plotnumber <= 3:
ax = plt.subplot(1, 3, plotnumber)
sns.distplot(df[col])

plotnumber += 1

plt.tight_layout()
plt.show()

Distributions: age, annual income, and spending score.

Let’s look at the Gender pie-plot

values = df[‘Gender’].value_counts()
labels = [‘Male’, ‘Female’]

fig, ax = plt.subplots(figsize = (4, 4), dpi = 100)
explode = (0, 0.06)

patches, texts, autotexts = ax.pie(values, labels = labels, autopct = ‘%1.2f%%’, shadow = True,
startangle = 90, explode = explode)

plt.setp(texts, color = ‘red’)
plt.setp(autotexts, size = 12, color = ‘white’)
autotexts[1].set_color(‘black’)
plt.show()

Gender pie-plot

Let’s compare the corresponding violin plots

plt.figure(figsize = (20, 8))
plotnumber = 1

for col in [‘Age’, ‘Annual Income (k$)’, ‘Spending Score (1-100)’]:
if plotnumber <= 3:
ax = plt.subplot(1, 3, plotnumber)
sns.violinplot(x = col, y = ‘Gender’, data = df)

plotnumber += 1

plt.tight_layout()
plt.show()

Violin plots: gender vs age, annual income, and spending score.

Let’s compare different age groups

age_18_25 = df.Age[(df.Age >= 18) & (df.Age <= 25)] age_26_35 = df.Age[(df.Age >= 26) & (df.Age <= 35)] age_36_45 = df.Age[(df.Age >= 36) & (df.Age <= 45)] age_46_55 = df.Age[(df.Age >= 46) & (df.Age <= 55)] age_55above = df.Age[df.Age >= 55]

x_age = [’18-25′, ’26-35′, ’36-45′, ’46-55′, ’55+’]
y_age = [len(age_18_25.values), len(age_26_35.values), len(age_36_45.values), len(age_46_55.values),
len(age_55above.values)]

px.bar(data_frame = df, x = x_age, y = y_age, color = x_age,
title = ‘Number of customers per age group’)

Number of customers per age group

Let’s plot the Relation between Annual Income and Spending Score

fig=px.scatter(data_frame = df, x = ‘Annual Income (k$)’, y = ‘Spending Score (1-100)’, color=”Age”,
title = ‘Relation between Annual Income and Spending Score’)
fig.update_traces(marker=dict(size=18,
line=dict(width=1,
color=’DarkSlateGrey’)),
selector=dict(mode=’markers’))

Relation between Annual Income and Spending Score

Let’s look at the Number of customers per Spending Score group

ss_1_20 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 1) &
(df[‘Spending Score (1-100)’] <= 20)]

ss_21_40 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 21) &
(df[‘Spending Score (1-100)’] <= 40)]

ss_41_60 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 41) &
(df[‘Spending Score (1-100)’] <= 60)]

ss_61_80 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 61) &
(df[‘Spending Score (1-100)’] <= 80)]

ss_81_100 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 81) &
(df[‘Spending Score (1-100)’] <= 100)]

x_ss = [‘1-20′, ’21-40′, ’41-60′, ’61-80′, ’81-100’]
y_ss = [len(ss_1_20.values), len(ss_21_40.values), len(ss_41_60.values), len(ss_61_80.values),
len(ss_81_100.values)]

px.bar(data_frame = df, x = x_ss, y = y_ss, color = x_ss,
title = ‘Number of customers per Spending Score group’)

Number of customers per Spending Score group

Let’s check the Number of customers per Annual Income group

ai_0_30 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 0) & (df[‘Annual Income (k$)’] <= 30)] ai_31_60 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 31)&(df[‘Annual Income (k$)’] <= 60)] ai_61_90 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 61)&(df[‘Annual Income (k$)’] <= 90)] ai_91_120 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’]>= 91)&(df[‘Annual Income (k$)’]<=120)] ai_121_150 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’]>=121)&(df[‘Annual Income (k$)’]<=150)]

x_ai = [‘$ 0-30,000’, ‘$ 30,001-60,000’, ‘$ 60,001-90,000’, ‘$ 90,001-120,000’, ‘$ 120,000-150,000’]
y_ai = [len(ai_0_30.values) , len(ai_31_60.values) , len(ai_61_90.values) , len(ai_91_120.values),
len(ai_121_150.values)]

px.bar(data_frame = df, x = x_ai, y = y_ai, color = x_ai,
title = ‘Number of customers per Annual Income group’)

Number of customers per Annual Income group

Let’s perform K-means clustering using Age and Spending Score columns

X1 = df.loc[:, [‘Age’, ‘Spending Score (1-100)’]].values

wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X1)
wcss.append(kmeans.inertia_)

plt.figure(figsize = (12, 7))

plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()

K-means clustering using Age and Spending Score columns: Elbow plot

Let’s set n_clusters = 4 and run

kmeans = KMeans(n_clusters = 4)
labels = kmeans.fit_predict(X1)

Let’s plot customer clusters in the domain Age vs Spending Score (1-100)

plt.figure(figsize = (14, 8))

plt.scatter(X1[:, 0], X1[:, 1], c = kmeans.labels_, s = 200)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color = ‘red’, s = 350)
plt.title(‘Clusters of Customers\n’, fontsize = 20)
plt.xlabel(‘Age’)
plt.ylabel(‘Spending Score (1-100)’)
plt.show()

Customer clusters in the domain Age vs Spending Score (1-100)

Let’s perform K-means clustering using Age and Annual Income columns

X2 = df.loc[:, [‘Age’, ‘Annual Income (k$)’]].values

wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X2)
wcss.append(kmeans.inertia_)

plt.figure(figsize = (12, 7))

plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()

K-means clustering using Age and Annual Income columns: Elbow Plot

Let’s set n_clusters = 5 and run

kmeans = KMeans(n_clusters = 5)
labels = kmeans.fit_predict(X2)

Let’s plot Customer Clusters in the domain Annual Income (k$) vs Spending Score (1-100)

Customer Clusters in the domain Annual Income (k$) vs Spending Score (1-100)

Let’s perform K-means clustering using Age, Annual Score and Spending Score columns
X4 = df.iloc[:, 1:]

wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X4)
wcss.append(kmeans.inertia_)

plt.figure(figsize = (12, 7))

plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()

K-means clustering using Age, Annual Score and Spending Score columns: Elbow plot

Let’s set n_clusters = 6 and run

kmeans = KMeans(n_clusters = 6)
clusters = kmeans.fit_predict(X4)
X4[‘label’] = clusters

Let’s look at the corresponding 3D plot

fig = px.scatter_3d(X4, x=”Annual Income (k$)”, y=”Spending Score (1-100)”, z=”Age”,
color = ‘label’, size = ‘label’)
fig.show()

Customer clusters 3D plot:  x="Annual Income (k$)", y="Spending Score (1-100)", z="Age"

Let’s plot the Hierarchical Clustering dendrogram
plt.figure(figsize = (22, 8))

dendo = dendrogram(linkage(X3, method = ‘ward’))
plt.title(‘Dendrogram’, fontsize = 15)
plt.show()

Hierarchical Clustering dendrogram

Let’s run Agglomerative Clustering with n_clusters = 5

agc = AgglomerativeClustering(n_clusters = 5, affinity = ‘euclidean’, linkage = ‘ward’)
labels = agc.fit_predict(X3)

plt.figure(figsize = (12, 8))
smb=200

plt.scatter(X3[labels == 0,0], X3[labels == 0,1], label = ‘Cluster 1’, s = smb)
plt.scatter(X3[labels == 1,0], X3[labels == 1,1], label = ‘Cluster 2’, s = smb)
plt.scatter(X3[labels == 2,0], X3[labels == 2,1], label = ‘Cluster 3’, s = smb)
plt.scatter(X3[labels == 3,0], X3[labels == 3,1], label = ‘Cluster 4’, s = smb)
plt.scatter(X3[labels == 4,0], X3[labels == 4,1], label = ‘Cluster 5’, s = smb)

plt.legend(loc = ‘best’)
plt.title(‘Clusters of Customers\n ‘, fontsize = 20)
plt.xlabel(‘Annual Income (k$)’)
plt.ylabel(‘Spending Score (1-100)’)
plt.show()

Clusters of Customers:
Annual Income (k$) vs
Spending Score (1-100)

Let’s compare these results with the DBSCAN clustering algorithm

centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0) # generate sample blobs

X = StandardScaler().fit_transform(X)

db = DBSCAN(eps=0.3, min_samples=10).fit(X)

Creating an array of true and false as the same size as db.labels
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True

(setting the indices of the core regions to True)
labels = db.labels_

(similar to the model.fit() method, it gives the labels of the clustered data).

The label -1 is considered as noise by the DBSCAN algorithm
n_clusters_ = len(set(labels)) – (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1) # calculating the number of clusters

print(‘Estimated number of clusters: %d’ % n_clusters_)
print(‘Estimated number of noise points: %d’ % n_noise_)

“””Homogeneity metric of a cluster labeling given a ground truth.

A clustering result satisfies homogeneity if all of its clusters
contain only data points which are members of a single class.”””

print(“Homogeneity: %0.3f” % metrics.homogeneity_score(labels_true, labels))

Estimated number of clusters: 3
Estimated number of noise points: 18
Homogeneity: 0.953

Let’s plot the result representing 3 clusters + noise

plt.figure(figsize = (10, 8))

unique_labels = set(labels) # identifying all the unique labels/clusters
colors = [plt.cm.Spectral(each)
# creating the list of colours, generating the colourmap
for each in np.linspace(0, 1, len(unique_labels))]

for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k) # assigning class members for each class
xy = X[class_member_mask & core_samples_mask] # creating the list of points for each class
plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),markeredgecolor=’k’, markersize=14)
xy = X[class_member_mask & ~core_samples_mask] # creating the list of noise points
plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col), markeredgecolor=’k’, markersize=14)

plt.title(‘Clustering using DBSCAN\n’, fontsize = 15)
plt.show()

Clustering using DBSCAN

Here black symbols designate noise.

Read more here.

Online Retail K-Means Clustering

Let’s continue using the dataset Mall_Customers.csv

from sklearn.preprocessing import StandardScaler

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import os
import warnings

warnings.filterwarnings(‘ignore’)

df = pd.read_csv(‘Mall_Customers.csv’)

Let’s rename the columns

df.rename(index=str, columns={‘Annual Income (k$)’: ‘Income’,
‘Spending Score (1-100)’: ‘Score’}, inplace=True)
df.head()

The input data table

Let’s examine the input data using sns.pairplot
X = df.drop([‘CustomerID’, ‘Gender’], axis=1)
sns.pairplot(df.drop(‘CustomerID’, axis=1), hue=’Gender’, aspect=1.5)
plt.show()

Input data pairplot

Let’s look at the K-Means Elbow plot

from sklearn.cluster import KMeans

clusters = []

for i in range(1, 11):
km = KMeans(n_clusters=i).fit(X)
clusters.append(km.inertia_)

fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax,linewidth = 4)
ax.set_title(‘Searching for Elbow’)
ax.set_xlabel(‘Clusters’)
ax.set_ylabel(‘Inertia’)

ax.annotate(‘Possible Elbow Point’, xy=(3, 140000), xytext=(3, 50000), xycoords=’data’,
arrowprops=dict(arrowstyle=’->’, connectionstyle=’arc3′, color=’red’, lw=2))

ax.annotate(‘Possible Elbow Point’, xy=(5, 80000), xytext=(5, 150000), xycoords=’data’,
arrowprops=dict(arrowstyle=’->’, connectionstyle=’arc3′, color=’red’, lw=2))

plt.show()

Searching for Elbow

Let’s plot K-means with 3 clusters Income vs Score

km3 = KMeans(n_clusters=3).fit(X)

X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))

sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Income”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 3 Clusters’)
plt.show()

K-means with 3 clusters Income vs Score

Let’s plot K-means with 3 clusters Income-Age vs Score

km3 = KMeans(n_clusters=3).fit(X)

X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))

sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 3 Clusters’)
plt.show()

K-means with 3 clusters Income-Age vs Score

Let’s plot K-means with 5 clusters Income vs Score

km3 = KMeans(n_clusters=5).fit(X)

X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))
sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Income”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 5 Clusters’)
plt.show()

K-means with 5 clusters Income vs Score

Let’s plot K-means with 5 clusters Income-Age vs Score

km3 = KMeans(n_clusters=5).fit(X)

X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))
sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 5 Clusters’)
plt.show()

K-means with 5 clusters Income-Age vs Score

Let’s invoke sns.swarmplot to compare Labels According to Annual Income and Scoring History

fig = plt.figure(figsize=(20,8))
sns.set(font_scale=2)
ax = fig.add_subplot(121)
sns.swarmplot(x=’Labels’, y=’Income’, data=X, ax=ax,hue=”Labels”, legend=False)
ax.set_title(‘Labels According to Annual Income’)

ax = fig.add_subplot(122)
sns.swarmplot(x=’Labels’, y=’Score’, data=X, ax=ax)
ax.set_title(‘Labels According to Scoring History’)

plt.show()

sns.swarmplot: Labels according to Annual Income and Scoring History

Hierarchical Clustering:
from sklearn.cluster import AgglomerativeClustering
sns.set(font_scale=1)
agglom = AgglomerativeClustering(n_clusters=5, linkage=’average’).fit(X)

X[‘Labels’] = agglom.labels_
plt.figure(figsize=(12, 8))

sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=sns.color_palette(‘hls’, 5),
height=6, data=X)

plt.title(‘Agglomerative with 5 Clusters’)
plt.show()

Agglomerative Clustering with 5 clusters

Let’s compute the spatial distance matrix

from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix

dist = distance_matrix(X, X)
Z = hierarchy.linkage(dist, ‘complete’)

and plot the dendrogram with dist=’complete’

plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size=12, orientation=’right’)

The dendrogram with dist='complete'

Let’s plot the dendrogram with dist=’average’

Z = hierarchy.linkage(dist, ‘average’)
plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size =12, orientation = ‘right’)

The dendrogram with dist='average'

Density Based Clustering (DBSCAN):
from sklearn.cluster import DBSCAN

db = DBSCAN(eps=11, min_samples=6).fit(X)

X[‘Labels’] = db.labels_
plt.figure(figsize=(12, 8))

sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=sns.color_palette(‘hls’, 5),
height=6, data=X)
plt.title(‘DBSCAN with epsilon 11, min samples 6’)
plt.show()

DBSCAN with epsilon 11, min samples 6

Read more here.

Online Retail Data Analytics

Let’s look at the OnlineRetail.csv dataset

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
df = pd.read_csv(“OnlineRetail.csv”, delimiter=’,’, encoding = “ISO-8859-1”)

Let’s check the data structure

df.head()

Online Retail input data table

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

df.describe()

Online Retail descriptive statistics

Data Cleaning: Checking for Null Values
msno.bar(df)

Checking for Null Values before data cleaning

df.count()

InvoiceNo      541909
StockCode      541909
Description    540455
Quantity       541909
InvoiceDate    541909
UnitPrice      541909
CustomerID     406829
Country        541909
dtype: int64

df[df[‘CustomerID’].isnull()].count()

InvoiceNo      135080
StockCode      135080
Description    133626
Quantity       135080
InvoiceDate    135080
UnitPrice      135080
CustomerID          0
Country        135080
dtype: int64

Let’s check the percentage

100 – ((541909-135000)/541909 * 100)

24.911931708091203

So, approximately 25% of the data is missing.

We will proceed with dropping the missing rows now.
df.dropna(inplace=True)

msno.bar(df)

Checking for Null Values after data cleaning

Let’s perform further data editing

df[‘InvoiceDate’] = pd.to_datetime(df[‘InvoiceDate’], format=’%d-%m-%Y %H:%M’)

df[‘Total Amount Spent’]= df[‘Quantity’] * df[‘UnitPrice’]

total_amount = df[‘Total Amount Spent’].groupby(df[‘CustomerID’]).sum()
total_amount = pd.DataFrame(total_amount).reset_index()

transactions = df[‘InvoiceNo’].groupby(df[‘CustomerID’]).count()
transaction = pd.DataFrame(transactions).reset_index()

final = df[‘InvoiceDate’].max()
df[‘Last_transact’] = final – df[‘InvoiceDate’]
LT = df.groupby(df[‘CustomerID’]).min()[‘Last_transact’]
LT = pd.DataFrame(LT).reset_index()

df_new = pd.merge(total_amount, transaction, how=’inner’, on=’CustomerID’)
df_new = pd.merge(df_new, LT, how=’inner’, on=’CustomerID’)

df_new[‘Last_transact’] = df_new[‘Last_transact’].dt.days

Let’s run the K-Means Clustering Model
from sklearn.cluster import KMeans
kmeans= KMeans(n_clusters=2)

kmeans.fit(df_new[[‘Total Amount Spent’, ‘InvoiceNo’, ‘Last_transact’]])
pred = kmeans.predict(df_new[[‘Total Amount Spent’, ‘InvoiceNo’, ‘Last_transact’]])

df_new=df_new.join(pred, lsuffix=”_left”, rsuffix=”_right”)

error_rate = []
for clusters in range(1,16):
kmeans = KMeans(n_clusters = clusters)
kmeans.fit(df_new)
kmeans.predict(df_new)
error_rate.append(kmeans.inertia_)

error_rate = pd.DataFrame({‘Cluster’:range(1,16) , ‘Error’:error_rate})

Let’s plot the Error Rate and Clusters

plt.figure(figsize=(12,8))
p = sns.barplot(x=’Cluster’, y= ‘Error’, data= error_rate, palette=’coolwarm_r’)
sns.despine(left=True)
p.set_title(‘Error Rate and Clusters’)

Error Rate and Clusters

Let’s drop 1 column

del df[‘InvoiceDate’]

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 392692 entries, 0 to 541908
Data columns (total 9 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   InvoiceNo     392692 non-null  object 
 1   StockCode     392692 non-null  object 
 2   Description   392692 non-null  object 
 3   Quantity      392692 non-null  int64  
 4   UnitPrice     392692 non-null  float64
 5   CustomerID    392692 non-null  float64
 6   Country       392692 non-null  object 
 7   cancellation  392692 non-null  int64  
 8   TotalPrice    392692 non-null  float64
dtypes: float64(3), int64(2), object(4)
memory usage: 30.0+ MB

Let’s group the data

country_wise = df.groupby(‘Country’).sum()

and read the full country list

country_codes = pd.read_csv(‘wikipedia-iso-country-codes.csv’, names=[‘Country’, ‘two’, ‘three’, ‘numeric’, ‘ISO’])

country_codes.head()

Full country list

Let’s merge df and country_codes

country_wise = pd.merge(country_codes,country_wise, on=’Country’)

Let’s invoke plotly for plotting purposes

from plotly import version
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
import plotly.graph_objs as go

Let’s plot European Countries According to Revenue

data = dict(type=’choropleth’,colorscale=’GnBu’, locations = country_wise[‘three’], locationmode = ‘ISO-3’, z= country_wise[‘Total Amount Spent’], text = country_wise[‘Country’], colorbar={‘title’:’Revenue’}, marker = dict(line=dict(width=0)))
layout = dict(title = ‘European Countries According to Revenue!’, geo = dict(scope=’europe’,showlakes=False, projection = {‘type’: ‘winkel tripel’}))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)

European Countries According to Revenue

Let’s plot All Countries According to Revenue

data = dict(type=’choropleth’,colorscale=’rainbow’, locations = country_wise[‘three’], locationmode = ‘ISO-3’, z= country_wise[‘Total Amount Spent’], text = country_wise[‘Country’], colorbar={‘title’:’Revenue’}, marker = dict(line=dict(width=0)))
layout = dict(title = ‘All Countries According to Revenue!’, geo = dict(scope=’world’,showlakes=False, projection = {‘type’: ‘winkel tripel’}))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)

All Countries According to Revenue

Let’s create the list of countries based upon the percentage of Total Price

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(“OnlineRetail.csv”, encoding= ‘unicode_escape’)

df = df.drop_duplicates()

df[‘cancellation’] = df.InvoiceNo.str.extract(‘([C])’).fillna(0).replace({‘C’:1})
df.cancellation.value_counts()

cancellation
0    527390
1      9251
Name: count, dtype: int64

df[df.cancellation == 1][‘CustomerID’].nunique() / df.CustomerID.nunique() * 100

36.34492223238792

df = df[df.CustomerID.notnull()]
df = df[(df.Quantity > 0) & (df.UnitPrice > 0)]

df = df.drop(‘cancellation’, axis = 1)

df[“TotalPrice”] = df.UnitPrice * df.Quantity

df2 = pd.DataFrame(df.groupby(‘Country’).TotalPrice.sum().apply(lambda x: round(x, 2))).sort_values(‘TotalPrice’, ascending = False)

df2[‘perc_of_TotalPrice’] = round(df2.TotalPrice / df2.TotalPrice.sum() * 100, 2)
df2

The list of countries based upon the percentage of Total Price

Read more here.

RFM Customer Segmentation

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import the key libraries

!pip install termcolor

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import ipywidgets
from ipywidgets import interact

import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mticker
import squarify as sq

import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import missingno as msno

import datetime as dt
from datetime import datetime

from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import plotly.express as px
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import colorama
from colorama import Fore, Style # maakes strings colored
from termcolor import colored
from termcolor import cprint

from wordcloud import WordCloud

import warnings
warnings.filterwarnings(“ignore”)
warnings.warn(“this will not show”)

plt.rcParams[“figure.figsize”] = (10,6)
pd.set_option(‘max_colwidth’,200)
pd.set_option(‘display.max_rows’, 1000)
pd.set_option(‘display.max_columns’, 200)
pd.set_option(‘display.float_format’, lambda x: ‘%.3f’ % x)

Let’s define the following functions

##########################

def missing_values(df):
missing_number = df.isnull().sum().sort_values(ascending=False)
missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=[‘Missing_Number’, ‘Missing_Percent’])
return missing_values[missing_values[‘Missing_Number’]>0]

def first_looking(df):
print(colored(“Shape:”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘100, ‘red’, attrs=[‘bold’]),
colored(“\nInfo:\n”, attrs=[‘bold’]), sep=”)
print(df.info(), ‘\n’,
colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Number of Uniques:\n”, attrs=[‘bold’]), df.nunique(),’\n’,
colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Missing Values:\n”, attrs=[‘bold’]), missing_values(df),’\n’,
colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“All Columns:”, attrs=[‘bold’]), list(df.columns),’\n’,
colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)

df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')
print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
      colored('*'*100, 'red', attrs=['bold']), sep='')  
print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
      colored('*'*100, 'red', attrs=['bold']), sep='')
print(colored("Descriptive Statistics \n", attrs=['bold']), df.describe().round(2),'\n',
      colored('*'*100, 'red', attrs=['bold']), sep='') # Gives a statstical breakdown of the data.
print(colored("Descriptive Statistics (Categorical Columns) \n", attrs=['bold']), df.describe(include=object).T,'\n',
      colored('*'*100, 'red', attrs=['bold']), sep='') # Gives a statstical breakdown of the data.

def multicolinearity_control(df):
feature =[]
collinear=[]
for col in df.corr().columns:
for i in df.corr().index:
if (abs(df.corr()[col][i])> .9 and abs(df.corr()[col][i]) < 1):
feature.append(col)
collinear.append(i)
print(colored(f”Multicolinearity alert in between:{col} – {i}”,
“red”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)

def duplicate_values(df):
print(colored(“Duplicate check…”, attrs=[‘bold’]), sep=”)
print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)

def drop_null(df, limit):
print(‘Shape:’, df.shape)
for i in df.isnull().sum().index:
if (df.isnull().sum()[i]/df.shape[0]*100)>limit:
print(df.isnull().sum()[i], ‘percent of’, i ,’null and were dropped’)
df.drop(i, axis=1, inplace=True)
print(‘new shape:’, df.shape)
print(‘New shape after missing value control:’, df.shape)

def first_look(col):
print(“column name : “, col)
print(“——————————–“)
print(“Per_of_Nulls : “, “%”, round(df[col].isnull().sum()/df.shape[0]*100, 2))
print(“Num_of_Nulls : “, df[col].isnull().sum())
print(“Num_of_Uniques : “, df[col].nunique())
print(“Duplicates : “, df.duplicated(subset=None, keep=’first’).sum())
print(df[col].value_counts(dropna = False))

def fill_most(df, group_col, col_name):
”’Fills the missing values with the most existing value (mode) in the relevant column according to single-stage grouping”’
for group in list(df[group_col].unique()):
cond = df[group_col]==group
mode = list(df[cond][col_name].mode())
if mode != []:
df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
else:
df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
print(“Number of NaN : “,df[col_name].isnull().sum())
print(“——————“)
print(df[col_name].value_counts(dropna=False))

##########################

Let’s read and copy the input Online Retail dataset

df0=pd.read_csv(‘Online Retail.csv’,index_col=0)
df = df0.copy()

Let’s check the dataset content

first_looking(df)
duplicate_values(df)

Shape:(541909, 8)
****************************************************************************************************
Info:

<class 'pandas.core.frame.DataFrame'>
Index: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  float64
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(3), object(5)
memory usage: 37.2+ MB
None
****************************************************************************************************
Number of Uniques:
InvoiceNo      25900
StockCode       4070
Description     4223
Quantity         722
InvoiceDate    23260
UnitPrice       1630
CustomerID      4372
Country           38
dtype: int64
****************************************************************************************************
Missing Values:
             Missing_Number  Missing_Percent
CustomerID           135080            0.249
Description            1454            0.003
****************************************************************************************************
All Columns:['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']
****************************************************************************************************
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
****************************************************************************************************
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
****************************************************************************************************
Descriptive Statistics 
        quantity  unitprice  customerid
count 541909.000 541909.000  406829.000
mean       9.550      4.610   15287.690
std      218.080     96.760    1713.600
min   -80995.000 -11062.060   12346.000
25%        1.000      1.250   13953.000
50%        3.000      2.080   15152.000
75%       10.000      4.130   16791.000
max    80995.000  38970.000   18287.000
****************************************************************************************************
Descriptive Statistics (Categorical Columns) 
              count unique                                 top    freq
invoiceno    541909  25900                            573585.0    1114
stockcode    541909   4070                              85123A    2313
description  540455   4223  WHITE HANGING HEART T-LIGHT HOLDER    2369
invoicedate  541909  23260                 2011-10-31 14:41:00    1114
country      541909     38                      United Kingdom  495478
****************************************************************************************************
Duplicate check...
There are 5268 duplicated observations in the dataset.

Let’s calculate the total price

df[‘total_price’] = df[‘quantity’] * df[‘unitprice’]
print(“There is”, df.shape[0], “observations and”, df.shape[1], “columns in the dataset”)

There is 541909 observations and 9 columns in the dataset

Let’s plot Invoice Counts Per Country

fig = px.histogram(df, x = df.groupby(‘country’)[‘invoiceno’].nunique().index,
y = df.groupby(‘country’)[‘invoiceno’].nunique().values,
title = ‘Invoice Counts Per Country’,
labels = dict(x = “Countries”, y =”Invoice”))
fig.show();

Invoice Counts Per Country

Let’s group the data

df.groupby([‘invoiceno’,’customerid’, ‘country’])[‘total_price’].sum().sort_values().head()

invoiceno  customerid  country       
C581484    16446.000   United Kingdom   -168469.600
C541433    12346.000   United Kingdom    -77183.600
C556445    15098.000   United Kingdom    -38970.000
C550456    15749.000   United Kingdom    -22998.400
C570556    16029.000   United Kingdom    -11816.640
Name: total_price, dtype: float64

Let’s plot Customer Counts by Country

fig = px.histogram(df, x = df.groupby(‘country’)[‘customerid’].nunique().index,
y = df.groupby(‘country’)[‘customerid’].nunique().values,
title = ‘Customer Counts by Country’,
labels = dict(x = “Countries”, y =”Customer”))
fig.show()

Customer Counts by Country

Let’s check missing values

df.isnull().melt(value_name=”missing”)

Clean the Data from the Noise and Missing Values:
missing_values(df)

Missing numbers

Drop observations with missing values:
df = df.dropna(subset=[“customerid”])

Get rid of cancellations and negative values:
df = df[(df[‘unitprice’] > 0) & (df[‘quantity’] > 0)]

Check for duplicates and get rid of them:
print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)
print(df.duplicated(subset=None, keep=’first’).sum(), “Duplicated observations are dropped!”)
df.drop_duplicates(keep=’first’, inplace=True)

print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)
print(df.duplicated(subset=None, keep=’first’).sum(), “Duplicated observations are dropped!”)
df.drop_duplicates(keep=’first’, inplace=True)

There are 5192 duplicated observations in the dataset.
5192 Duplicated observations are dropped!

df.columns

Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate',
       'unitprice', 'customerid', 'country', 'total_price'],
      dtype='object')

Explore the Orders:
cprint(“Unique number of invoice per customer”,’red’)
df.groupby(“customerid”)[“invoiceno”].nunique().sort_values(ascending = False)

Unique number of invoice per customer
customerid
12748.000    209
14911.000    201
17841.000    124
13089.000     97
14606.000     93
            ... 
15314.000      1
15313.000      1
15308.000      1
15307.000      1
15300.000      1
Name: invoiceno, Length: 4338, dtype: int64

Let’s plot the country-based wordcloud

from wordcloud import WordCloud

country_text = df[“country”].str.split(” “).str.join(“_”)
all_countries = ” “.join(country_text)

wc = WordCloud(background_color=”red”,
max_words=250,
max_font_size=256,
random_state=42,
width=800, height=400)
wc.generate(all_countries)
plt.figure(figsize = (12, 10))
plt.imshow(wc)
plt.axis(‘off’)
plt.show()

Country-based wordcloud

Let’s focus on the UK online retail market

df_uk = df[df[“country”]==”United Kingdom”]

first_looking(df_uk)
duplicate_values(df_uk)

Shape:(349203, 9)
****************************************************************************************************
Info:

<class 'pandas.core.frame.DataFrame'>
Index: 349203 entries, 0 to 541893
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   invoiceno    349203 non-null  object 
 1   stockcode    349203 non-null  object 
 2   description  349203 non-null  object 
 3   quantity     349203 non-null  float64
 4   invoicedate  349203 non-null  object 
 5   unitprice    349203 non-null  float64
 6   customerid   349203 non-null  float64
 7   country      349203 non-null  object 
 8   total_price  349203 non-null  float64
dtypes: float64(4), object(5)
memory usage: 26.6+ MB
None
****************************************************************************************************
Number of Uniques:
invoiceno      16646
stockcode       3645
description     3844
quantity         293
invoicedate    15612
unitprice        402
customerid      3920
country            1
total_price     2792
dtype: int64
****************************************************************************************************
Missing Values:
Empty DataFrame
Columns: [Missing_Number, Missing_Percent]
Index: []
****************************************************************************************************
All Columns:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
****************************************************************************************************
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
****************************************************************************************************
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
****************************************************************************************************
Descriptive Statistics 
        quantity  unitprice  customerid  total_price
count 349203.000 349203.000  349203.000   349203.000
mean      12.150      2.970   15548.380       20.860
std      190.630     17.990    1594.380      328.420
min        1.000      0.000   12346.000        0.000
25%        2.000      1.250   14191.000        4.200
50%        4.000      1.950   15518.000       10.200
75%       12.000      3.750   16931.000       17.850
max    80995.000   8142.750   18287.000   168469.600
****************************************************************************************************
Descriptive Statistics (Categorical Columns) 
              count unique                                 top    freq
invoiceno    349203  16646                            576339.0     542
stockcode    349203   3645                              85123A    1936
description  349203   3844  WHITE HANGING HEART T-LIGHT HOLDER    1929
invoicedate  349203  15612                 2011-11-14 15:27:00     542
country      349203      1                      United Kingdom  349203
****************************************************************************************************
Duplicate check...
There are 0 duplicated observations in the dataset.

Let’s prepare the date for the RFM analysis

import datetime as dt
from datetime import datetime
import datetime as dt

df_uk[“ref_date”] = pd.to_datetime(df_uk[‘invoicedate’]).dt.date.max()

df_uk[‘ref_date’] = pd.to_datetime(df_uk[‘ref_date’])

df_uk[“date”] = pd.to_datetime(df_uk[“invoicedate”]).dt.date

df_uk[‘date’] = pd.to_datetime(df_uk[‘date’])

df_uk[‘last_purchase_date’] = df_uk.groupby(‘customerid’)[‘invoicedate’].transform(max)
df_uk[‘last_purchase_date’] = pd.to_datetime(df_uk[‘last_purchase_date’]).dt.date
df_uk[‘last_purchase_date’] = pd.to_datetime(df_uk[‘last_purchase_date’])

df_uk.groupby(‘customerid’)[[‘last_purchase_date’]].max().sample(5)

Last purchase date

df_uk[“customer_recency”] = df_uk[“ref_date”] – df_uk[“last_purchase_date”]

df_uk[‘recency_value’] = pd.to_numeric(df_uk[‘customer_recency’].dt.days.astype(‘int64’))

customer_recency = pd.DataFrame(df_uk.groupby(‘customerid’)[‘recency_value’].min())
customer_recency.rename(columns={‘recency_value’:’recency’}, inplace=True)
customer_recency.reset_index(inplace=True)

df_uk.drop(‘last_purchase_date’, axis=1, inplace=True)

Let’s plot Customer Regency Distribution

fig = px.histogram(df_uk, x = ‘recency_value’, title = ‘Customer Regency Distribution’)
fig.show()

Customer Regency Distribution

Let’s compute Frequency – Number of purchases

df_uk1 = df_uk.copy()
customer_frequency = pd.DataFrame(df_uk.groupby(‘customerid’)[‘invoiceno’].nunique())
customer_frequency.rename(columns={‘invoiceno’:’frequency’}, inplace=True)
customer_frequency.reset_index(inplace=True)

df_uk[‘customer_frequency’] = df_uk.groupby(‘customerid’)[‘invoiceno’].transform(‘count’)

Let’s plot Customer Frequency Distribution

fig = px.histogram(df_uk, x = ‘customer_frequency’, title = ‘Customer Frequency Distribution’)
fig.show()

Customer Frequency Distribution

Let’s calculate Monetary

customer_monetary = pd.DataFrame(df_uk.groupby(‘customerid’)[‘total_price’].sum())
customer_monetary.rename(columns={‘total_price’:’monetary’}, inplace=True)
customer_monetary.reset_index(inplace=True)

df_uk[‘customer_monetary’] = df_uk.groupby(‘customerid’)[‘total_price’].transform(‘sum’)

Let’s plot Customer Monetary Distribution

fig = px.histogram(df_uk, x = ‘customer_monetary’, title = ‘Customer Monetary Distribution’)
fig.show()

Customer Monetary Distribution

Let’s create the RFM Table
customer_rfm = pd.merge(pd.merge(customer_recency, customer_frequency, on=’customerid’), customer_monetary, on=’customerid’)
customer_rfm.head()

Recency, frequency and monetary columns

customer_rfm.set_index(‘customerid’, inplace = True)

quantiles = customer_rfm.quantile(q = [0.25,0.50,0.75])
quantiles

3 quantiles of recency, frequency and monetary columns

Let’s introduce several functions creating the RFM Segmentation Table

def rec_score(x):
if x < 17.000:
return 4
elif 17.000 <= x < 50.000:
return 3
elif 50.000 <= x < 142.000:
return 2
else:
return 1

def freq_score(x):
if x > 5.000:
return 4
elif 5.000 >= x > 2.000:
return 3
elif 2.000 >= x > 1:
return 2
else:
return 1

def mon_score(x):
if x > 1571.285:
return 4
elif 1571.285 >= x > 644.975:
return 3
elif 644.975 >= x > 298.185:
return 2
else:
return 1

Let’s calculate the RFM score

customer_rfm[‘recency_score’] = customer_rfm[‘recency’].apply(rec_score)
customer_rfm[‘frequency_score’] = customer_rfm[‘frequency’].apply(freq_score)
customer_rfm[‘monetary_score’] = customer_rfm[‘monetary’].apply(mon_score)

customer_rfm.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3920 entries, 12346.0 to 18287.0
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   recency          3920 non-null   int64  
 1   frequency        3920 non-null   int64  
 2   monetary         3920 non-null   float64
 3   recency_score    3920 non-null   int64  
 4   frequency_score  3920 non-null   int64  
 5   monetary_score   3920 non-null   int64  
dtypes: float64(1), int64(5)
memory usage: 214.4 KB

Let’s calculate the total RFM score

customer_rfm[‘RFM_score’] = (customer_rfm[‘recency_score’].astype(str) + customer_rfm[‘frequency_score’].astype(str) +
customer_rfm[‘monetary_score’].astype(str))
customer_rfm.sample(5)

RFM score table

Let’s plot Customer_RFM Distribution

fig = px.histogram(customer_rfm, x = customer_rfm[‘RFM_score’].value_counts().index,
y = customer_rfm[‘RFM_score’].value_counts().values,
title = ‘Customer_RFM Distribution’,
labels = dict(x = “RFM_score”, y =”counts”))
fig.show()

Customer_RFM Distribution

Let’s calculate the RFM_level

customer_rfm[‘RFM_level’] = customer_rfm[‘recency_score’] + customer_rfm[‘frequency_score’] + customer_rfm[‘monetary_score’]
customer_rfm.sample(5)

RFM_level

print(‘Min value for RFM_level : ‘, customer_rfm[‘RFM_level’].min())
print(‘Max value for RFM_level : ‘, customer_rfm[‘RFM_level’].max())

Min value for RFM_level :  3
Max value for RFM_level :  12

Let’s introduce the RFM segment function

def segments(df_rfm):
if df_rfm[‘RFM_level’] == 12 :
return ‘champion’
elif (df_rfm[‘RFM_level’] == 11) or (df_rfm[‘RFM_level’] == 10 ):
return ‘loyal_customer’
elif (df_rfm[‘RFM_level’] == 9) or (df_rfm[‘RFM_level’] == 8 ):
return ‘promising’
elif (df_rfm[‘RFM_level’] == 7) or (df_rfm[‘RFM_level’] == 6 ):
return ‘need_attention’
elif (df_rfm[‘RFM_level’] == 5) or (df_rfm[‘RFM_level’] == 4 ):
return ‘hibernating’
else:
return ‘almost_lost’

Let’s call this function to compute customer_segment

customer_rfm[‘customer_segment’] = customer_rfm.apply(segments,axis=1)
customer_rfm.sample(5)

customer_segment

Let’s plot the customer_segment distribution

fig = px.histogram(customer_rfm, x = customer_rfm[‘customer_segment’].value_counts().index,
y = customer_rfm[‘customer_segment’].value_counts().values,
title = ‘Customer_Segment Distribution’,
labels = dict(x = “customer_segment”, y =”counts”))
fig.show()

customer_segment distribution

Let’s calculate the RFM_level Mean Values

avg_rfm_segment = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False)
avg_rfm_segment

customer_segment
champion         12.000
loyal_customer   10.482
promising         8.530
need_attention    6.479
hibernating       4.499
almost_lost       3.000
Name: RFM_level, dtype: float64

Let’s plot the Customer_Segment RFM_level Mean Values

fig = px.histogram(customer_rfm, x = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False).index,
y = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False).values,
title = ‘Customer_Segment RFM_level Mean Values’,
labels = dict(x = “customer_segment”, y =”RFM_level Mean Values”))
fig.show()

Customer_Segment RFM_level Mean Values

Let’s calculate the size of RFM_Segment

size_rfm_segment = customer_rfm[‘customer_segment’].value_counts()
size_rfm_segment

hibernating       918
promising         783
need_attention    771
loyal_customer    650
champion          402
almost_lost       396
Name: count, dtype: int64

Let’s plot Size RFM_Segment

fig = px.histogram(customer_rfm, x = customer_rfm[‘customer_segment’].value_counts().index,
y = customer_rfm[‘customer_segment’].value_counts().values,
title = ‘Size RFM_Segment’,
labels = dict(x = “customer_segment”, y =”Size RFM_Segment”))
fig.show()

Size RFM_Segment

Final output table:

customer_segment = pd.DataFrame(pd.concat([avg_rfm_segment, size_rfm_segment], axis=1))
customer_segment.rename(columns={‘RFM_level’: ‘avg_RFM_level’, ‘customer_segment’: ‘segment_size’}, inplace=True)
customer_segment

avg_RFM_level count

Let’s plot the wordcloud of RFM Segments

from wordcloud import WordCloud

segment_text = customer_rfm[“customer_segment”].str.split(” “).str.join(“_”)
all_segments = ” “.join(segment_text)

wc = WordCloud(background_color=”orange”,
max_words=250,
max_font_size=256,
random_state=42,
width=800, height=400)
wc.generate(all_segments)
plt.figure(figsize = (16, 15))
plt.imshow(wc)
plt.title(“RFM Segments”,fontsize=18,fontweight=”bold”)
plt.axis(‘off’)
plt.show()

Wordcloud of RFM Segments

Let’s check the structure and content of customer_rfm

first_looking(customer_rfm)
duplicate_values(customer_rfm)

Shape:(3920, 9)
****************************************************************************************************
Info:

<class 'pandas.core.frame.DataFrame'>
Index: 3920 entries, 12346.0 to 18287.0
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   recency           3920 non-null   int64  
 1   frequency         3920 non-null   int64  
 2   monetary          3920 non-null   float64
 3   recency_score     3920 non-null   int64  
 4   frequency_score   3920 non-null   int64  
 5   monetary_score    3920 non-null   int64  
 6   RFM_score         3920 non-null   object 
 7   RFM_level         3920 non-null   int64  
 8   customer_segment  3920 non-null   object 
dtypes: float64(1), int64(6), object(2)
memory usage: 306.2+ KB
None
****************************************************************************************************
Number of Uniques:
recency              302
frequency             57
monetary            3849
recency_score          4
frequency_score        4
monetary_score         4
RFM_score             61
RFM_level             10
customer_segment       6
dtype: int64
****************************************************************************************************
Missing Values:
Empty DataFrame
Columns: [Missing_Number, Missing_Percent]
Index: []
****************************************************************************************************
All Columns:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'RFM_score', 'RFM_level', 'customer_segment']
****************************************************************************************************
Columns after rename:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_level', 'customer_segment']
****************************************************************************************************
Columns after rename:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_level', 'customer_segment']
****************************************************************************************************
Descriptive Statistics 
       recency  frequency   monetary  recency_score  frequency_score   
count 3920.000   3920.000   3920.000       3920.000         3920.000  \
mean    91.740      4.250   1858.420          2.480            2.320   
std     99.530      7.200   7478.630          1.110            1.150   
min      0.000      1.000      3.750          1.000            1.000   
25%     17.000      1.000    298.180          1.000            1.000   
50%     50.000      2.000    644.970          2.000            2.000   
75%    142.000      5.000   1571.280          3.000            3.000   
max    373.000    209.000 259657.300          4.000            4.000   

       monetary_score  rfm_level  
count        3920.000   3920.000  
mean            2.500      7.300  
std             1.120      2.880  
min             1.000      3.000  
25%             1.750      5.000  
50%             2.500      7.000  
75%             3.250     10.000  
max             4.000     12.000  
****************************************************************************************************
Descriptive Statistics (Categorical Columns) 
                 count unique          top freq
rfm_score         3920     61          444  402
customer_segment  3920      6  hibernating  918
****************************************************************************************************
Duplicate check...
There are 0 duplicated observations in the dataset.

customer_rfm.groupby(‘rfm_level’).agg({‘recency’: [‘mean’,’min’,’max’,’count’],
‘frequency’: [‘mean’,’min’,’max’,’count’],
‘monetary’: [‘mean’,’min’,’max’,’count’] }).round(1)

RFM summary table

Let’s check RFM correlations

plt.figure(figsize = (20,20))
sns.pairplot(customer_rfm[[‘recency’, ‘frequency’, ‘monetary’,’customer_segment’]], hue = ‘customer_segment’);

RFM pairplots

Let’s compute the RFM correlation matrix

matrix = np.triu(customer_rfm[[‘recency’,’frequency’,’monetary’]].corr())
fig, ax = plt.subplots(figsize=(14,10))
sns.heatmap (customer_rfm[[‘recency’,’frequency’,’monetary’]].corr(), annot=True, fmt= ‘.2f’, vmin=-1, vmax=1, center=0, cmap=’coolwarm’,mask=matrix, ax=ax);

RFM correlation matrix

Read more here.

RFM TreeMap: Online Retail Dataset II

Let’s import libraries and read the Online Retail II dataset Year 2010-2011

import datetime as dt
import pandas as pd
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.max_rows’, None)
pd.set_option(‘display.float_format’, lambda x: ‘%.2f’ % x)

df_ = pd.read_excel(“online_retail_II.xlsx”,sheet_name=”Year 2010-2011″ )
df = df_.copy()

Let’s check the overall data structure

def check_df(dataframe):
print(“################ Shape ####################”)
print(dataframe.shape)
print(“############### Columns ###################”)
print(dataframe.columns)
print(“############### Types #####################”)
print(dataframe.dtypes)
print(“############### Head ######################”)
print(dataframe.head())
print(“############### Tail ######################”)
print(dataframe.tail())
print(“############### Describe ###################”)
print(dataframe.describe().T)

check_df(df)

################ Shape ####################
(541910, 8)
############### Columns ###################
Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'Price', 'Customer ID', 'Country'],
      dtype='object')
############### Types #####################
Invoice                object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
Price                 float64
Customer ID           float64
Country                object
dtype: object
############### Head ######################
  Invoice StockCode                          Description  Quantity   
0  536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6  \
1  536365     71053                  WHITE METAL LANTERN         6   
2  536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3  536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4  536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   

          InvoiceDate  Price  Customer ID         Country  
0 2010-12-01 08:26:00   2.55     17850.00  United Kingdom  
1 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
2 2010-12-01 08:26:00   2.75     17850.00  United Kingdom  
3 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
4 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
############### Tail ######################
       Invoice StockCode                      Description  Quantity   
541905  581587     22899     CHILDREN'S APRON DOLLY GIRL          6  \
541906  581587     23254    CHILDRENS CUTLERY DOLLY GIRL          4   
541907  581587     23255  CHILDRENS CUTLERY CIRCUS PARADE         4   
541908  581587     22138    BAKING SET 9 PIECE RETROSPOT          3   
541909  581587      POST                          POSTAGE         1   

               InvoiceDate  Price  Customer ID Country  
541905 2011-12-09 12:50:00   2.10     12680.00  France  
541906 2011-12-09 12:50:00   4.15     12680.00  France  
541907 2011-12-09 12:50:00   4.15     12680.00  France  
541908 2011-12-09 12:50:00   4.95     12680.00  France  
541909 2011-12-09 12:50:00  18.00     12680.00  France  
############### Describe ###################
                count                           mean                  min   
Quantity    541910.00                           9.55            -80995.00  \
InvoiceDate    541910  2011-07-04 13:35:22.342307584  2010-12-01 08:26:00   
Price       541910.00                           4.61            -11062.06   
Customer ID 406830.00                       15287.68             12346.00   

                             25%                  50%                  75%   
Quantity                    1.00                 3.00                10.00  \
InvoiceDate  2011-03-28 11:34:00  2011-07-19 17:17:00  2011-10-19 11:27:00   
Price                       1.25                 2.08                 4.13   
Customer ID             13953.00             15152.00             16791.00   

                             max     std  
Quantity                80995.00  218.08  
InvoiceDate  2011-12-09 12:50:00     NaN  
Price                   38970.00   96.76  
Customer ID             18287.00 1713.60  
####### DATA PREPARATION

df.isnull().any()
df.isnull().sum()
df.dropna(inplace=True)

df[“StockCode”].nunique()

df.groupby(“StockCode”)[“Quantity”].sum()

df.groupby(“StockCode”).agg({“Quantity”:”sum”}).sort_values(by=”Quantity”, ascending=False).head()

df = df[~df[“Invoice”].str.contains(“C”, na=False)]

df[“TotalPrice”] = df[“Price”] * df[“Quantity”]

df[“InvoiceDate”].max()
today_date = dt.datetime(2011, 12, 11)

rfm = df.groupby(“Customer ID”).agg({“InvoiceDate”: lambda InvıiceDate: (today_date- InvıiceDate.max()).days,
“Invoice”: lambda Invoice: Invoice.nunique(),
“TotalPrice”: lambda TotalPrice: TotalPrice.sum()})

rfm.columns = [“recency”,”frequency”,”monetary”]
rfm = rfm[rfm[“monetary”] > 0]

####### RFM CALCULATION

rfm[“recency_score”] = pd.qcut(rfm[‘recency’], 5, labels=[5, 4, 3, 2, 1])

rfm[“frequency_score”] = pd.qcut(rfm[‘frequency’].rank(method=”first”), 5, labels=[1, 2, 3, 4, 5])

rfm[“monetary_score”] = pd.qcut(rfm[‘monetary’], 5, labels=[1, 2, 3, 4, 5])

rfm[“RFM_SCORE”] = (rfm[‘recency_score’].astype(str) + rfm[‘frequency_score’].astype(str))
rfm.head()

rfm.describe().T

rfm[rfm[“RFM_SCORE”] == “55”].head() # champions

rfm[rfm[“RFM_SCORE”] == “11”].head() # hibernating

RFM score table

#######CUSTOMER SEGMENTATION

seg_map = {
r'[1-2][1-2]’: ‘hibernating’,
r'[1-2][3-4]’: ‘at_Risk’,
r'[1-2]5′: ‘cant_loose’,
r’3[1-2]’: ‘about_to_sleep’,
r’33’: ‘need_attention’,
r'[3-4][4-5]’: ‘loyal_customers’,
r’41’: ‘promising’,
r’51’: ‘new_customers’,
r'[4-5][2-3]’: ‘potential_loyalists’,
r’5[4-5]’: ‘champions’
}

rfm[‘segment’] = rfm[‘RFM_SCORE’].replace(seg_map, regex=True)

rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

rfm[rfm[“segment”] == “need_attention”].head()

rfm[rfm[“segment”] == “new_customers”].index

seg_list = [“at_Risk”, “hibernating”, “cant_loose”, “loyal_customers”]
for i in seg_list:
print(F” {i.upper()} “.center(50, “*”))
print(rfm[rfm[“segment”]==i].describe().T)

rfm[rfm[“segment”] == “loyal_customers”].index
loyalCus_df = pd.DataFrame()
loyalCus_df[“loyal_customersId”] = rfm[rfm[“segment”] == “loyal_customers”].index
loyalCus_df.head()

loyalCus_df.to_csv(“loyal_customers.csv”)

******************** AT_RISK *********************
           count    mean     std   min    25%    50%     75%      max
recency   593.00  153.79   68.62 73.00  96.00 139.00  195.00   374.00
frequency 593.00    2.88    0.95  2.00   2.00   3.00    3.00     6.00
monetary  593.00 1084.54 2562.07 52.00 412.78 678.25 1200.62 44534.30
****************** HIBERNATING *******************
            count   mean     std   min    25%    50%    75%      max
recency   1071.00 217.61   92.01 73.00 135.00 219.00 289.50   374.00
frequency 1071.00   1.10    0.30  1.00   1.00   1.00   1.00     2.00
monetary  1071.00 488.64 2419.68  3.75 155.11 296.25 457.93 77183.60
******************* CANT_LOOSE *******************
           count    mean     std   min     25%     50%     75%      max
recency    63.00  132.97   65.25 73.00   89.00  108.00  161.00   373.00
frequency  63.00    8.38    4.29  6.00    6.00    7.00    9.00    34.00
monetary   63.00 2796.16 2090.49 70.02 1137.51 2225.97 3532.24 10254.18
**************** LOYAL_CUSTOMERS *****************
           count    mean     std   min    25%     50%     75%       max
recency   819.00   33.61   15.58 15.00  20.00   30.00   44.00     72.00
frequency 819.00    6.48    4.55  3.00   4.00    5.00    8.00     63.00
monetary  819.00 2864.25 6007.06 36.56 991.80 1740.48 3052.91 124914.53

#####CREATE RFM

def create_rfm(dataframe):

dataframe["TotalPrice"] = dataframe["Quantity"] * dataframe["Price"]
dataframe.dropna(inplace=True)
dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]

today_date = dt.datetime(2011, 12, 11)
rfm = dataframe.groupby('Customer ID').agg({'InvoiceDate': lambda date: (today_date - date.max()).days,
                                            'Invoice': lambda num: num.nunique(),
                                            "TotalPrice": lambda price: price.sum()})
rfm.columns = ['recency', 'frequency', "monetary"]
rfm = rfm[(rfm['monetary'] > 0)]


rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
rfm["frequency_score"] = pd.qcut(rfm["frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])


rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) +
                    rfm['frequency_score'].astype(str))

rfm.head()

RFM segment

rfm[‘score’]=rfm[‘recency’]rfm[‘frequency’]rfm[‘monetary’]

Statistical informations of segments
rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

How many segments are there
segments_counts = rfm[‘segment’].value_counts().sort_values(ascending=True)

Let’s plot our RFM segments

fig, ax = plt.subplots()

bars = ax.barh(range(len(segments_counts)),
segments_counts,
color=’lightcoral’)
ax.set_frame_on(False)
ax.tick_params(left=False,
bottom=False,
labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index)

for i, bar in enumerate(bars):
value = bar.get_width()
if segments_counts.index[i] in [‘Can\’t loose’]:
bar.set_color(‘firebrick’)
ax.text(value,
bar.get_y() + bar.get_height()/2,
‘{:,} ({:}%)’.format(int(value),
int(value*100/segments_counts.sum())),
va=’center’,
ha=’left’
)

plt.show(block=True)

RFM segments bar plot

Let’s plot the RFM segmentation treemap

import matplotlib.pyplot as plt
import squarify
sizes=[24, 18, 14, 13,11,8,4,3,2,1]
label=[“hibernating”, “loyal_customers”, ‘champions’, ‘at_risk’,’potential_loyalists’,’about_to_sleep’,’need_attention’,’promising’,’CL’,’NC’]

colors = [‘#91DCEA’, ‘#64CDCC’, ‘#5FBB68’,
‘#F9D23C’, ‘#F9A729’, ‘#FD6F30′,’grey’,’red’,’blue’,’cyan’]
squarify.plot(sizes=sizes, label=label, color=colors, alpha=0.6 )
plt.show()
plt.axis(“off”)

RFM segmentation treemap

where CL = cant_loose and NC = new_cutomers.

Read more here.

CRM Analytics: CLTV

Let’s look at the CRM Analytics – CLTV (Customer Lifetime Value).

Importing libraries:
import datetime as dt
import pandas as pd
import warnings
warnings.filterwarnings(‘ignore’)
import matplotlib.pyplot as plt

and set display setting
pd.set_option(“display.max_columns”, None)
pd.set_option(“display.float_format”, lambda x: “%.3f” % x)

####### DATA PREPARATION

Reading the Online Retail II dataset and creating a copy of it
df_ = pd.read_excel(“online_retail.xlsx”)
df = df_.copy()
df.head()

Online Retail dataset II

df.shape

(525461, 8)

Number of unique products
df[“Description”].nunique()

4681

value_counts() of each product
df[“Description”].value_counts().head()

WHITE HANGING HEART T-LIGHT HOLDER    3549
REGENCY CAKESTAND 3 TIER              2212
STRAWBERRY CERAMIC TRINKET BOX        1843
PACK OF 72 RETRO SPOT CAKE CASES      1466
ASSORTED COLOUR BIRD ORNAMENT         1457
Name: Description, dtype: int64

What is the most ordered product
df.groupby(“Description”).agg({“Quantity”: “sum”}).head()

Missing value check
df.isnull().sum

Customers with missing Customer ID must be removed from the dataset
df.dropna(inplace =True)

Adding the total price on the basis of products to the data set as a variable
df[“TotalPrice”] = df[“Quantity”] * df[“Price”]

Let’s remove attribute values started with “C” (returned products)

df = df[~df[“Invoice”].str.contains(“C”, na=False)]

with the descriptive statistics of the dataset
df.describe().T

Descriptive statistics of the dataset

######### CALCULATION OF RFM METRICS

Last date in the InvoiceDate
df[“InvoiceDate”].max()

Timestamp('2010-12-09 20:01:00')

Let’s set 2 days after the last date
analysis_date = dt.datetime(2010, 12, 11)

Let’s introduce the following RFM columns

rfm = df.groupby(“Customer ID”).agg({“InvoiceDate”: lambda invoice_date: (analysis_date – invoice_date.max()).days,
“Invoice”: lambda invoice: invoice.nunique(),
“TotalPrice”: lambda total_price: total_price.sum()})
rfm.columns = [“recency”, “frequency”, “monetary”]
rfm.head()

while excluding monetary<=0

rfm = rfm[rfm[“monetary”] > 0]

and checking descriptive statistics

rfm.describe().T

RFM descriptive statistics

RECENCY SCORE
rfm[“recency_score”] = pd.qcut(rfm[“recency”], 5, labels=[5, 4, 3, 2, 1])

FREQUENCY SCORE
rfm[“frequency_score”] = pd.qcut(rfm[“frequency”].rank(method=”first”), 5, labels=[1, 2, 3, 4, 5])

MONETARY SCORE
rfm[“monetary_score”] = pd.qcut(rfm[“monetary”], 5, labels=[1, 2, 3, 4, 5])

RF SCORE
rfm[“RF_SCORE”] = (rfm[“recency_score”].astype(str) + rfm[“frequency_score”].astype(str))

rfm.head()

RFM and  RF_score columns

Let’s convert RF_SCORE to segment with regex=True

seg_map = {
r'[1-2][1-2]’: ‘hibernating’,
r'[1-2][3-4]’: ‘at_Risk’,
r'[1-2]5′: ‘cant_loose’,
r’3[1-2]’: ‘about_to_sleep’,
r’33’: ‘need_attention’,
r'[3-4][4-5]’: ‘loyal_customers’,
r’41’: ‘promising’,
r’51’: ‘new_customers’,
r'[4-5][2-3]’: ‘potential_loyalists’,
r’5[4-5]’: ‘champions’
}

rfm[‘segment’] = rfm[‘RF_SCORE’].replace(seg_map, regex=True)

rfm.head()

RFM, RF_scoe and segment columns

Let’s compute mean and count of our RFM columns sorted by segment in the ascending order

rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

segments_counts = rfm[‘segment’].value_counts().sort_values(ascending=True)

Let’s plot our RFM segments

fig, ax = plt.subplots()

bars = ax.barh(range(len(segments_counts)),
segments_counts,
color=’lightcoral’)
ax.set_frame_on(False)
ax.tick_params(left=False,
bottom=False,
labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index)

for i, bar in enumerate(bars):
value = bar.get_width()
if segments_counts.index[i] in [‘Can\’t loose’]:
bar.set_color(‘firebrick’)
ax.text(value,
bar.get_y() + bar.get_height()/2,
‘{:,} ({:}%)’.format(int(value),
int(value*100/segments_counts.sum())),
va=’center’,
ha=’left’
)

plt.show(block=True)

Barplot of RFM segments

########## CLTV CALCULATION

Let’s prepare our input data

cltv_c = df.groupby(‘Customer ID’).agg({“Invoice”: lambda x: x.nunique(),
“Quantity”: lambda x: x.sum(),
“TotalPrice”: lambda x: x.sum()})

cltv_c.columns = [“total_transaction”, “total_unit”, “total_price”]

Let’s calculate the following metrics:

profit margin:
cltv_c[“profit_margin”] = cltv_c[“total_price”] * 0.10

Churn Rate:

repeat_rate = cltv_c[cltv_c[“total_transaction”] > 1].shape[0] / cltv_c.shape[0]

churn_rate = 1 – repeat_rate

Purchase Frequency:
cltv_c[“purchase_frequency”] = cltv_c[“total_transaction”] / cltv_c.shape[0]

Average Order Value:
cltv_c[“average_order_value”] = cltv_c[“total_price”] / cltv_c[“total_transaction”]

Customer Value:
cltv_c[“customer_value”] = cltv_c[“average_order_value”] * cltv_c[“purchase_frequency”]

CLTV (Customer Lifetime Value):
cltv_c[“cltv”] = (cltv_c[“customer_value”] / churn_rate) * cltv_c[“profit_margin”]

Sorting our dataset by cltv in the descending order
cltv_c.sort_values(by = “cltv”, ascending=False).head()

Input data with cltv

Eliminating outliers with interquantile_range
def outlier_threshold(dataframe, variable):
quartile1 = dataframe[variable].quantile(0.01)
quartile3 = dataframe[variable].quantile(0.99)
interquantile_range = quartile3 – quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 – 1.5 * interquantile_range
return low_limit, up_limit

def replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_threshold(dataframe, variable)
# dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

replace_with_thresholds(df, “Quantity”)
replace_with_thresholds(df, “Price”)

Define the analysis date

df[“InvoiceDate”].max()
analysis_date = dt.datetime(2011, 12, 11)

############ Preparation of Lifetime Data Structure

cltv_df = df.groupby(“Customer ID”).agg({“InvoiceDate”: [lambda InvoiceDate: (InvoiceDate.max() – InvoiceDate.min()).days / 7,
lambda InvoiceDate: (analysis_date – InvoiceDate.min()).days / 7],
“Invoice”: lambda Invoice: Invoice.nunique(),
“TotalPrice”: lambda TotalPrice: TotalPrice.sum()})

cltv_df.columns = cltv_df.columns.droplevel(0)

cltv_df.columns = [“recency”, “T”, “frequency”, “monetary”]

cltv_df[“monetary”] = cltv_df[“monetary”] / cltv_df[“frequency”]

cltv_df = cltv_df[cltv_df[“frequency”] > 1]

cltv_df.head()

Data columns recency, T, frequency, and monetary.

cltv_df.describe().T

Descriptive statistics of data columns recency, T, frequency, and monetary.

#######CLTV MODELLING

Let’s install and import the following libraries

!pip install Lifetimes

from lifetimes import BetaGeoFitter
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from lifetimes.plotting import plot_period_transactions

import datetime as dt
import pandas as pd
import warnings
warnings.filterwarnings(‘ignore’)
import matplotlib.pyplot as plt
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.float_format’, lambda x: ‘%.5f’ % x)
pd.set_option(‘display.width’, 500)

######BG/NBD Model

bgf = BetaGeoFitter(penalizer_coef=0.001)

bgf.fit(cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])

<lifetimes.BetaGeoFitter: fitted with 2893 subjects, a: 1.93, alpha: 9.47, b: 6.27, r: 2.22>

Let’s look at the first 10 customers with the highest purchase:
bgf.conditional_expected_number_of_purchases_up_to_time(1, # number of weeks
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)

Customer ID
15989.00000   0.00756
16720.00000   0.00722
14119.00000   0.00720
16204.00000   0.00715
17591.00000   0.00714
15169.00000   0.00700
17193.00000   0.00698
17251.00000   0.00696
17411.00000   0.00690
17530.00000   0.00680
dtype: float64

We can do same thing with predict:
bgf.predict(1, #number of weeks
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)

Let’s apply predict for 1 month:
bgf.predict(4, # 4 weeks = 1 month
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)

Customer ID
15989.00000   0.02986
16720.00000   0.02849
14119.00000   0.02841
16204.00000   0.02825
17591.00000   0.02816
15169.00000   0.02763
17193.00000   0.02756
17251.00000   0.02748
17411.00000   0.02726
17530.00000   0.02682
dtype: float64

Let’s compare expected purchases for several time intervals:

Expected purchases for 1 months
cltv_df[“expected_purchase_1_month”] = bgf.predict(4,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])

Expected purchases for 3 months
cltv_df[“expected_purchase_3_month”] = bgf.predict(4 * 3,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])

Expected purchases for 1 year
cltv_df[“expected_purchase_1_year”] = bgf.predict(4 * 12,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])
cltv_df.head()

Expected purchases for 1, 3, and 12 months

Evaluation of predicted results:
plot_period_transactions(bgf)
plt.show(block=True)

Frequency of repeat transactions: actual vs model.

Let’s plot the FR matrix

from lifetimes.plotting import plot_frequency_recency_matrix
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,8))
plot_frequency_recency_matrix(bgf)
plt.show(block=True)

Expected number of future purchases for 1 unit of time, by frequency and recency of a customer

############ Gamma-Gamma Submodel

ggf = GammaGammaFitter(penalizer_coef=0.01)

ggf.fit(cltv_df[“frequency”], cltv_df[“monetary”])

cltv_df[“expected_average_profit”] = ggf.conditional_expected_average_profit(cltv_df[“frequency”],
cltv_df[“monetary”])

cltv_df.sort_values(“expected_average_profit”, ascending=False).head(10)

######### Calculation of CLTV with BG/NBD and GG Model

cltv = ggf.customer_lifetime_value(bgf,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”],
cltv_df[“monetary”],
time=3, # 3 month
freq=”W”, # frequency of T
discount_rate=0.01)

cltv = cltv.reset_index()

cltv_final = cltv_df.merge(cltv, on=”Customer ID”, how=”left”)

cltv_final.sort_values(by=”clv”, ascending=False).head(10)

########## Segmentation according to CLTV

cltv_final[“segment”] = pd.qcut(cltv_final[“clv”], 4, labels=[“D”, “C”, “B”, “A”])

cltv_final.groupby(“segment”).agg({“count”, “mean”, “sum”})

_df_plt = cltv_final[[“expected_average_profit”, “segment”]]
plot_df = _df_plt.groupby(“segment”).agg(“mean”)
plot_df.plot(kind=”bar”, color=”tomato”,figsize=(15, 8))
plt.ylabel(“mean”)

plt.show(block=True)

Expected average profit for CLTV segments A-D

Read more here.

Cohort Analysis in Online Retail

Let’s dive into Cohort Analysis using the Online Retail Dataset.

Let’s install and import standard libraries

!pip install sklearn-pandas

import pandas as pd
import numpy as np
import datetime as dt
import seaborn as sns

from sklearn import preprocessing
import sklearn
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from matplotlib import pyplot as plt

Reading the dataset
df = pd.read_excel(“online_retail.xlsx”)

#############DATA PREPARATION

df.shape

(525461, 8)

print(‘Duplicate entries: {}’.format(df.duplicated().sum()))
print(‘{}% rows are duplicate.’.format(round((df.duplicated().sum()/df.shape[0])*100),2))
df.drop_duplicates(inplace = True)

Duplicate entries: 6865
1% rows are duplicate.

pd.DataFrame([{‘products’: len(df[‘StockCode’].value_counts()),
‘transactions’: len(df[‘Invoice’].value_counts()),
‘customers’: len(df[‘Customer ID’].value_counts()),
}], columns = [‘products’, ‘transactions’, ‘customers’], index = [‘quantity’])

products transactions customers

quantity 4632 28816 4383

temp = df.groupby([‘Country’],as_index=False).agg({‘Invoice’:’nunique’}).rename(columns = {‘Invoice’:’Orders’})
total = temp[‘Orders’].sum(axis=0)
temp[‘%Orders’] = round((temp[‘Orders’]/total)*100,4)

temp.sort_values(by=[‘%Orders’],ascending=False,inplace=True)
temp.reset_index(drop=True,inplace=True)

Let’s plot %Orders vs Country

plt.figure(figsize=(13,6))
splot=sns.barplot(x=”Country”,y=”%Orders”,data=temp[:10])
for p in splot.patches:
splot.annotate(format(p.get_height(), ‘.1f’),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = ‘center’, va = ‘center’,
xytext = (0, 9),
textcoords = ‘offset points’)
plt.xlabel(“Country”, size=14)
plt.ylabel(“%Orders”, size=14)

%Orders vs Country

invoices = df[‘Invoice’]

x = invoices.str.contains(‘C’, regex=True)
x.fillna(0, inplace=True)

x = x.astype(int)

x.value_counts()

0    508414
1     10182
Name: Invoice, dtype: int64

df[‘order_canceled’] = x
#df.head()

n1 = df[‘order_canceled’].value_counts()[1]
n2 = df.shape[0]
print(‘Number of orders canceled: {}/{} ({:.2f}%) ‘.format(n1, n2, n1/n2*100))

Number of orders canceled: 10182/518596 (1.96%) 

df = df.loc[df[‘order_canceled’] == 0,:]

df.reset_index(drop=True,inplace=True)

df.reset_index(drop=True,inplace=True)

df.loc[df[‘Quantity’] < 0,:]

df = df[df[‘Customer ID’].notna()]

df.reset_index(drop=True,inplace=True)

#############NL COHORT INDEX ANALYSIS

df_uk = df[df.Country == ‘Netherlands’]

df_uk.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2729 entries, 4346 to 387139
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   Invoice         2729 non-null   object        
 1   StockCode       2729 non-null   object        
 2   Description     2729 non-null   object        
 3   Quantity        2729 non-null   int64         
 4   InvoiceDate     2729 non-null   datetime64[ns]
 5   Price           2729 non-null   float64       
 6   Customer ID     2729 non-null   float64       
 7   Country         2729 non-null   object        
 8   order_canceled  2729 non-null   int32         
dtypes: datetime64[ns](1), float64(2), int32(1), int64(1), object(4)
memory usage: 202.5+ KB

pd.DataFrame([{‘products’: len(df[‘StockCode’].value_counts()),
‘transactions’: len(df[‘Invoice’].value_counts()),
‘customers’: len(df[‘Customer ID’].value_counts()),
}], columns = [‘products’, ‘transactions’, ‘customers’], index = [‘Quantity’])

products transactions customers

Quantity 4017 19215 4314

cohort_data = df_uk[[‘Invoice’,’StockCode’,’Description’,’Quantity’,’InvoiceDate’,’Price’,’Customer ID’,’Country’]]

Checking for nulls in the data
cohort_data.isnull().sum()

Invoice        0
StockCode      0
Description    0
Quantity       0
InvoiceDate    0
Price          0
Customer ID    0
Country        0
dtype: int64

all_dates = (pd.to_datetime(cohort_data[‘InvoiceDate’])).apply(lambda x:x.date())

(all_dates.max() – all_dates.min()).days

364

Start and end dates:
print(‘Start date: {}’.format(all_dates.min()))
print(‘End date: {}’.format(all_dates.max()))

Start date: 2009-12-02
End date: 2010-12-01

cohort_data.head()

InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountry

def get_month(x):
return dt.datetime(x.year, x.month, 1)

cohort_data[‘InvoiceMonth’] = cohort_data[‘InvoiceDate’].apply(get_month)
grouping = cohort_data.groupby(‘Customer ID’)[‘InvoiceMonth’]
cohort_data[‘CohortMonth’] = grouping.transform(‘min’)
cohort_data.head()

InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountryInvoiceMonthCohortMonth

def get_date_int(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, day

invoice_year, invoice_month, _ = get_date_int(cohort_data, ‘InvoiceMonth’)
cohort_year, cohort_month, _ = get_date_int(cohort_data, ‘CohortMonth’)
years_diff = invoice_year – cohort_year
months_diff = invoice_month – cohort_month
cohort_data[‘CohortIndex’] = years_diff * 12 + months_diff
#cohort_data.head()

grouping = cohort_data.groupby([‘CohortMonth’, ‘CohortIndex’])

cohort_data = grouping[‘Quantity’].mean()
cohort_data = cohort_data.reset_index()
average_quantity = cohort_data.pivot(index=’CohortMonth’,columns=’CohortIndex’,values=’Quantity’)
average_quantity.round(1)
#average_quantity

Let’s plot retention rates heatmap

plt.figure(figsize=(10, 8))
plt.title(‘Retention rates’)
sns.heatmap(data = retention,annot = True,fmt = ‘.0%’,vmin = 0.0,vmax = 0.5,cmap = ‘BuGn’)
plt.show()

Retention rates vs Cohort Month/Index heatmap

Let’s compute TotalPrice

grdf=grouping.apply(pd.DataFrame)

grdf[‘TotalPrice’]=grdf[‘Quantity’]*grdf[‘Price’]

grdf[grdf[‘TotalPrice’]>139.2].sort_values(‘TotalPrice’,ascending=False)

#############RFM SCORE ANALYSIS

from dateutil.relativedelta import relativedelta
start_date = all_dates.max()-relativedelta(months=12,days=-1)
print(‘Start date: {}’.format(start_date))
print(‘End date: {}’.format(all_dates.max()))

Start date: 2009-12-02
End date: 2010-12-01

data_rfm = grdf[grdf[‘InvoiceDate’] >= pd.to_datetime(start_date)]
data_rfm.reset_index(drop=True,inplace=True)
data_rfm.head()

InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountryInvoiceMonthCohortMonthCohortIndexTotalPrice

snapshot_date = max(data_rfm.InvoiceDate) + dt.timedelta(days=1)
print(‘Snapshot date: {}’.format(snapshot_date.date()))

Snapshot date: 2010-12-02

Aggregate data on a customer level
data = data_rfm.groupby([‘Customer ID’],as_index=False).agg({‘InvoiceDate’: lambda x: (snapshot_date – x.max()).days,
‘Invoice’: ‘count’,
‘TotalPrice’: ‘sum’}).rename(columns = {‘InvoiceDate’: ‘Recency’,
‘Invoice’: ‘Frequency’,
‘TotalPrice’: ‘MonetaryValue’})

r_labels = range(4, 0, -1)
r_quartiles = pd.qcut(data[‘Recency’], 4, labels = r_labels)
data = data.assign(R = r_quartiles.values)

f_labels = range(1,5)
m_labels = range(1,5)
f_quartiles = pd.qcut(data[‘Frequency’], 4, labels = f_labels)
m_quartiles = pd.qcut(data[‘MonetaryValue’], 4, labels = m_labels)
data = data.assign(F = f_quartiles.values)
data = data.assign(M = m_quartiles.values)
#data.head()

def join_rfm(x):
return str(x[‘R’]) + str(x[‘F’]) + str(x[‘M’])

data[‘RFM_Segment’] = data.apply(join_rfm, axis=1)
data[‘RFM_Score’] = data[[‘R’,’F’,’M’]].sum(axis=1)
data.head()

RFM score vs segment

data.groupby(‘RFM_Score’).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’] }).round(1)

Let’s create segment

def create_segment(df):
if df[‘RFM_Score’] >= 9:
return ‘Top’
elif (df[‘RFM_Score’] >= 5) and (df[‘RFM_Score’] < 9):
return ‘Middle’
else:
return ‘Low’

data[‘General_Segment’] = data.apply(create_segment, axis=1)
data.groupby(‘General_Segment’).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’]}).round(1)

Plotting the distributions of Recency, Frequency and Monetary Value variables

plt.figure(figsize=(12,10))

plt.subplot(3, 1, 1); sns.distplot(data[‘Recency’])

plt.subplot(3, 1, 2); sns.distplot(data[‘Frequency’])

plt.subplot(3, 1, 3); sns.distplot(data[‘MonetaryValue’])

RFM distributions

Checking for constant mean and variance.
data[[‘Recency’,’Frequency’,’MonetaryValue’]].describe()

data[data[‘MonetaryValue’] == 0]

raw_data = data[[‘Recency’,’Frequency’,’MonetaryValue’]]

#############RFM SCORE TRANSFORMATIONS

Let’s unskew the data
data_log = np.log(raw_data)

Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(data_log)

Scale and center the data
data_normalized = scaler.transform(data_log)

Create a pandas DataFrame
data_norm = pd.DataFrame(data=data_log, index=raw_data.index, columns=raw_data.columns)

Let’s plot the transformed RFM distributions

plt.figure(figsize=(12,10))

plt.subplot(3, 1, 1); sns.distplot(data_norm[‘Recency’])

plt.subplot(3, 1, 2); sns.distplot(data_norm[‘Frequency’])

plt.subplot(3, 1, 3); sns.distplot(data_norm[‘MonetaryValue’])

plt.show()

The transformed RFM distributions

#############K-MEANS CLUSTERING

from sklearn.cluster import KMeans

sse = {}

Fit KMeans and calculate SSE for each k
for k in range(1, 21):

# Initialize KMeans with k clusters
kmeans = KMeans(n_clusters=k, random_state=1)

# Fit KMeans on the normalized dataset
kmeans.fit(data_norm)

# Assign sum of squared distances to k element of dictionary
sse[k] = kmeans.inertia_

Let’s look at the Elbow plot

plt.figure(figsize=(12,8))

plt.title(‘The Elbow Method’)
plt.xlabel(‘k’);
plt.ylabel(‘Sum of squared errors’)
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()

The Elbow plot

Let’s apply Kmeans with n_clusters=3

kmeans = KMeans(n_clusters=3, random_state=1)

Compute k-means clustering on pre-processed data
kmeans.fit(data_norm)

Extract cluster labels from labels_ attribute
cluster_labels = kmeans.labels_

Create a cluster label column in the original DataFrame
data_norm_k3 = data_norm.assign(Cluster = cluster_labels)
data_k3 = raw_data.assign(Cluster = cluster_labels)

Calculate average RFM values and size for each cluster
summary_k3 = data_k3.groupby([‘Cluster’]).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’],}).round(0)

summary_k3

RFM mean count for 3 clusters

Let’s apply Kmeans with n_clusters=4

kmeans = KMeans(n_clusters=4, random_state=1)

Compute k-means clustering on pre-processed data
kmeans.fit(data_norm)

Extract cluster labels from labels_ attribute
cluster_labels = kmeans.labels_

Create a cluster label column in the original DataFrame
data_norm_k4 = data_norm.assign(Cluster = cluster_labels)
data_k4 = raw_data.assign(Cluster = cluster_labels)

Calculate average RFM values and size for each cluster
summary_k4 = data_k4.groupby([‘Cluster’]).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’],}).round(0)

summary_k4

RFM mean count for 4 clusters

data_norm_k4.index = data[‘Customer ID’].astype(int)

Let’s invoke melt to massage the data as follows
data_melt = pd.melt(data_norm_k4.reset_index(),
id_vars=[‘Customer ID’, ‘Cluster’],
value_vars=[‘Recency’, ‘Frequency’, ‘MonetaryValue’],
var_name=’Attribute’,
value_name=’Value’)

Let’s look at the RFM snake plot for 4 clusters

plt.title(‘Snake plot of standardized variables’)
sns.lineplot(x=”Attribute”, y=”Value”, hue=’Cluster’, data=data_melt)

RFm snake plot for 4 clusters

######RFM Data Wrangling:

data_k4.index = data[‘Customer ID’].astype(int)

raw_data.index = data[‘Customer ID’].astype(int)

cluster_avg = data_k4.groupby([‘Cluster’]).mean()
population_avg = raw_data.head().mean()

population_avg

Recency            88.400
Frequency          65.800
MonetaryValue    1224.666
dtype: float64

relative_imp = cluster_avg / population_avg – 1

Let’s plot Relative importance of attributes as a sns heatmap

plt.figure(figsize=(8, 4))
plt.title(‘Relative importance of attributes’)
sns.heatmap(data=relative_imp, annot=True, fmt=’.2f’, cmap=’RdYlGn’)
plt.show()

Relative importance of attributes

Read more here.

Customer Segmentation, RFM, K-Means and Cohort Analysis in Online Retail

Let’s set the working directory YOURPATH

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import standard libraries with optional settings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
import statsmodels.api as sm
import statsmodels.formula.api as smf
import missingno as msno

from sklearn.compose import make_column_transformer

from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler

import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

import warnings
warnings.filterwarnings(“ignore”)
warnings.warn(“this will not show”)

plt.rcParams[“figure.figsize”] = (10,6)
pd.set_option(‘max_colwidth’,200)
pd.set_option(‘display.max_rows’, 1000)
pd.set_option(‘display.max_columns’, 200)
pd.set_option(‘display.float_format’, lambda x: ‘%.3f’ % x)

import colorama
from colorama import Fore, Style
from termcolor import colored

import ipywidgets
from ipywidgets import interact

sns.set_style(“whitegrid”)

Let’s read the input dataset

df=pd.read_csv(‘Online Retail.csv’,index_col=0)

and define some useful functions

#######################

def missing_values(df):
missing_number = df.isnull().sum().sort_values(ascending=False)
missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=[‘Missing_Number’, ‘Missing_Percent’])
return missing_values[missing_values[‘Missing_Number’]>0]

def first_looking(df):
print(colored(“Shape:”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), colored(“\nInfo:\n”, attrs=[‘bold’]), sep=”) print(df.info(), ‘\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Number of Uniques:\n”, attrs=[‘bold’]), df.nunique(),’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”) print(colored(“Missing Values:\n”, attrs=[‘bold’]), missing_values(df),’\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“All Columns:”, attrs=[‘bold’]), list(df.columns),’\n’,
colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)

df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')

print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
          colored('-'*79, 'red', attrs=['bold']), sep='')

def multicolinearity_control(df):
feature =[]
collinear=[]
for col in df.corr().columns:
for i in df.corr().index:
if (abs(df.corr()[col][i])> .9 and abs(df.corr()[col][i]) < 1):
feature.append(col)
collinear.append(i)
print(colored(f”Multicolinearity alert in between:{col} – {i}”,
“red”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)

def duplicate_values(df):
print(colored(“Duplicate check…”, attrs=[‘bold’]), sep=”)
duplicate_values = df.duplicated(subset=None, keep=’first’).sum()
if duplicate_values > 0:
df.drop_duplicates(keep=’first’, inplace=True)
print(duplicate_values, colored(” Duplicates were dropped!”),’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”) else: print(colored(“There are no duplicates”),’\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)

def drop_columns(df, drop_columns):
if drop_columns !=[]:
df.drop(drop_columns, axis=1, inplace=True)
print(drop_columns, ‘were dropped’)
else:
print(colored(‘We will now check the missing values and if necessary will drop realted columns!’, attrs=[‘bold’]),’\n’,
colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)

def drop_null(df, limit):
print(‘Shape:’, df.shape)
for i in df.isnull().sum().index:
if (df.isnull().sum()[i]/df.shape[0]*100)>limit:
print(df.isnull().sum()[i], ‘percent of’, i ,’null and were dropped’)
df.drop(i, axis=1, inplace=True)
print(‘new shape:’, df.shape)
print(‘New shape after missing value control:’, df.shape)

#######################

Let’s check the structure and content of our input data

first_looking(df)
duplicate_values(df)
drop_columns(df,[])
drop_null(df, 90)
df.head()
df.tail()
df.sample(5)
df.describe().T
df.describe(include=object).T

Shape:(541909, 8)
-------------------------------------------------------------------------------
Info:

<class 'pandas.core.frame.DataFrame'>
Index: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  float64
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(3), object(5)
memory usage: 37.2+ MB
None
-------------------------------------------------------------------------------
Number of Uniques:
InvoiceNo      25900
StockCode       4070
Description     4223
Quantity         722
InvoiceDate    23260
UnitPrice       1630
CustomerID      4372
Country           38
dtype: int64
-------------------------------------------------------------------------------
Missing Values:
             Missing_Number  Missing_Percent
CustomerID           135080            0.249
Description            1454            0.003
-------------------------------------------------------------------------------
All Columns:['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']
-------------------------------------------------------------------------------
Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
-------------------------------------------------------------------------------
Duplicate check...
5268 Duplicates were dropped!
-------------------------------------------------------------------------------
We will now check the missing values and if necessary will drop realted columns!
-------------------------------------------------------------------------------
Shape: (536641, 8)
New shape after missing value control: (536641, 8)
Input data statistics

######## DATA WRANGLING

df_model = df.copy()

df.isnull().sum()

invoiceno           0
stockcode           0
description      1454
quantity            0
invoicedate         0
unitprice           0
customerid     135037
country             0
dtype: int64

We will drop NaN values of customerid
df = df.dropna(subset=[‘customerid’])
df.shape

(401604, 8)

df[‘cancelled’] = df[‘invoiceno’].str.contains(‘C’)

Let’s drop cancelled
df = df.drop([‘cancelled’],axis=1)

Clean the Data from the Noise and Missing Values
df = df[(df.quantity > 0) & (df.unitprice > 0)]
df.shape

(392692, 8)

Let’s look at the treemap Number of Customers Per Country Without UK

fig = px.treemap(df.groupby(‘country’)[[“customerid”]].nunique().sort_values(by=”customerid”, ascending=False).iloc[1:],
path=[df.groupby(‘country’)[[“customerid”]].nunique().sort_values(by=”customerid”, ascending=False).iloc[1:].index],
values=’customerid’,
width=1000,
height=600)
fig.update_layout(title_text=’Number of Customers Per Country Without UK’,
title_x = 0.5, title_font = dict(size=20)
)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Number of Customers Per Country Without UK

#######TOTAL PRICE aka REVENUE

The calculation of total price
df[‘total_price’] = df[‘quantity’] * df[‘unitprice’]

Let’s save this dataframe for future analysis
df_total_price = pd.DataFrame(df.groupby(‘country’)[“total_price”].sum().sort_values(ascending=False))

Let’s plot the treemap Total Prices By Countries Without UK

fig = px.treemap(df_total_price.iloc[1:],
path=[df_total_price.iloc[1:].index],
values=’total_price’,
width=1000,
height=600)
fig.update_layout(title_text=’Total Prices By Countries Without UK’,
title_x = 0.5, title_font = dict(size=20)
)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()

Total Prices By Countries Without UK

#########UK RFM ANALYSIS

df_uk = df[df[“country”] == ‘United Kingdom’]

Top 15 most owned products
df_uk[“stockcode”].value_counts().head(15).plot(kind=”bar”, width=0.5, color=’pink’, edgecolor=’purple’, figsize=(16,9))
plt.xticks(rotation=45);

Top 15 most owned products in UK

last_invoice = max(df_uk[‘invoicedate’])
last_invoice

'2011-12-09 12:49:00'

first_invoice = min(df_uk[‘invoicedate’])
first_invoice

'2010-12-01 08:26:00'

df_uk[‘invoicedate’] = pd.to_datetime(df_uk[‘invoicedate’])

df_uk[“date”] = df_uk[‘invoicedate’].dt.date

df_uk[“last_purchased_date”] = df_uk.groupby(“customerid”).date.transform(max)

last_invoice = pd.to_datetime(last_invoice).date()

df_uk.groupby(‘customerid’)[‘last_purchased_date’].apply(lambda x: last_invoice – x)

# Recency

df_recency = df_uk.groupby(by=’customerid’,
as_index=False)[‘date’].max()
df_recency.columns = [‘customerid’, ‘last_purchased_date’]
recent_date = df_recency[‘last_purchased_date’].max()
df_recency[‘recency’] = df_recency[‘last_purchased_date’].apply(
lambda x: (recent_date – x).days)
#df_recency.head()

Let’s plot recency

plt.figure(figsize=(16,9))
sns.distplot(df_recency[‘recency’], bins=35);

UK Recency

# Frequency: Number of purchases

df_uk[‘frequency’] = df_uk.groupby(‘customerid’)[“invoiceno”].transform(‘count’)

plt.figure(figsize=(16,9))
sns.distplot(df_uk[‘frequency’], bins=50);

UK Frequency

########### Monetary: Total amount of money spent

df_uk[“monetary”] = df_uk.groupby(‘customerid’)[“total_price”].transform(‘sum’)

plt.figure(figsize=(16,9))
sns.distplot(df_uk[‘monetary’], bins=50);

UK Monetary

############ UK RFM Table

df_uk.columns

Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate',
       'unitprice', 'customerid', 'country', 'total_price', 'date',
       'last_purchased_date', 'frequency', 'monetary'],
      dtype='object')

df_recency.columns

Index(['customerid', 'last_purchased_date', 'recency'], dtype='object')

df_uk[‘recency’] = df_recency[‘recency’]

df_rfm_table = df_uk[[‘customerid’, ‘recency’, ‘frequency’, ‘monetary’]]

Customer Segmentation with RFM Scores
df_rfm_table = df_rfm_table.set_index(‘customerid’)

df_rfm_table.reset_index().duplicated().sum()

df_rfm_table.drop_duplicates(inplace=True)
df_rfm_table.shape

(6050, 3)

########### Recency Scoring

df_rfm_table[“recency”].quantile(q = [.25,.5,.75])

0.250    21.000
0.500    58.000
0.750   163.000
Name: recency, dtype: float64

def recency_scoring(data):
if data[“recency”] <= 17.000:
return 4
elif data[“recency”] <= 50.000:
return 3
elif data[“recency”] <= 142.000:
return 2
else:
return 1

df_rfm_table[‘recency_quantile’] = df_rfm_table.apply(recency_scoring, axis =1)
#df_rfm_table.head()

########### Frequency Scoring

df_rfm_table[“frequency”].quantile(q = [.25,.5,.75])

0.250    23.000
0.500    65.000
0.750   147.000
Name: frequency, dtype: float64

def frequency_scoring(data):
if data.frequency <= 17.000:
return 1
elif data.frequency <= 40.000:
return 2
elif data.frequency <= 98.000:
return 3
else:
return 4

df_rfm_table[‘frequency_quantile’] = df_rfm_table.apply(frequency_scoring, axis =1)
#df_rfm_table.head()

########### Monetary Scoring

df_rfm_table[“monetary”].quantile(q = [.25,.5,.75])

def monetary_scoring(data):
if data.monetary <= 298.185:
return 1
elif data.monetary <= 644.975:
return 2
elif data.monetary <= 1571.285:
return 3
else:
return 4

df_rfm_table[‘monetary_quantile’] = df_rfm_table.apply(monetary_scoring, axis =1)
#df_rfm_table.head()

############### RFM Scoring

def rfm_scoring(data):
return str(int(data[‘recency_quantile’])) + str(int(data[‘frequency_quantile’])) + str(int(data[‘monetary_quantile’]))

df_rfm_table[‘rfm_score’] = df_rfm_table.apply(rfm_scoring, axis=1)
#df_rfm_table.head()

df_rfm_table[‘rfm_level’] = df_rfm_table[‘recency_quantile’] + df_rfm_table[‘frequency_quantile’] + df_rfm_table[‘monetary_quantile’]
#df_rfm_table.head()

def segments1(data):
if data[‘rfm_level’] >= 10 :
return ‘Gold’
elif (data[‘rfm_level’] >= 6) and (data[‘rfm_level’] < 10 ):
return ‘Silver’
else:
return ‘Bronze’

df_rfm_table [‘segments1’] = df_rfm_table.apply(segments1,axis=1)
#df_rfm_table.head()

segments2 = {
‘Customer Segment’:
[‘Champions’,
‘Loyal Customers’,
‘Potential Loyalist’,
‘Recent Customers’,
‘Customers Needing Attention’,
‘Still Got Hope’,
‘Need to Get Them Back’,
‘Lost’, ‘Give it a Try’],\
‘RFM’:
[‘(3|4)-(3|4)-(3|4)’,
‘(2|3|4)-(3|4)-(1|2|3|4)’,
‘(3|4)-(2|3)-(1|2|3|4)’,
‘(4)-(1)-(1|2|3|4)’,
‘(2|3)-(2|3)-(2|3)’,
‘(2|3)-(1|2)-(1|2|3|4)’,
‘(1|2)-(3|4)-(2|3|4)’,
‘(1|2)-(1|2)-(1|2)’,
‘(1|2)-(1|2|3)-(1|2|3|4)’]
}
pd.DataFrame(segments2)

def categorizer(rfm):
if (rfm[0] in [‘3’, ‘4’]) & (rfm[1] in [‘3’, ‘4’]) & (rfm[2] in [‘3’, ‘4’]):
rfm = ‘Champions’

elif (rfm[0] in ['2', '3', '4']) & (rfm[1] in ['3', '4']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Loyal Customers'

elif (rfm[0] in ['3', '4']) & (rfm[1] in ['2', '3']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Potential Loyalist'

elif (rfm[0] in ['4']) & (rfm[1] in ['1']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Recent Customers'

elif (rfm[0] in ['2', '3']) & (rfm[1] in ['2', '3']) & (rfm[2] in ['2', '3']):
    rfm = 'Customers Needing Attention'

elif (rfm[0] in ['2', '3']) & (rfm[1] in ['1', '2']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Still Got Hope'

elif (rfm[0] in ['1', '2']) & (rfm[1] in ['3', '4']) & (rfm[2] in ['2', '3', '4']):
    rfm = 'Need to Get Them Back'

elif (rfm[0] in ['1', '2']) & (rfm[1] in ['1', '2']) & (rfm[2] in ['1', '2']):
    rfm = 'Lost'

elif (rfm[0] in ['1', '2']) & (rfm[1] in ['1', '2', '3']) & (rfm[2] in ['1', '2', '3', '4']):
    rfm = 'Give it a Try'

return rfm 

df_rfm_table[‘segments2’] = df_rfm_table[“rfm_score”].apply(categorizer)
#df_rfm_table.head()

df_rfm_table[“segments2”].value_counts(dropna=False)

segments2
Need to Get Them Back          2343
Lost                           1666
Champions                       660
Loyal Customers                 602
Give it a Try                   483
Potential Loyalist              115
Still Got Hope                   93
Recent Customers                 46
Customers Needing Attention      40
141                               2
Name: count, dtype: int64

df_rfm_table[“segments1”].info()

<class 'pandas.core.series.Series'>
Index: 6050 entries, 17850.0 to 14569.0
Series name: segments1
Non-Null Count  Dtype 
--------------  ----- 
6050 non-null   object
dtypes: object(1)
memory usage: 94.5+ KB

Plot RFM Segments
df_plot1 = pd.DataFrame(df_rfm_table[“segments1”].value_counts(dropna=False).sort_values(ascending=False)).reset_index().rename(columns={‘index’:’Segments’, ‘segments1′:’Customers’})
df_plot1

import matplotlib.pyplot as plt
import squarify

############################# segment1

sizes=[3200, 1944, 906]
label=[“Silver”, “Bronze”, “Gold”]
color=[‘yellow’,’blue’,’orange’]
squarify.plot(sizes=sizes, label=label, color=color,alpha=0.8 , text_kwargs={‘fontsize’:16})
plt.axis(‘off’)
plt.show()

UK RFM segment1

################ segment2

import matplotlib.pyplot as plt
import seaborn as sns

data = [2343, 1666, 660, 602, 483, 115]
labels = [‘Need to Get Them Back’, ‘Lost’,’Champions’,’Loyal Customers’,’Give it a Try’,’Potential Loyalist’]

colors = sns.color_palette(‘bright’)[0:6]

plt.pie(data, labels = labels, colors = colors, autopct=’%.0f%%’)
plt.show()

UK RFM segment2

Read more here.

Customer Clustering PCA, K-Means and Agglomerative for Marketing Campaigns

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and read the dataset

import pandas as pd
import numpy as np
df = pd.read_csv(‘marketing_campaign.csv’, sep=”\t”)

Get the shape of the dataframe
print(“Shape of the dataframe:”, df.shape)

Shape of the dataframe: (2240, 29)

Get the information about the dataframe
print(“\nInformation about the dataframe:”)
print(df.info())

Information about the dataframe:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   int64  
 16  NumWebPurchases      2240 non-null   int64  
 17  NumCatalogPurchases  2240 non-null   int64  
 18  NumStorePurchases    2240 non-null   int64  
 19  NumWebVisitsMonth    2240 non-null   int64  
 20  AcceptedCmp3         2240 non-null   int64  
 21  AcceptedCmp4         2240 non-null   int64  
 22  AcceptedCmp5         2240 non-null   int64  
 23  AcceptedCmp1         2240 non-null   int64  
 24  AcceptedCmp2         2240 non-null   int64  
 25  Complain             2240 non-null   int64  
 26  Z_CostContact        2240 non-null   int64  
 27  Z_Revenue            2240 non-null   int64  
 28  Response             2240 non-null   int64  
dtypes: float64(1), int64(25), object(3)
memory usage: 507.6+ KB
None

Get the summary statistics of the dataframe
print(“\nSummary statistics of the dataframe:”)
df.describe().T

Summary statistics of the dataframe

############# DATA PREPARATION

Check for missing data
print(“Missing data in the dataframe:”)
print(df.isnull().sum())

Missing data in the dataframe:
ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

Let’s drop duplicates

df = df.dropna()
df.duplicated().sum()

0

########### Outliers

Calculate the IQR for the Income column
Q1 = df[‘Income’].quantile(0.25)
Q3 = df[‘Income’].quantile(0.75)
IQR = Q3 – Q1

Identify the outliers in the Income column
outliers = df[(df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR))]

Print the number of outliers
print(“Number of outliers in the Income column:”, len(outliers))

Number of outliers in the Income column: 8

Remove the outliers in the Income column
df = df[~((df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR)))]

Print the updated shape of the dataframe
print(“Updated shape of the dataframe:”, df.shape)

############## Feature engineering
print(“Unique values in Education column:”, df[‘Education’].unique())
print(“Unique values in Marital_Status column:”, df[‘Marital_Status’].unique())

Unique values in Education column: ['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle']
Unique values in Marital_Status column: ['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']

Let’s define the following new columns

def education_level(education):
if education in [‘Graduation’, ‘PhD’, ‘Master’]:
return ‘High’
elif education in [‘Basic’]:
return ‘Middle’
else:
return ‘Low’

df[‘Education_Level’] = df[‘Education’].apply(education_level)

def living_status(marital_status):
if marital_status in [‘Alone’, ‘Absurd’, ‘YOLO’]:
return ‘Living Alone’
else:
return ‘Living with Others’

df[‘Living_Status’] = df[‘Marital_Status’].apply(living_status)

df[‘Age’] = 2022 – df[‘Year_Birth’]

df[‘Total_Campaigns_Accepted’] = df[[‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘AcceptedCmp3’, ‘AcceptedCmp4’, ‘AcceptedCmp5’]].sum(axis=1)

df[‘Average_Spend’] = (df[[‘MntWines’, ‘MntFruits’, ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’]].sum(axis=1)) / df[‘NumDealsPurchases’]

df[‘Spent’] = df[‘MntWines’]+df[“MntWines”] +df[‘MntFruits’]+ df[‘MntMeatProducts’] +df[‘MntFishProducts’]+df[‘MntSweetProducts’]+ df[‘MntGoldProds’]

df[‘Is_Parent’] = (df[‘Kidhome’] + df[‘Teenhome’] > 0).astype(int)

Create the new feature for total spending in the last 2 years
df[‘total_spending’] = df[‘MntWines’] + df[‘MntFruits’] + df[‘MntMeatProducts’] + df[‘MntFishProducts’] + df[‘MntSweetProducts’] + df[‘MntGoldProds’]

Create new feature for average monthly visits to the company’s website
df[‘avg_web_visits’] = df[‘NumWebVisitsMonth’] / 12

Create new feature for the ratio of online purchases to total purchases
df[‘online_purchase_ratio’] = df[‘NumWebPurchases’] / (df[‘NumWebPurchases’] + df[‘NumCatalogPurchases’] + df[‘NumStorePurchases’])

Let’s drop a few columns

to_drop = [‘Dt_Customer’, ‘Z_CostContact’, ‘Z_Revenue’, ‘Year_Birth’, ‘ID’]
df = df.drop(to_drop, axis=1)

df.dtypes

Education                    object
Marital_Status               object
Income                      float64
Kidhome                       int64
Teenhome                      int64
Recency                       int64
MntWines                      int64
MntFruits                     int64
MntMeatProducts               int64
MntFishProducts               int64
MntSweetProducts              int64
MntGoldProds                  int64
NumDealsPurchases             int64
NumWebPurchases               int64
NumCatalogPurchases           int64
NumStorePurchases             int64
NumWebVisitsMonth             int64
AcceptedCmp3                  int64
AcceptedCmp4                  int64
AcceptedCmp5                  int64
AcceptedCmp1                  int64
AcceptedCmp2                  int64
Complain                      int64
Response                      int64
Education_Level              object
Living_Status                object
Age                           int64
Total_Campaigns_Accepted      int64
Average_Spend               float64
Spent                         int64
Is_Parent                     int32
total_spending                int64
avg_web_visits              float64
online_purchase_ratio       float64
dtype: object

########### PLOTLY EXPLORATORY DATA ANALYSIS (EDA)

import plotly.express as px

fig0 = px.histogram(df, x=”Income”, nbins=50)
fig0.show()

Income histogram

fig1 = px.histogram(df, x=”Age”, nbins=30, color=’Age’, title=”Distribution of Age”)
fig1.show()

Age histogram

fig2 = px.histogram(df, x=’Marital_Status’, nbins=5, title=”Marital Status Distribution”)
fig2.show()

Marital Status Distribution

fig3 = px.histogram(df, x=’Education_Level’, nbins=5, title=”Education Level Distribution”)
fig3.show()

Education Level Distribution

import plotly.express as px

df_plot = df.groupby([‘Marital_Status’])[‘Average_Spend’].mean().reset_index()

fig4 = px.bar(df_plot, x=’Marital_Status’, y=’Average_Spend’, color=’Marital_Status’)

fig4.show()

Marital Status vs Average Spend Distribution

Marital Status vs Average Spend Distribution

Plot the distribution of number of children in household
fig6 = px.histogram(df, x=’Kidhome’)
fig6.show()

 Distribution of number of children in household

Total Campaigns Accepted Distribution
fig8 = px.histogram(df, x=’Total_Campaigns_Accepted’, nbins=20, title=”Total Campaigns Accepted Distribution”)
fig8.show()

Total Campaigns Accepted Distribution

Average Spend per Purchase Distribution
fig9 = px.histogram(df, x=’Average_Spend’, nbins=20, title=”Average Spend per Purchase Distribution”)
fig9.show()

Average Spend per Purchase Distribution

Spending distribution by marital status, education, and is_parent

fig10 = px.histogram(df, x=’total_spending’, color=’Marital_Status’, nbins=50,
title=’Spending Distribution by Marital Status’)
fig11 = px.histogram(df, x=’total_spending’, color=’Education_Level’, nbins=50,
title=’Spending Distribution by Education Level’)
fig12 = px.histogram(df, x=’total_spending’, color=’Is_Parent’, nbins=50,
title=’Spending Distribution by Is_Parent’)

fig10.show()
fig11.show()
fig12.show()

Spending distribution by marital status
Spending distribution by education level
Spending distribution is_parent

Plot the Distribution of Online Purchase Ratio
fig13 = px.histogram(df, x=’online_purchase_ratio’)
fig13.show()

Online purchase ratio

Plot the Distribution of Number of Web Visits per Month
fig14 = px.histogram(df, x=’NumWebVisitsMonth’)
fig14.show()

Number of Web Visits per Month

Plot the Distribution of Number of Web Purchases
fig15 = px.histogram(df, x=’NumWebPurchases’)
fig15.show()

Number of Web Purchases

Plot the Distribution of Number of Catalog Purchases
fig16 = px.histogram(df, x=’NumCatalogPurchases’)
fig16.show()

Number of Catalog Purchases

Plot the Distribution of Number of Store Purchases
fig17 = px.histogram(df, x=’NumStorePurchases’)
fig17.show()

Number of Store Purchases

Box plot of NumWebPurchases vs NumStorePurchases
fig18 = px.box(df, x=”NumWebPurchases”, y=”NumStorePurchases”)
fig18.show()

Box plot of NumWebPurchases vs NumStorePurchases

Scatter plot of “MntWines” vs “MntSweetProducts” with a third variable represented by size or color
fig22 = px.scatter(df, x=”MntWines”, y=”MntSweetProducts”, size=”NumWebVisitsMonth”, color=”Income”, size_max=50)
fig22.show()

fig22 = px.scatter(df, x="MntWines", y="MntSweetProducts", size="NumWebVisitsMonth", color="Income", size_max=50)

########### CLUSTERING ALGORITHMS

import plotly.express as px
import plotly.io as pio

from sklearn.preprocessing import StandardScaler, MinMaxScaler

######## Data Preparation

df = pd.read_csv(‘marketing_campaign.csv’, sep=”\t”)

df = df.dropna()

df.duplicated().sum()

0

Calculate the IQR for the Income column
Q1 = df[‘Income’].quantile(0.25)
Q3 = df[‘Income’].quantile(0.75)
IQR = Q3 – Q1

Identify the outliers in the Income column
outliers = df[(df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR))]

Remove the outliers in the Income column
df = df[~((df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR)))]

df[‘Age’] = 2022 – df[‘Year_Birth’]
df[‘Total_Campaigns_Accepted’] = df[[‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘AcceptedCmp3’, ‘AcceptedCmp4’, ‘AcceptedCmp5’]].sum(axis=1)

df[‘Average_Spend’] = (df[[‘MntWines’, ‘MntFruits’, ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’]].sum(axis=1)) / df[‘NumDealsPurchases’]

df[‘Spent’] = df[‘MntWines’]+df[“MntWines”] +df[‘MntFruits’]+ df[‘MntMeatProducts’] +df[‘MntFishProducts’]+df[‘MntSweetProducts’]+ df[‘MntGoldProds’]

df[‘Is_Parent’] = (df[‘Kidhome’] + df[‘Teenhome’] > 0).astype(int)

df[‘total_spending’] = df[‘MntWines’] + df[‘MntFruits’] + df[‘MntMeatProducts’] + df[‘MntFishProducts’] + df[‘MntSweetProducts’]

df[‘avg_web_visits’] = df[‘NumWebVisitsMonth’] / 12

df[‘online_purchase_ratio’] = df[‘NumWebPurchases’] / (df[‘NumWebPurchases’] + df[‘NumCatalogPurchases’] + df[‘NumStorePurchases’])

to_drop = [‘Dt_Customer’, ‘Z_CostContact’, ‘Z_Revenue’, ‘Year_Birth’, ‘ID’]
df = df.drop(to_drop, axis=1)

One-hot encode the categorical variables

Select the numerical columns to scale
num_cols = [‘Income’, ‘Kidhome’, ‘Teenhome’, ‘Recency’, ‘MntWines’, ‘MntFruits’,
‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’,
‘NumDealsPurchases’, ‘NumWebPurchases’, ‘NumCatalogPurchases’,
‘NumStorePurchases’, ‘NumWebVisitsMonth’, ‘AcceptedCmp3’, ‘AcceptedCmp4’,
‘AcceptedCmp5’, ‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘Complain’, ‘Response’,
‘total_spending’, ‘avg_web_visits’, ‘online_purchase_ratio’, ‘Age’,
‘Total_Campaigns_Accepted’, ‘Is_Parent’]

Initialize the StandardScaler
scaler = StandardScaler()

Fit the scaler to the numerical columns
df[num_cols] = scaler.fit_transform(df[num_cols])

df.isnull().sum().sum()

0

al=df[‘online_purchase_ratio’].mean()
print(al)

4.1834490782976914e-17

df[‘online_purchase_ratio’] = df[‘online_purchase_ratio’].fillna(0.33)

df.replace([np.inf, -np.inf], np.nan, inplace=True)

al=df[‘Average_Spend’].mean()
print(al)

412.04631023016225

df[‘Average_Spend’] = df[‘Average_Spend’].fillna(al)

######## K-means Clusters

from sklearn.decomposition import PCA

Initialize the PCA model
pca = PCA(n_components=8)

Fit and transform the data
df_pca = pca.fit_transform(df)

Determining the optimal number of clusters using Silhouette Score
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans

list_k = list(range(2, 10))
silhouette_scores = []
for k in list_k:
km = KMeans(n_clusters=k)
preds = km.fit_predict(df_pca)
silhouette_scores.append(silhouette_score(df_pca, preds))

best_k = list_k[np.argmax(silhouette_scores)]

best_k

2

Let’s fit the KMeans model with the number of clusters set to 4

import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score,davies_bouldin_score

kmeans = KMeans(n_clusters=4)
kmeans.fit(df_pca)

predictions = kmeans.predict(df_pca)

silhouette_score_value = silhouette_score(df_pca, predictions)

calinski_harabasz_score_value = calinski_harabasz_score(df_pca, predictions)

davies_bouldin_score_value = davies_bouldin_score(df_pca, predictions)

plt.scatter(df_pca[:, 0], df_pca[:, 1], c=predictions, cmap=’viridis’)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
plt.title(‘KMeans Clustering Results in 2D\nSilhouette Score: {0:.3f}’.format(silhouette_score_value))
plt.show()

fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(df_pca[:, 0], df_pca[:, 1], df_pca[:, 2], c=predictions, cmap=’viridis’)
ax.set_xlabel(‘Feature 1’)
ax.set_ylabel(‘Feature 2’)
ax.set_zlabel(‘Feature 3’)
ax.set_title(‘KMeans Clustering Results in 3D\nSilhouette Score: {0:.3f}’.format(silhouette_score_value))
plt.show()

KMeans clustering result in 2D
KMeans clustering result in 3D

####### Agglomerative Clustering

import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics import davies_bouldin_score

Generate sample data
X = df_pca

Compute the pairwise distances between samples
dist_matrix = squareform(pdist(X))

Fit the Agglomerative Clustering model
agg_cluster = AgglomerativeClustering(n_clusters=4)
agg_cluster.fit(X)

Plotting

plt.scatter(X[:, 0], X[:, 1], c=agg_cluster.labels_, cmap=’viridis’)
plt.show()

fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=agg_cluster.labels_, cmap=’viridis’)
plt.show()

davies_bouldin_index = davies_bouldin_score(X, agg_cluster.labels_)
print(“Davies-Bouldin Index:”, davies_bouldin_index)

silhouette_score_value = silhouette_score(X, agg_cluster.labels_)

calinski_harabasz_score_value = calinski_harabasz_score(X, agg_cluster.labels_)

print(“Silhouette Score:”, silhouette_score_value)
print(“Calinski-Harabasz Index:”, calinski_harabasz_score_value)

Agglomerative Clustering 2D
Agglomerative Clustering 3D

Read more here.

Supply Chain RFM & ABC Analysis

  • Let’s apply the RFM/ABC analysis to the Data Co Supply Chain Dataset. Recency, Frequency, And Monetary Analysis(RFM) and ABC are techniques would give us an idea about important factors in our business and how we can optimize the future business growth.
  • The Recency-Frequency-Monetary analysis helps demand side sector of an organization by segmenting customers and providing more insights to marketing department and sales department about the customers who buy products at the store.
  • The ABC analysis helps the supply side sector of an organization by segmenting products based on revenue they generate. It helps manage inventory more efficiently.

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import standard libraries

import pandas as pd
import dateutil.parser
import numpy as np
import datetime
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns;
import matplotlib.pyplot as plt
from plotly import graph_objects as go
import plotly.express as px

Let’s read the input data

dFppsc= pd.read_csv(“DataCoSupplyChainDataset.csv”, encoding_errors=”ignore”)
dFppsc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180519 entries, 0 to 180518
Data columns (total 53 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Type                           180519 non-null  object 
 1   Days for shipping (real)       180519 non-null  int64  
 2   Days for shipment (scheduled)  180519 non-null  int64  
 3   Benefit per order              180519 non-null  float64
 4   Sales per customer             180519 non-null  float64
 5   Delivery Status                180519 non-null  object 
 6   Late_delivery_risk             180519 non-null  int64  
 7   Category Id                    180519 non-null  int64  
 8   Category Name                  180519 non-null  object 
 9   Customer City                  180519 non-null  object 
 10  Customer Country               180519 non-null  object 
 11  Customer Email                 180519 non-null  object 
 12  Customer Fname                 180519 non-null  object 
 13  Customer Id                    180519 non-null  int64  
 14  Customer Lname                 180511 non-null  object 
 15  Customer Password              180519 non-null  object 
 16  Customer Segment               180519 non-null  object 
 17  Customer State                 180519 non-null  object 
 18  Customer Street                180519 non-null  object 
 19  Customer Zipcode               180516 non-null  float64
 20  Department Id                  180519 non-null  int64  
 21  Department Name                180519 non-null  object 
 22  Latitude                       180519 non-null  float64
 23  Longitude                      180519 non-null  float64
 24  Market                         180519 non-null  object 
 25  Order City                     180519 non-null  object 
 26  Order Country                  180519 non-null  object 
 27  Order Customer Id              180519 non-null  int64  
 28  order date (DateOrders)        180519 non-null  object 
 29  Order Id                       180519 non-null  int64  
 30  Order Item Cardprod Id         180519 non-null  int64  
 31  Order Item Discount            180519 non-null  float64
 32  Order Item Discount Rate       180519 non-null  float64
 33  Order Item Id                  180519 non-null  int64  
 34  Order Item Product Price       180519 non-null  float64
 35  Order Item Profit Ratio        180519 non-null  float64
 36  Order Item Quantity            180519 non-null  int64  
 37  Sales                          180519 non-null  float64
 38  Order Item Total               180519 non-null  float64
 39  Order Profit Per Order         180519 non-null  float64
 40  Order Region                   180519 non-null  object 
 41  Order State                    180519 non-null  object 
 42  Order Status                   180519 non-null  object 
 43  Order Zipcode                  24840 non-null   float64
 44  Product Card Id                180519 non-null  int64  
 45  Product Category Id            180519 non-null  int64  
 46  Product Description            0 non-null       float64
 47  Product Image                  180519 non-null  object 
 48  Product Name                   180519 non-null  object 
 49  Product Price                  180519 non-null  float64
 50  Product Status                 180519 non-null  int64  
 51  shipping date (DateOrders)     180519 non-null  object 
 52  Shipping Mode                  180519 non-null  object 
dtypes: float64(15), int64(14), object(24)
memory usage: 73.0+ MB

############ Data Cleaning for Future Regression Models
dFppsc.drop(“Product Description”,axis=1,inplace=True)

dFppsc.drop(“Product Description”,axis=1,inplace=True)

dFppsc[“order date (DateOrders)”]=pd.to_datetime(dFppsc[“order date (DateOrders)”])
dFppsc[“shipping date (DateOrders)”]=pd.to_datetime(dFppsc[“shipping date (DateOrders)”])
dFppsc=dFppsc.sort_values(by=”order date (DateOrders)”)

dFppsc.drop([“Benefit per order”,”Sales per customer”,”Order Item Cardprod Id”,”Order Item Product Price”,”Product Category Id”,”Order Customer Id”],axis=1,inplace=True)

######### Variance Inflation Factor (VIF)
vif=pd.DataFrame()
vif[“columns”]=[‘Order Item Discount’,
‘Order Item Discount Rate’, ‘Order Item Profit Ratio’,
‘Order Item Quantity’, ‘Sales’, ‘Order Item Total’,
‘Order Profit Per Order’,’Product Price’]
vif[“vif value”] = [variance_inflation_factor(dFppsc[[‘Order Item Discount’,
‘Order Item Discount Rate’, ‘Order Item Profit Ratio’,
‘Order Item Quantity’, ‘Sales’, ‘Order Item Total’,
‘Order Profit Per Order’,’Product Price’]].values, i) for i in range(len(vif[“columns”]))]
vif.T

Variance Inflation Factor (VIF)

df1=dFppsc.drop([“Order Item Total”,”Product Price”,”Order Item Discount Rate”,”Order Profit Per Order”],axis=1)

vif=pd.DataFrame()
vif[“columns”]=[‘Sales’,
‘Order Item Quantity’,’Order Item Discount’,’Order Item Profit Ratio’]
vif[“data”] = [variance_inflation_factor(df1[[‘Sales’,
‘Order Item Quantity’,’Order Item Discount’,’Order Item Profit Ratio’]].values, i) for i in range(len(vif[“columns”]))]
dFppsc[“Delivery Status”].unique()

array(['Advance shipping', 'Late delivery', 'Shipping on time',
       'Shipping canceled'], dtype=object)

dFppsc.groupby(“Delivery Status”)[“Order Id”].count()

Delivery Status
Advance shipping     41592
Late delivery        98977
Shipping canceled     7754
Shipping on time     32196
Name: Order Id, dtype: int64

dFppsc[dFppsc[“Delivery Status”]==”Shipping canceled”].groupby(“Order Status”).agg(tc=(“Order Id”,”count”))


Order Statustc
CANCELED3692
SUSPECTED_FRAUD4062

dF_clean=dFppsc[dFppsc[“Delivery Status”]!=”Shipping canceled”].copy()

############# RFM Analysis
dF_frequency=dF_clean.groupby([“Customer Id”]).agg(Total_count=(“Order Id”,”nunique”))

dF_clean[“Year”]=dF_clean[“order date (DateOrders)”].dt.year
dF_frequency_year=dF_clean.groupby([“Customer Id”,”Year”],as_index=False)[“Order Id”].nunique()

dF_frequency_year=dF_frequency_year.pivot_table(index=”Customer Id”,columns=”Year”,values=”Order Id”,fill_value=0)

pd.set_option(“display.max_columns”,None)
dF_clean[dF_clean[“Customer Id”]==12436]

dt1=datetime.datetime.strptime(’02-10-2017 12:46:00′, ‘%d-%m-%Y %H:%M:%S’)
dt2=dt1-datetime.timedelta(days=5)

df_exploring=dF_clean[dF_clean[“order date (DateOrders)”]>=dt2]
df_exploring.to_excel(“afterdt1.xlsx”)
df_exploring[df_exploring[“order date (DateOrders)”]>=dt1]

7964 rows × 47 columns

dF_clean[“Customer Full Name”]=dF_clean[“Customer Fname”]+” “+dF_clean[“Customer Lname”]
datetime_val = datetime.datetime.strptime(’02-10-2017 12:46:00′, ‘%d-%m-%Y %H:%M:%S’)
Customer_Names_After_Dt=dF_clean[dF_clean[“order date (DateOrders)”]>=datetime_val][[“Customer Full Name”,”Customer City”]].drop_duplicates()

Customer_Name_Common_Before_After_dt=pd.merge(dF_clean[(dF_clean[“order date (DateOrders)”]<=datetime_val)],Customer_Names_After_Dt,how=”inner”,on=[“Customer Full Name”,”Customer City”])
Records=pd.merge(Customer_Name_Common_Before_After_dt[[“Customer Full Name”,”Customer City”]].drop_duplicates(),dF_clean,how=”inner”,on=[“Customer Full Name”,”Customer City”])

dF_clean=dF_clean[dF_clean[“order date (DateOrders)”]<datetime_val].copy()

####### FREQUENCY

dF_frequency=dF_clean[[“Customer Id”,”Order Id”]].groupby(“Customer Id”,as_index=False).nunique()
dF_frequency.columns=[“Customer Id”,”Frequency”]
sns.histplot(dF_frequency.Frequency,bins=15,kde=False)

Frequency Histogram:

Frequency histogram

dF_frequency[“F”],Intervals_Frequency=pd.qcut(dF_frequency[“Frequency”],q=3,labels=[1,2,3],retbins=True)

########### RECENCY

dF_recency=dF_clean[[“Customer Id”,”order date (DateOrders)”]].groupby(“Customer Id”,as_index=False).max()
dF_recency.rename(columns ={“order date (DateOrders)”:”last_purchase_date”},inplace=True)
max_date=dF_recency[“last_purchase_date”].max()
dF_recency[“recency”]=max_date-dF_recency[“last_purchase_date”]
dF_recency[“recency”]=dF_recency[“recency”].dt.days

sns.displot(dF_recency.recency,bins=8,kde=False)

Recency Histogram

Recency histogram

dF_recency[‘R’],Intervals_Recency=pd.qcut(dF_recency[“recency”],q=3,labels=[3,2,1],retbins=True)

######MONETARY

dF_monetory=dF_clean[[“Customer Id”,”Order Item Total”]].groupby(“Customer Id”,as_index=False).sum()
dF_monetory.columns=[“Customer Id”,”Sum_of_Sales”]
dF_monetory[“M”],Intervals_Monetory=pd.qcut(dF_monetory[“Sum_of_Sales”],q=3,labels=[1,2,3],retbins=True)

sns.displot(dF_monetory.Sum_of_Sales,kde=True)

Monetary histogram

Monetary histogram

####RFM SCORE

dF_rfm=pd.merge(dF_recency[[“Customer Id”,”R”]],dF_monetory[[“Customer Id”,”M”]],on=”Customer Id”,how=”inner”)
dF_rfm=pd.merge(dF_rfm,dF_frequency[[“Customer Id”,”F”]],on=”Customer Id”,how=”inner”)
dF_rfm[“RFM”]=(dF_rfm[“R”]).astype(str)+(dF_rfm[“F”]).astype(str)+(dF_rfm[“M”]).astype(str)

######SEGMENT STRATEGY

Let’s read the input data

pd.set_option(‘display.max_colwidth’, None)
Segment_Strategy=pd.read_excel(“Segments_Strategy.xlsx”)
Segment_Strategy

Segment Strategy table

#####Customer Segment Data preparation

def Customer_Segment(data):
if data[“R”]==1 and data[“F”] in [1,2,3] and (data[“M”]==3):
return “Lost Customers – Big Spenders”
elif data[“R”]== 1 and data[“F”] in [1,2] and data[“M”] in [1,2]:
return “Lost Customers – Bargain”
elif data[“R”] in [1,2] and data[“F”]==3 and data[“M”] in [1,2]:
return “Lost/Almost Lost Customers – Loyal”
elif (data[“R”]==3) and (data[“F”]==3) and data[“M”] in [1,2]:
return “Loyal Customers”
elif (data[“R”]==3) and data[“F”] in [3,2] and data[“M”]==3:
return “Big Spenders”
elif (data[“R”]==3) and (data[“F”]==1) and data[“M”] in [1,2,3]:
return “New Customers”
elif (data[“R”]==3) and (data[“F”]==2) and data[“M”] in [1,2]:
return “Bargain Customers”
elif (data[“R”]==2) and data[“F”]==2 and data[“M”] in [1,2]:
return “Occasional Customers-Bargain”
elif (data[“R”]==2) and data[“F”] in [2,3] and data[“M”]==3:
return “Occasional Customers- Big Spenders”
elif (data[“R”]==2) and data[“F”]==1 and data [“M”] in [1,2,3]:
return “Unsatisfied Customers”
else:
return “No Segment”

dF_rfm[“R”]=dF_rfm[“R”].astype(“category”)
dF_rfm[“F”]=dF_rfm[“F”].astype(“category”)
dF_rfm[“M”]=dF_rfm[“M”].astype(“category”)
dF_rfm[“Segment”]=dF_rfm.apply(Customer_Segment,axis=1)

Segment_count=(dF_rfm.groupby(“Segment”,as_index=False).agg(Total_Count=(“Customer Id”,”count”))).sort_values(by=”Total_Count”,ascending=False)

Let’s plot Number of Customer in Each Segment

fig2=go.Figure()
fig2.add_trace(go.Bar(x=Segment_count.Segment,
y=Segment_count.Total_Count,
hovertemplate =”%{label}
Number of Customers:%{value}”,
texttemplate = “%{value}”))
fig2.update_layout(title=”Number of Customers in Each Segment”,
xaxis=dict(title=”Customer Segment”),
yaxis=dict(title=”Number Of Customers”),width=800)

Number of Customers vs Customer Segment

Let’s plot Revenue Generated by Each Segment

dF_clean=pd.merge(dF_clean,dF_rfm[[“Customer Id”,”RFM”,”Segment”]],on=”Customer Id”,how=”left”)
dF_segment_revenue=dF_clean.groupby(“Segment”,as_index=False).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=”Total_Revenue”,ascending=False)
fig3=go.Figure()
fig3.add_trace(go.Bar(x=dF_segment_revenue.Segment,
y=dF_segment_revenue.Total_Revenue,
hovertemplate =”%{label}
Revenue:%{value}”,
texttemplate = “%{value}”))
fig3.update_layout(title=”Revenue Generated by Each Segment”,xaxis=dict(title=”Customer Segment”),
yaxis=dict(title=”Revenue”))

Revenue Generated by Each Segment

######TEXT WRAPS & SEGMENT TREE MAPS

def wrap(a):
str1=a.split(” “)
final_str=””
for c in str1:
final_str=final_str+c+’

return final_str

Df1=dF_clean.groupby([“Segment”,”Department Name”,”Product Name”],
as_index=False
).agg(Total_Revenue=(“Order Item Total”,”sum”),Total_Quantity=(“Order Item Quantity”,”sum”))
Df1[“Product Name”]=Df1[“Product Name”].apply(wrap) #Applying text wrap because our product names are too long
segment=[“Big Spenders”,”Bargain Customers”,”Loyal Customers”,”New Customers”]
def type1(data):
if “Big Spenders” in data[“Segment”]:
return “Big spenders”
elif “Bargain” in data[“Segment”]:
return “Bargain”
elif “Loyal” in data[“Segment”]:
return “Loyal”
elif “New” in data[“Segment”]:
return “New Customers”
else:
return “Unsatisfied”

Df1[“Customer Type”]=Df1.apply(type1,axis=1)

data_tm=list(Df1.groupby(“Customer Type”))+[(‘All’,Df1)]

traces=[]
buttons=[]
visible=[]
for i,d in enumerate(data_tm):
visible=[False]*len(data_tm)
visible[i]=True
title=d[0]
traces.append((px.treemap(d[1],
path=[px.Constant(“All”),”Customer Type”,”Segment”,”Department Name”,”Product Name”],
values=”Total_Revenue”).update_traces(visible= True if i==0 else False)).data[0])
buttons.append(dict(label=title,
method=”update”,
args=[{
“visible”:visible},
{“title”:f”{title}”}
]))

updatemenus=[{“active”:0,”buttons”:buttons}]
fig11=go.Figure(data=traces,layout=dict(updatemenus=updatemenus))
fig11.update_layout(title=data_tm[0][0],title_x=0.6)
fig11.update_coloraxes(colorbar_title_text=”Total
Quantity”)
fig11.show()

Bargain tree map
Big spenders tree map
Loyal tree map
New Customers tree map
Unsatisfied tree map
All tree map

#####ABC ANALYSIS

Total_Products=dF_clean[“Product Name”].nunique()
print(“Total Number of products: “+f”{Total_Products}”)

Total Number of products: 100

Revenue_ABC=dF_clean.groupby([“Department Name”,”Product Name”]).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=”Total_Revenue”,ascending=False).reset_index()
Revenue_ABC[“cum_sum”]=Revenue_ABC[“Total_Revenue”].cumsum()
Revenue_ABC[“cum_per”]=Revenue_ABC[“cum_sum”]/Revenue_ABC[“Total_Revenue”].sum()*100
Revenue_ABC[“per”]=Revenue_ABC[“cum_per”]-Revenue_ABC[“cum_per”].shift(1)
Revenue_ABC.loc[0,”per”]=Revenue_ABC[“cum_per”][0]

def ABC(data):
if data[“cum_per”]<=75: return “A” elif data[“cum_per”]>75 and data[“cum_per”]<=95: return “B” elif data[“cum_per”]>95:
return “C”

Revenue_ABC[“ABC_Revenue”]=Revenue_ABC.apply(ABC,axis=1)

Bar_graph_Abc=Revenue_ABC[[“ABC_Revenue”,”Product Name”,”Total_Revenue”]].groupby(“ABC_Revenue”).agg(Revenue=(“Total_Revenue”,”sum”),count=(“Product Name”,”count”))

##########Bar_graph_Abc
fig2=go.Figure(go.Bar(x=Bar_graph_Abc.index,
y=Bar_graph_Abc[“Revenue”],
hovertemplate =”%{label}
Revenue:%{value}”,
texttemplate = “Revenue
%{value}”,
marker_color=[“orange”,”lightgreen”,”red”],
showlegend=False))
fig2.add_trace(
go.Scatter(
x=Bar_graph_Abc.index,
y=Bar_graph_Abc[“count”],
name=”Number Of Products”,
mode=’lines’,
line = dict(color=’blue’, width=3),
yaxis=”y2″,
marker_line_width = 0
))

fig2.update_layout(
title=”Revenue Generated By Products in Different ABC Segments”,
xaxis=dict(title=”Segment” ),
yaxis=dict(title=”Revenue”,showgrid=False),
yaxis2=dict(title=”Number Of Products”, anchor=”x”, overlaying=”y”,side=”right”,dtick=10),
legend = dict(x = 1.05, y = 1))
fig2.show()

Revenue Generated By Products in Different ABC Segments

######A Segment Products in Top 6 Cities

Top6Cities=dF_clean[[“Order City”,”Order Item Total”]].groupby(“Order City”).sum().sort_values(by=”Order Item Total”,ascending=False).head(6)

data=dF_clean[dF_clean[“Order City”].isin(Top6Cities.index.tolist())].groupby([“Order City”,”Product Name”],as_index=False).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=[“Order City”,”Total_Revenue”],ascending=False)
data[“sum”]=data.groupby(“Order City”)[“Total_Revenue”].transform(‘sum’)
data[“cum_per”]=data.groupby(“Order City”)[“Total_Revenue”].cumsum()/data[“sum”]*100
data[“segment”]=data.apply(ABC,axis=1)
data[“Product Name”]=data[“Product Name”].apply(wrap)
data.sort_values(by=[“sum”,”Total_Revenue”],inplace=True,ascending=False)

fig = px.bar(data[data[“segment”]==”A”], x=’Product Name’, y=’Total_Revenue’,
facet_col=’Order City’,facet_col_wrap=2,facet_row_spacing=0.12,
height=1500,title=”A Segment Products in Top 6 Cities”)
fig.for_each_annotation(lambda a: a.update(text=a.text.split(“=”)[-1]))
fig.update_xaxes(showticklabels=True,tickangle=0)
fig.update_layout(width=950)
fig.show()

A Segment Products in Top 6 Cities
A Segment Products in Top 6 Cities
A Segment Products in Top 6 Cities

Read more here.

Bank Churn ML Prediction

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

Basic Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Processing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif, chi2

Model libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import train_test_split
from sklearn import metrics

Let’s read the input dataset

df = pd.read_csv(‘BankChurners.csv’)

df = df.drop(‘Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2’, axis = 1)
df = df.drop(‘Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1’, axis = 1

Let’s check the statistics

df.describe().T

Descriptive statistics

Number of churn and non-churn
counts = df.Attrition_Flag.value_counts()
perc_churn = (counts[1] / (counts[0] + counts[1])) * 100

No. of duplicates
duplicates = len(df[df.duplicated()])

No. of missing values
missing_values = df.isnull().sum().sum()

Data types in dataset
types = df.dtypes.value_counts()
print(“Churn Rate = %.1f %%”%(perc_churn))
print(‘Number of Duplicate Entries: %d’%(duplicates))
print(‘Number of Missing Values: %d’%(missing_values))
print(‘Number of Features: %d’%(df.shape[1]))
print(‘Number of Customers: %d’%(df.shape[0]))
print(‘Data Types and Frequency in Dataset:’)
print(types)

Churn Rate = 16.1 %
Number of Duplicate Entries: 0
Number of Missing Values: 0
Number of Features: 21
Number of Customers: 10127
Data Types and Frequency in Dataset:
int64      10
object      6
float64     5
Name: count, dtype: int64

####### Data Pre-processing

Make gender and outcome numerical

df[‘Gender’] = df[‘Gender’].map({‘M’: 1, ‘F’: 0})
df[‘Attrition_Flag’] = df[‘Attrition_Flag’].map({‘Attrited Customer’: 1, ‘Existing Customer’: 0})

drop client id
df = df.drop(‘CLIENTNUM’, axis = 1)

catcols = df.select_dtypes(exclude = [‘int64′,’float64’]).columns
intcols = df.select_dtypes(include = [‘int64’]).columns
floatcols = df.select_dtypes(include = [‘Float64’]).columns

One-hot encoding on categorical columns
df = pd.get_dummies(df, columns = catcols)

Minmax scaling numeric features
for col in df[floatcols]:
df[col] = MinMaxScaler().fit_transform(df[[col]])

for col in df[intcols]:
df[col] = MinMaxScaler().fit_transform(df[[col]])

print(‘New Number of Features: %d’%(df.shape[1]))

New Number of Features: 37

Split into X and y
X = df.drop(‘Attrition_Flag’, axis = 1)
y = df[‘Attrition_Flag’]

##################Correlation between numerical features

heat = df.corr()
plt.figure(figsize=[16,8])
plt.title(“Correlation between numerical features”, size = 25, pad = 20, color = ‘#8cabb6’)
sns.heatmap(heat,cmap = sns.diverging_palette(20, 220, n = 200), annot=False)
plt.show()

Correlation between numerical features.

#######Feature Engineering

print(“Correlation Coefficient of all the Features”)
corr = df.corr()
corr.sort_values([“Attrition_Flag”], ascending = False, inplace = True)
correlations = corr.Attrition_Flag
a = correlations[correlations > 0.1]
b = correlations[correlations < -0.1]
top_corr_features = a._append(b)
top_corr_features

Correlation Coefficient of all the Features

Out[12]:

Attrition_Flag              1.000000
Contacts_Count_12_mon       0.204491
Months_Inactive_12_mon      0.152449
Total_Amt_Chng_Q4_Q1       -0.131063
Total_Relationship_Count   -0.150005
Total_Trans_Amt            -0.168598
Avg_Utilization_Ratio      -0.178410
Total_Revolving_Bal        -0.263053
Total_Ct_Chng_Q4_Q1        -0.290054
Total_Trans_Ct             -0.371403
Name: Attrition_Flag, dtype: float64

Let’s define the following functions

def plot_importances(model, model_name, features_to_plot, feature_names):
#fit model and performances
model.fit(X,y)
importances = model.feature_importances_

# sort and rank importances
indices = np.argsort(importances)
best_features = np.array(feature_names)[indices][-features_to_plot:]
values = importances[indices][-features_to_plot:]

# plot a graph
y_ticks = np.arange(0, features_to_plot)
fig, ax = plt.subplots()
ax.barh(y_ticks, values, color = '#b2c4cc')
ax.set_yticklabels(best_features)
ax.set_yticks(y_ticks)
ax.set_title("%s Feature Importances"%(model_name))
fig.tight_layout()
plt.show()

def best_features(model, features_to_plot, feature_names):
# get list of best features
model.fit(X,y)
importances = model.feature_importances_

indices = np.argsort(importances)
best_features = np.array(feature_names)[indices][-features_to_plot:]
return best_features

Let’s plot feature importance weights for the following 3 models

feature_names = list(X.columns)

model1 = RandomForestClassifier(random_state = 1234)
plot_importances(model1, ‘Random Forest’, 10, feature_names)

model2 = GradientBoostingClassifier(n_estimators = 100, learning_rate = 1.0, max_depth = 1, random_state = 0)
plot_importances(model2, ‘XGBoost’, 10, feature_names)

model3 = AdaBoostClassifier(n_estimators = 100, learning_rate = 1.0, random_state = 0)
plot_importances(model3, ‘AdaBoost’, 10, feature_names)

Random Forest Feature Importances
XGBoost Feature Importances
AdaBoost Feature Importances

Looking at the F-value between label/feature for classification tasks

f_selector = SelectKBest(f_classif, k = 10)
f_selector.fit_transform(X, y)
f_selector_best = f_selector.get_feature_names_out()
print(f_selector_best)

['Gender' 'Total_Relationship_Count' 'Months_Inactive_12_mon'
 'Contacts_Count_12_mon' 'Total_Revolving_Bal' 'Total_Amt_Chng_Q4_Q1'
 'Total_Trans_Amt' 'Total_Trans_Ct' 'Total_Ct_Chng_Q4_Q1'
 'Avg_Utilization_Ratio']

forest_best = list(best_features(model1, 10, feature_names))
XG_best = list(best_features(model2, 10, feature_names))
ada_best = list(best_features(model3, 10, feature_names))
top_corr_features = list(top_corr_features.index[1:])
f_selector_best = list(f_selector_best)

best_features_overall = forest_best + XG_best + ada_best + top_corr_features + f_selector_best

# create a dictionary with the number of times features appear
from collections import Counter
count_best_features = dict(Counter(best_features_overall))

# list of the features without any repeatitions
features_no_repeats = list(dict.fromkeys(best_features_overall))

display(count_best_features)

{'Customer_Age': 3,
 'Avg_Open_To_Buy': 3,
 'Credit_Limit': 2,
 'Total_Amt_Chng_Q4_Q1': 5,
 'Avg_Utilization_Ratio': 3,
 'Total_Relationship_Count': 4,
 'Total_Revolving_Bal': 5,
 'Total_Ct_Chng_Q4_Q1': 5,
 'Total_Trans_Ct': 5,
 'Total_Trans_Amt': 5,
 'Months_Inactive_12_mon': 4,
 'Contacts_Count_12_mon': 4,
 'Gender': 1}

# get list of features with high counts in the dictionary
def get_features(threshold):
# remove features below a certain number of appearances
chosen_features = []
for i in features_no_repeats:
if count_best_features[i] > threshold:
chosen_features.append(i)
return chosen_features

chosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
chosen_features

['Customer_Age',
 'Total_Amt_Chng_Q4_Q1',
 'Total_Relationship_Count',
 'Total_Revolving_Bal',
 'Total_Ct_Chng_Q4_Q1',
 'Total_Trans_Ct',
 'Total_Trans_Amt',
 'Months_Inactive_12_mon',
 'Contacts_Count_12_mon']

######### ML Model Evaluation

def eval_model(model, model_name, X, y, threshold):
# make X the chosen subset
chosen_features = get_features(threshold)
X = X[chosen_features]

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)

# fit model
model.fit(train_x,train_y)
model.score(test_x, test_y)
pred_test = model.predict(test_x)

# get metrics
f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)

print(con,'%s model with %s threshold: %.4f F1-score and %.4f accuracy'%(model_name, threshold, f1, test_acc))

Let’s use the full dataset to perform train/test data splitting with test_size = 0.25

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)

####Model1 Fitting

model1.fit(train_x,train_y)
model1.score(test_x, test_y)
pred_test = model1.predict(test_x)

f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)

print(con,f1,test_acc)

[[2094   19]
 [  96  323]] 0.8488830486202367 0.9545813586097947

Let’s look at the reduced dataset and repeat model1 fitting

chosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
Xnew = X[chosen_features]

train_x, test_x, train_y, test_y = train_test_split(Xnew, y, test_size = 0.25, random_state = 42)

model1.fit(train_x,train_y)
model1.score(test_x, test_y)
pred_test = model1.predict(test_x)

f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)

print(con,f1,test_acc)

[[2086   27]
 [  60  359]] 0.8919254658385094 0.9656398104265402

######More ML Classifiers

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
model4 = BaggingClassifier(KNeighborsClassifier(n_neighbors = 7), max_samples = 0.8, max_features = 0.8)

eval_model(model4, ‘KNN’, X, y, 2)

[[2074   39]
 [ 176  243]] KNN model with 2 threshold: 0.6933 F1-score and 0.9151 accuracy

from sklearn.linear_model import LogisticRegression

model5 = LogisticRegression(random_state=1)
eval_model(model5, ‘Logistic’, X, y, 2)

[[2059   54]
 [ 217  202]] Logistic model with 2 threshold: 0.5985 F1-score and 0.8930 accuracy

# lists of possible parameters
n = [400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]
depth = [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
rand = [600, 650, 700, 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250]

forest = RandomForestClassifier(n_estimators = 100, max_depth = 15, random_state = 750)
eval_model(forest, ‘forest’, X, y, 2)

[[2090   23]
 [  64  355]] forest model with 2 threshold: 0.8908 F1-score and 0.9656 accuracy

####Further Evaluation of Random Forest

def eval_forest(model, model_name, X, y, threshold, n, depth, rand):
# create subset from feature selection
chosen_features = get_features(threshold)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
X = X[chosen_features]

train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)

model.fit(train_x,train_y)
model.score(test_x, test_y)
pred_test = model.predict(test_x)

f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)

print('Model: %s Threshold: %s F1-Score %.4f Accuracy: %.4f n_estimators: %s depth: %s rand: %s'%(model_name, threshold, f1, test_acc,n,depth,rand))

Let’s define the model

forest = RandomForestClassifier(n_estimators = 850, max_depth = 19, random_state = 1200)

and train/test this model using 2 selected features

chosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
X_new = X[chosen_features]

train_x, test_x, train_y, test_y = train_test_split(X_new, y, test_size = 0.25, random_state = 42)

forest.fit(train_x,train_y)
forest.score(test_x, test_y)
pred_test = forest.predict(test_x)

f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)
precision = metrics.precision_score(test_y, pred_test)
recall = metrics.recall_score(test_y, pred_test)
roc = metrics.roc_auc_score(test_y, pred_test)
print(‘Accuracy Score’, test_acc)
print(‘Precision’, precision)
print(‘Recall’, recall)
print(‘F1-Score’, f1)
print(‘ROC Score’, roc)
print(con)

Accuracy Score 0.9664296998420221
Precision 0.9304123711340206
Recall 0.8615751789976134
F1-Score 0.8946716232961587
ROC Score 0.9243985691485936
[[2086   27]
 [  58  361]]

#####CROSS-VALIDATION

from sklearn.model_selection import cross_validate

cv_results = cross_validate(forest, X_new, y, scoring = (‘f1’, ‘accuracy’, ‘roc_auc’), cv = 8)
sorted(cv_results.keys())

['fit_time', 'score_time', 'test_accuracy', 'test_f1', 'test_roc_auc']

cv_results[‘test_roc_auc’]

array([0.91056078, 0.96866615, 0.98541168, 0.9918462 , 0.99539807,
       0.99796213, 0.8989536 , 0.90523967])

cv_results[‘test_accuracy’]

array([0.88230648, 0.94233807, 0.96366509, 0.96366509, 0.96840442,
       0.97630332, 0.92969984, 0.92885375])

#####ML PERFORMANCE ANALYSIS USING Scikit Plot

Import standard libraries

import scikitplot as skplt

import sklearn

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

import matplotlib.pyplot as plt

import sys
import warnings
warnings.filterwarnings(“ignore”)

print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)

%matplotlib inline

Scikit Plot Version :  0.3.7
Scikit Learn Version :  1.2.2
Python Version :  3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]

#####RFC Learning Curve

skplt.estimators.plot_learning_curve(forest, train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”RFC Learning Curve”);

RFC Learning Curve

######KNN Classification Learning Curve

skplt.estimators.plot_learning_curve(model4, train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”KNN Classification Learning Curve”);

KNN Classification Learning Curve

####ETC Learning Curve

skplt.estimators.plot_learning_curve(ExtraTreesClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”ETC Learning Curve”);

ETC Learning Curve

#####GBC Learning Curve

skplt.estimators.plot_learning_curve(GradientBoostingClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”GBC Learning Curve”);

GBC Learning Curve

#######DTC Learning Curve

from sklearn.tree import DecisionTreeClassifier
skplt.estimators.plot_learning_curve(DecisionTreeClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”DTC Learning Curve”);

DTC Learning Curve

######ABC Learning Curve

from sklearn.ensemble import AdaBoostClassifier
skplt.estimators.plot_learning_curve(AdaBoostClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”ABC Learning Curve”);

ABC Learning Curve

########MLPC Learning Curve

from sklearn.neural_network import MLPClassifier
skplt.estimators.plot_learning_curve(MLPClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”MLPC Learning Curve”);

MLPC Learning Curve

#######GNB Learning Curve

from sklearn.naive_bayes import GaussianNB
skplt.estimators.plot_learning_curve(GaussianNB(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”GNB Learning Curve”);

GNB Learning Curve

######LR Learning Curve

skplt.estimators.plot_learning_curve(LogisticRegression(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”LR Learning Curve”);

LR Learning Curve

#######QDA Learning Curve

from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
skplt.estimators.plot_learning_curve(QuadraticDiscriminantAnalysis(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”QDA Learning Curve”);

QDA Learning Curve

########SVC Learning Curve

from sklearn.svm import SVC
skplt.estimators.plot_learning_curve(SVC(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”SVC Learning Curve”);

SVC Learning Curve

######CALIBRATION PLOTS

lr_probas = QuadraticDiscriminantAnalysis().fit(train_x, train_y).predict_proba(test_x)
rf_probas = RandomForestClassifier().fit(train_x, train_y).predict_proba(test_x)
gb_probas = GradientBoostingClassifier().fit(train_x, train_y).predict_proba(test_x)
et_scores = AdaBoostClassifier().fit(train_x, train_y).predict_proba(test_x)

probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = [‘QDA’, ‘RFC’, ‘GBC’, ‘ABC’]

skplt.metrics.plot_calibration_curve(test_y,
probas_list,
clf_names, n_bins=15,
figsize=(12,6)
);

CALIBRATION PLOTS
######GBC Normalized Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns

model=GradientBoostingClassifier()

model.fit(train_x,train_y)
pred_test = model.predict(test_x)

cm = confusion_matrix(test_y, pred_test)

target_names=[‘0′,’1’]

cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.title(‘GBC Normalized Confusion Matrix’)

GBC Normalized Confusion Matrix

########RFC Normalized Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns

forest.fit(train_x,train_y)

pred_test = forest.predict(test_x)

forest.fit(train_x,train_y)
pred_test = forest.predict(test_x)

cm = confusion_matrix(test_y, pred_test)

target_names=[‘0′,’1’]

cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(10,10))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.title(‘RFC Normalized Confusion Matrix’)

RFC Normalized Confusion Matrix

########RFC ROC Curve

Y_test_probs = forest.predict_proba(test_x)

skplt.metrics.plot_roc_curve(test_y, Y_test_probs,
title=”RFC ROC Curve”, figsize=(12,6));

RFC ROC Curve

##########GBC ROC Curve

model=GradientBoostingClassifier()
model.fit(train_x,train_y)
Y_test_probs = model.predict_proba(test_x)

skplt.metrics.plot_roc_curve(test_y, Y_test_probs,
title=”GBC ROC Curve”, figsize=(12,6));

GBC ROC Curve

##########RFC Precision-Recall Curve

Y_test_probs = forest.predict_proba(test_x)
skplt.metrics.plot_precision_recall_curve(test_y, Y_test_probs,
title=”RFC Precision-Recall Curve”, figsize=(12,6));

RFC Precision-Recall Curve

########GBC Precision-Recall Curve

model=GradientBoostingClassifier()
model.fit(train_x,train_y)
Y_test_probs = model.predict_proba(test_x)
skplt.metrics.plot_precision_recall_curve(test_y, Y_test_probs,
title=”GBC Precision-Recall Curve”, figsize=(12,6));

GBC Precision-Recall Curve

#####K-Means Elbow Plot

skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
train_x,
cluster_ranges=range(2, 20),
figsize=(8,6));

Elbow Plot

######Silhouette Analysis

kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(train_x, train_y)
cluster_labels = kmeans.predict(test_x)

skplt.metrics.plot_silhouette(test_x, cluster_labels,
figsize=(8,6));

Silhouette Analysis

#######PCA Component Variance

pca = PCA(random_state=1)
pca.fit(train_x)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

PCA Component Explained Variances

#####PCA 2-D Projection

skplt.decomposition.plot_pca_2d_projection(pca, train_x, train_y,
figsize=(10,10),
cmap=”tab10″);

PCA 2-D Projection

######Classification Reports via yellowbrick

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import sklearn
import yellowbrick

pd.set_option(“display.max_columns”, 35)

import warnings
warnings.filterwarnings(“ignore”)

from yellowbrick.classifier import ClassificationReport

viz = ClassificationReport(forest,
classes=target_names,
support=True,
fig=plt.figure(figsize=(8,6)))

viz.fit(train_x, train_y)

viz.score(test_x, test_y)

viz.show();

RFC Classification report

from yellowbrick.classifier import ClassificationReport

model=GradientBoostingClassifier()
model.fit(train_x,train_y)

viz = ClassificationReport(model,
classes=target_names,
support=True,
fig=plt.figure(figsize=(8,6)))

viz.fit(train_x, train_y)

viz.score(test_x, test_y)

viz.show();

GBC Classification Report

######CLASS PREDICTION ERROR

from yellowbrick.classifier import ClassPredictionError

viz = ClassPredictionError(forest,
classes=target_names,
fig=plt.figure(figsize=(9,6)))

viz.fit(train_x, train_y)

viz.score(test_x, test_y)

viz.show();

CLASS PREDICTION ERROR: RFC

model=GradientBoostingClassifier()
model.fit(train_x,train_y)
viz = ClassPredictionError(model,
classes=target_names,
fig=plt.figure(figsize=(9,6)))

viz.fit(train_x, train_y)

viz.score(test_x, test_y)

viz.show();

CLASS PREDICTION ERROR: GBC

Read more here.

Groceries Market Basket & RFM Analysis

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import/install standard libraries

!pip install apriori

import os
import plotly.express as px
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns
import operator as op
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Loading the input dataset
data = pd.read_csv(‘Groceries_dataset.csv’)
data.head()

Grocery dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Member_number    38765 non-null  int64 
 1   Date             38765 non-null  object
 2   itemDescription  38765 non-null  object
dtypes: int64(1), object(2)
memory usage: 908.7+ KB

Renaming the columns
data.columns = [‘memberID’, ‘Date’, ‘itemName’]

Checking for the missing values
nan_values = data.isna().sum()
nan_values

memberID    0
Date        0
itemName    0
dtype: int64

Converting Date column into correct data type which is datetime
data.Date = pd.to_datetime(data.Date)
data.memberID = data[‘memberID’].astype(‘str’)
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   memberID  38765 non-null  object        
 1   Date      38765 non-null  datetime64[ns]
 2   itemName  38765 non-null  object        
dtypes: datetime64[ns](1), object(2)
memory usage: 908.7+ KB

#######EXPLORATORY DATA ANALYSIS (EDA)

Sales_weekly = data.resample(‘w’, on=’Date’).size()
fig = px.line(data, x=Sales_weekly.index, y=Sales_weekly,
labels={‘y’: ‘Number of Sales’,
‘x’: ‘Date’})
fig.update_layout(title_text=’Number of Sales Weekly’,
title_x=0.5, title_font=dict(size=18))
fig.show()

####Number of Sales Weekly

Number of sales weekly

#########Number of Customers Weekly

Unique_customer_weekly = data.resample(‘w’, on=’Date’).memberID.nunique()
fig = px.line(Unique_customer_weekly, x=Unique_customer_weekly.index, y=Unique_customer_weekly,
labels={‘y’: ‘Number of Customers’})
fig.update_layout(title_text=’Number of Customers Weekly’,
title_x=0.5, title_font=dict(size=18))
fig.show()

Number of Customers Weekly'

#####Sales per Customer Ratio Weekly

Sales_per_Customer = Sales_weekly / Unique_customer_weekly
fig = px.line(Sales_per_Customer, x=Sales_per_Customer.index, y=Sales_per_Customer,
labels={‘y’: ‘Sales per Customer Weekly’})
fig.update_layout(title_text=’Sales per Customer Weekly’,
title_x=0.5, title_font=dict(size=18))
fig.update_yaxes(rangemode=”tozero”)
fig.show()

Sales per Customer Ratio Weekly

#######Frequency of the Items Sold

Frequency_of_items = data.groupby(pd.Grouper(key=’itemName’)).size().reset_index(name=’count’)
fig = px.treemap(Frequency_of_items, path=[‘itemName’], values=’count’)
fig.update_layout(title_text=’Frequency of the Items Sold’,
title_x=0.5, title_font=dict(size=18)
)
fig.update_traces(textinfo=”label+value”)
fig.show()

Frequency of the Items Sold

########Top 20 Customers regarding Number of Items Bought

user_item = data.groupby(pd.Grouper(key=’memberID’)).size().reset_index(name=’count’) \
.sort_values(by=’count’, ascending=False)
fig = px.bar(user_item.head(25), x=’memberID’, y=’count’,
labels={‘y’: ‘Number of Sales’,
‘count’: ‘Number of Items Bought’},
color=’count’)
fig.update_layout(title_text=’Top 20 Customers regarding Number of Items Bought’,
title_x=0.5, title_font=dict(size=18))
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.show()

Top 20 Customers regarding Number of Items Bought

##########Number of Sales per Discrete Week Days

day = data.groupby(data[‘Date’].dt.strftime(‘%A’))[‘itemName’].count()
fig = px.bar(day, x=day.index, y=day, color=day,
labels={‘y’: ‘Number of Sales’,
‘Date’: ‘Week Days’})
fig.update_layout(title_text=’Number of Sales per Discrete Week Days’,
title_x=0.5, title_font=dict(size=18))
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.show()

Number of Sales per Discrete Week Days

##############Number of Sales per Discrete Months

month = data.groupby(data[‘Date’].dt.strftime(‘%m’))[‘itemName’].count()
fig = px.bar(month, x=month.index, y=month, color=month,
labels={‘y’: ‘Number of Sales’,
‘Date’: ‘Months’})
fig.update_layout(title_text=’Number of Sales per Discrete Months’,
title_x=0.5, title_font=dict(size=18))
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.show()

Number of Sales per Discrete Months

###########Number of Sales per Discrete Month Days

month_day = data.groupby(data[‘Date’].dt.strftime(‘%d’))[‘itemName’].count()
fig = px.bar(month_day, x=month_day.index, y=month_day, color=month_day,
labels={‘y’: ‘Number of Sales’,
‘Date’: ‘Month Days’})
fig.update_layout(title_text=’Number of Sales per Discrete Month Days’,
title_x=0.5, title_font=dict(size=18))
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.show()

Number of Sales per Discrete Month Days

###########Market Basket Analysis
baskets = data.groupby([‘memberID’, ‘itemName’])[‘itemName’].count().unstack().fillna(0).reset_index()
#baskets.head()

Let’s check the most sold -item which is whole milk- has the same number of sales as we discussed above in the treemap.
baskets[‘whole milk’].sum()

2502.0

Encoding the items that sold more than 1
def one_hot_encoder(k):
if k <= 0: return 0 if k >= 1:
return 1

baskets_final = baskets.iloc[:, 1:baskets.shape[1]].applymap(one_hot_encoder)
#baskets_final.head()

Finding the most frequent items sold together
frequent_itemsets = apriori(baskets_final, min_support=0.025, use_colnames=True, max_len=3).sort_values(by=’support’)
#frequent_itemsets.head(25)

Creating association rules for indicating astecedent and consequent items
rules = association_rules(frequent_itemsets, metric=”lift”, min_threshold=1).sort_values(‘lift’, ascending=False)
rules = rules[[‘antecedents’, ‘consequents’, ‘support’, ‘confidence’, ‘lift’]]
rules.head(25)

Market Basket table

###########RFM Analysis

Finding last purchase date of each customer
Recency = data.groupby(by=’memberID’)[‘Date’].max().reset_index()
Recency.columns = [‘memberID’, ‘LastDate’]
#Recency.head()

Finding last date for our dataset
last_date_dataset = Recency[‘LastDate’].max()
last_date_dataset

Timestamp('2015-12-30 00:00:00')

Calculating Recency by subtracting (last transaction date of dataset) and (last purchase date of each customer)
Recency[‘Recency’] = Recency[‘LastDate’].apply(lambda x: (last_date_dataset – x).days)
#Recency.head()

##########Recency Distribution of the Customers

fig = px.histogram(Recency, x=’Recency’, opacity=0.85, marginal=’box’)
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.update_layout(title_text=’Recency Distribution of the Customers’,
title_x=0.5, title_font=dict(size=20))
fig.show()

Recency Distribution of the Customers

###########Frequency of the customer visits
Frequency = data.drop_duplicates([‘Date’, ‘memberID’]).groupby(by=[‘memberID’])[‘Date’].count().reset_index()
Frequency.columns = [‘memberID’, ‘Visit_Frequency’]
#Frequency.head()

#######Visit Frequency Distribution of the Customers

fig = px.histogram(Frequency, x=’Visit_Frequency’, opacity=0.85, marginal=’box’)
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.update_layout(title_text=’Visit Frequency Distribution of the Customers’,
title_x=0.5, title_font=dict(size=20))
fig.show()

Visit Frequency Distribution of the Customers

#####MONETARY

Monetary = data.groupby(by=”memberID”)[‘itemName’].count().reset_index()
Monetary.columns = [‘memberID’, ‘Monetary’]
#Monetary.head()

fig = px.histogram(Monetary, x=’Monetary’, opacity=0.85, marginal=’box’,
labels={‘itemName’: ‘Monetary’})
fig.update_traces(marker=dict(line=dict(color=’#000000′, width=1)))
fig.update_layout(title_text=’Monetary Distribution of the Customers’,
title_x=0.5, title_font=dict(size=20))
fig.show()

Monetary Distribution of the Customers

#######RFM Score

Combining all scores into one DataFrame
RFM = pd.concat([Recency[‘memberID’], Recency[‘Recency’], Frequency[‘Visit_Frequency’], Monetary[‘Monetary’]], axis=1)
#RFM.head()

5-5 score = the best customers
RFM[‘Recency_quartile’] = pd.qcut(RFM[‘Recency’], 5, [5, 4, 3, 2, 1])
RFM[‘Frequency_quartile’] = pd.qcut(RFM[‘Visit_Frequency’], 5, [1, 2, 3, 4, 5])

RFM[‘RF_Score’] = RFM[‘Recency_quartile’].astype(str) + RFM[‘Frequency_quartile’].astype(str)
#RFM.head()

segt_map = { # Segmentation Map [Ref]
r'[1-2][1-2]’: ‘hibernating’,
r'[1-2][3-4]’: ‘at risk’,
r'[1-2]5′: ‘can\’t loose’,
r’3[1-2]’: ‘about to sleep’,
r’33’: ‘need attention’,
r'[3-4][4-5]’: ‘loyal customers’,
r’41’: ‘promising’,
r’51’: ‘new customers’,
r'[4-5][2-3]’: ‘potential loyalists’,
r’5[4-5]’: ‘champions’
}

RFM[‘RF_Segment’] = RFM[‘RF_Score’].replace(segt_map, regex=True)
#RFM.head()

x = RFM.RF_Segment.value_counts()
fig = px.treemap(x, path=[x.index], values=x)
fig.update_layout(title_text=’Distribution of the RFM Segments’, title_x=0.5,
title_font=dict(size=20))
fig.update_traces(textinfo=”label+value+percent root”)
fig.show()

Distribution of the RFM Segments

###########Relationship between Visit_Frequency and Recency

fig = px.scatter(RFM, x=”Visit_Frequency”, y=”Recency”, color=’RF_Segment’,
labels={“math score”: “Math Score”,
“writing score”: “Writing Score”})
fig.update_layout(title_text=’Relationship between Visit_Frequency and Recency’,
title_x=0.5, title_font=dict(size=20))
fig.show()

Relationship between Visit_Frequency and Recency

Read more here.

Largest E-Commerce Showcase in Pakistan

Let’s apply the customer segmentation RFM analysis to the Pakistan’s largest e-commerce dataset.

Let’s set the working directory

import os
os.chdir(‘YOURPATH’)
os. getcwd()

and import standard libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import opendatasets as od
import os

import plotly.express as px

%matplotlib inline

import matplotlib.style as style

style.use(‘seaborn-poster’)
style.use(“fivethirtyeight”)
plt.rcParams[‘font.family’] = ‘serif’
sns.set(rc={‘axes.facecolor’:’black’, ‘figure.facecolor’:’black’, ‘axes.grid’ : True})

Let’s read the input dataset

ecommerce = pd.read_csv(‘Pakistan Largest Ecommerce Dataset.csv’,low_memory=False,
comment=’,’ )

ecommerce.shape

(584524, 26)

#ecommerce.head()

ecommerce = ecommerce.dropna(how=’all’, axis=1)

ecommerce.describe().T

Descriptive statistics of input data

#########Order Status

total = ecommerce[ecommerce.grand_total < 0].status.value_counts().to_dict()

total_df = pd.DataFrame(list(total.items()), columns=[‘Order Status’, ‘Counts’])

plt.figure(figsize=(20,10))
ax = sns.barplot(data=total_df, x=’Order Status’, y=’Counts’, palette=’bright’)
ax.bar_label(container = ax.containers[0], padding = 0, fontsize = 22, color=’white’)
plt.xticks( fontsize=20)
plt.yticks( fontsize=20)
plt.xlabel(‘Order Status’, fontsize=24, labelpad=24, color=’white’)
plt.ylabel(‘Counts’, fontsize=24, labelpad=24, color=’white’);
ax.tick_params(axis=’x’, colors=’white’)
ax.tick_params(axis=’y’, colors=’white’);

Order Status

#########Data Editing

ecommerce = ecommerce[ecommerce.grand_total > 0]

ecommerce.status = ecommerce.status.replace({‘complete’: ‘Completed’,
‘received’: ‘Completed’,
‘cod’: ‘Completed’,
‘paid’: ‘Completed’,
‘closed’: ‘Completed’,
‘exchange’: ‘Completed’,
‘canceled’: ‘Canceled’,
‘order_refunded’: ‘Canceled’,
‘refund’: ‘Canceled’,
‘fraud’: ‘Canceled’,
‘payment_review’: ‘Pending’,
‘pending’: ‘Pending’,
‘processing’: ‘Pending’,
‘holded’: ‘Pending’,
‘pending_paypal’: ‘Pending’})

ecommerce = ecommerce.drop([‘created_at’, ‘sku’, ‘sales_commission_code’, ‘Customer Since’, ‘M-Y’], axis=1)

ecommerce = ecommerce.dropna()

##########Number of Orders: One Order vs Orders > 1

number_of_orders = ecommerce.groupby(‘Customer ID’)[‘increment_id’].nunique().sort_values(ascending=False)

number_of_orders_df = pd.DataFrame(list(number_of_orders.items()), columns=[‘Customer ID’, ‘Number of Orders’])

a = number_of_orders_df[number_of_orders_df[‘Number of Orders’] == 1].value_counts().sum()
b = number_of_orders_df[number_of_orders_df[‘Number of Orders’] != 1].value_counts().sum()

data = {‘Order’: [‘One Order’, ‘More than One Order’], ‘Customer_Counts’: [a, b]}

order_counts = pd.DataFrame.from_dict(data)

fig = px.pie(order_counts,
values = order_counts.Customer_Counts,
names = order_counts.Order,
template = ‘plotly_dark’)
fig.update_traces(textposition=’inside’, textinfo=’percent+label’, textfont_size=20,
marker = dict(line = dict(color = ‘white’, width = 6)))
fig.show()

Pie Chart: One Order, More than One Order

########Number of Products: One Category, More than One Category

number_of_prod = ecommerce.groupby(‘Customer ID’)[‘category_name_1’].nunique().sort_values(ascending=False)

number_of_prod_df = pd.DataFrame(list(number_of_prod.items()), columns=[‘Customer ID’, ‘Number of Products’])

a = number_of_prod_df[number_of_prod_df[‘Number of Products’] == 1].value_counts().sum()
b = number_of_prod_df[number_of_prod_df[‘Number of Products’] != 1].value_counts().sum()

data = {‘Order’: [‘One Category’, ‘More than One Category’], ‘Customer_Counts’: [a, b]}

category_counts = pd.DataFrame.from_dict(data)

fig = px.pie(category_counts,
values = category_counts.Customer_Counts,
names = category_counts.Order,
template = ‘plotly_dark’)
fig.update_traces(textposition=’inside’, textinfo=’percent+label’, textfont_size=20,
marker = dict(line = dict(color = ‘white’, width = 6)))
fig.show()

Number of Categories

######Canceled Orders

a = ecommerce[‘Customer ID’].nunique() + canceled [‘Customer ID’].nunique()
b = canceled [‘Customer ID’].nunique()

data = {‘Customers’: [‘Total Customers’, ‘Those Who Canceled’], ‘Customer_Counts’: [a, b]}

customer_counts = pd.DataFrame.from_dict(data)

c = ecommerce[‘increment_id’].nunique() + canceled [‘increment_id’].nunique()
d = canceled [‘increment_id’].nunique()

data = {‘Orders’: [‘Total Orders’, ‘Canceled Orders’], ‘Order_Counts’: [c, d]}

order_counts = pd.DataFrame.from_dict(data)

sns.set(rc={‘axes.facecolor’:’black’, ‘figure.facecolor’:’black’, ‘axes.grid’ : True})

fig, ax = plt.subplots(1,2, figsize = (20,10))
sns.barplot(ax = ax[0], data = customer_counts, x=customer_counts.Customers, y=customer_counts.Customer_Counts, palette= ‘bright’)
sns.barplot(ax = ax[1], data = order_counts, x=order_counts.Orders, y=order_counts.Order_Counts, palette= ‘bright’)

ax[0].set_title(“Customers Who Canceled”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0].set_xlabel(“Customers”, fontsize = 24, labelpad = 15, color=’white’)
ax[0].set_ylabel(“Counts”, fontsize = 24, labelpad = 15, color=’white’)
ax[0].tick_params(axis=’x’, colors=’white’, labelsize=20)
ax[0].tick_params(axis=’y’, colors=’white’, labelsize=20)

ax[1].set_title(“Canceled Orders”, fontsize = 28, pad = 30, color=’red’, fontweight=’bold’)
ax[1].set_xlabel(“Orders”, fontsize = 24, labelpad = 15, color=’white’)
ax[1].set_ylabel(“Counts”, fontsize = 24, labelpad = 15, color=’white’)
ax[1].tick_params(axis=’x’, colors=’white’, labelsize=20)
ax[1].tick_params(axis=’y’, colors=’white’, labelsize=20);

plt.tight_layout(pad=2);

Canceled Orders

#######Cancelled Categories

canceled = canceled.dropna()
fig = px.treemap(canceled,
path=[‘category_name_1′], template=’plotly_dark’)
fig.update_traces(textfont_color=’yellow’,textfont_size=16, selector=dict(type=’treemap’))
fig.show()

Cancelled Categories treemap

######Converting Date column into correct data type which is datetime

data.Date = pd.to_datetime(data.Date)

ecommerce[‘Working Date’] = pd.to_datetime(ecommerce[‘Working Date’])

ecommerce.info()

<class 'pandas.core.frame.DataFrame'>
Index: 308167 entries, 0 to 584523
Data columns (total 18 columns):
 #   Column           Non-Null Count   Dtype         
---  ------           --------------   -----         
 0   item_id          308167 non-null  int64         
 1   status           308167 non-null  object        
 2   price            308167 non-null  float64       
 3   qty_ordered      308167 non-null  int64         
 4   grand_total      308167 non-null  float64       
 5   increment_id     308167 non-null  object        
 6   category_name_1  308167 non-null  object        
 7   discount_amount  308167 non-null  float64       
 8   payment_method   308167 non-null  object        
 9   Working Date     308167 non-null  datetime64[ns]
 10  BI Status        308167 non-null  object        
 11   MV              308167 non-null  object        
 12  Year             308167 non-null  int64         
 13  Month            308167 non-null  int64         
 14  FY               308167 non-null  object        
 15  Customer ID      308167 non-null  float64       
 16  Frequency        308167 non-null  int64         
 17  Monetary         308167 non-null  float64       
dtypes: datetime64[ns](1), float64(5), int64(5), object(7)
memory usage: 44.7+ MB

########Recency Frequency Monetary Columns

ref_date = ecommerce[‘Working Date’].max()
ecommerce[‘Date’] = ecommerce[‘Working Date’].apply(lambda x: x.date())
ecommerce[‘Most_Recent_Purchase’] = ecommerce.groupby(‘Customer ID’)[‘Date’].transform(max)

ecommerce[‘Recency’] = ref_date.date() – ecommerce.Most_Recent_Purchase

ecommerce = ecommerce.drop(‘Most_Recent_Purchase’, axis=1)

ecommerce[‘Frequency’] = ecommerce.groupby(‘Customer ID’)[‘increment_id’].transform(‘nunique’)

ecommerce[‘Monetary’] = ecommerce.groupby(‘Customer ID’)[‘grand_total’].transform(sum)

RFM_df = ecommerce[[‘Customer ID’, ‘Recency’, ‘Frequency’, ‘Monetary’]].drop_duplicates()
#RFM_df.head()

####RFM Quantiles & Total Score

RFM_quantiles = RFM_df[[‘Recency’, ‘Frequency’, ‘Monetary’]].quantile(q = [0.25, 0.5, 0.75])

RFM_df.loc[(RFM_df[‘Recency’] <= RFM_quantiles.Recency[0.25]), ‘Rscore’] = 4.0 RFM_df.loc[(RFM_df[‘Recency’] > RFM_quantiles.Recency[0.25]) & (RFM_df.Recency <= RFM_quantiles.Recency[0.5]), ‘Rscore’] = 3.0 RFM_df.loc[(RFM_df[‘Recency’] > RFM_quantiles.Recency[0.5]) & (RFM_df.Recency <= RFM_quantiles.Recency[0.75]), ‘Rscore’] = 2.0 RFM_df.loc[(RFM_df[‘Recency’] > RFM_quantiles.Recency[0.75]), ‘Rscore’] = 1.0

RFM_df.loc[(RFM_df[‘Frequency’] <= RFM_quantiles.Frequency[0.25]), ‘Fscore’] = 1.0 RFM_df.loc[(RFM_df[‘Frequency’] > RFM_quantiles.Frequency[0.25]) & (RFM_df[‘Frequency’] <= RFM_quantiles.Frequency[0.5]), ‘Fscore’] = 2.0 RFM_df.loc[(RFM_df[‘Frequency’] > RFM_quantiles.Frequency[0.5]) & (RFM_df[‘Frequency’] <= RFM_quantiles.Frequency[0.75]), ‘Fscore’] = 3.0 RFM_df.loc[(RFM_df[‘Frequency’] > RFM_quantiles.Frequency[0.75]), ‘Fscore’] = 4.0

RFM_df.loc[(RFM_df[‘Monetary’] <= RFM_quantiles.Monetary[0.25]), ‘Mscore’] = 1.0 RFM_df.loc[(RFM_df[‘Monetary’] > RFM_quantiles.Monetary[0.25]) & (RFM_df[‘Monetary’] <= RFM_quantiles.Monetary[0.5]), ‘Mscore’] = 2.0 RFM_df.loc[(RFM_df[‘Monetary’] > RFM_quantiles.Monetary[0.5]) & (RFM_df[‘Monetary’] <= RFM_quantiles.Monetary[0.75]), ‘Mscore’] = 3.0 RFM_df.loc[(RFM_df[‘Monetary’] > RFM_quantiles.Monetary[0.75]), ‘Mscore’] = 4.0

RFM_df[‘RFM_score’] = RFM_df.Rscore + RFM_df.Fscore + RFM_df.Mscore

RFM_df.loc[(RFM_df[‘RFM_score’] == 12.0), ‘Cluster’] = ‘Champions’
RFM_df.loc[(RFM_df[‘RFM_score’] == 11.0), ‘Cluster’] = ‘Loyal Customers’
RFM_df.loc[(RFM_df[‘RFM_score’] >= 9.0) & (RFM_df[‘RFM_score’] <= 10.0), ‘Cluster’] = ‘Promising Customers’ RFM_df.loc[(RFM_df[‘RFM_score’] >= 7.0) & (RFM_df[‘RFM_score’] <= 8.0), ‘Cluster’] = ‘At Risk’ RFM_df.loc[(RFM_df[‘RFM_score’] >= 4.0) & (RFM_df[‘RFM_score’] <= 6.0), ‘Cluster’] = ‘Hibernating’
RFM_df.loc[(RFM_df[‘RFM_score’] == 3.0), ‘Cluster’] = ‘Lost Customers’

######Customer Segments/Clusters vs Number of Purchases

frequency = RFM_df.groupby(‘Cluster’)[‘Frequency’].sum()
frequency_df = pd.DataFrame(list(frequency.items()), columns=[‘Cluster’, ‘Number_of_Purchases’])

plt.figure(figsize=(20,10))

ax = sns.barplot(data=frequency_df, x=’Cluster’, y=’Number_of_Purchases’, palette=’bright’)
ax.bar_label(container = ax.containers[0], padding = 0, fontsize = 22, color=’white’)
plt.xticks( fontsize=20)
plt.yticks( fontsize=20)
plt.xlabel(‘Clusters’, fontsize=30, labelpad=24, color=’white’)
plt.ylabel(‘No. of Purchases’, fontsize=30, labelpad=24, color=’white’);
ax.tick_params(axis=’x’, colors=’white’)
ax.tick_params(axis=’y’, colors=’white’);

Customer Segments/Clusters

######Customer Segments/Clusters vs Spendings

frequency = RFM_df.groupby(‘Cluster’)[‘Monetary’].mean()
frequency_df = pd.DataFrame(list(frequency.items()), columns=[‘Cluster’, ‘Spendings’])

plt.figure(figsize=(20,10))

ax = sns.barplot(data=frequency_df, x=’Cluster’, y=’Spendings’, palette=’bright’)
ax.bar_label(container = ax.containers[0], padding = 0, fontsize = 22, color=’white’)
plt.xticks( fontsize=20)
plt.yticks( fontsize=20)
plt.xlabel(‘Clusters’, fontsize=30, labelpad=24, color=’white’)
plt.ylabel(‘Spendings’, fontsize=30, labelpad=24, color=’white’);
ax.tick_params(axis=’x’, colors=’white’)
ax.tick_params(axis=’y’, colors=’white’);

Customer Segments/Clusters vs Spendings

######Customer Segments/Clusters vs RFM Score

cluster_mapping = pd.Series(RFM_df.Cluster.values, index=RFM_df[‘Customer ID’]).to_dict()

ecommerce[‘Cluster’] = ecommerce[‘Customer ID’].map(cluster_mapping)

cluster_counts = RFM_df.Cluster.value_counts()

cluster_counts_df = pd.DataFrame(list(cluster_counts.items()), columns=[‘Cluster’, ‘Number_of_Customers’])

fig = px.pie(cluster_counts_df,
values = cluster_counts_df.Number_of_Customers,
names = cluster_counts_df.Cluster,
template = ‘plotly_dark’)
fig.update_traces(textposition=’inside’, textinfo=’percent+label’, textfont_size=20,
marker = dict(line = dict(color = ‘white’, width = 6)))
fig.show()

Customer Segments/Clusters vs RFM Score

fig = px.treemap(ecommerce,
path=[‘Cluster’,’category_name_1′], template=’plotly_dark’, height=1000)
fig.update_traces(textfont_color=’yellow’, textfont_size=16, selector=dict(type=’treemap’))
fig.show()

Treemap of customer segments vs Categories.

######6 Clusters vs Categories

cluster1 = ecommerce[ecommerce.Cluster == ‘Champions’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster1.columns = [‘Category’, ‘Grand_Total_Cluster1’]

cluster2 = ecommerce[ecommerce.Cluster == ‘Loyal Customers’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster2.columns = [‘Category’, ‘Grand_Total_Cluster2’]

cluster3 = ecommerce[ecommerce.Cluster == ‘Promising Customers’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster3.columns = [‘Category’, ‘Grand_Total_Cluster3’]

cluster4 = ecommerce[ecommerce.Cluster == ‘At Risk’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster4.columns = [‘Category’, ‘Grand_Total_Cluster4’]

cluster5 = ecommerce[ecommerce.Cluster == ‘Hibernating’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster5.columns = [‘Category’, ‘Grand_Total_Cluster5’]

cluster6 = ecommerce[ecommerce.Cluster == ‘Lost Customers’].groupby(‘category_name_1’)[‘grand_total’].sum().to_frame().reset_index()
cluster6.columns = [‘Category’, ‘Grand_Total_Cluster6’]

merged = cluster1.merge(cluster2, how=’left’).merge(cluster3, how=’left’).merge(cluster4, how=’left’).merge(cluster5, how=’left’).merge(cluster6, how=’left’)

sns.set(rc={‘axes.facecolor’:’black’, ‘figure.facecolor’:’black’, ‘axes.grid’ : True})

fig, ax = plt.subplots(3,2, figsize = (20,20))
sns.barplot(ax = ax[0,0], data = cluster1, y=cluster1.Category, x=cluster1.Grand_Total_Cluster1, palette= ‘bright’)
sns.barplot(ax = ax[0,1], data = cluster2, y=cluster2.Category, x=cluster2.Grand_Total_Cluster2, palette= ‘bright’)
sns.barplot(ax = ax[1,0], data = cluster3, y=cluster3.Category, x=cluster3.Grand_Total_Cluster3, palette= ‘bright’)
sns.barplot(ax = ax[1,1], data = cluster4, y=cluster4.Category, x=cluster4.Grand_Total_Cluster4, palette= ‘bright’)
sns.barplot(ax = ax[2,0], data = cluster5, y=cluster5.Category, x=cluster5.Grand_Total_Cluster5, palette= ‘bright’)
sns.barplot(ax = ax[2,1], data = cluster6, y=cluster6.Category, x=cluster6.Grand_Total_Cluster6, palette= ‘bright’)

ax[0,0].set_title(“Cluster 1 – Champions”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0,0].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,0].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,0].tick_params(axis=’x’, colors=’white’)
ax[0,0].tick_params(axis=’y’, colors=’white’)

ax[0,1].set_title(“Cluster 2 – Loyal Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0,1].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,1].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,1].tick_params(axis=’x’, colors=’white’)
ax[0,1].tick_params(axis=’y’, colors=’white’)

ax[1,0].set_title(“Cluster 3 – Promising Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[1,0].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,0].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,0].tick_params(axis=’x’, colors=’white’)
ax[1,0].tick_params(axis=’y’, colors=’white’)

ax[1,1].set_title(“Cluster 4 – At Risk”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[1,1].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,1].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,1].tick_params(axis=’x’, colors=’white’)
ax[1,1].tick_params(axis=’y’, colors=’white’)

ax[2,0].set_title(“Cluster 5 – Hibernating”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[2,0].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,0].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,0].tick_params(axis=’x’, colors=’white’)
ax[2,0].tick_params(axis=’y’, colors=’white’)

ax[2,1].set_title(“Cluster 6 – Lost Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[2,1].set_xlabel(“Amount Spent”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,1].set_ylabel(“Category”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,1].tick_params(axis=’x’, colors=’white’)
ax[2,1].tick_params(axis=’y’, colors=’white’)

plt.tight_layout(pad=2);

Clusters 1-4 vs Categories
Clusters 5-6 vs Categories

######6 Clusters vs Payment Method

cluster1_payment = ecommerce[ecommerce.Cluster == ‘Champions’].payment_method.value_counts().to_frame().reset_index()
cluster1_payment.columns = [‘Category’, ‘Payment_Method_Cluster1’]

cluster2_payment = ecommerce[ecommerce.Cluster == ‘Loyal Customers’].payment_method.value_counts().to_frame().reset_index()
cluster2_payment.columns = [‘Category’, ‘Payment_Method_Cluster2’]

cluster3_payment = ecommerce[ecommerce.Cluster == ‘Promising Customers’].payment_method.value_counts().to_frame().reset_index()
cluster3_payment.columns = [‘Category’, ‘Payment_Method_Cluster3’]

cluster4_payment = ecommerce[ecommerce.Cluster == ‘At Risk’].payment_method.value_counts().to_frame().reset_index()
cluster4_payment.columns = [‘Category’, ‘Payment_Method_Cluster4’]

cluster5_payment = ecommerce[ecommerce.Cluster == ‘Hibernating’].payment_method.value_counts().to_frame().reset_index()
cluster5_payment.columns = [‘Category’, ‘Payment_Method_Cluster5’]

cluster6_payment = ecommerce[ecommerce.Cluster == ‘Lost Customers’].payment_method.value_counts().to_frame().reset_index()
cluster6_payment.columns = [‘Category’, ‘Payment_Method_Cluster6’]

merged_payment = cluster1_payment.merge(cluster2_payment, how=’left’).merge(cluster3_payment, how=’left’).merge(cluster4_payment, how=’left’).merge(cluster5_payment, how=’left’).merge(cluster6_payment, how=’left’)

sns.set(rc={‘axes.facecolor’:’black’, ‘figure.facecolor’:’black’, ‘axes.grid’ : True})

fig, ax = plt.subplots(3,2, figsize = (20,20))
sns.barplot(ax = ax[0,0], data = cluster1_payment, y=cluster1_payment.Category, x=cluster1_payment.Payment_Method_Cluster1, palette= ‘bright’)
sns.barplot(ax = ax[0,1], data = cluster2_payment, y=cluster2_payment.Category, x=cluster2_payment.Payment_Method_Cluster2, palette= ‘bright’)
sns.barplot(ax = ax[1,0], data = cluster3_payment, y=cluster3_payment.Category, x=cluster3_payment.Payment_Method_Cluster3, palette= ‘bright’)
sns.barplot(ax = ax[1,1], data = cluster4_payment, y=cluster4_payment.Category, x=cluster4_payment.Payment_Method_Cluster4, palette= ‘bright’)
sns.barplot(ax = ax[2,0], data = cluster5_payment, y=cluster5_payment.Category, x=cluster5_payment.Payment_Method_Cluster5, palette= ‘bright’)
sns.barplot(ax = ax[2,1], data = cluster6_payment, y=cluster6_payment.Category, x=cluster6_payment.Payment_Method_Cluster6, palette= ‘bright’)

ax[0,0].set_title(“Cluster 1 – Champions”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0,0].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,0].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,0].tick_params(axis=’x’, colors=’white’)
ax[0,0].tick_params(axis=’y’, colors=’white’)

ax[0,1].set_title(“Cluster 2 – Loyal Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[0,1].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,1].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[0,1].tick_params(axis=’x’, colors=’white’)
ax[0,1].tick_params(axis=’y’, colors=’white’)

ax[1,0].set_title(“Cluster 3 – Promising Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[1,0].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,0].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,0].tick_params(axis=’x’, colors=’white’)
ax[1,0].tick_params(axis=’y’, colors=’white’)

ax[1,1].set_title(“Cluster 4 – At Risk”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[1,1].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,1].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[1,1].tick_params(axis=’x’, colors=’white’)
ax[1,1].tick_params(axis=’y’, colors=’white’)

ax[2,0].set_title(“Cluster 5 – Hibernating”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[2,0].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,0].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,0].tick_params(axis=’x’, colors=’white’)
ax[2,0].tick_params(axis=’y’, colors=’white’)

ax[2,1].set_title(“Cluster 6 – Lost Customers”, fontsize = 26, pad = 30, color=’red’, fontweight=’bold’)
ax[2,1].set_xlabel(“Payment Method”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,1].set_ylabel(“Counts”, fontsize = 20, labelpad = 15, color=’white’)
ax[2,1].tick_params(axis=’x’, colors=’white’)
ax[2,1].tick_params(axis=’y’, colors=’white’)

plt.tight_layout(pad=2);

Clusters 1-4 vs Payment method
Clusters 5-6 vs Payment Method

############# All Clusters Pending vs Completed Orders

from plotly.subplots import make_subplots
import plotly.graph_objects as go

completed_orders = ecommerce[ecommerce.status == ‘Completed’].Cluster.value_counts().to_frame().reset_index()
completed_orders.columns = [‘Cluster’, ‘Order_Counts’]

pending_orders = ecommerce[ecommerce.status == ‘Pending’].Cluster.value_counts().to_frame().reset_index()
pending_orders.columns = [‘Cluster’, ‘Order_Counts’]

fig = make_subplots(rows=1, cols=2, specs=[[{‘type’:’domain’}, {‘type’:’domain’}]], subplot_titles=[‘Completed Orders’, ‘Pending Orders’])

fig.add_trace(go.Pie(labels=completed_orders.Cluster, values=completed_orders.Order_Counts),1, 1)

fig.add_trace(go.Pie(labels=pending_orders.Cluster, values=pending_orders.Order_Counts),1, 2)

fig.update_traces(textposition=’inside’, textinfo=’percent+label’, textfont_size=20,
marker = dict(line = dict(color = ‘white’, width = 6)))

fig.update_layout(template=’plotly_dark’, height=600, width=1000)
fig.show()

All Clusters Pending vs Completed Orders

Read more here.

Conclusions

Using market segmentation, companies are able to identify their target audiences and personalize marketing campaigns more effectively. This is why market segmentation is key to staying competitive. It allows you to understand your customers, anticipate their needs, and seize growth opportunities. This powerful technique allows you to improve your decision-making, marketing efforts, and improve your company’s bottom line. 

The key to successful market segmentation remains data quality; therefore, you need to pick your data provider after doing your due diligence, ensuring that you have access to the latest industry information in accessible and easy-to-understand formats. 

The main benefits of market segmentation include: personalized, more efficient and cost-effective marketing communications, the ability to select your clientele, customer retention and loyalty, and identification of growth opportunities.

The 5-Step Roadmap:

  1. Identify the market that interests you
  2. Conduct research and analyze your market
  3. Segment the market into several parts using a proposed combination of criteria
  4. Create and test your strategy using the insights from the previous step
  5. Select a target market that will be the recipient of your strategy.

Explore More

E-Commerce Cohort Analysis in Python

E-Commerce Data Science Use-Case

E-Commerce ML/AI Classification

AI-Powered Customer Churn Prediction

Telco Customer Churn/Retention Rate ML/AI Strategies that Work!

A K-means Cluster Cohort E-Commerce

Brazilian E-Commerce Showcase

Simple E-Commerce Sales BI Analytics

Top E-Commerce Trends in Q1’23

Start Your E-Commerce with Shopify

Applications of ML/AI in HR – Predicting Employee Attrition


One-Time
Monthly
Yearly

Make a one-time donation

Make a monthly donation

Make a yearly donation

Choose an amount

$5.00
$15.00
$100.00
$5.00
$15.00
$100.00
$5.00
$15.00
$100.00

Or enter a custom amount

$

Your contribution is appreciated.

Your contribution is appreciated.

Your contribution is appreciated.

DonateDonate monthlyDonate yearly
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: