A K-means Cluster Cohort E-Commerce

Understanding who your customers are and what they want is a fundamental part of any successful business. Yet as a business grows, so does the customer base, and it can become increasingly challenging to create a one-size-fits-all customer profile. This is where the concept of cluster-based cohort analysis comes in. 

A Cluster-Based Customer Segmenation = Better Marketing


  1. Introduction
  2. Algorithm
    1. The K Means Clustering Algorithm and its implementation in Python consist of the following steps:
  3. ETL Pipeline
    1. #Data Analysis and Manipulation
    2. #Data Visualization
    3. #Data Scaling and K-Means PCA
    4. #plt.show()
    5. #df0_std.plot.scatter(x_axis, y_axis)
    6. #plt.show()
    7. #plt.show()
    8. Let’s call the function and use only the 2 PCs
    9. #plt.show()
  4. Interpretation
  5. Conclusions
  6. References


Cohort analysis is a strategy that groups customers into smaller clusters, based on the characteristics they have in common. There are three main categories of customer segmentation types, which encompass a variety of client personas.

  • Market-based segments: Based on observable demographic, geographic, or other firmographic traits.
  • Needs-based segments: Based on business use and needs, including frequency of service interaction. These needs are verified through market research. 
  • Value-based segments: Based on economic value, revenue, and the customer’s willingness to buy.

Cohort analysis describes the process of identifying groups (or segments) of a company’s customers that are similar in terms of one or more specific characteristics or factors. The goal of this categorization is to optimize marketing to each group, such that individual customers receive the most appropriate and relevant communications, and so as to maximize the value of each customer to your business.

The potential characteristics or factors that can be used to segment customers are nearly unlimited, but the most common (and easily accessible) include: 

  • Personal characteristics such as age, stage of life (retired, new parents, students, etc.), gender, marital status
  • Geographic factors such as location, urban/suburban/rural areas
  • Buying behavior, including purchase history (value, frequency, type of products purchased) and responses to marketing communications or social media promotions.

Airbnb cohort Targeting Example:

  • Location
  • Adventure
  • Price
  • Vacation
  • Family
  • Students

Bottom Line:

Before you can determine which groups to focus on in your marketing, you will need to segment your market into cohorts that share certain characteristics.


Let’s look at the cluster-based customer segmentation (CCS). The focus is on multiple possible factors, as identified through mathematical analyses.

Advantages: It reduces bias via the use of objective data, expanding segmentation possibilities significantly

Challenges: It requires market research and further statistical analysis, most likely requiring third-party involvement (higher investment costs).

We address these challenges by implementing CCS using K-means clustering.

The K Means Clustering Algorithm and its implementation in Python consist of the following steps:
  1. Specify the number of clusters K.
  2. Initialize centroids by first shuffling the dataset and then randomly selecting data points for the centroids without replacement.
  3. Keep iterating until there is no change to the centroids, i.e assignment of data points to clusters isn’t changing.

ETL Pipeline

Let’s load relevant libraries

#Data Analysis and Manipulation

import pandas as pd
import numpy as np

#Data Visualization

import matplotlib.pyplot as plt
import seaborn as sns
sns.set() ## this is for styling

#Data Scaling and K-Means PCA

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

!pip install cluster

import cluster

and set our working directory

import os

Let’s read the input customer transaction history data

data = pd.read_csv(‘data.csv’, encoding= ‘unicode_escape’)

customer transaction history data

This data will be used later in conjunction with e-commerce CS applications.

Let’s focus on the customer personal information data

df= pd.read_csv(‘segmentationdata.csv’, index_col = 0)


 customer personal information data

Column description: sex = 0, 1 (male/female), marital status = 0, 1 (single/non-single), age=18-76, education = 0, 1, 2, 3 (other, high school, university, graduate school), income = 35832309364 USD, occupation = 0, 1, 2 (unemployed, skilled, manager), settlement size = 0, 1, 2 (small, mid-sized and big city)

Let’s check the basic statistics

descriptive statistics table

and general column info


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2000 entries, 100000001 to 100002000
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype
---  ------           --------------  -----
 0   Sex              2000 non-null   int64
 1   Marital status   2000 non-null   int64
 2   Age              2000 non-null   int64
 3   Education        2000 non-null   int64
 4   Income           2000 non-null   int64
 5   Occupation       2000 non-null   int64
 6   Settlement size  2000 non-null   int64
dtypes: int64(7)
memory usage: 125.0 KB

Let’s plot the triangle correlation heatmap

mask = np.triu(np.ones_like(df.corr(), dtype=np.bool))
heatmap = sns.heatmap(df.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap=’BrBG’)
heatmap.set_title(‘Triangle Correlation Heatmap’, fontdict={‘fontsize’:18}, pad=16);

triangle correlation heatmap

Let’s apply standard data scaling and subsequent K-means cluster analysis

scaler = StandardScaler()
df_std = scaler.fit_transform(df)

df_std = pd.DataFrame(data = df_std,columns = df.columns)

wcss = []
for i in range(1,11):
kmeans_pca = KMeans(n_clusters = i, init = ‘k-means++’, random_state = 42)

Let’s plot the outcome

plt.figure(figsize = (10,8))
plt.plot(range(1, 11), wcss, marker = ‘o’, linestyle = ‘-.’,color=’red’)
plt.xlabel(‘Number of Clusters’)
plt.title(‘K-means Clustering’)



WCSS plot

Here, WCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. WCSS value is largest when K = 1, as shown above.

The elbow in the graph is the cluster 4 mark. This is the only place until which the graph is steeply declining while smoothing out afterward.

Let’s apply K-means by setting n_clusters = 4

kmeans = KMeans(n_clusters = 4, init = ‘k-means++’, random_state = 42)


KMeans(n_clusters=4, random_state=42)

df_segm_kmeans= df_std.copy()
df_std[‘Segment K-means’] = kmeans.labels_

Let’s group the customers by clusters and see the average values for each variable

df_segm_analysis = df_std.groupby([‘Segment K-means’]).mean()

3:’career focused’})

Let’s plot the specific content

df0_std=df_std[df_std[‘Segment K-means’] == 0]

x_axis = df0_std[‘Age’]
y_axis = df0_std[‘Income’]
plt.figure(figsize = (10, 8))
sns.scatterplot(x_axis, y_axis)
plt.title(‘Segmentation K-means’)

Segmentation K-means
income and age

df0_std=df_std[df_std[‘Segment K-means’] == 1]
x_axis = df0_std[‘Age’]
y_axis = df0_std[‘Income’]

#df0_std.plot.scatter(x_axis, y_axis)

plt.figure(figsize = (10, 8))
sns.scatterplot(x=”Age”, y=”Income”,
hue=”Segment K-means”,
data=df_std,palette = [‘g’, ‘r’, ‘c’, ‘m’]);
plt.title(‘Segmentation K-means’)



income and age clusters 0-3

We can see the green segment well off is clearly separated as it is highest in both age and income. But the other three are grouped together.

df0_std=df_std[df_std[‘Segment K-means’] == 2]
x_axis = df0_std[‘Age’]
y_axis = df0_std[‘Income’]
plt.figure(figsize = (10, 8))
sns.scatterplot(x_axis, y_axis)
plt.title(‘Segmentation K-means’)

income and age cluster 2

df0_std=df_std[df_std[‘Segment K-means’] == 3]
x_axis = df0_std[‘Age’]
y_axis = df0_std[‘Income’]
plt.figure(figsize = (10, 8))
sns.scatterplot(x_axis, y_axis)
plt.title(‘Segmentation K-means’)

income and age cluster 3

df0_std=df_std[df_std[‘Segment K-means’] == 1]
x_axis = df0_std[‘Age’]
y_axis = df0_std[‘Income’]
plt.figure(figsize = (10, 8))
sns.scatterplot(x_axis, y_axis)
plt.title(‘Segmentation K-means’)

income and age cluster 1

Let’s apply the PCA explained variance ratio method

pca = PCA()



array([0.31670682, 0.27749602, 0.18635691, 0.07264964, 0.05009459,
       0.0448685 , 0.03460207, 0.01722544])

exp_var_pca = pca.explained_variance_ratio_

Let’s calculate the cumulative sum of eigenvalues for visualizing the variance explained by each principal component

cum_sum_eigenvalues = np.cumsum(exp_var_pca)

This is neededd to create the step plot

plt.figure(figsize = (10, 8))
plt.bar(range(0,len(exp_var_pca)), exp_var_pca, alpha=0.5, align=’center’, label=’Individual explained variance’)
plt.step(range(0,len(cum_sum_eigenvalues)), cum_sum_eigenvalues, where=’mid’,label=’Cumulative explained variance’)
plt.ylabel(‘Explained variance ratio’)
plt.xlabel(‘Principal component index’)



cumulative explained variance

Let’s check first three components

pca = PCA(n_components = 3)

array([[-0.36277879, -0.23285431,  0.2028735 ,  0.01012855,  0.48812776,
         0.50055445,  0.49207793,  0.20480728],
       [ 0.19903017,  0.20918309,  0.50222575,  0.61853334,  0.21367707,
         0.04531451, -0.03116446, -0.48283818],
       [-0.46722452, -0.63600769,  0.36489191, -0.07750792, -0.15798026,
        -0.30506618, -0.21375638, -0.27262971]])

Let’s get PCA scores

pca_scores = PCA().fit_transform(df_std)
pca = PCA(svd_solver=’auto’, whiten=True)

[[-0.36277879 -0.23285431  0.2028735   0.01012855  0.48812776  0.50055445
   0.49207793  0.20480728]
 [ 0.19903017  0.20918309  0.50222575  0.61853334  0.21367707  0.04531451
  -0.03116446 -0.48283818]
 [-0.46722452 -0.63600769  0.36489191 -0.07750792 -0.15798026 -0.30506618
  -0.21375638 -0.27262971]
 [ 0.26565759 -0.24117282 -0.21799604 -0.33058964  0.45883995  0.35753911
  -0.48336281 -0.3774149 ]
 [-0.67566647  0.4762232  -0.09339521  0.11959267  0.10415854  0.13501925
  -0.50880933  0.07546584]
 [-0.28227197  0.17175885 -0.39235969 -0.12583693 -0.14901693  0.02734829
   0.45041949 -0.70371065]
 [-0.04528281  0.12362945 -0.08693096 -0.11089852  0.66568059 -0.71101518
   0.11670231  0.02273963]
 [-0.04065294 -0.4096063  -0.59642055  0.67792412  0.06630117 -0.04440195
  -0.04661133  0.08204538]]

Let’s plot the 2-D projection of our clusters using only 2 PCs

x_new = pca_scores
def myplot(score,coeff,labels=None):
xs = score[:,0]
ys = score[:,1]
n = coeff.shape[0]
scalex = 1.0/(xs.max() – xs.min())
scaley = 1.0/(ys.max() – ys.min())
plt.scatter(xs * scalex,ys * scaley)
for i in range(n):
plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = ‘r’,alpha = 0.5)
if labels is None:
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, “Var”+str(i+1), color = ‘g’, ha = ‘center’, va = ‘center’)
plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = ‘g’, ha = ‘center’, va = ‘center’)

Let’s call the function and use only the 2 PCs

plt.figure(figsize = (10, 8))
myplot(x_new[:,0:2],np.transpose(pca.components_[0:2, :]))



PCA 2-D clusters


  • Strong correlations: education-age and income-occupation
  • Cluster 1 has almost the same number of men and women with the average age of 56 (the oldest age group)
  • Cluster 2: Customers live almost exclusively in small cities and have the lowest annual salary
  • Cluster 3 represents the youngest age group of 29 with the medium level of education and income
  • Cluster 4 consists of men, less than 20% of whom are in relationship, they live in big or middle-sized cities, have a relatively low level of education and high levels of income and occupation.


Correlation matrix is a very useful tool to analyze the relationship between features. We have used K-means PCA to group data points into distinct segments. The major business impact of K-Means CSS: it allows us a better understanding of consumer behaviour which in turn could be used to improve the marketing strategy.


Churn Prediction Analysis with Decision Tree Machine Learning in Python

Customer Segmentation with K-means clustering Machine Learning in Python

Customer Segmentation with Python

Clustering algorithms for customer segmentation

Customer Segmentation

Find Your Best Customers with Customer Segmentation in Python

Customer Segmentation Using K Means Clustering

Starbucks offers: Advanced customer segmentation with Python

The importance of customer segmentation in SaaS

Customer Segmentation with Machine Learning


2 responses to “A K-means Cluster Cohort E-Commerce”

  1. Your work is amazing, i am data scientist enthusiastic and want to learn.
    I feelni can learn many things from newdigitals.
    Pls let me know how can i learn from u

    Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: