E-Commerce Cohort Analysis in Python

Cohort Analysis in E-Commerce Marketing

Cohort analysis (CA) is a method by which you can monitor the overall health of your marketing campaigns in terms of the customer retention rate. CA offers a unique opportunity to judge the efficiency of marketing efforts. It can also enlighten marketers as to which cohorts (i.e. groups of customers and/or contacts) are the most valuable to your brand. Here are a few good examples:

  • How long does it take subscribers to become customers? 
  • How long does it take for a customer to return?
  • Is each stage of the customer lifecycle being nurtured effectively?
  • What are the long-term purchasing habits of different demographic segments?
  • Which channels are driving the best results?
  • Do I have many seasonal shoppers?
  • Are those subscribed spending more than those unsubscribed?
  • Different stores, different results?

Bottom line: CA is the gift that keeps on giving.

Customer Transaction Showcase

Let’s apply CA ML to customer transaction data. For test purposes, we consider the KPMG Data Analytics Virtual Internship Dataset representing the 2Mb MS Excel spreadsheet:

TableTotal RowsTotal Columns
Title Sheet989
Source: Kaggle

Step 1: Data Reading and Manipulation

import pandas as pd

!pip install openpyxl

df = pd.read_excel(‘Your_path/KPMG_VI_New_raw_data_update_final.xlsx’, sheet_name = ‘Transactions’)

df.columns = df.iloc[0]
df.drop(df.index[0],inplace=True, axis = 0)

11229502017-02-25 00:00:00FalseApprovedSolexStandardmediummedium71.4953.6241245
22331202017-05-21 00:00:00TrueApprovedTrek BicyclesStandardmediumlarge2091.47388.9241701
33374022017-10-16 00:00:00FalseApprovedOHM CyclesStandardlowmedium1793.43248.8236361
448831352017-08-31 00:00:00FalseApprovedNorco BicyclesStandardmediummedium1198.46381.136145
55787872017-10-01 00:00:00TrueApprovedGiant BicyclesStandardmediumlarge1765.3709.4842226
#Get the necessary columns for Cohort Analysis

df_final = df[[‘customer_id’,’transaction_date’,’online_order’,’order_status’]]

df_final = df_final[df_final[‘order_status’] == ‘Approved’]
df_final = df_final[~df_final.duplicated()]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19761 entries, 1 to 20000
Data columns (total 4 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   customer_id       19761 non-null  object
 1   transaction_date  19761 non-null  object
 2   online_order      19407 non-null  object
 3   order_status      19761 non-null  object
dtypes: object(4)
memory usage: 771.9+ KB
#Get Transaction Month for the dataframe

import datetime as dt
def get_month(x) :
return dt.datetime(x.year, x.month,1)
df_final[‘transaction_date’] = pd.to_datetime(df[‘transaction_date’])
df_final[‘transaction_month’] = df[‘transaction_date’].apply(get_month)

Step 2: Create Cohort Month and Index

#Create Cohort Month per Rows

group = df_final.groupby(‘customer_id’)[‘transaction_month’]
df_final[‘cohort_month’] = group.transform(‘min’)

#Calculate Cohort Index for Each Rows

def get_date_int(df, column) :
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day

return year, month, day

transaction_year, transaction_month, transaction_day = get_date_int(df_final, ‘transaction_month’)
cohort_year, cohort_month, cohort_day = get_date_int(df_final,’cohort_month’)

#Calculate Year Differences

years_diff = transaction_year – cohort_year

#Calculate Month Differences

months_diff = transaction_month – cohort_month
df_final[‘cohort_index’] = years_diff*12 + months_diff + 1

#Final Grouping to Calculate Total Unique Users in Each Cohort

cohort_group = df_final.groupby([‘cohort_month’,’cohort_index’])
cohort_data = cohort_group[‘customer_id’].apply(pd.Series.nunique)
cohort_data = cohort_data.reset_index()
cohort_counts = cohort_data.pivot_table(index = ‘cohort_month’,
columns = ‘cohort_index’,
values = ‘customer_id’

Step 3: Generate User Retention Summary Table

#Calculate Retention rate per Month Index

cohort_size = cohort_counts.iloc[:,0]
retention = cohort_counts.divide(cohort_size, axis = 0)
retention = retention.round(3)*100
retention.index = retention.index.strftime(‘%Y-%m’)

#Plotting Heatmap for Retention Table

import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize = (16,10))
plt.title(‘MoM Retention Rate for Customer Transaction Data’)
sns.heatmap(retention, annot = True, cmap=”YlGnBu”, fmt=’g’)
plt.xlabel(‘Cohort Index’)
plt.ylabel(‘Cohort Month’)
plt.yticks(rotation = ‘360’)

Customer Retention Rate versus Cohort Month/Index

Thus, we have processed transaction data to calculate the customer retention rate per cohort index. Many tech companies use the above heat map to determine whether the campaign is successful or not.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: