Brazilian E-Commerce Showcase

This is a hands-on end-to-end Brazilian e-commerce use-case with detailed Exploratory Data Analysis (EDA) steps and business action items.


  1. Business Case
  2. Objectives
  3. Input Dataset
  4. Read Data
  5. Customers
  6. Orders
  7. RFM
  8. Clusters
  9. Summary
  10. References
Brazilian E-Commerce Showcase
Brazilian ecommerce grew 18% last year and is not slowing down!

Business Case

Brazil is the largest ecommerce market in Latin America and the 13th largest worldwide. In 2019, the country registered 12.5 billion EUR in total ecommerce sales, up 16% year-on-year and last year we saw higher growth (18%) as more consumers shifted to online shopping during the coronavirus pandemic.

Global growth will continue over the next few years. The eCommerce market includes online sales of physical goods to a private end user (B2C). 

he biggest player in the Brazilian eCommerce Market is The store had a revenue of US$3.3 billion in 2021. It is followed by with US$2.6 billion revenue and with US$2.4 billion revenue. Altogether, the top three stores account for 30% of online revenue in Brazil.


The goal of this showcase is to apply the RFM Segmentation and Customer Analysis to the Brazilian E-Commerce Public Dataset by Olist. By definition, the term RFM stands for Recency, Frequency, and Monetary Value. It focuses on the lifetime value of customers, and it’s the preferred customer segmentation methodology for eCommerce businesses that focus on retention strategies more than on client acquisition.

RFM segmentation is a three-fold customer segmentation approach:

  1. Recency (R): When was the customer’s most recent transaction?
  2. Frequency (F): How often does the customer transact?
  3. Monetary (M): What is the size of the customer’s transaction?

The end goal is to answer the following question: what would be an optimal CRM campaign strategy based on the given dataset?

Input Dataset

This is a Brazilian e-commerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. We also released a geolocation dataset that relates Brazilian zip codes to lat/lng coordinates.

This is the real commercial dataset, but it has been anonymised, and references to the companies and partners in the review text have been replaced with the names of Game of Thrones great houses.

Read Data

Let’s set the working directory

import os
data_dir = “YOURPATH”
os.makedirs(data_dir, exist_ok=True)

and import all relevant libraries

import pandas as pd
import urllib
import json
import unidecode
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import datetime
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

Let’s read the data

customers = pd.read_csv(‘olist_customers_dataset.csv’)
geolocation = pd.read_csv(‘olist_geolocation_dataset.csv’)
orders = pd.read_csv(‘olist_orders_dataset.csv’)
order_items = pd.read_csv(‘olist_order_items_dataset.csv’)

and check the content



Let’s check if there are null values for customers


customer_id                 0.0
customer_unique_id          0.0
customer_zip_code_prefix    0.0
customer_city               0.0
customer_state              0.0
dtype: float64

Let’s group the data by customer_unique_id


8d50f5eadf50201ccdcedfb9e2ac8455    17
3e43e6105506432c953e165fb2acf44c     9
6469f99c1f9dfae7733b25662e7f1782     7
ca77025e7201e3b30c44b472ff346268     7
1b6c7548a2a1f9037c1fd3ddfed95f33     7
5657dfebff5868c4dc7e8355fea865c4     1
5657596addb4d7b07b32cd330614bdf8     1
5656eb169546146caeab56c3ffc3d268     1
5656a8fabc8629ff96b2bc14f8c09a27     1
ffffd2657e2aad2907e67c3e9daecbeb     1
Length: 96096, dtype: int64

Let’s check the content of geolocation

Let’s see if there are null values


geolocation_zip_code_prefix    0.0
geolocation_lat                0.0
geolocation_lng                0.0
geolocation_city               0.0
geolocation_state              0.0
dtype: float64

Let’s check duplicates in geolocation_city


array(['sao paulo', 'são paulo', 'sao bernardo do campo', ..., 'ciríaco',
       'estação', 'vila lângaro'], dtype=object)

Let’s modify geolocation strings as follows:

def pretty_string(column):
column_space = ‘ ‘.join(column.split())
return unidecode.unidecode(column_space.lower())

geolocation[‘geolocation_city’] = geolocation[‘geolocation_city’].apply(pretty_string)


24220    1146
24230    1102
38400     965
35500     907
11680     879
20056       1
76370       1
63012       1
76372       1
32635       1
Length: 19015, dtype: int64

Let’s check a couple of zipcodes

geolocation[geolocation[‘geolocation_zip_code_prefix’] == 24220].head()

geolocation[geolocation[‘geolocation_zip_code_prefix’] == 11680].head()

Let’s group our data by geolocation_zip_code_prefix

other_state_geolocation = geolocation.groupby([‘geolocation_zip_code_prefix’])[‘geolocation_state’].nunique().reset_index(name=’count’)
other_state_geolocation[other_state_geolocation[‘count’]>= 2].shape
max_state = geolocation.groupby([‘geolocation_zip_code_prefix’,’geolocation_state’]).size().reset_index(name=’count’).drop_duplicates(subset = ‘geolocation_zip_code_prefix’).drop(‘count’,axis=1)

geolocation_silver = geolocation.groupby([‘geolocation_zip_code_prefix’,’geolocation_city’,’geolocation_state’])[[‘geolocation_lat’,’geolocation_lng’]].median().reset_index()
geolocation_silver = geolocation_silver.merge(max_state,on=[‘geolocation_zip_code_prefix’,’geolocation_state’],how=’inner’)

customers_silver = customers.merge(geolocation_silver,left_on=’customer_zip_code_prefix’,right_on=’geolocation_zip_code_prefix’,how=’inner’)

Let’s plot all available geolocations

def plot_brasil_map(data):
brazil = mpimg.imread(urllib.request.urlopen(‘’),’jpg’)
ax = data.plot(kind=”scatter”, x=”geolocation_lng”, y=”geolocation_lat”, figsize=(10,10), alpha=0.3,s=0.3,c=’blue’)
plt.imshow(brazil, extent=[-73.98283055, -33.8,-33.75116944,5.4])



Let’s look at the content of


and see if there are null values


order_id               0.0
order_item_id          0.0
product_id             0.0
seller_id              0.0
shipping_limit_date    0.0
price                  0.0
freight_value          0.0
dtype: float64

Let’s group our data by order_id


8272b63d03f5f79c56e9e4120aec44ef    21
1b15974a0141d54e36626dca3fdc731a    20
ab14fdcfbe524636d65ee38360e22ce8    20
9ef13efd6949e4573a18964dd1bbe7f5    15
428a2f660dc84138d969ccd69a0ab6d5    15
5a0911d70c1f85d3bed0df1bf693a6dd     1
5a082b558a3798d3e36d93bfa8ca1eae     1
5a07264682e0b8fbb3f166edbbffc6e8     1
5a071192a28951b76774e5a760c8c9b7     1
fffe41c64501cc87c801fd61db3f6244     1
Length: 98666, dtype: int64

Since we have more than 1 product per order, we need to sum the price and the shipping value and get the maximum value of the shipping_limit_date for our analysis

order_items_silver = order_items.groupby(‘order_id’).agg({‘price’:sum,’freight_value’:sum,’shipping_limit_date’:max }).reset_index()

Let’s plot the result


The plot shows some outliers as price>5000 and freight_value>500.

Let’s check the statistics of order_items_silver

Let’s exclude outliers

percentil_freight_value = order_items_silver[‘freight_value’].quantile(0.99)
order_items_silver = order_items_silver[(order_items_silver[‘price’] <= 5000) & (order_items_silver[‘freight_value’] <= percentil_freight_value)]

and check our statistics again


Let’s check the content


as the table 97679 rows × 4 columns

and check the info

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99441 entries, 0 to 99440
Data columns (total 8 columns):
 #   Column                         Non-Null Count  Dtype 
---  ------                         --------------  ----- 
 0   order_id                       99441 non-null  object
 1   customer_id                    99441 non-null  object
 2   order_status                   99441 non-null  object
 3   order_purchase_timestamp       99441 non-null  object
 4   order_approved_at              99281 non-null  object
 5   order_delivered_carrier_date   97658 non-null  object
 6   order_delivered_customer_date  96476 non-null  object
 7   order_estimated_delivery_date  99441 non-null  object
dtypes: object(8)
memory usage: 6.1+ MB

Let’s create the column


as follows:

columns_timestamp = [‘order_purchase_timestamp’,’order_approved_at’,
‘order_delivered_carrier_date’, ‘order_delivered_customer_date’, ‘order_estimated_delivery_date’]
for column in columns_timestamp:
orders[column] = pd.to_datetime(orders[column])

orders[‘diff_delivery_days’] = (orders[‘order_estimated_delivery_date’] – orders[‘order_delivered_customer_date’]).dt.days

Let’s group the data

orders.groupby(orders[‘diff_delivery_days’] < 0).size()

False    91614
True      7827
dtype: int64

Only 8,54% of orders is above of estimated delivery.

Let’s print statistics and plot the histogram of diff_delivery_days

sns.histplot(x=’diff_delivery_days’, data=orders, kde=True,binwidth=5, color=”purple”,shrink=.8)
plt.xlim(-50, 50)

Let’s exclude min/max values

orders[(orders[‘diff_delivery_days’] > min(orders[‘diff_delivery_days’])) & (orders[‘diff_delivery_days’] < max(orders[‘diff_delivery_days’])) ]

as the 96474 rows × 9 columns table

Let’s perform the inner merge of orders based on order_id

orders_silver = orders.merge(order_items_silver,on=’order_id’,how=’inner’)

Let’s define the plotting function

def plot_ts_decompose(decompose,figsize=(12,8)):
fig, ax = plt.subplots(4,1,figsize=figsize)
sns.lineplot(data=decompose.observed,x=decompose.observed.index,y=decompose.observed,ax=ax[0],color=’blue’, linewidth=2.5)
sns.lineplot(data=decompose.trend,x=decompose.trend.index,y=decompose.trend,ax=ax[1],color=’red’, linewidth=2.5)
sns.lineplot(data=decompose.seasonal,x=decompose.seasonal.index,y=decompose.seasonal,ax=ax[2],color=’green’, linewidth=2.5)
sns.lineplot(data=decompose.resid,x=decompose.resid.index,y=decompose.resid,ax=ax[3],color=’black’, linewidth=2.5)

to plot the decomposed time stamps as observed, trend, seasonal and residual

orders_silver[‘month_year_purchase’] = orders_silver[‘order_purchase_timestamp’].dt.to_period(‘M’)
order_purchase_timestamp = orders_silver.groupby(‘month_year_purchase’).size()
order_purchase_timestamp.index = order_purchase_timestamp.index.astype(‘datetime64[ns]’)

decompose = seasonal_decompose(order_purchase_timestamp,model=’additive’,period=12, extrapolate_trend=12)


Let’s create rfm_data as follows:

orders_customers = customers.merge(orders_silver, on=’customer_id’, how=’inner’)

max_date = max(orders_customers[‘order_purchase_timestamp’]) + datetime.timedelta(days=1)

rfm_data = orders_customers.groupby(‘customer_unique_id’).agg({
‘order_purchase_timestamp’: lambda x: (max_date – x.max()).days,
rfm_data.columns =[‘customer_id’,’recency’,’frequency’,’monetary’]

Let’s print statistics and plot the histogram of frequency

print(rfm_data[rfm_data[‘frequency’]>1].shape[0] / rfm_data.shape[0])
sns.histplot(x=’frequency’, data=rfm_data, kde=True,color=”purple”,shrink=.8,binwidth=1)

count    94488.000000
mean         1.033771
std          0.210110
min          1.000000
25%          1.000000
50%          1.000000
75%          1.000000
max         16.000000
Name: frequency, dtype: float64

Similarly, let’s print statistics and plot the histogram of monetary

sns.histplot(x=’monetary’, data=rfm_data, kde=True,color=”purple”,shrink=.8,binwidth=10)
plt.xlim(0, 750)
plt.ylim(0, 8000)

count    94488.000000
mean       136.506113
std        190.946953
min          0.850000
25%         47.000000
50%         89.000000
75%        150.000000
max       4690.000000
Name: monetary, dtype: float64

Let’s look at recency

sns.histplot(x=’recency’, data=rfm_data, kde=True,color=”purple”)

count    94488.000000
mean       243.851198
std        153.165787
min          1.000000
25%        120.000000
50%        224.000000
75%        353.000000
max        729.000000
Name: recency, dtype: float64

As we can see the 25% of customers have a recency of 3.9 months with an average of 8 months. With a purchase frequency equivalent to 1 and with this recency this indicates that customers make very specific purchases.

The above RFM plots can be used to define the customer score that ranges from 1 to 5, where the higher this number, the better. This score is assigned for each acronym independently:

  • The more recent the customer’s purchase the higher the Recency (R) score
  • The more purchases the customer makes, the higher the Frequency score (F)
  • The more the customer spends on purchases, the higher the score the customer will have Monetarity(M).

This definition of each score can be given through quintile.


We use the K-means cluster analysis to group our data according to their similar characteristics (aka cohorts)

def k_means_group(data, n_clusters, random_state, asc=False, log_transf=False, standard_tranf=False):

data_temp = data.copy()
if log_transf:
    data_temp = np.log(data_temp) + 1

if standard_tranf:
    scaler = StandardScaler()
    scaler =
    data_temp = scaler.transform(data_temp)

kmeans_sel = KMeans(n_clusters=n_clusters, random_state=random_state).fit(data_temp)

cluster_group = data.assign(cluster = kmeans_sel.labels_)

mean_group = cluster_group.groupby('cluster').mean().reset_index()
mean_group = mean_group.sort_values(by=mean_group.columns[1],ascending=asc)
mean_group['cluster_set'] = [i for i in range(n_clusters, 0, -1) ]

cluster_map = mean_group.set_index('cluster').to_dict()['cluster_set']

return cluster_group['cluster'].map(cluster_map)

Let’s introduce the RFM labels

r_labels = k_means_group(rfm_data[[‘recency’]],6,1,asc=True)
f_labels = k_means_group(rfm_data[[‘frequency’]],6,1)
m_labels = k_means_group(rfm_data[[‘monetary’]],6,1)

rfm_data = rfm_data.assign(R = r_labels, F = f_labels, M = m_labels)

rfm_data[‘R’] = rfm_data[‘R’] – 1
rfm_data[‘R’] = rfm_data[‘F’] – 1
rfm_data[‘R’] = rfm_data[‘M’] – 1

to perform RFM grouping




Let’s define our data segments baed on the customer scores

def get_segment(data):
mean_fm = (data[‘F’] + data[‘M’]) / 2

if (data['R'] >= 4 and data['R'] <= 5) and (mean_fm >= 4 and mean_fm <= 5):
    return 'Champions'
if (data['R'] >= 2 and data['R'] <= 5) and (mean_fm >= 3 and mean_fm <= 5):
    return 'Loyal Customers'
if (data['R'] >= 3 and data['R'] <= 5) and (mean_fm >= 1 and mean_fm <= 3):
    return 'Potential Loyslist'
if (data['R'] >= 4 and data['R'] <= 5) and (mean_fm >= 0 and mean_fm <= 1):
    return 'New Customers'
if (data['R'] >= 3 and data['R'] <= 4) and (mean_fm >= 0 and mean_fm <= 1):
    return 'Promising'
if (data['R'] >= 2 and data['R'] <= 3) and (mean_fm >= 2 and mean_fm <= 3):
    return 'Customer Needing Attention'
if (data['R'] >= 2 and data['R'] <= 3) and (mean_fm >= 0 and mean_fm <= 2):
    return 'About to Sleep'
if (data['R'] >= 0 and data['R'] <= 2) and (mean_fm >= 2 and mean_fm <= 5):
    return 'At Risk'
if (data['R'] >= 0 and data['R'] <= 1) and (mean_fm >= 4 and mean_fm <= 5):
    return "Can't Lose Then"
if (data['R'] >= 1 and data['R'] <= 2) and (mean_fm >= 1 and mean_fm <= 2):
    return 'Hibernating'
return 'Lost'

rfm_data[‘segment’] = rfm_data.apply(get_segment,axis=1)

Let’s plot the result

percentage = (rfm_data[‘segment’].value_counts(normalize=True)* 100).reset_index(name=’percentage’)
g = sns.barplot(x=percentage[‘percentage’],y=percentage[‘index’], data=percentage,palette=”GnBu_d”)
sns.despine(bottom = True, left = True)
for i, v in enumerate(percentage[‘percentage’]):
g.text(v,i+0.20,” {:.2f}”.format(v)+”%”, color=’black’, ha=”left”)


NoCohortPercentageCRM Strategy
Low frequency of purchase, recency and spending.
58.61* Reviving interest with outreach campaigns
These are customers who have bought a long time ago, only a few times and have spent little
28.23* Standard communication for sending offers;
* Offer relevant products and good deals.
3Need Attention
These are customers who have recently purchased, however are still in doubt whether they will make their next purchase from the company or a competitor
8.29* Promotional campaigns for a limited time;
* Product recommendations based on their behavior;
* Show the importance of buying with the company.
4Potential Customer
These are recent buyers, spend a good amount and have bought more than once
2.31* Offer a loyalty program;
* Keep them engaged;
* Personalized and other product recommendations.
5Customer At Risk
These are customers who have spent very little money and buy frequently, but have not bought for a long time
1.32* Send personalized communications and other messages to reconnect;
* Offer good deals.
6Loyal Customer
These are customers who spend well and often
1.21* Personalized communication;
* Avoid mass mailing of offers;
* Offer few products, but present products that they are likely to be interested in;
* Ask for product reviews.
These are customers who have bought recently, buy often, and spend a lot.
0.03* Special offers, products and discounts for these customers so they feel valued;
* Ask for reviews and feedbacks constantly;
* Avoid sending massive amounts of offers;
* Personalized communication;
* Give rewards/gifts.


Kaggle RFM Segmentation and Customer Analysis

Matt Clarke (2021) How to assign RFM scores with quantile-based discretization

Paulo Vasconsellos (2017) O que é RFM e como aplica-lo ao seu time de Customer Success

Leif Arne Bakker (2020)  Know your customers with RFM.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: