Simple E-Commerce Sales BI Analytics

Good businesses learn from previous efforts and test future ideas using e-commerce analytics (ECA). ECA enable you to delve deep into historical BI data, and future forecasting so that you can make the optimized business decisions. 

The key benefits of ECA are as follows:

  • Improve your ROI while lowering your CAC.
  • Measure the effectiveness of your marketing/sales campaigns.
  • Create a BI insight into your actual budget spend and forecast position.
  • Optimize pricing, up-sell and online inventory performance.
  • Integrate data from multiple domains and improve feature engineering.
  • Help optimally position your products and optimize the customer experience.

Let’s look at the warehouse optimization problem by analyzing Kaggle sales data of an UK online retailer.

Let’s import the libraries first

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib
import warnings

and set the working directory YOURPATH

import os

Let’s read the Kaggle dataset

data = pd.read_csv(“data.csv”, encoding=”ISO-8859-1″, dtype={‘CustomerID’: str})

and check the data structure


(541909, 8)


input data table

We can see that the data consists of 541909 entries and 8 features.

Let’s check the percentage of missing values in the above table

missing_percentage = data.isnull().sum() / data.shape[0] * 100

InvoiceNo       0.000000
StockCode       0.000000
Description     0.268311
Quantity        0.000000
InvoiceDate     0.000000
UnitPrice       0.000000
CustomerID     24.926694
Country         0.000000
dtype: float64

We can see that 25% of customers have undefined ID, and 0.27% descriptions are missing.

Let’s look at the missing descriptions


missing descriptions table

Let’s check the number of missing customers


True    1454
Name: CustomerID, dtype: int64

and the unit price


0.0    1454
Name: UnitPrice, dtype: int64

Let’s look at the missing customers


missing customer IDs

Let’s check the statistics of missing customers

data.loc[data.CustomerID.isnull(), [“UnitPrice”, “Quantity”]].describe()

missing customers statistics UnitPrice and Quantity

We can see extreme negative min outliers in the records of missing customers.

Let’s check if we have lowercase “nan”-strings instead of actual nan-values in descriptions

data.loc[data.Description.isnull()==False, “lowercase_descriptions”] = data.loc[
].apply(lambda l: l.lower())

lambda l: np.where(“nan” in l, True, False)

False    539724
True        731
Name: lowercase_descriptions, dtype: int64

Let’s check if we have empty “”-strings as well

lambda l: np.where(“” == l, True, False)

False    540455
Name: lowercase_descriptions, dtype: int64

Let’s transform “nan”-strings into nan-variables and drop all missing values

data.loc[data.lowercase_descriptions.isnull()==False, “lowercase_descriptions”] = data.loc[
data.lowercase_descriptions.isnull()==False, “lowercase_descriptions”
].apply(lambda l: np.where(“nan” in l, None, l))

data = data.loc[(data.CustomerID.isnull()==False) & (data.lowercase_descriptions.isnull()==False)].copy()



So there are no missing values left in our data.

Let’s check the min/max time period

data[“InvoiceDate”] = pd.to_datetime(data.InvoiceDate, cache=True)

data.InvoiceDate.max() – data.InvoiceDate.min()

Timedelta('373 days 04:24:00')

print(“Datafile starts with timepoint {}”.format(data.InvoiceDate.min()))
print(“Datafile ends with timepoint {}”.format(data.InvoiceDate.max()))

Datafile starts with timepoint 2010-12-01 08:26:00
Datafile ends with timepoint 2011-12-09 12:50:00

Let’s count all unique invoice numbers



Let’s create the new feature “IsCancelled” to deal with cancelled transactions that start with the capital letter “C”

data[“IsCancelled”]=np.where(data.InvoiceNo.apply(lambda l: l[0]==”C”), True, False)
data.IsCancelled.value_counts() / data.shape[0] * 100

False    97.81007
True      2.18993
Name: IsCancelled, dtype: float64

We can see that 2.2% of total transactions are cancelled.

Let’s check the statistics of these transactions


We can see that all cancelled transactions have Quantity<0 and UnitPrice>0.

Let’s drop these transactions

data = data.loc[data.IsCancelled==False].copy()
data = data.drop(“IsCancelled”, axis=1)



So we have got 3663 unique stock codes. Let’s look at the most and least common codes by plotting the histogram

stockcode_counts = data.StockCode.value_counts().sort_values(ascending=False)
fig, ax = plt.subplots(2,1,figsize=(20,20))
ax = ax[0], palette=”Blues_r”)
ax[0].set_title(“Which stockcodes are most common?”);
ax=ax[1], color=”red”)
ax[1].set_title(“How seldom are stockcodes?”)
ax[1].set_xlabel(“% of data with this stockcode”)

most common stock codes

We can see that 85123A is the most common stock code, whereas 22457 is the least common one.

how seldom are stockcodes

Since most stock codes are very seldom, we conclude that the retailer sells a variety of products without strong secialization.

Let’s count the length and the number of numeric chars in our stock codes

def count_numeric_chars(l):
return sum(1 for c in l if c.isdigit())

data[“StockCodeLength”] = data.StockCode.apply(lambda l: len(l))
data[“nNumericStockCode”] = data.StockCode.apply(lambda l: count_numeric_chars(l))

fig, ax = plt.subplots(1,2,figsize=(20,10))
sns.countplot(data[“StockCodeLength”], palette=”Blues_r”, ax=ax[0])
sns.countplot(data[“nNumericStockCode”], palette=”Blues_r”, ax=ax[1])
ax[0].set_xlabel(“Length of stockcode”)
ax[1].set_xlabel(“Number of numeric chars in the stockcode”);

length and number of numeric chars in stock codes

It is clear that the majority of codes consists of 5 numeric chars, but there are other occurences.

Let’s check these outliers

data.loc[data.nNumericStockCode < 5].lowercase_descriptions.value_counts()

postage                       1099
manual                         290
carriage                       133
dotcom postage                  16
bank charges                    12
pads to match all cushions       4
Name: lowercase_descriptions, dtype: int64

Let’s drop all of these occurences:

data = data.loc[(data.nNumericStockCode == 5) & (data.StockCodeLength==5)].copy()


Let’s count the number of unique descriptions

data = data.drop([“nNumericStockCode”, “StockCodeLength”], axis=1)



Let’s plot the histogram of product descriptions

description_counts = data.Description.value_counts().sort_values(ascending=False).iloc[0:30]
sns.barplot(description_counts.index, description_counts.values, palette=”Greens_r”)
plt.title(“Which product descriptions are most common?”);

descriptions histogram

We can see that all descriptions consist of uppercase chars. Most common descriptions confirm that the retailers sell different kinds of products.

Let’s count the length and the number of lowercase chars if any.

def count_lower_chars(l):
return sum(1 for c in l if c.islower())

data[“DescriptionLength”] = data.Description.apply(lambda l: len(l))
data[“LowCharsInDescription”] = data.Description.apply(lambda l: count_lower_chars(l))

fig, ax = plt.subplots(1,2,figsize=(20,10))
sns.countplot(data.DescriptionLength, ax=ax[0], color=”Red”)
sns.countplot(data.LowCharsInDescription, ax=ax[1], color=”Red”)

description length and lowercase chars histogram

It appears that almost all descriptions do not have the lowercase chars.

Let’s plot the histogram of descriptions with lowercase chars

lowchar_counts = data.loc[data.LowCharsInDescription > 0].Description.value_counts()

sns.barplot(lowchar_counts.index, lowchar_counts.values, palette=”Greens_r”)

descriptions lowercase chars histogram

We can see a mix of chars: Next Day Carriage and High Resolution Image.

Let’s compute the ratio of lowercase-to-uppercase letters

def count_upper_chars(l):
return sum(1 for c in l if c.isupper())

data[“UpCharsInDescription”] = data.Description.apply(lambda l: count_upper_chars(l))


count    362522.000000
mean         22.572291
std           4.354845
min           3.000000
25%          20.000000
50%          23.000000
75%          26.000000
max          32.000000
Name: UpCharsInDescription, dtype: float64

data.loc[data.UpCharsInDescription <=5].Description.value_counts()

Next Day Carriage        79
High Resolution Image     3
Name: Description, dtype: int64

Let’s drop these outliers

data = data.loc[data.UpCharsInDescription > 5].copy()

Let’s plot the histogram of DescriptionLength < 14

dlength_counts = data.loc[data.DescriptionLength < 14].Description.value_counts()

sns.barplot(dlength_counts.index, dlength_counts.values, palette=”Reds_r”)

descriptions with the length less than 14 histogram

Descriptions with the length<14 look valid and we should not drop them.

Let’s count unique stock codes and descriptions in our data





Let’s find out why do we have more descriptions than stock codes.


23236    4
23196    4
23413    3
23244    3
23126    3
23203    3
23209    3
23366    3
23131    3
23535    3
Name: Description, dtype: int64

We still have stock codes with multiple descriptions, for example

data.loc[data.StockCode == “23244”].Description.value_counts()

Name: Description, dtype: int64

These stock codes look fine.

Let’s count the number of unique customer IDs



and find out which customers are most common

customer_counts = data.CustomerID.value_counts().sort_values(ascending=False).iloc[0:20]
sns.barplot(customer_counts.index, customer_counts.values, order=customer_counts.index,palette=”Oranges_r”)
plt.title(“Which customers are most common?”);

which customers are most common?

Let’s check the country list



And which countries are most common?

which countries are most common?

We can see that the retailer sells almost all products within the UK, followed by many EU countries. 

Let’s check the percentage of UK transactions

data.loc[data.Country==”United Kingdom”].shape[0] / data.shape[0] * 100


Let’s look at the unit price statistics


count    362440.000000
mean          2.885355
std           4.361812
min           0.000000
25%           1.250000
50%           1.790000
75%           3.750000
max         649.500000
Name: UnitPrice, dtype: float64

We can see the occurence of zero unit prices.

Let’s check the unit price data content

data.loc[data.UnitPrice == 0].sort_values(by=”Quantity”, ascending=False).head()

unit price data table

It is unclear that zero unit prices are gifts to customers. Let’s drop them

data = data.loc[data.UnitPrice > 0].copy()

and plot the Log-Unit-Price histogram

log-unit-price histogram

Let’s focus on lower unit prices

data = data.loc[(data.UnitPrice > 0.1) & (data.UnitPrice < 20)].copy()

and look at quantities


count    361608.000000
mean         13.024112
std         187.566510
min           1.000000
25%           2.000000
50%           6.000000
75%          12.000000
max       80995.000000
Name: Quantity, dtype: float64

by plotting the quantity histogram

fig, ax = plt.subplots(1,2,figsize=(20,10))
sns.distplot(data.Quantity, ax=ax[0], kde=False, color=”Blue”);
sns.distplot(np.log(data.Quantity), ax=ax[1], bins=20, kde=False, color=”Blue”);
ax[0].set_title(“Quantity distribution”)
ax[1].set_title(“Log-Quantity distribution”)
ax[1].set_xlabel(“Natural-Log Quantity”);

log-quantity distribution

Let’s exclude Quantity outliers

data = data.loc[data.Quantity < 55].copy()

and focus on daily product sales

data[“Revenue”] = data.Quantity * data.UnitPrice

data[“Year”] = data.InvoiceDate.dt.year
data[“Quarter”] = data.InvoiceDate.dt.quarter
data[“Month”] = data.InvoiceDate.dt.month
data[“Week”] = data.InvoiceDate.dt.week
data[“Weekday”] = data.InvoiceDate.dt.weekday
data[“Day”] =
data[“Dayofyear”] = data.InvoiceDate.dt.dayofyear
data[“Date”] = pd.to_datetime(data[[‘Year’, ‘Month’, ‘Day’]])

grouped_features = [“Date”, “Year”, “Quarter”,”Month”, “Week”, “Weekday”, “Dayofyear”, “Day”,

daily_data = pd.DataFrame(data.groupby(grouped_features).Quantity.sum(),
daily_data[“Revenue”] = data.groupby(grouped_features).Revenue.sum()
daily_data = daily_data.reset_index()

Daily sales table

Let’s check the Quantity-Revenue statistics of our daily sales data

daily_data.loc[:, [“Quantity”, “Revenue”]].describe()

daily sales statistics

We still need to exclude outliers by considering 90% of our sales data

low_quantity = daily_data.Quantity.quantile(0.01)
high_quantity = daily_data.Quantity.quantile(0.99)
print((low_quantity, high_quantity))

(1.0, 88.48000000001048)

low_revenue = daily_data.Revenue.quantile(0.01)
high_revenue = daily_data.Revenue.quantile(0.99)
print((low_revenue, high_revenue))

(0.78, 204.0)

samples = daily_data.shape[0]

daily_data = daily_data.loc[
(daily_data.Quantity >= low_quantity) & (daily_data.Quantity <= high_quantity)] daily_data = daily_data.loc[ (daily_data.Revenue >= low_revenue) & (daily_data.Revenue <= high_revenue)]

samples – daily_data.shape[0]


That’s the amount of lost data entries.

Let’s look at the sales data without outliers

fig, ax = plt.subplots(1,2,figsize=(20,10))
sns.distplot(daily_data.Quantity.values, kde=True, ax=ax[0], color=”Green”, bins=30);
sns.distplot(np.log(daily_data.Quantity.values), kde=True, ax=ax[1], color=”Green”, bins=30);
ax[0].set_xlabel(“Number of daily product sales”);
ax[0].set_title(“How many products are sold per day?”);

How many products are sold per day?

We can see that the daily sales distributions are right skewed in that lower values (small quantities of products) are more common.  In addition, multiple modes indicate several periodic sales anomalies to be examined separately.


We have discussed the use-case of retail analytics and related BI insights to analyze sales performance and optimize processes. This analytics offers the following opportunities pertaining to online sales:

  • Using these insights, retailers can improve both the sales force performance as well as their overall profitability.
  • It takes into account the effect of seasonality, any executed Ad-campaigns, running Sales Promotion events or any competitor’s activity that can impact sales. These insights help retailers in stocking the inventory accordingly.
  • It uncovers associations between items; it may be because these are complementary products or have a tendency to be bought together. With the help of these insights, the store owners can create a tailored assortment of products and play with the margins by clubbing expensive (high margin) items with less selling discounted items. A correlation between the number of users in a family and identifying cross-sell & up-sell opportunities is key to driving sales.
  • Customer Trend Identification To Drive The Pricing & Promotion Plan. It allows e-commerce players to push out relevant offers to each customer at every stage of their buyer’s journey.
  • Analytics is also very crucial to analyse the success factors driving a product’s sale. A benchmarking of these factors could go a long way in building a successful new product strategy. 
  • It is capable enough to analyse if a competitor enjoys some unique advantage or has some gap in their product portfolio. Also, a BI system can draw insights from the impact analysis of the competitors’ pricing & promotion events.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: