-
A Comparison of Automated EDA Tools in Python: Pandas-Profiling vs SweetViz
- Exploratory Data Analysis (EDA) is a crucial step in any data science project.
- Its primary goals include discovering patterns, identifying anomalies, and finding relationships between variables.
- EDA can be broken down into different types of descriptive statistical analyses, such as univariate, bivariate, and multivariate data analytics.
- Automation can be used to conduct EDA, allowing for faster and more efficient analysis of data.
- Data scientists can utilize automated EDA tools to accelerate the time-consuming data preparation phase, minimize ETL errors, and gain a more complete from A to Z understanding of input data.
- This post will compare some popular Python libraries to automate EDA. The goal is to examine every aspect of the EDA procedure, including data cleaning, data visualization, and statistical analysis. Specifically, we will compare the types and quality of visualizations each tool supports.
Table of Contents
Pandas-Profiling
Let’s set the working directory
import os
os.chdir(‘YOURPATH’)
os. getcwd()import the key libraries and load the input data
import pandas as pd
from pandas_profiling import ProfileReportdf = pd.read_csv(“https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv”๐
Let’s generate the HTML report
profile = ProfileReport(df, title=”Pandas Profiling Report”)
profile.to_file(“report.html”)Let’s look at the report
Interactions:
Correlations: Heatmap
Correlations: Table
SweetViz
Let’s import the key libraries
import sweetviz as sv
import pandas as pdand proceed with the same dataset as above, viz.
df = pd.read_csv(“https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv”๐
Let’s create and display an analysis report for our data
report = sv.analyze(df)
Done! Use ‘show’ commands to display/save
report.show_html()
Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
Let’s look at the report
Other Libraries
Other automation libraries that can assist in EDA are as follows (cf. References):
Dataprep, D-tale, Pandas GUI, Dabl, Bamboolib, AutoViz, Dora, Visidata, Scattertext, QuickDA, ExploriPy, Rath, and Lux.
Summary
- Automated EDA libraries can perform data analysis, wrangling, editing, cleaning, visualization, and ETL transformations in a few lines of Python code.
- Pandas-Profiling creates an interactive HTML report that displays various summary statistics and visualizations of a given Pandas DataFrame.
- SweetViz also creates an HTML report with visualizations that provide insights into the data, including distributions of features, missing values, and correlations between features (Associations).
- The visualizations can be customized and fine-tuned as needed to best suit your EDA needs.
- These tools are especially powerful for marketers, salespeople and BI analysts looking to perform data analysis and present their data to stakeholders.
Explore More
- Advanced Integrated Data Visualization (AIDV) in Python – 2. Dabl Auto EDA & ML
- Joint Analysis of Bitcoin, Gold and Crude Oil Prices with Optimized Risk/Return in 2023
References
- How to automate the Exploratory Data Analysis process?
- Automatic EDA – Python
- Step By Step Guide To Automated Exploratory Data Analysis Using Sweetviz in Python
- Top Python Libraries to Automate Exploratory Data Analysis in 2021
- 10 automated EDA libraries in one place
- Comparing the Five Most Popular EDA Tools
- 10 Automated EDA Tools That Will Save You Hours Of Work
- Best Libraries That Will Assist You In EDA
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
Exploratory Data Analysis (EDA) and NLP Visualization of Restaurant Guest Reviews on Tripadvisor – Eating Out in Europe
Photo by Pablo Merchรกn Montes on Unsplash
Tripadvisor, Inc. is an American internet-based travel company headquartered in Needham, MA. Its branded sites and forums operate as online travel guides offering free user-generated reviews of travel-related content, tools for price comparisons, and online travel booking services. As of 2022, Tripadvisorโs total number of user reviews and ratings reached approximately one billion, covering listings for restaurants, hotels, vacation rentals, experiences, and attractions.
Following the recent Kaggle studies, let’s examine the restaurant guest reviews on Tripadvisor (TA) in terms of their experiences in eating out in Europe.
In this post, we will do sentiment classification of TA restaurant reviews in Europe and develop recommendation system via restaurant description,summary and user comments.
Goal: we will be exploring the restaurant reviews and the rating base on customer experience.
How Important Are Online Reviews?
In an ultra-competitive market like Europe, a restaurant’s online reviews can mean make-or-break. Restaurant owners are fully aware of the importance of customer reviews and doing everything possible to boost their online image.
We will answer the following relevant questions:
- Are there differences in average ratings and number of ratings between cities?
- Are there more vegetarian-friendly cities and if so, are they locally concentrated?
- Is local cuisine rated better than foreign cusine and if so, is there a difference between cities?
Disclaimer:
Review websites such as TA are built on credibility. Biased or inflated reviews can have a detrimental effect on user trust.
Table of Contents
About Dataset
This Kaggle dataset has been obtained by scraping TA ratings and reviews for restaurants across 31 European cities.
The dataset contain restaurants information for 31 cities in Europe: Amsterdam (NL), Athens (GR) , Barcelona (ES) , Berlin (DE), Bratislava (SK), Bruxelles (BE), Budapest (HU), Copenhagen (DK), Dublin (IE), Edinburgh (UK), Geneva (CH), Helsinki (FI), Hamburg (DE), Krakow (PL), Lisbon (PT), Ljubljana (SI), London (UK), Luxembourg (LU), Madrid (ES), Lyon (FR), Milan (IT), Munich (DE), Oporto (PT), Oslo (NO), Paris (FR), Prague (CZ), Rome (IT), Stockholm (SE), Vienna (AT), Warsaw (PL), Zurich (CH).
It is structured as follow:
- Name: name of the restaurant
- City: city location of the restaurant
- Cuisine Style: cuisine style(s) of the restaurant, in a Python list object (94 046 non-null)
- Ranking: rank of the restaurant among the total number of restaurants in the city as a float object (115 645 non-null)
- Rating: rate of the restaurant on a scale from 1 to 5, as a float object (115 658 non-null)
- Price Range: price range of the restaurant among 3 categories , as a categorical type (77 555 non-null)
- Number of Reviews: number of reviews that customers have let to the restaurant, as a float object (108 020 non-null)
- Reviews: 2 reviews that are displayed on the restaurants scrolling page of the city, as a list of list object where the first list contains the 2 reviews, and the second le dates when these reviews were written (115 673 non-null)
- URL_TA: part of the URL of the detailed restaurant page that comes after ‘www.tripadvisor.com’ as a string object (124 995 non-null)
- ID_TA: identification of the restaurant in the TA database constructed a one letter and a number (124 995 non-null).
Data Preparation
Let’s set the working directory, import the key libraries, load the input dataset and examine the data structure
import os
os.chdir(‘YOURPATH’)
os. getcwd()import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sbimport re
from collections import Counter
from PIL import Image%matplotlib inline
df_ta_reviews = pd.read_csv(‘TA_restaurants_curated.csv’, encoding=’utf8′, index_col=0)
df_ta_reviews.head()
df_ta_reviews.shape
(125527, 10)
df_ta_reviews.info()
<class 'pandas.core.frame.DataFrame'> Index: 125527 entries, 0 to 1666 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 125527 non-null object 1 City 125527 non-null object 2 Cuisine Style 94176 non-null object 3 Ranking 115876 non-null float64 4 Rating 115897 non-null float64 5 Price Range 77672 non-null object 6 Number of Reviews 108183 non-null float64 7 Reviews 115911 non-null object 8 URL_TA 125527 non-null object 9 ID_TA 125527 non-null object dtypes: float64(3), object(7) memory usage: 10.5+ MB
Let’s perform the following data wrangling:
- Fill NaN with zero
df_ta_reviews[‘Ranking’] = df_ta_reviews[‘Ranking’].astype(‘category’)
df_ta_reviews[‘Number of Reviews’] = df_ta_reviews[‘Number of Reviews’].fillna(0)
df_ta_reviews[‘Number of Reviews’] = df_ta_reviews[‘Number of Reviews’].round(0).astype(‘int’)- Remove duplicates
print(df_ta_reviews[df_ta_reviews.ID_TA.duplicated() == True].ID_TA.count())
df_ta_reviews = df_ta_reviews.drop_duplicates(‘ID_TA’, keep=’first’)
df_ta_reviews.info()201 <class 'pandas.core.frame.DataFrame'> Index: 125326 entries, 0 to 1666 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 125326 non-null object 1 City 125326 non-null object 2 Cuisine Style 94047 non-null object 3 Ranking 115710 non-null category 4 Rating 115734 non-null float64 5 Price Range 77574 non-null object 6 Number of Reviews 125326 non-null int32 7 Reviews 115745 non-null object 8 URL_TA 125326 non-null object 9 ID_TA 125326 non-null object dtypes: category(1), float64(1), int32(1), object(7) memory usage: 10.0+ MB
- Rename columns
df_ta_reviews.rename(columns={‘Name’: ‘name’,
‘City’: ‘city’,
‘Ranking’: ‘ranking’,
‘Rating’: ‘rating’,
‘Reviews’: ‘reviews’,
‘Cuisine Style’:’cuisine_style’,
‘Price Range’:’price_range’,
‘Number of Reviews’:’reviews_number’}, inplace=True)- Fill negative ratings with zero
print(df_ta_reviews[df_ta_reviews.rating == -1.0].city.count())
df_ta_reviews.rating.replace(-1, 0, inplace=True)41
- Create the data copy
ta_reviews = df_ta_reviews.copy()
- Count the unique restaurants and cities
rint(“Single Restaurants: {}”.format(ta_reviews.shape[0]))
print(“Cities: {}”.format(ta_reviews.city.nunique()))Single Restaurants: 125326 Cities: 31
Basic Bar Plots
- Number of restaurants in each city
import matplotlib
matplotlib.rc(‘xtick’, labelsize=20)
matplotlib.rc(‘ytick’, labelsize=20)
fig = plt.figure(figsize=(20, 9))
ax = plt.subplot();df_restaurants_in_cities = ta_reviews.groupby(‘city’).name.count().sort_values(ascending = False)
plt.bar(x = df_restaurants_in_cities.index, height=df_restaurants_in_cities, color=”#4ECDC4″);
plt.xticks(rotation=’vertical’);
plt.ylabel(‘Number of Restaurants’,fontsize=20);
ax.spines[‘right’].set_visible(False);
ax.spines[‘top’].set_visible(False);
ax.spines[‘left’].set_visible(False);
plt.title(‘Number of restaurants in each city’,fontsize=20);
ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);
plt.grid(color = ‘black’, linestyle = ‘–‘, linewidth = 0.5)- Number of reviews for each city
import matplotlib
matplotlib.rc(‘xtick’, labelsize=20)
matplotlib.rc(‘ytick’, labelsize=20)
df_reviews_count = ta_reviews.groupby(‘city’).reviews_number.sum().sort_values(ascending=False)
count_millions = np.arange(0, 2.14e6, 20e4)
count = np.arange(0, 2.6, 0.25)fig = plt.figure(figsize=(20, 9))
ax = plt.subplot();plt.bar(x = df_reviews_count.index, height=df_reviews_count, color=”#4ECDC4″);
plt.xticks(rotation=’vertical’);
plt.yticks(count_millions, count);
plt.ylabel(‘Total Number of Reviews (Millions)’,fontsize=20);
ax.spines[‘right’].set_visible(False);
ax.spines[‘top’].set_visible(False);
ax.spines[‘left’].set_visible(False);
plt.title(‘Number of reviews for each city’,fontsize=20);
ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);
plt.grid(color = ‘black’, linestyle = ‘–‘, linewidth = 0.5)- Average Number of reviews for each city
import matplotlib
matplotlib.rc(‘xtick’, labelsize=20)
matplotlib.rc(‘ytick’, labelsize=20)
tot_reviews_city = pd.DataFrame(ta_reviews.groupby(‘city’).reviews_number.sum())
tot_places_city = pd.DataFrame(ta_reviews.groupby(‘city’).name.count())reviews_per_city = pd.merge(tot_reviews_city, tot_places_city, how=’outer’, on=’city’)
reviews_per_city.rename(columns={‘name’:’number_of_places’}, inplace=True)
reviews_per_city[‘avg_reviews’] = round(reviews_per_city.reviews_number / reviews_per_city.number_of_places, 2)
reviews_per_city.sort_values(by=’avg_reviews’, ascending=False, inplace=True)fig = plt.figure(figsize=(20, 9))
ax = plt.subplot();plt.bar(x = reviews_per_city.index, height=reviews_per_city.avg_reviews, color=”#4ECDC4″);
plt.xticks(rotation=’vertical’,fontsize=20);
ax.spines[‘right’].set_visible(False);
ax.spines[‘top’].set_visible(False);
ax.spines[‘left’].set_visible(False);
plt.title(‘Average Number of reviews for each city’,fontsize=20);
ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);
plt.grid(color = ‘black’, linestyle = ‘–‘, linewidth = 0.5)- Let’s create the new data frame
cuisines = ta_reviews.loc[:, [‘city’, ‘name’, ‘cuisine_style’, ‘rating’, ‘reviews_number’, ‘reviews’]]
cuisines.cuisine_style = cuisines.cuisine_style.str.replace(‘[‘, ”)
cuisines.cuisine_style = cuisines.cuisine_style.str.replace(‘]’, ”)cuisines.columns = [‘city’, ‘place’, ‘cuisine_style’, ‘rating’, ‘reviews_number’, ‘reviews’]
all_cuisines = cuisines[cuisines.cuisine_style.isna() == False]
dic = []
for i in all_cuisines.iterrows():
for j in range(0, len(i[1].cuisine_style.split(', '))): dic.append({ 'city': i[1].city, 'place': i[1].place, 'cuisine_style': i[1].cuisine_style.split(', ')[j].replace('\'', ''), 'rating': i[1].rating, 'reviews_number': i[1].reviews_number })
cuisines_list = pd.DataFrame(data=dic)
cuisines_list.cuisine_style.nunique()
127
- Top 10 most common cuisines
import matplotlib
matplotlib.rc(‘xtick’, labelsize=20)
matplotlib.rc(‘ytick’, labelsize=20)df_cuisine_style = cuisines_list.cuisine_style.value_counts().sort_values(ascending = False)[:10]
count_ths = np.arange(0, 3.3e4, 5e3)
count = np.arange(0, 35, 5)fig = plt.figure(figsize=(20, 9))
ax = plt.subplot();plt.bar(x = df_cuisine_style.index, height=df_cuisine_style, color=”#4ECDC4″);
plt.yticks(count_ths, count,fontsize=20);
plt.ylabel(‘Total Places (Thousands)’,fontsize=20);
ax.spines[‘right’].set_visible(False);
ax.spines[‘top’].set_visible(False);
ax.spines[‘left’].set_visible(False);
plt.title(‘Top 10 most common cuisines’,fontsize=20);
ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);
plt.xticks(rotation=’vertical’,fontsize=20);
plt.grid(color = ‘black’, linestyle = ‘–‘, linewidth = 0.5)- Top 10 most reviewed cuisines
df_cuisine_style = cuisines_list.groupby(‘cuisine_style’).reviews_number.sum().sort_values(ascending=False)[:10]
count_ths = np.arange(0, 9.3e6, 5e5)
count = np.arange(0, 9.3, 0.5)fig = plt.figure(figsize=(20, 9))
ax = plt.subplot();plt.bar(x = df_cuisine_style.index, height=df_cuisine_style, color=”#4ECDC4″);
plt.yticks(count_ths, count);
plt.ylabel(‘Total Reviews (Million)’,fontsize=20);
ax.spines[‘right’].set_visible(False);
ax.spines[‘top’].set_visible(False);
ax.spines[‘left’].set_visible(False);
plt.title(‘Top 10 most reviewed cuisines’,fontsize=20);
ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);plt.xticks(rotation=’vertical’,fontsize=20);
- Cuisine styles per city
import matplotlib
matplotlib.rc(‘xtick’, labelsize=20)
matplotlib.rc(‘ytick’, labelsize=20)city_styles = cuisines_list.groupby([‘city’, ‘cuisine_style’]).rating.count().reset_index()
city_variety = city_styles.groupby(‘city’).cuisine_style.count().sort_values(ascending = False)fig = plt.figure(figsize=(20, 13))
ax = plt.subplot();ax.barh(city_variety.index[::-1],city_variety.values[::-1], align=’center’, color=”#4ECDC4″);
ax.spines[‘right’].set_visible(False);
ax.spines[‘top’].set_visible(False);
ax.spines[‘left’].set_visible(False);
plt.title(‘Cuisine styles per city’,fontsize=20);Spatial Analytics
- Southern Europe Favourite Cuisine Styles
from matplotlib import cm
euro = np.array(Image.open(‘europeblue.png’))
grn = [‘#006400’, ‘#007200’, ‘#007200’, ‘#008000’, ‘#008000’,
‘#38b000’, ‘#38b000’, ‘#70e000’, ‘#9ef01a’, ‘#9ef01a’, ]orng = [‘#FF5400’, ‘#FF6000’, ‘#FF6D00’, ‘#FF6D00’, ‘#FF8500’,
‘#FF9100’, ‘#FF9E00’, ‘#FFAA00’, ‘#FFB600’, ‘#FFB600’, ]plt.figure(figsize = (18, 10), facecolor = None) ;
ax = plt.subplot();
plt.title(‘Southern Europe Favourite Cuisine Styles’)
plt.imshow(euro);plt.text(
300, 30, ‘Most Reviewed Cuisine Styles’,
ha=”center”, va=”center”, rotation=0, size=15, color = “white”,
bbox=dict(boxstyle=”square,pad=0.3″, fc=’#004b23′, ec=”w”, lw=1));top = 40
left = 0
stili = reviews_south.cuisine_style.to_list()
i = 0
for styl in stili:
styl_score = styl + ” ” + str(reviews_south[reviews_south.cuisine_style == styl].revs.item())
top = top + 40
plt.text(
300, top, styl_score, ha=”center”, va=”center”, rotation=0, size=15, color = “white”,
bbox=dict(boxstyle=”square,pad=0.3″, fc=grn[i], ec=”w”, lw=1))
i += 1plt.text(
520, 40, ‘Best Rated Cuisine Styles’,
ha=”center”, va=”center”, rotation=0, size=15, color = “white”,
bbox=dict(boxstyle=”square,pad=0.3″, fc=’#ff4800′, ec=”w”, lw=1));top = 40
left = 520
rating_1000 = rating_south[rating_south.rat_count >= 3000]
rating_1000 = rating_1000.sort_values(by=’rat_mean’, ascending = False)[:10]stili = rating_1000.cuisine_style.to_list()
i = 0
for styl in stili:
styl_score = styl + ” ” + str(round(rating_1000[rating_1000.cuisine_style == styl].rat_mean.item(), 2))
top = top + 40
plt.text(
left, top, styl_score, ha=”center”, va=”center”, rotation=0, size=15, color = “white”,
bbox=dict(boxstyle=”square,pad=0.3″, fc=orng[i], ec=’w’, lw=1))
i += 1ax.spines[‘right’].set_visible(False);
ax.spines[‘top’].set_visible(False);
ax.spines[‘left’].set_visible(False);
ax.spines[‘bottom’].set_visible(False);
ax.yaxis.set_visible(False);
ax.xaxis.set_visible(False);- Folium map of European cities
import folium
import base64
from folium import IFramedef get_flag(city):
flag = map_cities.get(city)
flag_url = flag[‘flag’]
return(flag_url)def get_top_options2(city):
opt = top_options[top_options.index.get_level_values(‘city’).isin([city])][:3]
opt = opt.reset_index()
opt = opt.sort_values(by=[“rating”, “reviews_number”], ascending=[False, False])
top_3 = “”
icon_class = “”
for i in opt.iterrows():
if (i[1][‘cuisine_style’] == “Gluten Free Options”) or (i[1][‘cuisine_style’] == “Vegan Options”):
icon_class = “fa-pagelines”
elif i[1][‘cuisine_style’] == “Vegetarian Friendly”:
icon_class = “fa-tree”
else:
icon_class = “fa-globe”
top_3 += “” + i[1][‘cuisine_style’] + ” ” + str(i[1][‘rating’]) + ” (” + str(i[1][‘reviews_number’]) + ” reviews)”
return(top_3)
europe = folium.Map(
location=[52.4214, 8.3750],
tiles = “OpenStreetMap”,
zoom_start=4
)for k, v in map_cities.items():
flag = get_flag(k)
html = “” html += “{}
{}
“.format(k, flag, get_top_options2(k)) html += “”
iframe = folium.IFrame(html, width=(300)+20, height=(110)+20)
popup = folium.Popup(iframe, max_width=1000)folium.Marker(location =[v['lat'], v['lon']], popup=popup, icon = folium.Icon(color='darkpurple', icon='fa-cutlery', prefix='fa') ).add_to(europe)
europe
Word Clouds
words_cuisines = cuisines.copy()
words_cuisines[(words_cuisines.reviews.isna()) | (words_cuisines.reviews == ‘[[], []]’)].info()<class 'pandas.core.frame.DataFrame'> Index: 28622 entries, 1361 to 1666 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 city 28622 non-null object 1 place 28622 non-null object 2 cuisine_style 14754 non-null object 3 rating 19036 non-null float64 4 reviews_number 28622 non-null int32 5 reviews 19041 non-null object dtypes: float64(1), int32(1), object(4) memory usage: 1.4+ MB
all_revs = []
nan = 0
for ind, rev in words_cuisines[‘reviews’].items():
if type(rev) != float:
# the string is in a format [[reviews separated by comma], [review dates separated by comma]]
# in order to only keep the reviews, i split by ‘], [‘
g = rev.split(‘], [‘)
# and remove the first two chars at the beginning
g = g[0][2:].replace(‘\”, “”)
else:
g = “”
nan +=1
all_revs.append(g)words_cuisines[‘reviews_only’] = all_revs
words_cuisines = words_cuisines.drop(‘reviews’, axis = 1)from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
def words_cloud(words_list):
all_words = ”
#looping through all incidents and joining them to one text, to extract most common words
for ind, wrds in words_list.iteritems():
tokens = wrds.split()
all_words += ” “.join(tokens) + ” “
return all_wordsdef words_freq(items):
all_words = []
for i, item in items.items():
item = item.lower()
item = re.sub(“[^a-zA-Z]”, ” “, item)
#keep numbers
item = re.sub(“[^a-zA-Z0-9\s]+”, ” “, item)
all_words += item.split()words = Counter(all_words) return words
cloudmask = np.array(Image.open(‘europe_black.png’))
cloudmask = np.array(euro)words = words_freq(words_cuisines[‘reviews_only’])
wordcloud = WordCloud(width = 1000, height = 700,
stopwords = stopwords, mask = cloudmask,
collocations = False,
background_color =’white’,
min_font_size = 6).generate_from_frequencies(words)plt.figure(figsize = (20, 9), facecolor = None)
plt.imshow(wordcloud)
plt.axis(“off”)
plt.tight_layout(pad = 0)plt.show()
Summary
- Food has been viewed as one of the noteworthy features influencing customer satisfaction in restaurant evaluations.
- Data and analytics are valuable for increasing revenue and improving operations.
- We utilize data to tailor marketing, optimize pricing and menu offerings, and make informed operational changes.
- In this study, we used the TripAdvisor Restaurants Info for 31 Euro-Cities from Kaggle.
- Online restaurant reviews retrieved from TripAdvisor have been analyzed by using domain-specific aspect-based sentiment detection and NLP visualization.
- The influence of different restaurant features on the overall evaluation of visitors and locals has been determined and the most critical factors are identified by the frequency of their online discussion.
- As it can clearly be seen, there is a difference in average ratings by city.
- There are also great differences in average number of restaurants with vegetarian/vegan option available.
- Local foods and meals provided by restaurants have been recognized a critical part of customer satisfaction.
Explore More
- Eating out in Europe
- A Shamelessly Simple E-Restaurant Order
- Semantic Analysis and NLP Visualizations of Wine Reviews
- ML/AI Prediction of Wine Quality
- SARIMAX-TSA Forecasting, QC and Visualization of E-Commerce Food Delivery Sales
- Trending YouTube Video Data Science, NLP Predictions & Sentiment Analysis
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
Improved Multiple-Model ML/DL Credit Card Fraud Detection: F1=88% & ROC=91%
Photo by CardMapr.nl on Unsplash
- In 2023, global card industry losses to fraud are projected to reach $36.13 bln, or $0.0668 per $100 of volume.
- Enhancing credit card fraud detection is therefore a top priority for all banks and financial organisations.
- Thanks to AI-based techniques, credit card fraud detection (CCFD) is becoming easier and more accurate. AI models can recognise unusual credit card transactions and fraud. CCFD involves collecting and sorting raw data, which is then used to train the model to predict the probability of fraud.
- AI allows for creating algorithms that process large datasets with many variables and help find these hidden correlations between user behavior and the likelihood of fraudulent actions. Another strength of AI systems compared to rule-based ones is faster data processing and less manual work.
- Inspired by the recent study, our Python CCFD AI project aims to create an improved binary classifier that can recognize fraudulent credit card transactions with a high degree of confidence.
- Specifically, we have conducted a comparative analysis of the available supervised Machine Learning (ML) and Deep Learning (DL) techniques for CCFD.
- In this post, we will discuss a multiple-model ML/DL approach that closely resembles ensemble learning. An ensemble learning method involves combining the predictions from multiple contributing models.
- In addition, we will compare multiple techniques to determine the best performing model in detecting fraudulent transactions in terms of scikit-learn metrics and scoring to quantify the quality of CCFD predictions.
- Conventionally, we use the open-source Kaggle dataset that contains transactions made by credit cards in September 2013 by European cardholders. These are anonymized credit card transactions labeled as fraudulent (1) or genuine (0). The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
- Read more: Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE
Clickable Table of Contents
- Data Preparation & Exploratory Analysis
- Keras ANN Model Training & Validation
- Multiple-Model Training & Validation
- Multiple-Model ML/DL F1-Score Comparison
- Best ML/DL Model Performance QC Analysis
- K-Means PCA Clusters & Variance Analysis
- Final XGBoost Classification Report
- Summary
- Explore More
Data Preparation & Exploratory Analysis
Let’s set the working directory
import os
os.chdir(‘YOURPATH’)
os. getcwd()and import the necessary packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns%matplotlib inline
sns.set_style(“whitegrid”)Let’s load the dataset from the csv file using Pandas
data = pd.read_csv(“creditcard.csv”)
data.head()(5 rows ร 31 columns)
Here, features V1, V2, โฆ V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Describing the Input Data:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 284807 entries, 0 to 284806 Data columns (total 31 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Time 284807 non-null float64 1 V1 284807 non-null float64 2 V2 284807 non-null float64 3 V3 284807 non-null float64 4 V4 284807 non-null float64 5 V5 284807 non-null float64 6 V6 284807 non-null float64 7 V7 284807 non-null float64 8 V8 284807 non-null float64 9 V9 284807 non-null float64 10 V10 284807 non-null float64 11 V11 284807 non-null float64 12 V12 284807 non-null float64 13 V13 284807 non-null float64 14 V14 284807 non-null float64 15 V15 284807 non-null float64 16 V16 284807 non-null float64 17 V17 284807 non-null float64 18 V18 284807 non-null float64 19 V19 284807 non-null float64 20 V20 284807 non-null float64 21 V21 284807 non-null float64 22 V22 284807 non-null float64 23 V23 284807 non-null float64 24 V24 284807 non-null float64 25 V25 284807 non-null float64 26 V26 284807 non-null float64 27 V27 284807 non-null float64 28 V28 284807 non-null float64 29 Amount 284807 non-null float64 30 Class 284807 non-null int64 dtypes: float64(30), int64(1) memory usage: 67.4 MB
pd.set_option(“display.float”, “{:.2f}”.format)
data.describe().TLet us now check the missing values in the dataset
data.isnull().sum().sum()
0
data.columns
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'], dtype='object')
Let’s look at the Transaction Class Distribution
LABELS = [“Normal”, “Fraud”]
count_classes = pd.value_counts(data[‘Class’], sort = True)
count_classes.plot(kind = ‘bar’, rot=0)
plt.title(“Transaction Class Distribution”)
plt.xticks(range(2), LABELS)
plt.xlabel(“Class”)
plt.ylabel(“Frequency”);data.Class.value_counts()
Class 0 284315 1 492 Name: count, dtype: int64
data.Class.value_counts(normalize=True).mul(100).round(3).astype(str) + ‘%’
Class 0 99.827% 1 0.173% Name: proportion, dtype: object
It appears that most of the transactions are non-fraud. Only 0.173% fraudulent transaction out all the transactions. The data is highly unbalanced.
Let’s separate fraud and normal data
fraud = data[data[‘Class’]==1]
normal = data[data[‘Class’]==0]print(f”Shape of Fraudulent transactions: {fraud.shape}”)
print(f”Shape of Non-Fraudulent transactions: {normal.shape}”)Shape of Fraudulent transactions: (492, 31) Shape of Non-Fraudulent transactions: (284315, 31)
Let’s compare Amount fraud vs normal data
pd.concat([fraud.Amount.describe(), normal.Amount.describe()], axis=1)
Let’s compare Time fraud vs normal data
pd.concat([fraud.Time.describe(), normal.Time.describe()], axis=1)
Let's look at the feature heatmap to find any high correlations
plt.figure(figsize=(10,10))
sns.heatmap(data=data.corr(), cmap=”seismic”)
plt.show();The correlation matrix graphically gives us an idea of how features correlate with each other and can help us predict what are the features that are most relevant for the prediction. It is clear that most of the features do not correlate to other features but there are some features that either has a positive or a negative correlation with each other. For example, V2 and V5 are highly negatively correlated with the Amount.
Finally, let’s divide the data into features and target variables. In addition, we will be dividing the dataset into two main groups: one for training the model and the other for testing our trained modelโs performance:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScalerscalar = StandardScaler()
X = data.drop(‘Class’, axis=1)
y = data.ClassX_train_v, X_test, y_train_v, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)
X_train, X_validate, y_train, y_validate = train_test_split(X_train_v, y_train_v,
test_size=0.2, random_state=42)X_train = scalar.fit_transform(X_train)
X_validate = scalar.transform(X_validate)
X_test = scalar.transform(X_test)w_p = y_train.value_counts()[0] / len(y_train)
w_n = y_train.value_counts()[1] / len(y_train)print(f”Fraudulant transaction weight: {w_n}”)
print(f”Non-Fraudulant transaction weight: {w_p}”)Fraudulant transaction weight: 0.0017994745785028623 Non-Fraudulant transaction weight: 0.9982005254214972
print(f”TRAINING: X_train: {X_train.shape}, y_train: {y_train.shape}\n{‘’55}”) print(f”VALIDATION: X_validate: {X_validate.shape}, y_validate: {y_validate.shape}\n{”50}”)
print(f”TESTING: X_test: {X_test.shape}, y_test: {y_test.shape}”)TRAINING: X_train: (159491, 30), y_train: (159491,) _______________________________________________________ VALIDATION: X_validate: (39873, 30), y_validate: (39873,) __________________________________________________ TESTING: X_test: (85443, 30), y_test: (85443,)
We will need the following function to print the model score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score
def print_score(label, prediction, train=True):
if train:
clf_report = pd.DataFrame(classification_report(label, prediction, output_dict=True))
print(“Train Result:\n================================================”)
print(f”Accuracy Score: {accuracy_score(label, prediction) * 100:.2f}%”)
print(“___________________________________“)
print(f”Classification Report:\n{clf_report}”)
print(“___________________________________“)
print(f”Confusion Matrix: \n {confusion_matrix(y_train, prediction)}\n”)elif train==False: clf_report = pd.DataFrame(classification_report(label, prediction, output_dict=True)) print("Test Result:\n================================================") print(f"Accuracy Score: {accuracy_score(label, prediction) * 100:.2f}%") print("_______________________________________________") print(f"Classification Report:\n{clf_report}") print("_______________________________________________") print(f"Confusion Matrix: \n {confusion_matrix(label, prediction)}\n")
Keras ANN Model Training & Validation
Let’s define the following Artificial Neural Network (ANN) model
from tensorflow import keras
model = keras.Sequential([
keras.layers.Dense(256, activation=’relu’, input_shape=(X_train.shape[-1],)),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(256, activation=’relu’),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(256, activation=’relu’),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(1, activation=’sigmoid’),
])model.summary()
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 256) 7936 batch_normalization (BatchN (None, 256) 1024 ormalization) dropout (Dropout) (None, 256) 0 dense_1 (Dense) (None, 256) 65792 batch_normalization_1 (Batc (None, 256) 1024 hNormalization) dropout_1 (Dropout) (None, 256) 0 dense_2 (Dense) (None, 256) 65792 batch_normalization_2 (Batc (None, 256) 1024 hNormalization) dropout_2 (Dropout) (None, 256) 0 dense_3 (Dense) (None, 1) 257 ================================================================= Total params: 142,849 Trainable params: 141,313 Non-trainable params: 1,536
Let’s compile and train this model
METRICS = [
keras.metrics.FalseNegatives(name='fn'), keras.metrics.FalsePositives(name='fp'), keras.metrics.TrueNegatives(name='tn'), keras.metrics.TruePositives(name='tp'), keras.metrics.Precision(name='precision'), keras.metrics.Recall(name='recall')
]
model.compile(optimizer=keras.optimizers.Adam(1e-4), loss=’binary_crossentropy’, metrics=METRICS)
callbacks = [keras.callbacks.ModelCheckpoint(‘fraud_model_at_epoch_{epoch}.h5’)]
class_weight = {0:w_p, 1:w_n}r = model.fit(
X_train, y_train,
validation_data=(X_validate, y_validate),
batch_size=2048,
epochs=300,callbacks=callbacks,
)
Epoch 1/300 78/78 [==============================] - 4s 39ms/step - loss: 0.8119 - fn: 68.0000 - fp: 74850.0000 - tn: 84354.0000 - tp: 219.0000 - precision: 0.0029 - recall: 0.7631 - val_loss: 0.6675 - val_fn: 7.0000 - val_fp: 11144.0000 - val_tn: 28660.0000 - val_tp: 62.0000 - val_precision: 0.0055 - val_recall: 0.8986 .......................................... Epoch 300/300 78/78 [==============================] - 3s 36ms/step - loss: 8.1106e-04 - fn: 29.0000 - fp: 9.0000 - tn: 159195.0000 - tp: 258.0000 - precision: 0.9663 - recall: 0.8990 - val_loss: 0.0059 - val_fn: 14.0000 - val_fp: 8.0000 - val_tn: 39796.0000 - val_tp: 55.0000 - val_precision: 0.8730 - val_recall: 0.7971
Let’s plot the keras.metrics vs Epochs
plt.figure(figsize=(12, 16))
plt.subplot(4, 2, 1)
plt.plot(r.history[‘loss’], label=’Loss’)
plt.plot(r.history[‘val_loss’], label=’val_Loss’)
plt.title(‘Loss Function evolution during training’)
plt.legend()plt.subplot(4, 2, 2)
plt.plot(r.history[‘fn’], label=’fn’)
plt.plot(r.history[‘val_fn’], label=’val_fn’)
plt.title(‘Accuracy evolution during training’)
plt.legend()plt.subplot(4, 2, 3)
plt.plot(r.history[‘precision’], label=’precision’)
plt.plot(r.history[‘val_precision’], label=’val_precision’)
plt.title(‘Precision evolution during training’)
plt.legend()plt.subplot(4, 2, 4)
plt.plot(r.history[‘recall’], label=’recall’)
plt.plot(r.history[‘val_recall’], label=’val_recall’)
plt.title(‘Recall evolution during training’)
plt.legend()
plt.savefig(‘lossval.png’)Given the above trained model, let’s predict labels of the train/test data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)print_score(y_train, y_train_pred.round(), train=True)
print_score(y_test, y_test_pred.round(), train=False)scores_dict = {
‘ANNs’: {
‘Train’: f1_score(y_train, y_train_pred.round()),
‘Test’: f1_score(y_test, y_test_pred.round()),
},
}4985/4985 [==============================] - 4s 744us/step 2671/2671 [==============================] - 2s 783us/step Train Result: ================================================ Accuracy Score: 99.99% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 1.00 1.00 1.00 1.00 recall 1.00 0.97 1.00 0.98 1.00 f1-score 1.00 0.98 1.00 0.99 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159203 1] [ 9 278]] Test Result: ================================================ Accuracy Score: 99.96% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.90 1.00 0.95 1.00 recall 1.00 0.82 1.00 0.91 1.00 f1-score 1.00 0.86 1.00 0.93 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85295 12] [ 25 111]]
Multiple-Model Training & Validation
Let’s invoke several popular supervised ML binary classification algorithms.
XGBClassifier
from xgboost import XGBClassifier
xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, y_train, eval_metric=’aucpr’)y_train_pred = xgb_clf.predict(X_train)
y_test_pred = xgb_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘XGBoost’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 100.00% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 1.00 1.00 1.00 1.00 recall 1.00 1.00 1.00 1.00 1.00 f1-score 1.00 1.00 1.00 1.00 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159204 0] [ 0 287]] Test Result: ================================================ Accuracy Score: 99.96% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.95 1.00 0.97 1.00 recall 1.00 0.82 1.00 0.91 1.00 f1-score 1.00 0.88 1.00 0.94 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85301 6] [ 25 111]]
RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier(n_estimators=100, oob_score=False)
rf_clf.fit(X_train, y_train)y_train_pred = rf_clf.predict(X_train)
y_test_pred = rf_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘Random Forest’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 100.00% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 1.00 1.00 1.00 1.00 recall 1.00 1.00 1.00 1.00 1.00 f1-score 1.00 1.00 1.00 1.00 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159204 0] [ 0 287]] Test Result: ================================================ Accuracy Score: 99.96% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.90 1.00 0.95 1.00 recall 1.00 0.81 1.00 0.90 1.00 f1-score 1.00 0.85 1.00 0.93 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85295 12] [ 26 110]]
CatBoostClassifier
from catboost import CatBoostClassifier
cb_clf = CatBoostClassifier()
cb_clf.fit(X_train, y_train)Learning rate set to 0.089847 0: learn: 0.3914646 total: 169ms remaining: 2m 49s ..... 999: learn: 0.0001216 total: 16.6s remaining: 0us
y_train_pred = cb_clf.predict(X_train)
y_test_pred = cb_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘CatBoost’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 100.00% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 1.00 1.00 1.00 1.00 recall 1.00 1.00 1.00 1.00 1.00 f1-score 1.00 1.00 1.00 1.00 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159204 0] [ 1 286]] Test Result: ================================================ Accuracy Score: 99.96% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.93 1.00 0.97 1.00 recall 1.00 0.82 1.00 0.91 1.00 f1-score 1.00 0.87 1.00 0.94 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85299 8] [ 25 111]]
DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
lgbm_clf = DecisionTreeClassifier(random_state=0)
lgbm_clf.fit(X_train, y_train)
y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘DTC’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 100.00% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 1.00 1.00 1.00 1.00 recall 1.00 1.00 1.00 1.00 1.00 f1-score 1.00 1.00 1.00 1.00 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159204 0] [ 0 287]] Test Result: ================================================ Accuracy Score: 99.90% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.66 1.00 0.83 1.00 recall 1.00 0.74 1.00 0.87 1.00 f1-score 1.00 0.69 1.00 0.85 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85255 52] [ 36 100]]
AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
lgbm_clf=AdaBoostClassifier()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘Ada’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 99.92% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.80 1.00 0.90 1.00 recall 1.00 0.73 1.00 0.87 1.00 f1-score 1.00 0.76 1.00 0.88 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159150 54] [ 77 210]] Test Result: ================================================ Accuracy Score: 99.93% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.80 1.00 0.90 1.00 recall 1.00 0.76 1.00 0.88 1.00 f1-score 1.00 0.78 1.00 0.89 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85281 26] [ 32 104]]
GaussianNB
from sklearn.naive_bayes import GaussianNB
lgbm_clf=GaussianNB()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘GNB’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 97.77% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.06 0.98 0.53 1.00 recall 0.98 0.82 0.98 0.90 0.98 f1-score 0.99 0.12 0.98 0.55 0.99 support 159204.00 287.00 0.98 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[155706 3498] [ 52 235]] Test Result: ================================================ Accuracy Score: 97.78% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.06 0.98 0.53 1.00 recall 0.98 0.85 0.98 0.92 0.98 f1-score 0.99 0.11 0.98 0.55 0.99 support 85307.00 136.00 0.98 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[83428 1879] [ 20 116]]
SVC
from sklearn.svm import SVC
lgbm_clf=SVC()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘SVC’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 99.96% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.99 1.00 0.99 1.00 recall 1.00 0.81 1.00 0.91 1.00 f1-score 1.00 0.89 1.00 0.95 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159201 3] [ 54 233]] Test Result: ================================================ Accuracy Score: 99.94% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.92 1.00 0.96 1.00 recall 1.00 0.65 1.00 0.83 1.00 f1-score 1.00 0.76 1.00 0.88 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85299 8] [ 47 89]]
KNeighborsClassifier
from sklearn.neighbors import KNeighborsClassifier
lgbm_clf=KNeighborsClassifier()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘KNN’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 99.95% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.96 1.00 0.98 1.00 recall 1.00 0.78 1.00 0.89 1.00 f1-score 1.00 0.86 1.00 0.93 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159194 10] [ 62 225]] Test Result: ================================================ Accuracy Score: 99.94% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.87 1.00 0.93 1.00 recall 1.00 0.77 1.00 0.89 1.00 f1-score 1.00 0.82 1.00 0.91 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85291 16] [ 31 105]]
LogisticRegression
from sklearn.linear_model import LogisticRegression
lgbm_clf=LogisticRegression()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘LR’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 99.92% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.89 1.00 0.94 1.00 recall 1.00 0.62 1.00 0.81 1.00 f1-score 1.00 0.73 1.00 0.87 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159182 22] [ 108 179]] Test Result: ================================================ Accuracy Score: 99.93% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.87 1.00 0.93 1.00 recall 1.00 0.64 1.00 0.82 1.00 f1-score 1.00 0.74 1.00 0.87 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85294 13] [ 49 87]]
SGDClassifier
from sklearn.linear_model import SGDClassifier
lgbm_clf=SGDClassifier()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘SGD’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 99.91% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.88 1.00 0.94 1.00 recall 1.00 0.58 1.00 0.79 1.00 f1-score 1.00 0.70 1.00 0.85 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159182 22] [ 121 166]] Test Result: ================================================ Accuracy Score: 99.92% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.85 1.00 0.92 1.00 recall 1.00 0.58 1.00 0.79 1.00 f1-score 1.00 0.69 1.00 0.84 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85293 14] [ 57 79]]
GradientBoostingClassifier
from sklearn.ensemble import GradientBoostingClassifier
lgbm_clf=GradientBoostingClassifier()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘GBC’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 99.85% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.98 1.00 0.99 1.00 recall 1.00 0.19 1.00 0.59 1.00 f1-score 1.00 0.32 1.00 0.66 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159203 1] [ 233 54]] Test Result: ================================================ Accuracy Score: 99.85% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.73 1.00 0.86 1.00 recall 1.00 0.14 1.00 0.57 1.00 f1-score 1.00 0.23 1.00 0.62 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85300 7] [ 117 19]]
MLPClassifier
from sklearn.neural_network import MLPClassifier
lgbm_clf=MLPClassifier()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘MLP’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 99.98% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 1.00 1.00 1.00 1.00 recall 1.00 0.87 1.00 0.93 1.00 f1-score 1.00 0.93 1.00 0.96 1.00 support 159204.00 287.00 1.00 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[159203 1] [ 38 249]] Test Result: ================================================ Accuracy Score: 99.95% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.92 1.00 0.96 1.00 recall 1.00 0.76 1.00 0.88 1.00 f1-score 1.00 0.83 1.00 0.92 1.00 support 85307.00 136.00 1.00 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[85298 9] [ 33 103]]
QuadraticDiscriminantAnalysis
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
lgbm_clf=QuadraticDiscriminantAnalysis()
lgbm_clf.fit(X_train, y_train)y_train_pred = lgbm_clf.predict(X_train)
y_test_pred = lgbm_clf.predict(X_test)print_score(y_train, y_train_pred, train=True)
print_score(y_test, y_test_pred, train=False)scores_dict[‘QDA’] = {
‘Train’: f1_score(y_train,y_train_pred),
‘Test’: f1_score(y_test, y_test_pred),
}Train Result: ================================================ Accuracy Score: 97.55% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.06 0.98 0.53 1.00 recall 0.98 0.87 0.98 0.92 0.98 f1-score 0.99 0.11 0.98 0.55 0.99 support 159204.00 287.00 0.98 159491.00 159491.00 _______________________________________________ Confusion Matrix: [[155334 3870] [ 38 249]] Test Result: ================================================ Accuracy Score: 97.55% _______________________________________________ Classification Report: 0 1 accuracy macro avg weighted avg precision 1.00 0.06 0.98 0.53 1.00 recall 0.98 0.90 0.98 0.94 0.98 f1-score 0.99 0.11 0.98 0.55 0.99 support 85307.00 136.00 0.98 85443.00 85443.00 _______________________________________________ Confusion Matrix: [[83229 2078] [ 13 123]]
Multiple-Model ML/DL F1-Score Comparison
scores_df = pd.DataFrame(scores_dict)
scores_df.plot(kind=’barh’, figsize=(15, 8),fontsize=16)
Best ML/DL Model Performance QC Analysis
Let’s compare ML/DL model results and performance using SciKit-Plot visualization
import scikitplot as skplt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
from sklearn.decomposition import PCAimport matplotlib.pyplot as plt
import sys
import warnings
warnings.filterwarnings(“ignore”)print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)%matplotlib inline
Scikit Plot Version : 0.3.7 Scikit Learn Version : 1.2.2 Python Version : 3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]
XGBoost Classification Learning Curve
skplt.estimators.plot_learning_curve(xgb_clf, X_train, y_train,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”XGBoost Classification Learning Curve”);Feature Importance: Random Forest vs XGBoost Classifier
col=data.columns
col.drop(‘Class’)Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount'], dtype='object')
fig = plt.figure(figsize=(11,6))
ax1 = fig.add_subplot(121)
skplt.estimators.plot_feature_importances(rf_clf, feature_names=col,
title=”Random Forest Regressor Feature Importance”,
x_tick_rotation=90, order=”ascending”,
ax=ax1);ax2 = fig.add_subplot(122)
skplt.estimators.plot_feature_importances(xgb_clf, feature_names=col,
title=”XGBoost Classifier Feature Importance”,
x_tick_rotation=90,
ax=ax2);plt.tight_layout()
Calibration Plots
lr_probas = CatBoostClassifier().fit(X_train, y_train).predict_proba(X_test)
rf_probas = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
gb_probas = XGBClassifier().fit(X_train, y_train).predict_proba(X_test)
et_scores = MLPClassifier().fit(X_train, y_train).predict_proba(X_test)probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = [‘CatBoost’, ‘Random Forest’, ‘XGBoost’, ‘MLP’]skplt.metrics.plot_calibration_curve(y_test,
probas_list,
clf_names, n_bins=15,
figsize=(12,6)
);Normalized Confusion Matrix
y_train_pred = xgb_clf.predict(X_train)
y_test_pred = xgb_clf.predict(X_test)
target_names=[‘Normal’,’Fraud’]from sklearn.metrics import confusion_matrix
import seaborn as snscm = confusion_matrix(y_test, y_test_pred)
cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
fig, ax = plt.subplots(figsize=(6,6))
sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
plt.ylabel(‘Actual’)
plt.xlabel(‘Predicted’)
plt.title(‘XGBoost Normalized Confusion Matrix’)
plt.show(block=True)XGBoost ROC Curve
Y_test_probs = xgb_clf.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
title=”XGBoost ROC Curve”, figsize=(12,6));Precision-Recall Curve
skplt.metrics.plot_precision_recall_curve(y_test, Y_test_probs,
title=”XGBoost Precision-Recall Curve”, figsize=(12,6));XGBoost KS Statistic Plot
Y_probas = xgb_clf.predict_proba(X_test)
skplt.metrics.plot_ks_statistic(y_test, Y_probas, figsize=(10,6));
XGBoost Cumulative Gains Curve
skplt.metrics.plot_cumulative_gain(y_test, Y_probas, figsize=(10,6));
XGBoost Lift Curve
skplt.metrics.plot_lift_curve(y_test, Y_probas, figsize=(10,6));
K-Means PCA Clusters & Variance Analysis
Elbow Plot
skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
X_train,
cluster_ranges=range(2, 20),
figsize=(8,6));Silhouette Analysis
kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_train, y_train)
cluster_labels = kmeans.predict(X_test)skplt.metrics.plot_silhouette(X_test, cluster_labels,
figsize=(8,6));PCA Component Explained Variances
pca = PCA(random_state=1)
pca.fit(X_train)skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));
PCA 2-D Projection
skplt.decomposition.plot_pca_2d_projection(pca, X_train, y_train,
figsize=(10,10),
cmap=”tab10″);Final XGBoost Classification Report
from yellowbrick.classifier import ClassificationReport
viz = ClassificationReport(xgb_clf,
classes=target_names,
support=True,
fig=plt.figure(figsize=(8,6)))viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show();
Summary of sklearn.metrics for XGBoost
from sklearn.metrics import class_likelihood_ratios
class_likelihood_ratios(y_test,y_test_pred)
(11604.261029411764, 0.18383645940293095)
from sklearn.metrics import balanced_accuracy_score
balanced_accuracy_score(y_test,y_test_pred)0.9080530681917697
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score(y_test,y_test_pred)0.8772897054030634
from sklearn.metrics import hinge_loss
hinge_loss(y_test,y_test_pred)0.9987711105649381
from sklearn.metrics import matthews_corrcoef
matthews_corrcoef(y_test,y_test_pred)0.8797814646276692
from sklearn.metrics import roc_auc_score
roc_auc_score(y_test,y_test_pred)0.9080530681917698
from sklearn.metrics import top_k_accuracy_score
top_k_accuracy_score(y_test,y_test_pred)1.0
from sklearn.metrics import hamming_loss
hamming_loss(y_test,y_test_pred)0.0003628149760659153
from sklearn.metrics import jaccard_score
jaccard_score(y_test,y_test_pred)0.7816901408450704
from sklearn.metrics import log_loss
log_loss(y_test,y_test_pred)0.013077177241700906
from sklearn.metrics import zero_one_loss
zero_one_loss(y_test,y_test_pred)0.0003628149760659394
from sklearn.metrics import precision_recall_fscore_support
precision_recall_fscore_support(y_test,y_test_pred)(array([0.99970701, 0.94871795]), array([0.99992967, 0.81617647]), array([0.99981832, 0.87747036]), array([85307, 136], dtype=int64))
from sklearn.metrics import f1_score
f1_score(y_test,y_test_pred)0.8774703557312252
from sklearn.metrics import precision_score
precision_score(y_test,y_test_pred)0.9487179487179487
from sklearn.metrics import recall_score
recall_score(y_test,y_test_pred)0.8161764705882353
from sklearn.metrics import fbeta_score
fbeta_score(y_test,y_test_pred,beta=0.5)0.9188741721854304
Discrimination Threshold
from yellowbrick.classifier import DiscriminationThreshold
viz = DiscriminationThreshold(xgb_clf,
classes=target_names,
cv=0.2,
fig=plt.figure(figsize=(9,6)))viz.fit(X_train, y_train)
viz.score(X_test, y_test)
viz.show();
Summary
- This project presents a comparison of some established supervised ML and ANN algorithms to differentiate between genuine and fraudulent transactions.
- Multiple ML/DL models are built and fit on the training set, followed by comprehensive cross-validation and QC performance tests.
- We found that the XGBoost model performs the best in terms of FP = 18%, f1- and ROC-scores of 88% and 91%, respectively.
- Overfitting challenge: the model shows low bias but high variance. This is a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.
- In principle, the obtained model can be applied in anti-fraud monitoring systems, or a similar CCFD model development process can be performed in related business areas to detect fraud.
Explore More
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearlyArtificial Intelligence, Banking, deep learning, Financial Sector, fintech, Fraud, Machine Learning, Unsupervised Machine Learning#PCA, artificialintelligence, binary classification, cluster, credit card fraud, credit card transactions, data analytics, data science, data visualization, data-driven technology, deep learning, exploratory data analysis, Kaggle data, Machine Learning, Neural Network, Python, scikit-learn, Tensor Flow, test data, training data -
Datapane Stock Screener App from Scratch
Photo by Carl Heyerdahl on Unsplash
- Referring to the recent review of best stock screeners, let’s focus on the Datapane stock screener API.
- This API is built around the concept of Blocks, which are Python objects that represent an individual unit that can be processed, composed, and viewed.
Let’s install Datapane
!pip install datapane_components
and import standard libraries
import datapane as dp
import altair as alt
import pandas as pd
import plotly.express as px
import yfinance as yffrom datetime import datetime
import threading
from time import sleepLet’s set the stock ticker
ticker=’MSFT’
and download the stock Adj Close price in USD
df = pd.DataFrame(yf.download(ticker, period=”1d”, interval=”5m”)[“Adj Close”]).reset_index()
df.columns = [“time”, “adjusted_close”][*********************100%***********************] 1 of 1 completed
df.tail()
“””
Stock Portfolio Analysis
A Datapane app that analyses stock portfolio data.
“””
from datetime import date, timedeltaimport pandas as pd
import numpy as np
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.express as px
import plotly.io as pio
import datapane as dp
import datapane_components as dc
import yfinance as yfpio.templates.default = “ggplot2”
dp.enable_logging()
def process(ticker1: str, ticker2: str = “GOOG”) -> dp.View:
print(f”Downloading datasets for {ticker1} vs {ticker2}”)
start = date.today() – timedelta(365)
end = date.today() + timedelta(2)
# get the data
df1 = yf.download(ticker1, start=start, end=end, interval=”1d”).reset_index() # [“Adj Close”]).reset_index()
df2 = yf.download(ticker2, start=start, end=end, interval=”1d”).reset_index() # [“Adj Close”]).reset_index()
df1[“Date”] = pd.to_datetime(df1[“Date”], format=”%d/%m/%Y”)
df2[“Date”] = pd.to_datetime(df2[“Date”], format=”%d/%m/%Y”)
orig_df1 = df1.copy(deep=True)# Build all the visualizations print("Creating plots...") trace0 = go.Scatter(x=df1.Date, y=df1.Close, name=ticker1) fig0 = go.Figure([trace0]) fig0.update_layout(title={"text": f"{ticker1} Stock Price", "x": 0.5, "xanchor": "center"}) df1["10-day MA"] = df1["Close"].rolling(window=10).mean() df1["20-day MA"] = df1["Close"].rolling(window=20).mean() df1["50-day MA"] = df1["Close"].rolling(window=50).mean() trace0 = go.Scatter(x=df1.Date, y=df1.Close, name=ticker1) trace1 = go.Scatter(x=df1.Date, y=df1["10-day MA"], name="10-day MA") trace2 = go.Scatter(x=df1.Date, y=df1["20-day MA"], name="20-day MA") fig1 = go.Figure([trace0, trace1, trace2]) fig1.update_layout(title={"text": f"{ticker1} Stock Price (Rolling Average)", "x": 0.5, "xanchor": "center"}) fig2 = go.Figure(go.Candlestick(x=df1.Date, open=df1.Open, high=df1.High, low=df1.Low, close=df1.Close)) fig2.update_layout(title={"text": f"{ticker1} Stock Price (Candle Stick)", "x": 0.5, "xanchor": "center"}) df1["Daily return (%)"] = round(df1["Close"].pct_change() * 100, 2) fig3 = px.bar(df1, x="Date", y="Daily return (%)") fig3.update_layout(title={"text": f"{ticker1} Stock Daily Return", "x": 0.5, "xanchor": "center"}) trace0 = go.Scatter(x=df1.Date, y=df1.Close, name=ticker1, line=dict(color="lime")) trace1 = go.Scatter(x=df2.Date, y=df2.Close, name=ticker2, line=dict(color="red")) fig4 = go.Figure([trace0, trace1]) fig4.update_layout( dict( title=dict({"text": f"{ticker1} vs {ticker2} Stock Price", "x": 0.5, "xanchor": "center"}), xaxis=dict( rangeselector=dict( buttons=list( [ dict(count=1, label="1m", step="month", stepmode="backward"), dict(count=3, label="3m", step="month", stepmode="backward"), dict(count=6, label="6m", step="month", stepmode="backward"), dict(count=12, label="12m", step="month", stepmode="backward"), ] ) ), rangeslider=dict(visible=True), type="date", ), ) ) trace0 = go.Scatterpolar( r=[df1["Close"].mean(), df1["Open"].min(), df1["Low"].min(), df1["High"].max()], theta=["Close", "Open", "Low", "High"], line=dict(color="lime"), name=ticker1, fill="toself", ) trace1 = go.Scatterpolar( r=[df2["Close"].mean(), df2["Open"].min(), df2["Low"].min(), df2["High"].max()], theta=["Close", "Open", "Low", "High"], line=dict(color="red"), name="ZM", fill="toself", ) fig5 = go.Figure([trace0, trace1]) fig5.update_layout( go.Layout( polar=dict(radialaxis=dict(visible=True)), title=dict({"text": f"{ticker1} vs {ticker2} Stock Price (Radar Chart)", "x": 0.5, "xanchor": "center"}), ) ) df1 = df1[["Open", "Close", "Volume"]] df1["index"] = np.arange(len(df1)) fig7 = go.Figure( ff.create_scatterplotmatrix( df1, diag="box", index="index", size=3, height=600, width=1150, colormap="RdBu", title={"text": f"{ticker1} Stock Price (Scatterplot Matrix)", "x": 0.5, "xanchor": "center"}, ) ) # Building the output Datapane View # Now that we have a series of plots, we can construct our custom View. # In addition to the visualizations, this View includes Datapane's `BigNumber` component to # display today's stock prices, and our `DataTable` component to allow our viewers to filter, # explore, and download the data themselves. # We can build a more powerful report by using Datapane's layout components. # e.g. using `Group` to place the `BigNumber` blocks in two columns, and `Select` block to add multiple tabs. ticker1_today = df1.iloc[-1] ticker2_today = df2.iloc[-1] return dp.View( dp.Group( f"""## {ticker1} analysis (against {ticker2})""", dp.Group( dp.BigNumber( heading=f"{ticker1} Day Performance", value="${:,.2f}".format(ticker1_today.Close), prev_value="${:,.2f}".format(ticker1_today.Open), ), dp.BigNumber( heading=f"{ticker2} Day Performance", value="${:,.2f}".format(ticker2_today.Close), prev_value="${:,.2f}".format(ticker2_today.Open), ), columns=2, ), dp.Group(fig0, fig1, fig2, fig3, fig4, fig5, columns=2), dp.Plot(fig7), # datasets *dc.section( """ # Datasets _These datasets are pulled live_. """ ), dp.Select(dp.DataTable(orig_df1, label=ticker1), dp.DataTable(df2, label=ticker2)), label=f"{ticker1} vs {ticker2}", ), )
controls = dp.Controls(
ticker1=dp.TextBox(label=”Ticker”, initial=”MSFT”),
ticker2=dp.TextBox(label=”(Optional) Comparison Ticker”, initial=”GOOG”, allow_empty=True),
)v = dp.View(
dp.Compute(
function=process, controls=controls, label=”Choose Ticker”, cache=True, target=”results”, swap=dp.Swap.APPEND
),
dp.Select(name=”results”),
)
dp.serve_app(v)Summary
This is A Quick Guide for Value Investors Who Can Use Datapane to Build Their Own Value Investing Stock Screener API in 10 Minutes.
Explore More
An Interactive GPT Index and DeepLake Interface – 1. Amazon Financial Statements
Working with FRED API in Python: U.S. Recession Forecast & Beyond
Advanced Integrated Data Visualization (AIDV) in Python – 1. Stock Technical Indicators
The Donchian Channel vs Buy-and-Hold Breakout Trading Systems – $MO Use-Case
A Comparative Analysis of The 3 Best U.S. Growth Stocks in Q1’23 – 1. WMT
Python Technical Analysis for BioTech – Get Buy Alerts on ABBV in 2023
Basic Stock Price Analysis in Python
One-TimeMonthlyYearlyMake a one-time donation
Make a monthly donation
Make a yearly donation
Choose an amount
$5.00$15.00$100.00$5.00$15.00$100.00$5.00$15.00$100.00Or enter a custom amount
$
Your contribution is appreciated.
Your contribution is appreciated.
Your contribution is appreciated.
DonateDonate monthlyDonate yearly -
Unsupervised ML Clustering, Customer Segmentation, Cohort, Market Basket, Bank Churn, CRM, ABC & RFM Analysis – A Comprehensive Guide in Python
- The goal of this post is to describe the most popular, easy-to-use, and effective data-driven techniques that help organisations streamline their operations, build effective relationships with with individual people (customers, service users, colleagues, or suppliers), increase sales, improve customer service, and increase profitability.
- Customer Segmentation (CS) is a core concept in this technology. This is the process of tagging and grouping customers based on shared characteristics. This process also makes it easy to tailor and personalize your marketing, service, and sales efforts to the needs of specific groups. The result is a potential boost to customer loyalty and conversions.
- This article represents a comprehensive step-by-step CS guide in Python that helps you manage customer relationships across the entire customer lifecycle, at every marketing, sales, e-commerce, and customer service interaction.
- Our guide consists of hands-on case examples and best industry practices with open-source data and Python source codes: mall customers, online retail, marketing campaigns, supply chain, bank churn, groceries market, and e-commerce.
Table of Clickable Contents
- Motivation
- Methods
- Open-Source Datasets
- Mall Customer Segmentation
- Online Retail K-Means Clustering
- Online Retail Data Analytics
- RFM Customer Segmentation
- RFM TreeMap: Online Retail Dataset II
- CRM Analytics: CLTV
- Cohort Analysis in Online Retail
- Customer Segmentation, RFM, K-Means and Cohort Analysis in Online Retail
- Customer Clustering PCA, K-Means and Agglomerative for Marketing Campaigns
- Supply Chain RFM & ABC Analysis
- Bank Churn ML Prediction
- Groceries Market Basket & RFM Analysis
- Largest E-Commerce Showcase in Pakistan
- Conclusions
- Explore More
Motivation
- Uniform communication strategies that are not customer centric have failed to exploit millions of dollars in opportunities at the individual and segment level.
- The more comprehensive CS-based modeling provides executives and marketers with the ability to identify high-profit customers and micro-segments that will drive increases in traffic, sales, profit and retention.
Methods
- We will use the following techniques: unsupervised ML clustering, customer segmentation, cohort, market basket, bank churn, CRM data analytics, ABC & RFM analysis.
- ABC analysis is a tool that allows companies to divide their inventory or customers into groups based on their respective values to the company. The goal of this type of analysis is to ensure that businesses are optimizing their time and resources when serving customers.
- What is RFM (recency, frequency, monetary) analysis? RFM analysis is a marketing technique used to quantitatively rank and group customers based on the recency, frequency and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns.
- Cohort analysis is a type of behavioral analytics in which you take a group of users, and analyze their usage patterns based on their shared traits to better track and understand their actions. A cohort is simply a group of people with shared characteristics.
- Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.
- Customer Churn prediction means knowing which customers are likely to leave or unsubscribe from your service. Customers have different behaviors and preferences, and reasons for cancelling their subscriptions. Therefore, it is important to actively communicate with each of them to keep them on your customer list. You need to know which marketing activities are most effective for individual customers and when they are most effective.
Open-Source Datasets
This file contains the basic information (ID, age, gender, income, and spending score) about the customers.
Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.
We will be using the online retail transnational dataset to build RFM clustering and choose the best set of customers which the company should target.
Online Retail II dataset contains all the transactions occurring for a UK-based and registered, non-store online retail. This is a real online retail transaction data set of two years between 01/12/2009 and 09/12/2011.
The dataset was used in the Customer Personality Analysis (CPA). CPA is a detailed analysis of a companyโs ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviours and concerns of different types of customers.
SUPPLY CHAIN FOR BIG DATA ANALYSIS
Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.
Type Data :
Structured Data : DataCoSupplyChainDataset.csv
Unstructured Data : tokenized_access_logs.csv (Clickstream)Types of Products : Clothing , Sports , and Electronic Supplies
Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.
- Groceries Dataset Groceries_dataset.csv
3 columns: ID of customer, Date of purchase, and Description of product purchased.
Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions.
Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.
Half a million transaction record for e-commerce sales in Pakistan.
Attributes
People
- ID: Customer’s unique identifier
- Year_Birth: Customer’s birth year
- Education: Customer’s education level
- Marital_Status: Customer’s marital status
- Income: Customer’s yearly household income
- Kidhome: Number of children in customer’s household
- Teenhome: Number of teenagers in customer’s household
- Dt_Customer: Date of customer’s enrollment with the company
- Recency: Number of days since customer’s last purchase
- Complain: 1 if the customer complained in the last 2 years, 0 otherwise
Products
- MntWines: Amount spent on wine in last 2 years
- MntFruits: Amount spent on fruits in last 2 years
- MntMeatProducts: Amount spent on meat in last 2 years
- MntFishProducts: Amount spent on fish in last 2 years
- MntSweetProducts: Amount spent on sweets in last 2 years
- MntGoldProds: Amount spent on gold in last 2 years
Promotion
- NumDealsPurchases: Number of purchases made with a discount
- AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
- AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
- AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
- AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
- AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
- Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
Place
- NumWebPurchases: Number of purchases made through the companyโs website
- NumCatalogPurchases: Number of purchases made using a catalogue
- NumStorePurchases: Number of purchases made directly in stores
- NumWebVisitsMonth: Number of visits to companyโs website in the last month
Scope: Bank Churn Modelling, Bank Customers Churn , Churn Prediction of bank customers
Mall Customer Segmentation
Let’s set the working directory YOURPATH
import os
os.chdir(‘YOURPATH’)
os. getcwd()Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as snsimport plotly.express as px
import plotly.graph_objs as go
import missingno as msnofrom sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScalerimport warnings
warnings.filterwarnings(‘ignore’)plt.style.use(‘fivethirtyeight’)
%matplotlib inlineLet’s read the input dataset
df = pd.read_csv(‘Mall_Customers.csv’)
and check the content
df.head()Let’s check the descriptive statistics
df.describe().T
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 200 entries, 0 to 199 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CustomerID 200 non-null int64 1 Gender 200 non-null object 2 Age 200 non-null int64 3 Annual Income (k$) 200 non-null int64 4 Spending Score (1-100) 200 non-null int64 dtypes: int64(4), object(1) memory usage: 7.9+ KB
Let’s drop the column
df.drop(‘CustomerID’, axis = 1, inplace = True)
and plot the following distribution plots
plt.figure(figsize = (20, 8))
plotnumber = 1for col in [‘Age’, ‘Annual Income (k$)’, ‘Spending Score (1-100)’]:
if plotnumber <= 3:
ax = plt.subplot(1, 3, plotnumber)
sns.distplot(df[col])plotnumber += 1
plt.tight_layout()
plt.show()Let’s look at the Gender pie-plot
values = df[‘Gender’].value_counts()
labels = [‘Male’, ‘Female’]fig, ax = plt.subplots(figsize = (4, 4), dpi = 100)
explode = (0, 0.06)patches, texts, autotexts = ax.pie(values, labels = labels, autopct = ‘%1.2f%%’, shadow = True,
startangle = 90, explode = explode)plt.setp(texts, color = ‘red’)
plt.setp(autotexts, size = 12, color = ‘white’)
autotexts[1].set_color(‘black’)
plt.show()Let’s compare the corresponding violin plots
plt.figure(figsize = (20, 8))
plotnumber = 1for col in [‘Age’, ‘Annual Income (k$)’, ‘Spending Score (1-100)’]:
if plotnumber <= 3:
ax = plt.subplot(1, 3, plotnumber)
sns.violinplot(x = col, y = ‘Gender’, data = df)plotnumber += 1
plt.tight_layout()
plt.show()Let’s compare different age groups
age_18_25 = df.Age[(df.Age >= 18) & (df.Age <= 25)] age_26_35 = df.Age[(df.Age >= 26) & (df.Age <= 35)] age_36_45 = df.Age[(df.Age >= 36) & (df.Age <= 45)] age_46_55 = df.Age[(df.Age >= 46) & (df.Age <= 55)] age_55above = df.Age[df.Age >= 55]
x_age = [’18-25′, ’26-35′, ’36-45′, ’46-55′, ’55+’]
y_age = [len(age_18_25.values), len(age_26_35.values), len(age_36_45.values), len(age_46_55.values),
len(age_55above.values)]px.bar(data_frame = df, x = x_age, y = y_age, color = x_age,
title = ‘Number of customers per age group’)Let’s plot the Relation between Annual Income and Spending Score
fig=px.scatter(data_frame = df, x = ‘Annual Income (k$)’, y = ‘Spending Score (1-100)’, color=”Age”,
title = ‘Relation between Annual Income and Spending Score’)
fig.update_traces(marker=dict(size=18,
line=dict(width=1,
color=’DarkSlateGrey’)),
selector=dict(mode=’markers’))Let’s look at the Number of customers per Spending Score group
ss_1_20 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 1) &
(df[‘Spending Score (1-100)’] <= 20)]ss_21_40 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 21) &
(df[‘Spending Score (1-100)’] <= 40)]ss_41_60 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 41) &
(df[‘Spending Score (1-100)’] <= 60)]ss_61_80 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 61) &
(df[‘Spending Score (1-100)’] <= 80)]ss_81_100 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 81) &
(df[‘Spending Score (1-100)’] <= 100)]x_ss = [‘1-20′, ’21-40′, ’41-60′, ’61-80′, ’81-100’]
y_ss = [len(ss_1_20.values), len(ss_21_40.values), len(ss_41_60.values), len(ss_61_80.values),
len(ss_81_100.values)]px.bar(data_frame = df, x = x_ss, y = y_ss, color = x_ss,
title = ‘Number of customers per Spending Score group’)Let’s check the Number of customers per Annual Income group
ai_0_30 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 0) & (df[‘Annual Income (k$)’] <= 30)] ai_31_60 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 31)&(df[‘Annual Income (k$)’] <= 60)] ai_61_90 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 61)&(df[‘Annual Income (k$)’] <= 90)] ai_91_120 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’]>= 91)&(df[‘Annual Income (k$)’]<=120)] ai_121_150 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’]>=121)&(df[‘Annual Income (k$)’]<=150)]
x_ai = [‘$ 0-30,000’, ‘$ 30,001-60,000’, ‘$ 60,001-90,000’, ‘$ 90,001-120,000’, ‘$ 120,000-150,000’]
y_ai = [len(ai_0_30.values) , len(ai_31_60.values) , len(ai_61_90.values) , len(ai_91_120.values),
len(ai_121_150.values)]px.bar(data_frame = df, x = x_ai, y = y_ai, color = x_ai,
title = ‘Number of customers per Annual Income group’)Let’s perform K-means clustering using
Age
andSpending Score
columnsX1 = df.loc[:, [‘Age’, ‘Spending Score (1-100)’]].values
wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X1)
wcss.append(kmeans.inertia_)plt.figure(figsize = (12, 7))
plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()Let’s set n_clusters = 4 and run
kmeans = KMeans(n_clusters = 4)
labels = kmeans.fit_predict(X1)Let’s plot customer clusters in the domain Age vs Spending Score (1-100)
plt.figure(figsize = (14, 8))
plt.scatter(X1[:, 0], X1[:, 1], c = kmeans.labels_, s = 200)
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color = ‘red’, s = 350)
plt.title(‘Clusters of Customers\n’, fontsize = 20)
plt.xlabel(‘Age’)
plt.ylabel(‘Spending Score (1-100)’)
plt.show()Let’s perform K-means clustering using
Age
andAnnual Income
columnsX2 = df.loc[:, [‘Age’, ‘Annual Income (k$)’]].values
wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X2)
wcss.append(kmeans.inertia_)plt.figure(figsize = (12, 7))
plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()Let’s set n_clusters = 5 and run
kmeans = KMeans(n_clusters = 5)
labels = kmeans.fit_predict(X2)Let’s plot Customer Clusters in the domain Annual Income (k$) vs Spending Score (1-100)
Let’s perform K-means clustering using
Age
,Annual Score
andSpending Score
columns
X4 = df.iloc[:, 1:]wcss= []
for k in range(1, 11):
kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
kmeans.fit(X4)
wcss.append(kmeans.inertia_)plt.figure(figsize = (12, 7))
plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
plt.title(‘Elbow Plot\n’, fontsize = 20)
plt.xlabel(‘K’)
plt.ylabel(‘WCSS’)
plt.show()Let’s set n_clusters = 6 and run
kmeans = KMeans(n_clusters = 6)
clusters = kmeans.fit_predict(X4)
X4[‘label’] = clustersLet’s look at the corresponding 3D plot
fig = px.scatter_3d(X4, x=”Annual Income (k$)”, y=”Spending Score (1-100)”, z=”Age”,
color = ‘label’, size = ‘label’)
fig.show()Let’s plot the Hierarchical Clustering dendrogram
plt.figure(figsize = (22, 8))dendo = dendrogram(linkage(X3, method = ‘ward’))
plt.title(‘Dendrogram’, fontsize = 15)
plt.show()Let’s run Agglomerative Clustering with n_clusters = 5
agc = AgglomerativeClustering(n_clusters = 5, affinity = ‘euclidean’, linkage = ‘ward’)
labels = agc.fit_predict(X3)plt.figure(figsize = (12, 8))
smb=200plt.scatter(X3[labels == 0,0], X3[labels == 0,1], label = ‘Cluster 1’, s = smb)
plt.scatter(X3[labels == 1,0], X3[labels == 1,1], label = ‘Cluster 2’, s = smb)
plt.scatter(X3[labels == 2,0], X3[labels == 2,1], label = ‘Cluster 3’, s = smb)
plt.scatter(X3[labels == 3,0], X3[labels == 3,1], label = ‘Cluster 4’, s = smb)
plt.scatter(X3[labels == 4,0], X3[labels == 4,1], label = ‘Cluster 5’, s = smb)plt.legend(loc = ‘best’)
plt.title(‘Clusters of Customers\n ‘, fontsize = 20)
plt.xlabel(‘Annual Income (k$)’)
plt.ylabel(‘Spending Score (1-100)’)
plt.show()Let’s compare these results with the DBSCAN clustering algorithm
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0) # generate sample blobsX = StandardScaler().fit_transform(X)
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
Creating an array of true and false as the same size as db.labels
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True(setting the indices of the core regions to True)
labels = db.labels_(similar to the model.fit() method, it gives the labels of the clustered data).
The label -1 is considered as noise by the DBSCAN algorithm
n_clusters_ = len(set(labels)) – (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1) # calculating the number of clustersprint(‘Estimated number of clusters: %d’ % n_clusters_)
print(‘Estimated number of noise points: %d’ % n_noise_)“””Homogeneity metric of a cluster labeling given a ground truth.
A clustering result satisfies homogeneity if all of its clusters
contain only data points which are members of a single class.”””print(“Homogeneity: %0.3f” % metrics.homogeneity_score(labels_true, labels))
Estimated number of clusters: 3 Estimated number of noise points: 18 Homogeneity: 0.953
Let’s plot the result representing 3 clusters + noise
plt.figure(figsize = (10, 8))
unique_labels = set(labels) # identifying all the unique labels/clusters
colors = [plt.cm.Spectral(each)
# creating the list of colours, generating the colourmap
for each in np.linspace(0, 1, len(unique_labels))]for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k) # assigning class members for each class
xy = X[class_member_mask & core_samples_mask] # creating the list of points for each class
plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),markeredgecolor=’k’, markersize=14)
xy = X[class_member_mask & ~core_samples_mask] # creating the list of noise points
plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col), markeredgecolor=’k’, markersize=14)plt.title(‘Clustering using DBSCAN\n’, fontsize = 15)
plt.show()Here black symbols designate noise.
Read more here.
Online Retail K-Means Clustering
Let’s continue using the dataset Mall_Customers.csv
from sklearn.preprocessing import StandardScaler
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inlineimport os
import warningswarnings.filterwarnings(‘ignore’)
df = pd.read_csv(‘Mall_Customers.csv’)
Let’s rename the columns
df.rename(index=str, columns={‘Annual Income (k$)’: ‘Income’,
‘Spending Score (1-100)’: ‘Score’}, inplace=True)
df.head()Let’s examine the input data using sns.pairplot
X = df.drop([‘CustomerID’, ‘Gender’], axis=1)
sns.pairplot(df.drop(‘CustomerID’, axis=1), hue=’Gender’, aspect=1.5)
plt.show()Let’s look at the K-Means Elbow plot
from sklearn.cluster import KMeans
clusters = []
for i in range(1, 11):
km = KMeans(n_clusters=i).fit(X)
clusters.append(km.inertia_)fig, ax = plt.subplots(figsize=(12, 8))
sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax,linewidth = 4)
ax.set_title(‘Searching for Elbow’)
ax.set_xlabel(‘Clusters’)
ax.set_ylabel(‘Inertia’)ax.annotate(‘Possible Elbow Point’, xy=(3, 140000), xytext=(3, 50000), xycoords=’data’,
arrowprops=dict(arrowstyle=’->’, connectionstyle=’arc3′, color=’red’, lw=2))ax.annotate(‘Possible Elbow Point’, xy=(5, 80000), xytext=(5, 150000), xycoords=’data’,
arrowprops=dict(arrowstyle=’->’, connectionstyle=’arc3′, color=’red’, lw=2))plt.show()
Let’s plot K-means with 3 clusters Income vs Score
km3 = KMeans(n_clusters=3).fit(X)
X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Income”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 3 Clusters’)
plt.show()Let’s plot K-means with 3 clusters Income-Age vs Score
km3 = KMeans(n_clusters=3).fit(X)
X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 3 Clusters’)
plt.show()Let’s plot K-means with 5 clusters Income vs Score
km3 = KMeans(n_clusters=5).fit(X)
X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))
sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Income”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 5 Clusters’)
plt.show()Let’s plot K-means with 5 clusters Income-Age vs Score
km3 = KMeans(n_clusters=5).fit(X)
X[‘Labels’] = km3.labels_
plt.figure(figsize=(12, 8))
sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=”muted”,
height=6, data=X)
plt.title(‘KMeans with 5 Clusters’)
plt.show()Let’s invoke sns.swarmplot to compare Labels According to Annual Income and Scoring History
fig = plt.figure(figsize=(20,8))
sns.set(font_scale=2)
ax = fig.add_subplot(121)
sns.swarmplot(x=’Labels’, y=’Income’, data=X, ax=ax,hue=”Labels”, legend=False)
ax.set_title(‘Labels According to Annual Income’)ax = fig.add_subplot(122)
sns.swarmplot(x=’Labels’, y=’Score’, data=X, ax=ax)
ax.set_title(‘Labels According to Scoring History’)plt.show()
Hierarchical Clustering:
from sklearn.cluster import AgglomerativeClustering
sns.set(font_scale=1)
agglom = AgglomerativeClustering(n_clusters=5, linkage=’average’).fit(X)X[‘Labels’] = agglom.labels_
plt.figure(figsize=(12, 8))sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=sns.color_palette(‘hls’, 5),
height=6, data=X)plt.title(‘Agglomerative with 5 Clusters’)
plt.show()Let’s compute the spatial distance matrix
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrixdist = distance_matrix(X, X)
Z = hierarchy.linkage(dist, ‘complete’)and plot the dendrogram with dist=’complete’
plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size=12, orientation=’right’)Let’s plot the dendrogram with dist=’average’
Z = hierarchy.linkage(dist, ‘average’)
plt.figure(figsize=(18, 50))
dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size =12, orientation = ‘right’)Density Based Clustering (DBSCAN):
from sklearn.cluster import DBSCANdb = DBSCAN(eps=11, min_samples=6).fit(X)
X[‘Labels’] = db.labels_
plt.figure(figsize=(12, 8))sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
sizes=(40, 400), alpha=.5, palette=sns.color_palette(‘hls’, 5),
height=6, data=X)
plt.title(‘DBSCAN with epsilon 11, min samples 6’)
plt.show()Read more here.
Online Retail Data Analytics
Let’s look at the OnlineRetail.csv dataset
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
df = pd.read_csv(“OnlineRetail.csv”, delimiter=’,’, encoding = “ISO-8859-1”)Let’s check the data structure
df.head()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 541909 entries, 0 to 541908 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 541909 non-null object 1 StockCode 541909 non-null object 2 Description 540455 non-null object 3 Quantity 541909 non-null int64 4 InvoiceDate 541909 non-null object 5 UnitPrice 541909 non-null float64 6 CustomerID 406829 non-null float64 7 Country 541909 non-null object dtypes: float64(2), int64(1), object(5) memory usage: 33.1+ MB
df.describe()
Data Cleaning: Checking for Null Values
msno.bar(df)df.count()
InvoiceNo 541909 StockCode 541909 Description 540455 Quantity 541909 InvoiceDate 541909 UnitPrice 541909 CustomerID 406829 Country 541909 dtype: int64
df[df[‘CustomerID’].isnull()].count()
InvoiceNo 135080 StockCode 135080 Description 133626 Quantity 135080 InvoiceDate 135080 UnitPrice 135080 CustomerID 0 Country 135080 dtype: int64
Let’s check the percentage
100 – ((541909-135000)/541909 * 100)
24.911931708091203
So, approximately 25% of the data is missing.
We will proceed with dropping the missing rows now.
df.dropna(inplace=True)msno.bar(df)
Let’s perform further data editing
df[‘InvoiceDate’] = pd.to_datetime(df[‘InvoiceDate’], format=’%d-%m-%Y %H:%M’)
df[‘Total Amount Spent’]= df[‘Quantity’] * df[‘UnitPrice’]
total_amount = df[‘Total Amount Spent’].groupby(df[‘CustomerID’]).sum()
total_amount = pd.DataFrame(total_amount).reset_index()transactions = df[‘InvoiceNo’].groupby(df[‘CustomerID’]).count()
transaction = pd.DataFrame(transactions).reset_index()final = df[‘InvoiceDate’].max()
df[‘Last_transact’] = final – df[‘InvoiceDate’]
LT = df.groupby(df[‘CustomerID’]).min()[‘Last_transact’]
LT = pd.DataFrame(LT).reset_index()df_new = pd.merge(total_amount, transaction, how=’inner’, on=’CustomerID’)
df_new = pd.merge(df_new, LT, how=’inner’, on=’CustomerID’)df_new[‘Last_transact’] = df_new[‘Last_transact’].dt.days
Let’s run the K-Means Clustering Model
from sklearn.cluster import KMeans
kmeans= KMeans(n_clusters=2)kmeans.fit(df_new[[‘Total Amount Spent’, ‘InvoiceNo’, ‘Last_transact’]])
pred = kmeans.predict(df_new[[‘Total Amount Spent’, ‘InvoiceNo’, ‘Last_transact’]])df_new=df_new.join(pred, lsuffix=”_left”, rsuffix=”_right”)
error_rate = []
for clusters in range(1,16):
kmeans = KMeans(n_clusters = clusters)
kmeans.fit(df_new)
kmeans.predict(df_new)
error_rate.append(kmeans.inertia_)error_rate = pd.DataFrame({‘Cluster’:range(1,16) , ‘Error’:error_rate})
Let’s plot the Error Rate and Clusters
plt.figure(figsize=(12,8))
p = sns.barplot(x=’Cluster’, y= ‘Error’, data= error_rate, palette=’coolwarm_r’)
sns.despine(left=True)
p.set_title(‘Error Rate and Clusters’)Let’s drop 1 column
del df[‘InvoiceDate’]
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 392692 entries, 0 to 541908 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 392692 non-null object 1 StockCode 392692 non-null object 2 Description 392692 non-null object 3 Quantity 392692 non-null int64 4 UnitPrice 392692 non-null float64 5 CustomerID 392692 non-null float64 6 Country 392692 non-null object 7 cancellation 392692 non-null int64 8 TotalPrice 392692 non-null float64 dtypes: float64(3), int64(2), object(4) memory usage: 30.0+ MB
Let’s group the data
country_wise = df.groupby(‘Country’).sum()
and read the full country list
country_codes = pd.read_csv(‘wikipedia-iso-country-codes.csv’, names=[‘Country’, ‘two’, ‘three’, ‘numeric’, ‘ISO’])
country_codes.head()
Let’s merge df and country_codes
country_wise = pd.merge(country_codes,country_wise, on=’Country’)
Let’s invoke plotly for plotting purposes
from plotly import version
import cufflinks as cf
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
cf.go_offline()
import plotly.graph_objs as goLet’s plot European Countries According to Revenue
data = dict(type=’choropleth’,colorscale=’GnBu’, locations = country_wise[‘three’], locationmode = ‘ISO-3’, z= country_wise[‘Total Amount Spent’], text = country_wise[‘Country’], colorbar={‘title’:’Revenue’}, marker = dict(line=dict(width=0)))
layout = dict(title = ‘European Countries According to Revenue!’, geo = dict(scope=’europe’,showlakes=False, projection = {‘type’: ‘winkel tripel’}))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)Let’s plot All Countries According to Revenue
data = dict(type=’choropleth’,colorscale=’rainbow’, locations = country_wise[‘three’], locationmode = ‘ISO-3’, z= country_wise[‘Total Amount Spent’], text = country_wise[‘Country’], colorbar={‘title’:’Revenue’}, marker = dict(line=dict(width=0)))
layout = dict(title = ‘All Countries According to Revenue!’, geo = dict(scope=’world’,showlakes=False, projection = {‘type’: ‘winkel tripel’}))
Choromaps2 = go.Figure(data=[data], layout=layout)
iplot(Choromaps2)Let’s create the list of countries based upon the percentage of Total Price
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv(“OnlineRetail.csv”, encoding= ‘unicode_escape’)df = df.drop_duplicates()
df[‘cancellation’] = df.InvoiceNo.str.extract(‘([C])’).fillna(0).replace({‘C’:1})
df.cancellation.value_counts()cancellation 0 527390 1 9251 Name: count, dtype: int64
df[df.cancellation == 1][‘CustomerID’].nunique() / df.CustomerID.nunique() * 100
36.34492223238792
df = df[df.CustomerID.notnull()]
df = df[(df.Quantity > 0) & (df.UnitPrice > 0)]df = df.drop(‘cancellation’, axis = 1)
df[“TotalPrice”] = df.UnitPrice * df.Quantity
df2 = pd.DataFrame(df.groupby(‘Country’).TotalPrice.sum().apply(lambda x: round(x, 2))).sort_values(‘TotalPrice’, ascending = False)
df2[‘perc_of_TotalPrice’] = round(df2.TotalPrice / df2.TotalPrice.sum() * 100, 2)
df2Read more here.
RFM Customer Segmentation
Let’s set the working directory YOURPATH
import os
os.chdir(‘YOURPATH’)
os. getcwd()and import the key libraries
!pip install termcolor
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import ipywidgets
from ipywidgets import interactimport numpy as np
import pandas as pdimport seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib.ticker as mticker
import squarify as sqimport scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import missingno as msnoimport datetime as dt
from datetime import datetimefrom sklearn.preprocessing import scale, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import make_column_transformer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCAimport plotly.express as px
import cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)import colorama
from colorama import Fore, Style # maakes strings colored
from termcolor import colored
from termcolor import cprintfrom wordcloud import WordCloud
import warnings
warnings.filterwarnings(“ignore”)
warnings.warn(“this will not show”)plt.rcParams[“figure.figsize”] = (10,6)
pd.set_option(‘max_colwidth’,200)
pd.set_option(‘display.max_rows’, 1000)
pd.set_option(‘display.max_columns’, 200)
pd.set_option(‘display.float_format’, lambda x: ‘%.3f’ % x)Let’s define the following functions
##########################
def missing_values(df):
missing_number = df.isnull().sum().sort_values(ascending=False)
missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=[‘Missing_Number’, ‘Missing_Percent’])
return missing_values[missing_values[‘Missing_Number’]>0]def first_looking(df):
print(colored(“Shape:”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]),
colored(“\nInfo:\n”, attrs=[‘bold’]), sep=”)
print(df.info(), ‘\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Number of Uniques:\n”, attrs=[‘bold’]), df.nunique(),’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Missing Values:\n”, attrs=[‘bold’]), missing_values(df),’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“All Columns:”, attrs=[‘bold’]), list(df.columns),’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_') print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n', colored('*'*100, 'red', attrs=['bold']), sep='') print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n', colored('*'*100, 'red', attrs=['bold']), sep='') print(colored("Descriptive Statistics \n", attrs=['bold']), df.describe().round(2),'\n', colored('*'*100, 'red', attrs=['bold']), sep='') # Gives a statstical breakdown of the data. print(colored("Descriptive Statistics (Categorical Columns) \n", attrs=['bold']), df.describe(include=object).T,'\n', colored('*'*100, 'red', attrs=['bold']), sep='') # Gives a statstical breakdown of the data.
def multicolinearity_control(df):
feature =[]
collinear=[]
for col in df.corr().columns:
for i in df.corr().index:
if (abs(df.corr()[col][i])> .9 and abs(df.corr()[col][i]) < 1):
feature.append(col)
collinear.append(i)
print(colored(f”Multicolinearity alert in between:{col} – {i}”,
“red”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘‘100, ‘red’, attrs=[‘bold’]), sep=”)def duplicate_values(df):
print(colored(“Duplicate checkโฆ”, attrs=[‘bold’]), sep=”)
print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)def drop_null(df, limit):
print(‘Shape:’, df.shape)
for i in df.isnull().sum().index:
if (df.isnull().sum()[i]/df.shape[0]*100)>limit:
print(df.isnull().sum()[i], ‘percent of’, i ,’null and were dropped’)
df.drop(i, axis=1, inplace=True)
print(‘new shape:’, df.shape)
print(‘New shape after missing value control:’, df.shape)def first_look(col):
print(“column name : “, col)
print(“——————————–“)
print(“Per_of_Nulls : “, “%”, round(df[col].isnull().sum()/df.shape[0]*100, 2))
print(“Num_of_Nulls : “, df[col].isnull().sum())
print(“Num_of_Uniques : “, df[col].nunique())
print(“Duplicates : “, df.duplicated(subset=None, keep=’first’).sum())
print(df[col].value_counts(dropna = False))def fill_most(df, group_col, col_name):
”’Fills the missing values with the most existing value (mode) in the relevant column according to single-stage grouping”’
for group in list(df[group_col].unique()):
cond = df[group_col]==group
mode = list(df[cond][col_name].mode())
if mode != []:
df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
else:
df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
print(“Number of NaN : “,df[col_name].isnull().sum())
print(“——————“)
print(df[col_name].value_counts(dropna=False))##########################
Let’s read and copy the input Online Retail dataset
df0=pd.read_csv(‘Online Retail.csv’,index_col=0)
df = df0.copy()Let’s check the dataset content
first_looking(df)
duplicate_values(df)Shape:(541909, 8) **************************************************************************************************** Info: <class 'pandas.core.frame.DataFrame'> Index: 541909 entries, 0 to 541908 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 541909 non-null object 1 StockCode 541909 non-null object 2 Description 540455 non-null object 3 Quantity 541909 non-null float64 4 InvoiceDate 541909 non-null object 5 UnitPrice 541909 non-null float64 6 CustomerID 406829 non-null float64 7 Country 541909 non-null object dtypes: float64(3), object(5) memory usage: 37.2+ MB None **************************************************************************************************** Number of Uniques: InvoiceNo 25900 StockCode 4070 Description 4223 Quantity 722 InvoiceDate 23260 UnitPrice 1630 CustomerID 4372 Country 38 dtype: int64 **************************************************************************************************** Missing Values: Missing_Number Missing_Percent CustomerID 135080 0.249 Description 1454 0.003 **************************************************************************************************** All Columns:['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country'] **************************************************************************************************** Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country'] **************************************************************************************************** Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country'] **************************************************************************************************** Descriptive Statistics quantity unitprice customerid count 541909.000 541909.000 406829.000 mean 9.550 4.610 15287.690 std 218.080 96.760 1713.600 min -80995.000 -11062.060 12346.000 25% 1.000 1.250 13953.000 50% 3.000 2.080 15152.000 75% 10.000 4.130 16791.000 max 80995.000 38970.000 18287.000 **************************************************************************************************** Descriptive Statistics (Categorical Columns) count unique top freq invoiceno 541909 25900 573585.0 1114 stockcode 541909 4070 85123A 2313 description 540455 4223 WHITE HANGING HEART T-LIGHT HOLDER 2369 invoicedate 541909 23260 2011-10-31 14:41:00 1114 country 541909 38 United Kingdom 495478 **************************************************************************************************** Duplicate check... There are 5268 duplicated observations in the dataset.
Let’s calculate the total price
df[‘total_price’] = df[‘quantity’] * df[‘unitprice’]
print(“There is”, df.shape[0], “observations and”, df.shape[1], “columns in the dataset”)There is 541909 observations and 9 columns in the dataset
Let’s plot Invoice Counts Per Country
fig = px.histogram(df, x = df.groupby(‘country’)[‘invoiceno’].nunique().index,
y = df.groupby(‘country’)[‘invoiceno’].nunique().values,
title = ‘Invoice Counts Per Country’,
labels = dict(x = “Countries”, y =”Invoice”))
fig.show();Let’s group the data
df.groupby([‘invoiceno’,’customerid’, ‘country’])[‘total_price’].sum().sort_values().head()
invoiceno customerid country C581484 16446.000 United Kingdom -168469.600 C541433 12346.000 United Kingdom -77183.600 C556445 15098.000 United Kingdom -38970.000 C550456 15749.000 United Kingdom -22998.400 C570556 16029.000 United Kingdom -11816.640 Name: total_price, dtype: float64
Let’s plot Customer Counts by Country
fig = px.histogram(df, x = df.groupby(‘country’)[‘customerid’].nunique().index,
y = df.groupby(‘country’)[‘customerid’].nunique().values,
title = ‘Customer Counts by Country’,
labels = dict(x = “Countries”, y =”Customer”))
fig.show()Let’s check missing values
df.isnull().melt(value_name=”missing”)
Clean the Data from the Noise and Missing Values:
missing_values(df)Drop observations with missing values:
df = df.dropna(subset=[“customerid”])Get rid of cancellations and negative values:
df = df[(df[‘unitprice’] > 0) & (df[‘quantity’] > 0)]Check for duplicates and get rid of them:
print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)
print(df.duplicated(subset=None, keep=’first’).sum(), “Duplicated observations are dropped!”)
df.drop_duplicates(keep=’first’, inplace=True)print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)
print(df.duplicated(subset=None, keep=’first’).sum(), “Duplicated observations are dropped!”)
df.drop_duplicates(keep=’first’, inplace=True)There are 5192 duplicated observations in the dataset. 5192 Duplicated observations are dropped!
df.columns
Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price'], dtype='object')
Explore the Orders:
cprint(“Unique number of invoice per customer”,’red’)
df.groupby(“customerid”)[“invoiceno”].nunique().sort_values(ascending = False)Unique number of invoice per customer customerid 12748.000 209 14911.000 201 17841.000 124 13089.000 97 14606.000 93 ... 15314.000 1 15313.000 1 15308.000 1 15307.000 1 15300.000 1 Name: invoiceno, Length: 4338, dtype: int64
Let’s plot the country-based wordcloud
from wordcloud import WordCloud
country_text = df[“country”].str.split(” “).str.join(“_”)
all_countries = ” “.join(country_text)wc = WordCloud(background_color=”red”,
max_words=250,
max_font_size=256,
random_state=42,
width=800, height=400)
wc.generate(all_countries)
plt.figure(figsize = (12, 10))
plt.imshow(wc)
plt.axis(‘off’)
plt.show()Let’s focus on the UK online retail market
df_uk = df[df[“country”]==”United Kingdom”]
first_looking(df_uk)
duplicate_values(df_uk)Shape:(349203, 9) **************************************************************************************************** Info: <class 'pandas.core.frame.DataFrame'> Index: 349203 entries, 0 to 541893 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 invoiceno 349203 non-null object 1 stockcode 349203 non-null object 2 description 349203 non-null object 3 quantity 349203 non-null float64 4 invoicedate 349203 non-null object 5 unitprice 349203 non-null float64 6 customerid 349203 non-null float64 7 country 349203 non-null object 8 total_price 349203 non-null float64 dtypes: float64(4), object(5) memory usage: 26.6+ MB None **************************************************************************************************** Number of Uniques: invoiceno 16646 stockcode 3645 description 3844 quantity 293 invoicedate 15612 unitprice 402 customerid 3920 country 1 total_price 2792 dtype: int64 **************************************************************************************************** Missing Values: Empty DataFrame Columns: [Missing_Number, Missing_Percent] Index: [] **************************************************************************************************** All Columns:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price'] **************************************************************************************************** Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price'] **************************************************************************************************** Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price'] **************************************************************************************************** Descriptive Statistics quantity unitprice customerid total_price count 349203.000 349203.000 349203.000 349203.000 mean 12.150 2.970 15548.380 20.860 std 190.630 17.990 1594.380 328.420 min 1.000 0.000 12346.000 0.000 25% 2.000 1.250 14191.000 4.200 50% 4.000 1.950 15518.000 10.200 75% 12.000 3.750 16931.000 17.850 max 80995.000 8142.750 18287.000 168469.600 **************************************************************************************************** Descriptive Statistics (Categorical Columns) count unique top freq invoiceno 349203 16646 576339.0 542 stockcode 349203 3645 85123A 1936 description 349203 3844 WHITE HANGING HEART T-LIGHT HOLDER 1929 invoicedate 349203 15612 2011-11-14 15:27:00 542 country 349203 1 United Kingdom 349203 **************************************************************************************************** Duplicate check... There are 0 duplicated observations in the dataset.
Let’s prepare the date for the RFM analysis
import datetime as dt
from datetime import datetime
import datetime as dtdf_uk[“ref_date”] = pd.to_datetime(df_uk[‘invoicedate’]).dt.date.max()
df_uk[‘ref_date’] = pd.to_datetime(df_uk[‘ref_date’])
df_uk[“date”] = pd.to_datetime(df_uk[“invoicedate”]).dt.date
df_uk[‘date’] = pd.to_datetime(df_uk[‘date’])
df_uk[‘last_purchase_date’] = df_uk.groupby(‘customerid’)[‘invoicedate’].transform(max)
df_uk[‘last_purchase_date’] = pd.to_datetime(df_uk[‘last_purchase_date’]).dt.date
df_uk[‘last_purchase_date’] = pd.to_datetime(df_uk[‘last_purchase_date’])df_uk.groupby(‘customerid’)[[‘last_purchase_date’]].max().sample(5)
df_uk[“customer_recency”] = df_uk[“ref_date”] – df_uk[“last_purchase_date”]
df_uk[‘recency_value’] = pd.to_numeric(df_uk[‘customer_recency’].dt.days.astype(‘int64’))
customer_recency = pd.DataFrame(df_uk.groupby(‘customerid’)[‘recency_value’].min())
customer_recency.rename(columns={‘recency_value’:’recency’}, inplace=True)
customer_recency.reset_index(inplace=True)df_uk.drop(‘last_purchase_date’, axis=1, inplace=True)
Let’s plot Customer Regency Distribution
fig = px.histogram(df_uk, x = ‘recency_value’, title = ‘Customer Regency Distribution’)
fig.show()Let’s compute Frequency – Number of purchases
df_uk1 = df_uk.copy()
customer_frequency = pd.DataFrame(df_uk.groupby(‘customerid’)[‘invoiceno’].nunique())
customer_frequency.rename(columns={‘invoiceno’:’frequency’}, inplace=True)
customer_frequency.reset_index(inplace=True)df_uk[‘customer_frequency’] = df_uk.groupby(‘customerid’)[‘invoiceno’].transform(‘count’)
Let’s plot Customer Frequency Distribution
fig = px.histogram(df_uk, x = ‘customer_frequency’, title = ‘Customer Frequency Distribution’)
fig.show()Let’s calculate Monetary
customer_monetary = pd.DataFrame(df_uk.groupby(‘customerid’)[‘total_price’].sum())
customer_monetary.rename(columns={‘total_price’:’monetary’}, inplace=True)
customer_monetary.reset_index(inplace=True)df_uk[‘customer_monetary’] = df_uk.groupby(‘customerid’)[‘total_price’].transform(‘sum’)
Let’s plot Customer Monetary Distribution
fig = px.histogram(df_uk, x = ‘customer_monetary’, title = ‘Customer Monetary Distribution’)
fig.show()Let’s create the RFM Table
customer_rfm = pd.merge(pd.merge(customer_recency, customer_frequency, on=’customerid’), customer_monetary, on=’customerid’)
customer_rfm.head()customer_rfm.set_index(‘customerid’, inplace = True)
quantiles = customer_rfm.quantile(q = [0.25,0.50,0.75])
quantilesLet’s introduce several functions creating the RFM Segmentation Table
def rec_score(x):
if x < 17.000:
return 4
elif 17.000 <= x < 50.000:
return 3
elif 50.000 <= x < 142.000:
return 2
else:
return 1def freq_score(x):
if x > 5.000:
return 4
elif 5.000 >= x > 2.000:
return 3
elif 2.000 >= x > 1:
return 2
else:
return 1def mon_score(x):
if x > 1571.285:
return 4
elif 1571.285 >= x > 644.975:
return 3
elif 644.975 >= x > 298.185:
return 2
else:
return 1Let’s calculate the RFM score
customer_rfm[‘recency_score’] = customer_rfm[‘recency’].apply(rec_score)
customer_rfm[‘frequency_score’] = customer_rfm[‘frequency’].apply(freq_score)
customer_rfm[‘monetary_score’] = customer_rfm[‘monetary’].apply(mon_score)customer_rfm.info()
<class 'pandas.core.frame.DataFrame'> Index: 3920 entries, 12346.0 to 18287.0 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 recency 3920 non-null int64 1 frequency 3920 non-null int64 2 monetary 3920 non-null float64 3 recency_score 3920 non-null int64 4 frequency_score 3920 non-null int64 5 monetary_score 3920 non-null int64 dtypes: float64(1), int64(5) memory usage: 214.4 KB
Let’s calculate the total RFM score
customer_rfm[‘RFM_score’] = (customer_rfm[‘recency_score’].astype(str) + customer_rfm[‘frequency_score’].astype(str) +
customer_rfm[‘monetary_score’].astype(str))
customer_rfm.sample(5)Let’s plot Customer_RFM Distribution
fig = px.histogram(customer_rfm, x = customer_rfm[‘RFM_score’].value_counts().index,
y = customer_rfm[‘RFM_score’].value_counts().values,
title = ‘Customer_RFM Distribution’,
labels = dict(x = “RFM_score”, y =”counts”))
fig.show()Let’s calculate the RFM_level
customer_rfm[‘RFM_level’] = customer_rfm[‘recency_score’] + customer_rfm[‘frequency_score’] + customer_rfm[‘monetary_score’]
customer_rfm.sample(5)print(‘Min value for RFM_level : ‘, customer_rfm[‘RFM_level’].min())
print(‘Max value for RFM_level : ‘, customer_rfm[‘RFM_level’].max())Min value for RFM_level : 3 Max value for RFM_level : 12
Let’s introduce the RFM segment function
def segments(df_rfm):
if df_rfm[‘RFM_level’] == 12 :
return ‘champion’
elif (df_rfm[‘RFM_level’] == 11) or (df_rfm[‘RFM_level’] == 10 ):
return ‘loyal_customer’
elif (df_rfm[‘RFM_level’] == 9) or (df_rfm[‘RFM_level’] == 8 ):
return ‘promising’
elif (df_rfm[‘RFM_level’] == 7) or (df_rfm[‘RFM_level’] == 6 ):
return ‘need_attention’
elif (df_rfm[‘RFM_level’] == 5) or (df_rfm[‘RFM_level’] == 4 ):
return ‘hibernating’
else:
return ‘almost_lost’Let’s call this function to compute customer_segment
customer_rfm[‘customer_segment’] = customer_rfm.apply(segments,axis=1)
customer_rfm.sample(5)Let’s plot the customer_segment distribution
fig = px.histogram(customer_rfm, x = customer_rfm[‘customer_segment’].value_counts().index,
y = customer_rfm[‘customer_segment’].value_counts().values,
title = ‘Customer_Segment Distribution’,
labels = dict(x = “customer_segment”, y =”counts”))
fig.show()Let’s calculate the RFM_level Mean Values
avg_rfm_segment = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False)
avg_rfm_segmentcustomer_segment champion 12.000 loyal_customer 10.482 promising 8.530 need_attention 6.479 hibernating 4.499 almost_lost 3.000 Name: RFM_level, dtype: float64
Let’s plot the Customer_Segment RFM_level Mean Values
fig = px.histogram(customer_rfm, x = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False).index,
y = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False).values,
title = ‘Customer_Segment RFM_level Mean Values’,
labels = dict(x = “customer_segment”, y =”RFM_level Mean Values”))
fig.show()Let’s calculate the size of RFM_Segment
size_rfm_segment = customer_rfm[‘customer_segment’].value_counts()
size_rfm_segmenthibernating 918 promising 783 need_attention 771 loyal_customer 650 champion 402 almost_lost 396 Name: count, dtype: int64
Let’s plot Size RFM_Segment
fig = px.histogram(customer_rfm, x = customer_rfm[‘customer_segment’].value_counts().index,
y = customer_rfm[‘customer_segment’].value_counts().values,
title = ‘Size RFM_Segment’,
labels = dict(x = “customer_segment”, y =”Size RFM_Segment”))
fig.show()Final output table:
customer_segment = pd.DataFrame(pd.concat([avg_rfm_segment, size_rfm_segment], axis=1))
customer_segment.rename(columns={‘RFM_level’: ‘avg_RFM_level’, ‘customer_segment’: ‘segment_size’}, inplace=True)
customer_segmentLet’s plot the wordcloud of RFM Segments
from wordcloud import WordCloud
segment_text = customer_rfm[“customer_segment”].str.split(” “).str.join(“_”)
all_segments = ” “.join(segment_text)wc = WordCloud(background_color=”orange”,
max_words=250,
max_font_size=256,
random_state=42,
width=800, height=400)
wc.generate(all_segments)
plt.figure(figsize = (16, 15))
plt.imshow(wc)
plt.title(“RFM Segments”,fontsize=18,fontweight=”bold”)
plt.axis(‘off’)
plt.show()Let’s check the structure and content of customer_rfm
first_looking(customer_rfm)
duplicate_values(customer_rfm)Shape:(3920, 9) **************************************************************************************************** Info: <class 'pandas.core.frame.DataFrame'> Index: 3920 entries, 12346.0 to 18287.0 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 recency 3920 non-null int64 1 frequency 3920 non-null int64 2 monetary 3920 non-null float64 3 recency_score 3920 non-null int64 4 frequency_score 3920 non-null int64 5 monetary_score 3920 non-null int64 6 RFM_score 3920 non-null object 7 RFM_level 3920 non-null int64 8 customer_segment 3920 non-null object dtypes: float64(1), int64(6), object(2) memory usage: 306.2+ KB None **************************************************************************************************** Number of Uniques: recency 302 frequency 57 monetary 3849 recency_score 4 frequency_score 4 monetary_score 4 RFM_score 61 RFM_level 10 customer_segment 6 dtype: int64 **************************************************************************************************** Missing Values: Empty DataFrame Columns: [Missing_Number, Missing_Percent] Index: [] **************************************************************************************************** All Columns:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'RFM_score', 'RFM_level', 'customer_segment'] **************************************************************************************************** Columns after rename:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_level', 'customer_segment'] **************************************************************************************************** Columns after rename:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_level', 'customer_segment'] **************************************************************************************************** Descriptive Statistics recency frequency monetary recency_score frequency_score count 3920.000 3920.000 3920.000 3920.000 3920.000 \ mean 91.740 4.250 1858.420 2.480 2.320 std 99.530 7.200 7478.630 1.110 1.150 min 0.000 1.000 3.750 1.000 1.000 25% 17.000 1.000 298.180 1.000 1.000 50% 50.000 2.000 644.970 2.000 2.000 75% 142.000 5.000 1571.280 3.000 3.000 max 373.000 209.000 259657.300 4.000 4.000 monetary_score rfm_level count 3920.000 3920.000 mean 2.500 7.300 std 1.120 2.880 min 1.000 3.000 25% 1.750 5.000 50% 2.500 7.000 75% 3.250 10.000 max 4.000 12.000 **************************************************************************************************** Descriptive Statistics (Categorical Columns) count unique top freq rfm_score 3920 61 444 402 customer_segment 3920 6 hibernating 918 **************************************************************************************************** Duplicate check... There are 0 duplicated observations in the dataset.
customer_rfm.groupby(‘rfm_level’).agg({‘recency’: [‘mean’,’min’,’max’,’count’],
‘frequency’: [‘mean’,’min’,’max’,’count’],
‘monetary’: [‘mean’,’min’,’max’,’count’] }).round(1)Let’s check RFM correlations
plt.figure(figsize = (20,20))
sns.pairplot(customer_rfm[[‘recency’, ‘frequency’, ‘monetary’,’customer_segment’]], hue = ‘customer_segment’);Let’s compute the RFM correlation matrix
matrix = np.triu(customer_rfm[[‘recency’,’frequency’,’monetary’]].corr())
fig, ax = plt.subplots(figsize=(14,10))
sns.heatmap (customer_rfm[[‘recency’,’frequency’,’monetary’]].corr(), annot=True, fmt= ‘.2f’, vmin=-1, vmax=1, center=0, cmap=’coolwarm’,mask=matrix, ax=ax);Read more here.
RFM TreeMap: Online Retail Dataset II
Let’s import libraries and read the Online Retail II dataset Year 2010-2011
import datetime as dt
import pandas as pd
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.max_rows’, None)
pd.set_option(‘display.float_format’, lambda x: ‘%.2f’ % x)df_ = pd.read_excel(“online_retail_II.xlsx”,sheet_name=”Year 2010-2011″ )
df = df_.copy()Let’s check the overall data structure
def check_df(dataframe):
print(“################ Shape ####################”)
print(dataframe.shape)
print(“############### Columns ###################”)
print(dataframe.columns)
print(“############### Types #####################”)
print(dataframe.dtypes)
print(“############### Head ######################”)
print(dataframe.head())
print(“############### Tail ######################”)
print(dataframe.tail())
print(“############### Describe ###################”)
print(dataframe.describe().T)check_df(df)
################ Shape #################### (541910, 8) ############### Columns ################### Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'Price', 'Customer ID', 'Country'], dtype='object') ############### Types ##################### Invoice object StockCode object Description object Quantity int64 InvoiceDate datetime64[ns] Price float64 Customer ID float64 Country object dtype: object ############### Head ###################### Invoice StockCode Description Quantity 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 \ 1 536365 71053 WHITE METAL LANTERN 6 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 InvoiceDate Price Customer ID Country 0 2010-12-01 08:26:00 2.55 17850.00 United Kingdom 1 2010-12-01 08:26:00 3.39 17850.00 United Kingdom 2 2010-12-01 08:26:00 2.75 17850.00 United Kingdom 3 2010-12-01 08:26:00 3.39 17850.00 United Kingdom 4 2010-12-01 08:26:00 3.39 17850.00 United Kingdom ############### Tail ###################### Invoice StockCode Description Quantity 541905 581587 22899 CHILDREN'S APRON DOLLY GIRL 6 \ 541906 581587 23254 CHILDRENS CUTLERY DOLLY GIRL 4 541907 581587 23255 CHILDRENS CUTLERY CIRCUS PARADE 4 541908 581587 22138 BAKING SET 9 PIECE RETROSPOT 3 541909 581587 POST POSTAGE 1 InvoiceDate Price Customer ID Country 541905 2011-12-09 12:50:00 2.10 12680.00 France 541906 2011-12-09 12:50:00 4.15 12680.00 France 541907 2011-12-09 12:50:00 4.15 12680.00 France 541908 2011-12-09 12:50:00 4.95 12680.00 France 541909 2011-12-09 12:50:00 18.00 12680.00 France ############### Describe ################### count mean min Quantity 541910.00 9.55 -80995.00 \ InvoiceDate 541910 2011-07-04 13:35:22.342307584 2010-12-01 08:26:00 Price 541910.00 4.61 -11062.06 Customer ID 406830.00 15287.68 12346.00 25% 50% 75% Quantity 1.00 3.00 10.00 \ InvoiceDate 2011-03-28 11:34:00 2011-07-19 17:17:00 2011-10-19 11:27:00 Price 1.25 2.08 4.13 Customer ID 13953.00 15152.00 16791.00 max std Quantity 80995.00 218.08 InvoiceDate 2011-12-09 12:50:00 NaN Price 38970.00 96.76 Customer ID 18287.00 1713.60
####### DATA PREPARATION
df.isnull().any()
df.isnull().sum()
df.dropna(inplace=True)df[“StockCode”].nunique()
df.groupby(“StockCode”)[“Quantity”].sum()
df.groupby(“StockCode”).agg({“Quantity”:”sum”}).sort_values(by=”Quantity”, ascending=False).head()
df = df[~df[“Invoice”].str.contains(“C”, na=False)]
df[“TotalPrice”] = df[“Price”] * df[“Quantity”]
df[“InvoiceDate”].max()
today_date = dt.datetime(2011, 12, 11)rfm = df.groupby(“Customer ID”).agg({“InvoiceDate”: lambda InvฤฑiceDate: (today_date- InvฤฑiceDate.max()).days,
“Invoice”: lambda Invoice: Invoice.nunique(),
“TotalPrice”: lambda TotalPrice: TotalPrice.sum()})rfm.columns = [“recency”,”frequency”,”monetary”]
rfm = rfm[rfm[“monetary”] > 0]####### RFM CALCULATION
rfm[“recency_score”] = pd.qcut(rfm[‘recency’], 5, labels=[5, 4, 3, 2, 1])
rfm[“frequency_score”] = pd.qcut(rfm[‘frequency’].rank(method=”first”), 5, labels=[1, 2, 3, 4, 5])
rfm[“monetary_score”] = pd.qcut(rfm[‘monetary’], 5, labels=[1, 2, 3, 4, 5])
rfm[“RFM_SCORE”] = (rfm[‘recency_score’].astype(str) + rfm[‘frequency_score’].astype(str))
rfm.head()rfm.describe().T
rfm[rfm[“RFM_SCORE”] == “55”].head() # champions
rfm[rfm[“RFM_SCORE”] == “11”].head() # hibernating
#######CUSTOMER SEGMENTATION
seg_map = {
r'[1-2][1-2]’: ‘hibernating’,
r'[1-2][3-4]’: ‘at_Risk’,
r'[1-2]5′: ‘cant_loose’,
r’3[1-2]’: ‘about_to_sleep’,
r’33’: ‘need_attention’,
r'[3-4][4-5]’: ‘loyal_customers’,
r’41’: ‘promising’,
r’51’: ‘new_customers’,
r'[4-5][2-3]’: ‘potential_loyalists’,
r’5[4-5]’: ‘champions’
}rfm[‘segment’] = rfm[‘RFM_SCORE’].replace(seg_map, regex=True)
rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])
rfm[rfm[“segment”] == “need_attention”].head()
rfm[rfm[“segment”] == “new_customers”].index
seg_list = [“at_Risk”, “hibernating”, “cant_loose”, “loyal_customers”]
for i in seg_list:
print(F” {i.upper()} “.center(50, “*”))
print(rfm[rfm[“segment”]==i].describe().T)rfm[rfm[“segment”] == “loyal_customers”].index
loyalCus_df = pd.DataFrame()
loyalCus_df[“loyal_customersId”] = rfm[rfm[“segment”] == “loyal_customers”].index
loyalCus_df.head()loyalCus_df.to_csv(“loyal_customers.csv”)
******************** AT_RISK ********************* count mean std min 25% 50% 75% max recency 593.00 153.79 68.62 73.00 96.00 139.00 195.00 374.00 frequency 593.00 2.88 0.95 2.00 2.00 3.00 3.00 6.00 monetary 593.00 1084.54 2562.07 52.00 412.78 678.25 1200.62 44534.30 ****************** HIBERNATING ******************* count mean std min 25% 50% 75% max recency 1071.00 217.61 92.01 73.00 135.00 219.00 289.50 374.00 frequency 1071.00 1.10 0.30 1.00 1.00 1.00 1.00 2.00 monetary 1071.00 488.64 2419.68 3.75 155.11 296.25 457.93 77183.60 ******************* CANT_LOOSE ******************* count mean std min 25% 50% 75% max recency 63.00 132.97 65.25 73.00 89.00 108.00 161.00 373.00 frequency 63.00 8.38 4.29 6.00 6.00 7.00 9.00 34.00 monetary 63.00 2796.16 2090.49 70.02 1137.51 2225.97 3532.24 10254.18 **************** LOYAL_CUSTOMERS ***************** count mean std min 25% 50% 75% max recency 819.00 33.61 15.58 15.00 20.00 30.00 44.00 72.00 frequency 819.00 6.48 4.55 3.00 4.00 5.00 8.00 63.00 monetary 819.00 2864.25 6007.06 36.56 991.80 1740.48 3052.91 124914.53
#####CREATE RFM
def create_rfm(dataframe):
dataframe["TotalPrice"] = dataframe["Quantity"] * dataframe["Price"] dataframe.dropna(inplace=True) dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)] today_date = dt.datetime(2011, 12, 11) rfm = dataframe.groupby('Customer ID').agg({'InvoiceDate': lambda date: (today_date - date.max()).days, 'Invoice': lambda num: num.nunique(), "TotalPrice": lambda price: price.sum()}) rfm.columns = ['recency', 'frequency', "monetary"] rfm = rfm[(rfm['monetary'] > 0)] rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1]) rfm["frequency_score"] = pd.qcut(rfm["frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5]) rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5]) rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) + rfm['frequency_score'].astype(str))
rfm.head()
rfm[‘score’]=rfm[‘recency’]rfm[‘frequency’]rfm[‘monetary’]
Statistical informations of segments
rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])How many segments are there
segments_counts = rfm[‘segment’].value_counts().sort_values(ascending=True)Let’s plot our RFM segments
fig, ax = plt.subplots()
bars = ax.barh(range(len(segments_counts)),
segments_counts,
color=’lightcoral’)
ax.set_frame_on(False)
ax.tick_params(left=False,
bottom=False,
labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index)for i, bar in enumerate(bars):
value = bar.get_width()
if segments_counts.index[i] in [‘Can\’t loose’]:
bar.set_color(‘firebrick’)
ax.text(value,
bar.get_y() + bar.get_height()/2,
‘{:,} ({:}%)’.format(int(value),
int(value*100/segments_counts.sum())),
va=’center’,
ha=’left’
)plt.show(block=True)
Let’s plot the RFM segmentation treemap
import matplotlib.pyplot as plt
import squarify
sizes=[24, 18, 14, 13,11,8,4,3,2,1]
label=[“hibernating”, “loyal_customers”, ‘champions’, ‘at_risk’,’potential_loyalists’,’about_to_sleep’,’need_attention’,’promising’,’CL’,’NC’]colors = [‘#91DCEA’, ‘#64CDCC’, ‘#5FBB68’,
‘#F9D23C’, ‘#F9A729’, ‘#FD6F30′,’grey’,’red’,’blue’,’cyan’]
squarify.plot(sizes=sizes, label=label, color=colors, alpha=0.6 )
plt.show()
plt.axis(“off”)where CL = cant_loose and NC = new_cutomers.
Read more here.
CRM Analytics: CLTV
Let’s look at the CRM Analytics – CLTV (Customer Lifetime Value).
Importing libraries:
import datetime as dt
import pandas as pd
import warnings
warnings.filterwarnings(‘ignore’)
import matplotlib.pyplot as pltand set display setting
pd.set_option(“display.max_columns”, None)
pd.set_option(“display.float_format”, lambda x: “%.3f” % x)####### DATA PREPARATION
Reading the Online Retail II dataset and creating a copy of it
df_ = pd.read_excel(“online_retail.xlsx”)
df = df_.copy()
df.head()df.shape
(525461, 8)
Number of unique products
df[“Description”].nunique()4681
value_counts() of each product
df[“Description”].value_counts().head()WHITE HANGING HEART T-LIGHT HOLDER 3549 REGENCY CAKESTAND 3 TIER 2212 STRAWBERRY CERAMIC TRINKET BOX 1843 PACK OF 72 RETRO SPOT CAKE CASES 1466 ASSORTED COLOUR BIRD ORNAMENT 1457 Name: Description, dtype: int64
What is the most ordered product
df.groupby(“Description”).agg({“Quantity”: “sum”}).head()Missing value check
df.isnull().sumCustomers with missing Customer ID must be removed from the dataset
df.dropna(inplace =True)Adding the total price on the basis of products to the data set as a variable
df[“TotalPrice”] = df[“Quantity”] * df[“Price”]Let’s remove attribute values started with “C” (returned products)
df = df[~df[“Invoice”].str.contains(“C”, na=False)]
with the descriptive statistics of the dataset
df.describe().T######### CALCULATION OF RFM METRICS
Last date in the InvoiceDate
df[“InvoiceDate”].max()Timestamp('2010-12-09 20:01:00')
Let’s set 2 days after the last date
analysis_date = dt.datetime(2010, 12, 11)Let’s introduce the following RFM columns
rfm = df.groupby(“Customer ID”).agg({“InvoiceDate”: lambda invoice_date: (analysis_date – invoice_date.max()).days,
“Invoice”: lambda invoice: invoice.nunique(),
“TotalPrice”: lambda total_price: total_price.sum()})
rfm.columns = [“recency”, “frequency”, “monetary”]
rfm.head()while excluding monetary<=0
rfm = rfm[rfm[“monetary”] > 0]
and checking descriptive statistics
rfm.describe().T
RECENCY SCORE
rfm[“recency_score”] = pd.qcut(rfm[“recency”], 5, labels=[5, 4, 3, 2, 1])FREQUENCY SCORE
rfm[“frequency_score”] = pd.qcut(rfm[“frequency”].rank(method=”first”), 5, labels=[1, 2, 3, 4, 5])MONETARY SCORE
rfm[“monetary_score”] = pd.qcut(rfm[“monetary”], 5, labels=[1, 2, 3, 4, 5])RF SCORE
rfm[“RF_SCORE”] = (rfm[“recency_score”].astype(str) + rfm[“frequency_score”].astype(str))rfm.head()
Let’s convert RF_SCORE to segment with regex=True
seg_map = {
r'[1-2][1-2]’: ‘hibernating’,
r'[1-2][3-4]’: ‘at_Risk’,
r'[1-2]5′: ‘cant_loose’,
r’3[1-2]’: ‘about_to_sleep’,
r’33’: ‘need_attention’,
r'[3-4][4-5]’: ‘loyal_customers’,
r’41’: ‘promising’,
r’51’: ‘new_customers’,
r'[4-5][2-3]’: ‘potential_loyalists’,
r’5[4-5]’: ‘champions’
}rfm[‘segment’] = rfm[‘RF_SCORE’].replace(seg_map, regex=True)
rfm.head()
Let’s compute mean and count of our RFM columns sorted by segment in the ascending order
rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])
segments_counts = rfm[‘segment’].value_counts().sort_values(ascending=True)
Let’s plot our RFM segments
fig, ax = plt.subplots()
bars = ax.barh(range(len(segments_counts)),
segments_counts,
color=’lightcoral’)
ax.set_frame_on(False)
ax.tick_params(left=False,
bottom=False,
labelbottom=False)
ax.set_yticks(range(len(segments_counts)))
ax.set_yticklabels(segments_counts.index)for i, bar in enumerate(bars):
value = bar.get_width()
if segments_counts.index[i] in [‘Can\’t loose’]:
bar.set_color(‘firebrick’)
ax.text(value,
bar.get_y() + bar.get_height()/2,
‘{:,} ({:}%)’.format(int(value),
int(value*100/segments_counts.sum())),
va=’center’,
ha=’left’
)plt.show(block=True)
########## CLTV CALCULATION
Let’s prepare our input data
cltv_c = df.groupby(‘Customer ID’).agg({“Invoice”: lambda x: x.nunique(),
“Quantity”: lambda x: x.sum(),
“TotalPrice”: lambda x: x.sum()})cltv_c.columns = [“total_transaction”, “total_unit”, “total_price”]
Let’s calculate the following metrics:
profit margin:
cltv_c[“profit_margin”] = cltv_c[“total_price”] * 0.10Churn Rate:
repeat_rate = cltv_c[cltv_c[“total_transaction”] > 1].shape[0] / cltv_c.shape[0]
churn_rate = 1 – repeat_rate
Purchase Frequency:
cltv_c[“purchase_frequency”] = cltv_c[“total_transaction”] / cltv_c.shape[0]Average Order Value:
cltv_c[“average_order_value”] = cltv_c[“total_price”] / cltv_c[“total_transaction”]Customer Value:
cltv_c[“customer_value”] = cltv_c[“average_order_value”] * cltv_c[“purchase_frequency”]CLTV (Customer Lifetime Value):
cltv_c[“cltv”] = (cltv_c[“customer_value”] / churn_rate) * cltv_c[“profit_margin”]Sorting our dataset by cltv in the descending order
cltv_c.sort_values(by = “cltv”, ascending=False).head()Eliminating outliers with interquantile_range
def outlier_threshold(dataframe, variable):
quartile1 = dataframe[variable].quantile(0.01)
quartile3 = dataframe[variable].quantile(0.99)
interquantile_range = quartile3 – quartile1
up_limit = quartile3 + 1.5 * interquantile_range
low_limit = quartile1 – 1.5 * interquantile_range
return low_limit, up_limitdef replace_with_thresholds(dataframe, variable):
low_limit, up_limit = outlier_threshold(dataframe, variable)
# dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limitreplace_with_thresholds(df, “Quantity”)
replace_with_thresholds(df, “Price”)Define the analysis date
df[“InvoiceDate”].max()
analysis_date = dt.datetime(2011, 12, 11)############ Preparation of Lifetime Data Structure
cltv_df = df.groupby(“Customer ID”).agg({“InvoiceDate”: [lambda InvoiceDate: (InvoiceDate.max() – InvoiceDate.min()).days / 7,
lambda InvoiceDate: (analysis_date – InvoiceDate.min()).days / 7],
“Invoice”: lambda Invoice: Invoice.nunique(),
“TotalPrice”: lambda TotalPrice: TotalPrice.sum()})cltv_df.columns = cltv_df.columns.droplevel(0)
cltv_df.columns = [“recency”, “T”, “frequency”, “monetary”]
cltv_df[“monetary”] = cltv_df[“monetary”] / cltv_df[“frequency”]
cltv_df = cltv_df[cltv_df[“frequency”] > 1]
cltv_df.head()
cltv_df.describe().T
#######CLTV MODELLING
Let’s install and import the following libraries
!pip install Lifetimes
from lifetimes import BetaGeoFitter
from lifetimes import BetaGeoFitter
from lifetimes import GammaGammaFitter
from lifetimes.plotting import plot_period_transactionsimport datetime as dt
import pandas as pd
import warnings
warnings.filterwarnings(‘ignore’)
import matplotlib.pyplot as plt
pd.set_option(‘display.max_columns’, None)
pd.set_option(‘display.float_format’, lambda x: ‘%.5f’ % x)
pd.set_option(‘display.width’, 500)######BG/NBD Model
bgf = BetaGeoFitter(penalizer_coef=0.001)
bgf.fit(cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])<lifetimes.BetaGeoFitter: fitted with 2893 subjects, a: 1.93, alpha: 9.47, b: 6.27, r: 2.22>
Let’s look at the first 10 customers with the highest purchase:
bgf.conditional_expected_number_of_purchases_up_to_time(1, # number of weeks
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)Customer ID 15989.00000 0.00756 16720.00000 0.00722 14119.00000 0.00720 16204.00000 0.00715 17591.00000 0.00714 15169.00000 0.00700 17193.00000 0.00698 17251.00000 0.00696 17411.00000 0.00690 17530.00000 0.00680 dtype: float64
We can do same thing with predict:
bgf.predict(1, #number of weeks
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)Let’s apply predict for 1 month:
bgf.predict(4, # 4 weeks = 1 month
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”]).sort_values(ascending=False).head(10)Customer ID 15989.00000 0.02986 16720.00000 0.02849 14119.00000 0.02841 16204.00000 0.02825 17591.00000 0.02816 15169.00000 0.02763 17193.00000 0.02756 17251.00000 0.02748 17411.00000 0.02726 17530.00000 0.02682 dtype: float64
Let’s compare expected purchases for several time intervals:
Expected purchases for 1 months
cltv_df[“expected_purchase_1_month”] = bgf.predict(4,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])Expected purchases for 3 months
cltv_df[“expected_purchase_3_month”] = bgf.predict(4 * 3,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])Expected purchases for 1 year
cltv_df[“expected_purchase_1_year”] = bgf.predict(4 * 12,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”])
cltv_df.head()Evaluation of predicted results:
plot_period_transactions(bgf)
plt.show(block=True)Let’s plot the FR matrix
from lifetimes.plotting import plot_frequency_recency_matrix
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(12,8))
plot_frequency_recency_matrix(bgf)
plt.show(block=True)############ Gamma-Gamma Submodel
ggf = GammaGammaFitter(penalizer_coef=0.01)
ggf.fit(cltv_df[“frequency”], cltv_df[“monetary”])
cltv_df[“expected_average_profit”] = ggf.conditional_expected_average_profit(cltv_df[“frequency”],
cltv_df[“monetary”])cltv_df.sort_values(“expected_average_profit”, ascending=False).head(10)
######### Calculation of CLTV with BG/NBD and GG Model
cltv = ggf.customer_lifetime_value(bgf,
cltv_df[“frequency”],
cltv_df[“recency”],
cltv_df[“T”],
cltv_df[“monetary”],
time=3, # 3 month
freq=”W”, # frequency of T
discount_rate=0.01)cltv = cltv.reset_index()
cltv_final = cltv_df.merge(cltv, on=”Customer ID”, how=”left”)
cltv_final.sort_values(by=”clv”, ascending=False).head(10)
########## Segmentation according to CLTV
cltv_final[“segment”] = pd.qcut(cltv_final[“clv”], 4, labels=[“D”, “C”, “B”, “A”])
cltv_final.groupby(“segment”).agg({“count”, “mean”, “sum”})
_df_plt = cltv_final[[“expected_average_profit”, “segment”]]
plot_df = _df_plt.groupby(“segment”).agg(“mean”)
plot_df.plot(kind=”bar”, color=”tomato”,figsize=(15, 8))
plt.ylabel(“mean”)plt.show(block=True)
Read more here.
Cohort Analysis in Online Retail
Let’s dive into Cohort Analysis using the Online Retail Dataset.
Let’s install and import standard libraries
!pip install sklearn-pandas
import pandas as pd
import numpy as np
import datetime as dt
import seaborn as snsfrom sklearn import preprocessing
import sklearn
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn_pandas import DataFrameMapper
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
from matplotlib import pyplot as pltReading the dataset
df = pd.read_excel(“online_retail.xlsx”)#############DATA PREPARATION
df.shape
(525461, 8)
print(‘Duplicate entries: {}’.format(df.duplicated().sum()))
print(‘{}% rows are duplicate.’.format(round((df.duplicated().sum()/df.shape[0])*100),2))
df.drop_duplicates(inplace = True)Duplicate entries: 6865 1% rows are duplicate.
pd.DataFrame([{‘products’: len(df[‘StockCode’].value_counts()),
‘transactions’: len(df[‘Invoice’].value_counts()),
‘customers’: len(df[‘Customer ID’].value_counts()),
}], columns = [‘products’, ‘transactions’, ‘customers’], index = [‘quantity’])products transactions customers
quantity 4632 28816 4383
temp = df.groupby([‘Country’],as_index=False).agg({‘Invoice’:’nunique’}).rename(columns = {‘Invoice’:’Orders’})
total = temp[‘Orders’].sum(axis=0)
temp[‘%Orders’] = round((temp[‘Orders’]/total)*100,4)temp.sort_values(by=[‘%Orders’],ascending=False,inplace=True)
temp.reset_index(drop=True,inplace=True)Let’s plot %Orders vs Country
plt.figure(figsize=(13,6))
splot=sns.barplot(x=”Country”,y=”%Orders”,data=temp[:10])
for p in splot.patches:
splot.annotate(format(p.get_height(), ‘.1f’),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = ‘center’, va = ‘center’,
xytext = (0, 9),
textcoords = ‘offset points’)
plt.xlabel(“Country”, size=14)
plt.ylabel(“%Orders”, size=14)invoices = df[‘Invoice’]
x = invoices.str.contains(‘C’, regex=True)
x.fillna(0, inplace=True)x = x.astype(int)
x.value_counts()
0 508414 1 10182 Name: Invoice, dtype: int64
df[‘order_canceled’] = x
#df.head()n1 = df[‘order_canceled’].value_counts()[1]
n2 = df.shape[0]
print(‘Number of orders canceled: {}/{} ({:.2f}%) ‘.format(n1, n2, n1/n2*100))Number of orders canceled: 10182/518596 (1.96%)
df = df.loc[df[‘order_canceled’] == 0,:]
df.reset_index(drop=True,inplace=True)
df.reset_index(drop=True,inplace=True)
df.loc[df[‘Quantity’] < 0,:]
df = df[df[‘Customer ID’].notna()]
df.reset_index(drop=True,inplace=True)
#############NL COHORT INDEX ANALYSIS
df_uk = df[df.Country == ‘Netherlands’]
df_uk.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 2729 entries, 4346 to 387139 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Invoice 2729 non-null object 1 StockCode 2729 non-null object 2 Description 2729 non-null object 3 Quantity 2729 non-null int64 4 InvoiceDate 2729 non-null datetime64[ns] 5 Price 2729 non-null float64 6 Customer ID 2729 non-null float64 7 Country 2729 non-null object 8 order_canceled 2729 non-null int32 dtypes: datetime64[ns](1), float64(2), int32(1), int64(1), object(4) memory usage: 202.5+ KB
pd.DataFrame([{‘products’: len(df[‘StockCode’].value_counts()),
‘transactions’: len(df[‘Invoice’].value_counts()),
‘customers’: len(df[‘Customer ID’].value_counts()),
}], columns = [‘products’, ‘transactions’, ‘customers’], index = [‘Quantity’])products transactions customers
Quantity 4017 19215 4314
cohort_data = df_uk[[‘Invoice’,’StockCode’,’Description’,’Quantity’,’InvoiceDate’,’Price’,’Customer ID’,’Country’]]
Checking for nulls in the data
cohort_data.isnull().sum()Invoice 0 StockCode 0 Description 0 Quantity 0 InvoiceDate 0 Price 0 Customer ID 0 Country 0 dtype: int64
all_dates = (pd.to_datetime(cohort_data[‘InvoiceDate’])).apply(lambda x:x.date())
(all_dates.max() – all_dates.min()).days
364
Start and end dates:
print(‘Start date: {}’.format(all_dates.min()))
print(‘End date: {}’.format(all_dates.max()))Start date: 2009-12-02 End date: 2010-12-01
cohort_data.head()
Invoice StockCode Description Quantity InvoiceDate Price Customer ID Country def get_month(x):
return dt.datetime(x.year, x.month, 1)cohort_data[‘InvoiceMonth’] = cohort_data[‘InvoiceDate’].apply(get_month)
grouping = cohort_data.groupby(‘Customer ID’)[‘InvoiceMonth’]
cohort_data[‘CohortMonth’] = grouping.transform(‘min’)
cohort_data.head()Invoice StockCode Description Quantity InvoiceDate Price Customer ID Country InvoiceMonth CohortMonth def get_date_int(df, column):
year = df[column].dt.year
month = df[column].dt.month
day = df[column].dt.day
return year, month, dayinvoice_year, invoice_month, _ = get_date_int(cohort_data, ‘InvoiceMonth’)
cohort_year, cohort_month, _ = get_date_int(cohort_data, ‘CohortMonth’)
years_diff = invoice_year – cohort_year
months_diff = invoice_month – cohort_month
cohort_data[‘CohortIndex’] = years_diff * 12 + months_diff
#cohort_data.head()grouping = cohort_data.groupby([‘CohortMonth’, ‘CohortIndex’])
cohort_data = grouping[‘Quantity’].mean()
cohort_data = cohort_data.reset_index()
average_quantity = cohort_data.pivot(index=’CohortMonth’,columns=’CohortIndex’,values=’Quantity’)
average_quantity.round(1)
#average_quantityLet’s plot retention rates heatmap
plt.figure(figsize=(10, 8))
plt.title(‘Retention rates’)
sns.heatmap(data = retention,annot = True,fmt = ‘.0%’,vmin = 0.0,vmax = 0.5,cmap = ‘BuGn’)
plt.show()Let’s compute TotalPrice
grdf=grouping.apply(pd.DataFrame)
grdf[‘TotalPrice’]=grdf[‘Quantity’]*grdf[‘Price’]
grdf[grdf[‘TotalPrice’]>139.2].sort_values(‘TotalPrice’,ascending=False)
#############RFM SCORE ANALYSIS
from dateutil.relativedelta import relativedelta
start_date = all_dates.max()-relativedelta(months=12,days=-1)
print(‘Start date: {}’.format(start_date))
print(‘End date: {}’.format(all_dates.max()))Start date: 2009-12-02 End date: 2010-12-01
data_rfm = grdf[grdf[‘InvoiceDate’] >= pd.to_datetime(start_date)]
data_rfm.reset_index(drop=True,inplace=True)
data_rfm.head()Invoice StockCode Description Quantity InvoiceDate Price Customer ID Country InvoiceMonth CohortMonth CohortIndex TotalPrice snapshot_date = max(data_rfm.InvoiceDate) + dt.timedelta(days=1)
print(‘Snapshot date: {}’.format(snapshot_date.date()))Snapshot date: 2010-12-02
Aggregate data on a customer level
data = data_rfm.groupby([‘Customer ID’],as_index=False).agg({‘InvoiceDate’: lambda x: (snapshot_date – x.max()).days,
‘Invoice’: ‘count’,
‘TotalPrice’: ‘sum’}).rename(columns = {‘InvoiceDate’: ‘Recency’,
‘Invoice’: ‘Frequency’,
‘TotalPrice’: ‘MonetaryValue’})r_labels = range(4, 0, -1)
r_quartiles = pd.qcut(data[‘Recency’], 4, labels = r_labels)
data = data.assign(R = r_quartiles.values)f_labels = range(1,5)
m_labels = range(1,5)
f_quartiles = pd.qcut(data[‘Frequency’], 4, labels = f_labels)
m_quartiles = pd.qcut(data[‘MonetaryValue’], 4, labels = m_labels)
data = data.assign(F = f_quartiles.values)
data = data.assign(M = m_quartiles.values)
#data.head()def join_rfm(x):
return str(x[‘R’]) + str(x[‘F’]) + str(x[‘M’])data[‘RFM_Segment’] = data.apply(join_rfm, axis=1)
data[‘RFM_Score’] = data[[‘R’,’F’,’M’]].sum(axis=1)
data.head()data.groupby(‘RFM_Score’).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’] }).round(1)Let’s create segment
def create_segment(df):
if df[‘RFM_Score’] >= 9:
return ‘Top’
elif (df[‘RFM_Score’] >= 5) and (df[‘RFM_Score’] < 9):
return ‘Middle’
else:
return ‘Low’data[‘General_Segment’] = data.apply(create_segment, axis=1)
data.groupby(‘General_Segment’).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’]}).round(1)Plotting the distributions of Recency, Frequency and Monetary Value variables
plt.figure(figsize=(12,10))
plt.subplot(3, 1, 1); sns.distplot(data[‘Recency’])
plt.subplot(3, 1, 2); sns.distplot(data[‘Frequency’])
plt.subplot(3, 1, 3); sns.distplot(data[‘MonetaryValue’])
Checking for constant mean and variance.
data[[‘Recency’,’Frequency’,’MonetaryValue’]].describe()data[data[‘MonetaryValue’] == 0]
raw_data = data[[‘Recency’,’Frequency’,’MonetaryValue’]]
#############RFM SCORE TRANSFORMATIONS
Let’s unskew the data
data_log = np.log(raw_data)Initialize a standard scaler and fit it
scaler = StandardScaler()
scaler.fit(data_log)Scale and center the data
data_normalized = scaler.transform(data_log)Create a pandas DataFrame
data_norm = pd.DataFrame(data=data_log, index=raw_data.index, columns=raw_data.columns)Let’s plot the transformed RFM distributions
plt.figure(figsize=(12,10))
plt.subplot(3, 1, 1); sns.distplot(data_norm[‘Recency’])
plt.subplot(3, 1, 2); sns.distplot(data_norm[‘Frequency’])
plt.subplot(3, 1, 3); sns.distplot(data_norm[‘MonetaryValue’])
plt.show()
#############K-MEANS CLUSTERING
from sklearn.cluster import KMeans
sse = {}
Fit KMeans and calculate SSE for each k
for k in range(1, 21):# Initialize KMeans with k clusters kmeans = KMeans(n_clusters=k, random_state=1) # Fit KMeans on the normalized dataset kmeans.fit(data_norm) # Assign sum of squared distances to k element of dictionary sse[k] = kmeans.inertia_
Let’s look at the Elbow plot
plt.figure(figsize=(12,8))
plt.title(‘The Elbow Method’)
plt.xlabel(‘k’);
plt.ylabel(‘Sum of squared errors’)
sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
plt.show()Let’s apply Kmeans with n_clusters=3
kmeans = KMeans(n_clusters=3, random_state=1)
Compute k-means clustering on pre-processed data
kmeans.fit(data_norm)Extract cluster labels from labels_ attribute
cluster_labels = kmeans.labels_Create a cluster label column in the original DataFrame
data_norm_k3 = data_norm.assign(Cluster = cluster_labels)
data_k3 = raw_data.assign(Cluster = cluster_labels)Calculate average RFM values and size for each cluster
summary_k3 = data_k3.groupby([‘Cluster’]).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’],}).round(0)summary_k3
Let’s apply Kmeans with n_clusters=4
kmeans = KMeans(n_clusters=4, random_state=1)
Compute k-means clustering on pre-processed data
kmeans.fit(data_norm)Extract cluster labels from labels_ attribute
cluster_labels = kmeans.labels_Create a cluster label column in the original DataFrame
data_norm_k4 = data_norm.assign(Cluster = cluster_labels)
data_k4 = raw_data.assign(Cluster = cluster_labels)Calculate average RFM values and size for each cluster
summary_k4 = data_k4.groupby([‘Cluster’]).agg({‘Recency’: ‘mean’,
‘Frequency’: ‘mean’,
‘MonetaryValue’: [‘mean’, ‘count’],}).round(0)summary_k4
data_norm_k4.index = data[‘Customer ID’].astype(int)
Let’s invoke melt to massage the data as follows
data_melt = pd.melt(data_norm_k4.reset_index(),
id_vars=[‘Customer ID’, ‘Cluster’],
value_vars=[‘Recency’, ‘Frequency’, ‘MonetaryValue’],
var_name=’Attribute’,
value_name=’Value’)Let’s look at the RFM snake plot for 4 clusters
plt.title(‘Snake plot of standardized variables’)
sns.lineplot(x=”Attribute”, y=”Value”, hue=’Cluster’, data=data_melt)######RFM Data Wrangling:
data_k4.index = data[‘Customer ID’].astype(int)
raw_data.index = data[‘Customer ID’].astype(int)
cluster_avg = data_k4.groupby([‘Cluster’]).mean()
population_avg = raw_data.head().mean()population_avg
Recency 88.400 Frequency 65.800 MonetaryValue 1224.666 dtype: float64
relative_imp = cluster_avg / population_avg – 1
Let’s plot Relative importance of attributes as a sns heatmap
plt.figure(figsize=(8, 4))
plt.title(‘Relative importance of attributes’)
sns.heatmap(data=relative_imp, annot=True, fmt=’.2f’, cmap=’RdYlGn’)
plt.show()Read more here.
Customer Segmentation, RFM, K-Means and Cohort Analysis in Online Retail
Let’s set the working directory YOURPATH
import os
os.chdir(‘YOURPATH’)
os. getcwd()and import standard libraries with optional settings
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
%matplotlib inline
import statsmodels.api as sm
import statsmodels.formula.api as smf
import missingno as msnofrom sklearn.compose import make_column_transformer
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScalerimport cufflinks as cf
import plotly.offline
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)import warnings
warnings.filterwarnings(“ignore”)
warnings.warn(“this will not show”)plt.rcParams[“figure.figsize”] = (10,6)
pd.set_option(‘max_colwidth’,200)
pd.set_option(‘display.max_rows’, 1000)
pd.set_option(‘display.max_columns’, 200)
pd.set_option(‘display.float_format’, lambda x: ‘%.3f’ % x)import colorama
from colorama import Fore, Style
from termcolor import coloredimport ipywidgets
from ipywidgets import interactsns.set_style(“whitegrid”)
Let’s read the input dataset
df=pd.read_csv(‘Online Retail.csv’,index_col=0)
and define some useful functions
#######################
def missing_values(df):
missing_number = df.isnull().sum().sort_values(ascending=False)
missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=[‘Missing_Number’, ‘Missing_Percent’])
return missing_values[missing_values[‘Missing_Number’]>0]def first_looking(df):
print(colored(“Shape:”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), colored(“\nInfo:\n”, attrs=[‘bold’]), sep=”) print(df.info(), ‘\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“Number of Uniques:\n”, attrs=[‘bold’]), df.nunique(),’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”) print(colored(“Missing Values:\n”, attrs=[‘bold’]), missing_values(df),’\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)
print(colored(“All Columns:”, attrs=[‘bold’]), list(df.columns),’\n’,
colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_') print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n', colored('-'*79, 'red', attrs=['bold']), sep='')
def multicolinearity_control(df):
feature =[]
collinear=[]
for col in df.corr().columns:
for i in df.corr().index:
if (abs(df.corr()[col][i])> .9 and abs(df.corr()[col][i]) < 1):
feature.append(col)
collinear.append(i)
print(colored(f”Multicolinearity alert in between:{col} – {i}”,
“red”, attrs=[‘bold’]), df.shape,’\n’,
colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)def duplicate_values(df):
print(colored(“Duplicate checkโฆ”, attrs=[‘bold’]), sep=”)
duplicate_values = df.duplicated(subset=None, keep=’first’).sum()
if duplicate_values > 0:
df.drop_duplicates(keep=’first’, inplace=True)
print(duplicate_values, colored(” Duplicates were dropped!”),’\n’,
colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”) else: print(colored(“There are no duplicates”),’\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)def drop_columns(df, drop_columns):
if drop_columns !=[]:
df.drop(drop_columns, axis=1, inplace=True)
print(drop_columns, ‘were dropped’)
else:
print(colored(‘We will now check the missing values and if necessary will drop realted columns!’, attrs=[‘bold’]),’\n’,
colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)def drop_null(df, limit):
print(‘Shape:’, df.shape)
for i in df.isnull().sum().index:
if (df.isnull().sum()[i]/df.shape[0]*100)>limit:
print(df.isnull().sum()[i], ‘percent of’, i ,’null and were dropped’)
df.drop(i, axis=1, inplace=True)
print(‘new shape:’, df.shape)
print(‘New shape after missing value control:’, df.shape)#######################
Let’s check the structure and content of our input data
first_looking(df)
duplicate_values(df)
drop_columns(df,[])
drop_null(df, 90)
df.head()
df.tail()
df.sample(5)
df.describe().T
df.describe(include=object).TShape:(541909, 8) ------------------------------------------------------------------------------- Info: <class 'pandas.core.frame.DataFrame'> Index: 541909 entries, 0 to 541908 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 541909 non-null object 1 StockCode 541909 non-null object 2 Description 540455 non-null object 3 Quantity 541909 non-null float64 4 InvoiceDate 541909 non-null object 5 UnitPrice 541909 non-null float64 6 CustomerID 406829 non-null float64 7 Country 541909 non-null object dtypes: float64(3), object(5) memory usage: 37.2+ MB None ------------------------------------------------------------------------------- Number of Uniques: InvoiceNo 25900 StockCode 4070 Description 4223 Quantity 722 InvoiceDate 23260 UnitPrice 1630 CustomerID 4372 Country 38 dtype: int64 ------------------------------------------------------------------------------- Missing Values: Missing_Number Missing_Percent CustomerID 135080 0.249 Description 1454 0.003 ------------------------------------------------------------------------------- All Columns:['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country'] ------------------------------------------------------------------------------- Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country'] ------------------------------------------------------------------------------- Duplicate check... 5268 Duplicates were dropped! ------------------------------------------------------------------------------- We will now check the missing values and if necessary will drop realted columns! ------------------------------------------------------------------------------- Shape: (536641, 8) New shape after missing value control: (536641, 8)
######## DATA WRANGLING
df_model = df.copy()
df.isnull().sum()
invoiceno 0 stockcode 0 description 1454 quantity 0 invoicedate 0 unitprice 0 customerid 135037 country 0 dtype: int64
We will drop NaN values of customerid
df = df.dropna(subset=[‘customerid’])
df.shape(401604, 8)
df[‘cancelled’] = df[‘invoiceno’].str.contains(‘C’)
Let’s drop cancelled
df = df.drop([‘cancelled’],axis=1)Clean the Data from the Noise and Missing Values
df = df[(df.quantity > 0) & (df.unitprice > 0)]
df.shape(392692, 8)
Let’s look at the treemap Number of Customers Per Country Without UK
fig = px.treemap(df.groupby(‘country’)[[“customerid”]].nunique().sort_values(by=”customerid”, ascending=False).iloc[1:],
path=[df.groupby(‘country’)[[“customerid”]].nunique().sort_values(by=”customerid”, ascending=False).iloc[1:].index],
values=’customerid’,
width=1000,
height=600)
fig.update_layout(title_text=’Number of Customers Per Country Without UK’,
title_x = 0.5, title_font = dict(size=20)
)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()#######TOTAL PRICE aka REVENUE
The calculation of total price
df[‘total_price’] = df[‘quantity’] * df[‘unitprice’]Let’s save this dataframe for future analysis
df_total_price = pd.DataFrame(df.groupby(‘country’)[“total_price”].sum().sort_values(ascending=False))Let’s plot the treemap Total Prices By Countries Without UK
fig = px.treemap(df_total_price.iloc[1:],
path=[df_total_price.iloc[1:].index],
values=’total_price’,
width=1000,
height=600)
fig.update_layout(title_text=’Total Prices By Countries Without UK’,
title_x = 0.5, title_font = dict(size=20)
)
fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
fig.show()#########UK RFM ANALYSIS
df_uk = df[df[“country”] == ‘United Kingdom’]
Top 15 most owned products
df_uk[“stockcode”].value_counts().head(15).plot(kind=”bar”, width=0.5, color=’pink’, edgecolor=’purple’, figsize=(16,9))
plt.xticks(rotation=45);last_invoice = max(df_uk[‘invoicedate’])
last_invoice'2011-12-09 12:49:00'
first_invoice = min(df_uk[‘invoicedate’])
first_invoice'2010-12-01 08:26:00'
df_uk[‘invoicedate’] = pd.to_datetime(df_uk[‘invoicedate’])
df_uk[“date”] = df_uk[‘invoicedate’].dt.date
df_uk[“last_purchased_date”] = df_uk.groupby(“customerid”).date.transform(max)
last_invoice = pd.to_datetime(last_invoice).date()
df_uk.groupby(‘customerid’)[‘last_purchased_date’].apply(lambda x: last_invoice – x)
# Recency
df_recency = df_uk.groupby(by=’customerid’,
as_index=False)[‘date’].max()
df_recency.columns = [‘customerid’, ‘last_purchased_date’]
recent_date = df_recency[‘last_purchased_date’].max()
df_recency[‘recency’] = df_recency[‘last_purchased_date’].apply(
lambda x: (recent_date – x).days)
#df_recency.head()Let’s plot recency
plt.figure(figsize=(16,9))
sns.distplot(df_recency[‘recency’], bins=35);# Frequency: Number of purchases
df_uk[‘frequency’] = df_uk.groupby(‘customerid’)[“invoiceno”].transform(‘count’)
plt.figure(figsize=(16,9))
sns.distplot(df_uk[‘frequency’], bins=50);########### Monetary: Total amount of money spent
df_uk[“monetary”] = df_uk.groupby(‘customerid’)[“total_price”].transform(‘sum’)
plt.figure(figsize=(16,9))
sns.distplot(df_uk[‘monetary’], bins=50);############ UK RFM Table
df_uk.columns
Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price', 'date', 'last_purchased_date', 'frequency', 'monetary'], dtype='object')
df_recency.columns
Index(['customerid', 'last_purchased_date', 'recency'], dtype='object')
df_uk[‘recency’] = df_recency[‘recency’]
df_rfm_table = df_uk[[‘customerid’, ‘recency’, ‘frequency’, ‘monetary’]]
Customer Segmentation with RFM Scores
df_rfm_table = df_rfm_table.set_index(‘customerid’)df_rfm_table.reset_index().duplicated().sum()
df_rfm_table.drop_duplicates(inplace=True)
df_rfm_table.shape(6050, 3)
########### Recency Scoring
df_rfm_table[“recency”].quantile(q = [.25,.5,.75])
0.250 21.000 0.500 58.000 0.750 163.000 Name: recency, dtype: float64
def recency_scoring(data):
if data[“recency”] <= 17.000:
return 4
elif data[“recency”] <= 50.000:
return 3
elif data[“recency”] <= 142.000:
return 2
else:
return 1df_rfm_table[‘recency_quantile’] = df_rfm_table.apply(recency_scoring, axis =1)
#df_rfm_table.head()########### Frequency Scoring
df_rfm_table[“frequency”].quantile(q = [.25,.5,.75])
0.250 23.000 0.500 65.000 0.750 147.000 Name: frequency, dtype: float64
def frequency_scoring(data):
if data.frequency <= 17.000:
return 1
elif data.frequency <= 40.000:
return 2
elif data.frequency <= 98.000:
return 3
else:
return 4df_rfm_table[‘frequency_quantile’] = df_rfm_table.apply(frequency_scoring, axis =1)
#df_rfm_table.head()########### Monetary Scoring
df_rfm_table[“monetary”].quantile(q = [.25,.5,.75])
def monetary_scoring(data):
if data.monetary <= 298.185:
return 1
elif data.monetary <= 644.975:
return 2
elif data.monetary <= 1571.285:
return 3
else:
return 4df_rfm_table[‘monetary_quantile’] = df_rfm_table.apply(monetary_scoring, axis =1)
#df_rfm_table.head()############### RFM Scoring
def rfm_scoring(data):
return str(int(data[‘recency_quantile’])) + str(int(data[‘frequency_quantile’])) + str(int(data[‘monetary_quantile’]))df_rfm_table[‘rfm_score’] = df_rfm_table.apply(rfm_scoring, axis=1)
#df_rfm_table.head()df_rfm_table[‘rfm_level’] = df_rfm_table[‘recency_quantile’] + df_rfm_table[‘frequency_quantile’] + df_rfm_table[‘monetary_quantile’]
#df_rfm_table.head()def segments1(data):
if data[‘rfm_level’] >= 10 :
return ‘Gold’
elif (data[‘rfm_level’] >= 6) and (data[‘rfm_level’] < 10 ):
return ‘Silver’
else:
return ‘Bronze’df_rfm_table [‘segments1’] = df_rfm_table.apply(segments1,axis=1)
#df_rfm_table.head()segments2 = {
‘Customer Segment’:
[‘Champions’,
‘Loyal Customers’,
‘Potential Loyalist’,
‘Recent Customers’,
‘Customers Needing Attention’,
‘Still Got Hope’,
‘Need to Get Them Back’,
‘Lost’, ‘Give it a Try’],\
‘RFM’:
[‘(3|4)-(3|4)-(3|4)’,
‘(2|3|4)-(3|4)-(1|2|3|4)’,
‘(3|4)-(2|3)-(1|2|3|4)’,
‘(4)-(1)-(1|2|3|4)’,
‘(2|3)-(2|3)-(2|3)’,
‘(2|3)-(1|2)-(1|2|3|4)’,
‘(1|2)-(3|4)-(2|3|4)’,
‘(1|2)-(1|2)-(1|2)’,
‘(1|2)-(1|2|3)-(1|2|3|4)’]
}
pd.DataFrame(segments2)def categorizer(rfm):
if (rfm[0] in [‘3’, ‘4’]) & (rfm[1] in [‘3’, ‘4’]) & (rfm[2] in [‘3’, ‘4’]):
rfm = ‘Champions’elif (rfm[0] in ['2', '3', '4']) & (rfm[1] in ['3', '4']) & (rfm[2] in ['1', '2', '3', '4']): rfm = 'Loyal Customers' elif (rfm[0] in ['3', '4']) & (rfm[1] in ['2', '3']) & (rfm[2] in ['1', '2', '3', '4']): rfm = 'Potential Loyalist' elif (rfm[0] in ['4']) & (rfm[1] in ['1']) & (rfm[2] in ['1', '2', '3', '4']): rfm = 'Recent Customers' elif (rfm[0] in ['2', '3']) & (rfm[1] in ['2', '3']) & (rfm[2] in ['2', '3']): rfm = 'Customers Needing Attention' elif (rfm[0] in ['2', '3']) & (rfm[1] in ['1', '2']) & (rfm[2] in ['1', '2', '3', '4']): rfm = 'Still Got Hope' elif (rfm[0] in ['1', '2']) & (rfm[1] in ['3', '4']) & (rfm[2] in ['2', '3', '4']): rfm = 'Need to Get Them Back' elif (rfm[0] in ['1', '2']) & (rfm[1] in ['1', '2']) & (rfm[2] in ['1', '2']): rfm = 'Lost' elif (rfm[0] in ['1', '2']) & (rfm[1] in ['1', '2', '3']) & (rfm[2] in ['1', '2', '3', '4']): rfm = 'Give it a Try' return rfm
df_rfm_table[‘segments2’] = df_rfm_table[“rfm_score”].apply(categorizer)
#df_rfm_table.head()df_rfm_table[“segments2”].value_counts(dropna=False)
segments2 Need to Get Them Back 2343 Lost 1666 Champions 660 Loyal Customers 602 Give it a Try 483 Potential Loyalist 115 Still Got Hope 93 Recent Customers 46 Customers Needing Attention 40 141 2 Name: count, dtype: int64
df_rfm_table[“segments1”].info()
<class 'pandas.core.series.Series'> Index: 6050 entries, 17850.0 to 14569.0 Series name: segments1 Non-Null Count Dtype -------------- ----- 6050 non-null object dtypes: object(1) memory usage: 94.5+ KB
Plot RFM Segments
df_plot1 = pd.DataFrame(df_rfm_table[“segments1”].value_counts(dropna=False).sort_values(ascending=False)).reset_index().rename(columns={‘index’:’Segments’, ‘segments1′:’Customers’})
df_plot1import matplotlib.pyplot as plt
import squarify############################# segment1
sizes=[3200, 1944, 906]
label=[“Silver”, “Bronze”, “Gold”]
color=[‘yellow’,’blue’,’orange’]
squarify.plot(sizes=sizes, label=label, color=color,alpha=0.8 , text_kwargs={‘fontsize’:16})
plt.axis(‘off’)
plt.show()################ segment2
import matplotlib.pyplot as plt
import seaborn as snsdata = [2343, 1666, 660, 602, 483, 115]
labels = [‘Need to Get Them Back’, ‘Lost’,’Champions’,’Loyal Customers’,’Give it a Try’,’Potential Loyalist’]colors = sns.color_palette(‘bright’)[0:6]
plt.pie(data, labels = labels, colors = colors, autopct=’%.0f%%’)
plt.show()Read more here.
Customer Clustering PCA, K-Means and Agglomerative for Marketing Campaigns
Let’s set the working directory
import os
os.chdir(‘YOURPATH’)
os. getcwd()and read the dataset
import pandas as pd
import numpy as np
df = pd.read_csv(‘marketing_campaign.csv’, sep=”\t”)Get the shape of the dataframe
print(“Shape of the dataframe:”, df.shape)Shape of the dataframe: (2240, 29)
Get the information about the dataframe
print(“\nInformation about the dataframe:”)
print(df.info())Information about the dataframe: <class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 29 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Z_CostContact 2240 non-null int64 27 Z_Revenue 2240 non-null int64 28 Response 2240 non-null int64 dtypes: float64(1), int64(25), object(3) memory usage: 507.6+ KB None
Get the summary statistics of the dataframe
print(“\nSummary statistics of the dataframe:”)
df.describe().T############# DATA PREPARATION
Check for missing data
print(“Missing data in the dataframe:”)
print(df.isnull().sum())Missing data in the dataframe: ID 0 Year_Birth 0 Education 0 Marital_Status 0 Income 24 Kidhome 0 Teenhome 0 Dt_Customer 0 Recency 0 MntWines 0 MntFruits 0 MntMeatProducts 0 MntFishProducts 0 MntSweetProducts 0 MntGoldProds 0 NumDealsPurchases 0 NumWebPurchases 0 NumCatalogPurchases 0 NumStorePurchases 0 NumWebVisitsMonth 0 AcceptedCmp3 0 AcceptedCmp4 0 AcceptedCmp5 0 AcceptedCmp1 0 AcceptedCmp2 0 Complain 0 Z_CostContact 0 Z_Revenue 0 Response 0 dtype: int64
Let’s drop duplicates
df = df.dropna()
df.duplicated().sum()0
########### Outliers
Calculate the IQR for the Income column
Q1 = df[‘Income’].quantile(0.25)
Q3 = df[‘Income’].quantile(0.75)
IQR = Q3 – Q1Identify the outliers in the Income column
outliers = df[(df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR))]Print the number of outliers
print(“Number of outliers in the Income column:”, len(outliers))Number of outliers in the Income column: 8
Remove the outliers in the Income column
df = df[~((df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR)))]Print the updated shape of the dataframe
print(“Updated shape of the dataframe:”, df.shape)############## Feature engineering
print(“Unique values in Education column:”, df[‘Education’].unique())
print(“Unique values in Marital_Status column:”, df[‘Marital_Status’].unique())Unique values in Education column: ['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle'] Unique values in Marital_Status column: ['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']
Let’s define the following new columns
def education_level(education):
if education in [‘Graduation’, ‘PhD’, ‘Master’]:
return ‘High’
elif education in [‘Basic’]:
return ‘Middle’
else:
return ‘Low’df[‘Education_Level’] = df[‘Education’].apply(education_level)
def living_status(marital_status):
if marital_status in [‘Alone’, ‘Absurd’, ‘YOLO’]:
return ‘Living Alone’
else:
return ‘Living with Others’df[‘Living_Status’] = df[‘Marital_Status’].apply(living_status)
df[‘Age’] = 2022 – df[‘Year_Birth’]
df[‘Total_Campaigns_Accepted’] = df[[‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘AcceptedCmp3’, ‘AcceptedCmp4’, ‘AcceptedCmp5’]].sum(axis=1)
df[‘Average_Spend’] = (df[[‘MntWines’, ‘MntFruits’, ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’]].sum(axis=1)) / df[‘NumDealsPurchases’]
df[‘Spent’] = df[‘MntWines’]+df[“MntWines”] +df[‘MntFruits’]+ df[‘MntMeatProducts’] +df[‘MntFishProducts’]+df[‘MntSweetProducts’]+ df[‘MntGoldProds’]
df[‘Is_Parent’] = (df[‘Kidhome’] + df[‘Teenhome’] > 0).astype(int)
Create the new feature for total spending in the last 2 years
df[‘total_spending’] = df[‘MntWines’] + df[‘MntFruits’] + df[‘MntMeatProducts’] + df[‘MntFishProducts’] + df[‘MntSweetProducts’] + df[‘MntGoldProds’]Create new feature for average monthly visits to the company’s website
df[‘avg_web_visits’] = df[‘NumWebVisitsMonth’] / 12Create new feature for the ratio of online purchases to total purchases
df[‘online_purchase_ratio’] = df[‘NumWebPurchases’] / (df[‘NumWebPurchases’] + df[‘NumCatalogPurchases’] + df[‘NumStorePurchases’])Let’s drop a few columns
to_drop = [‘Dt_Customer’, ‘Z_CostContact’, ‘Z_Revenue’, ‘Year_Birth’, ‘ID’]
df = df.drop(to_drop, axis=1)df.dtypes
Education object Marital_Status object Income float64 Kidhome int64 Teenhome int64 Recency int64 MntWines int64 MntFruits int64 MntMeatProducts int64 MntFishProducts int64 MntSweetProducts int64 MntGoldProds int64 NumDealsPurchases int64 NumWebPurchases int64 NumCatalogPurchases int64 NumStorePurchases int64 NumWebVisitsMonth int64 AcceptedCmp3 int64 AcceptedCmp4 int64 AcceptedCmp5 int64 AcceptedCmp1 int64 AcceptedCmp2 int64 Complain int64 Response int64 Education_Level object Living_Status object Age int64 Total_Campaigns_Accepted int64 Average_Spend float64 Spent int64 Is_Parent int32 total_spending int64 avg_web_visits float64 online_purchase_ratio float64 dtype: object
########### PLOTLY EXPLORATORY DATA ANALYSIS (EDA)
import plotly.express as px
fig0 = px.histogram(df, x=”Income”, nbins=50)
fig0.show()fig1 = px.histogram(df, x=”Age”, nbins=30, color=’Age’, title=”Distribution of Age”)
fig1.show()fig2 = px.histogram(df, x=’Marital_Status’, nbins=5, title=”Marital Status Distribution”)
fig2.show()fig3 = px.histogram(df, x=’Education_Level’, nbins=5, title=”Education Level Distribution”)
fig3.show()import plotly.express as px
df_plot = df.groupby([‘Marital_Status’])[‘Average_Spend’].mean().reset_index()
fig4 = px.bar(df_plot, x=’Marital_Status’, y=’Average_Spend’, color=’Marital_Status’)
fig4.show()
Marital Status vs Average Spend Distribution
Plot the distribution of number of children in household
fig6 = px.histogram(df, x=’Kidhome’)
fig6.show()Total Campaigns Accepted Distribution
fig8 = px.histogram(df, x=’Total_Campaigns_Accepted’, nbins=20, title=”Total Campaigns Accepted Distribution”)
fig8.show()Average Spend per Purchase Distribution
fig9 = px.histogram(df, x=’Average_Spend’, nbins=20, title=”Average Spend per Purchase Distribution”)
fig9.show()Spending distribution by marital status, education, and is_parent
fig10 = px.histogram(df, x=’total_spending’, color=’Marital_Status’, nbins=50,
title=’Spending Distribution by Marital Status’)
fig11 = px.histogram(df, x=’total_spending’, color=’Education_Level’, nbins=50,
title=’Spending Distribution by Education Level’)
fig12 = px.histogram(df, x=’total_spending’, color=’Is_Parent’, nbins=50,
title=’Spending Distribution by Is_Parent’)fig10.show()
fig11.show()
fig12.show()Plot the Distribution of Online Purchase Ratio
fig13 = px.histogram(df, x=’online_purchase_ratio’)
fig13.show()Plot the Distribution of Number of Web Visits per Month
fig14 = px.histogram(df, x=’NumWebVisitsMonth’)
fig14.show()Plot the Distribution of Number of Web Purchases
fig15 = px.histogram(df, x=’NumWebPurchases’)
fig15.show()Plot the Distribution of Number of Catalog Purchases
fig16 = px.histogram(df, x=’NumCatalogPurchases’)
fig16.show()Plot the Distribution of Number of Store Purchases
fig17 = px.histogram(df, x=’NumStorePurchases’)
fig17.show()Box plot of NumWebPurchases vs NumStorePurchases
fig18 = px.box(df, x=”NumWebPurchases”, y=”NumStorePurchases”)
fig18.show()Scatter plot of “MntWines” vs “MntSweetProducts” with a third variable represented by size or color
fig22 = px.scatter(df, x=”MntWines”, y=”MntSweetProducts”, size=”NumWebVisitsMonth”, color=”Income”, size_max=50)
fig22.show()########### CLUSTERING ALGORITHMS
import plotly.express as px
import plotly.io as piofrom sklearn.preprocessing import StandardScaler, MinMaxScaler
######## Data Preparation
df = pd.read_csv(‘marketing_campaign.csv’, sep=”\t”)
df = df.dropna()
df.duplicated().sum()
0
Calculate the IQR for the Income column
Q1 = df[‘Income’].quantile(0.25)
Q3 = df[‘Income’].quantile(0.75)
IQR = Q3 – Q1Identify the outliers in the Income column
outliers = df[(df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR))]Remove the outliers in the Income column
df = df[~((df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR)))]df[‘Age’] = 2022 – df[‘Year_Birth’]
df[‘Total_Campaigns_Accepted’] = df[[‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘AcceptedCmp3’, ‘AcceptedCmp4’, ‘AcceptedCmp5’]].sum(axis=1)df[‘Average_Spend’] = (df[[‘MntWines’, ‘MntFruits’, ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’]].sum(axis=1)) / df[‘NumDealsPurchases’]
df[‘Spent’] = df[‘MntWines’]+df[“MntWines”] +df[‘MntFruits’]+ df[‘MntMeatProducts’] +df[‘MntFishProducts’]+df[‘MntSweetProducts’]+ df[‘MntGoldProds’]
df[‘Is_Parent’] = (df[‘Kidhome’] + df[‘Teenhome’] > 0).astype(int)
df[‘total_spending’] = df[‘MntWines’] + df[‘MntFruits’] + df[‘MntMeatProducts’] + df[‘MntFishProducts’] + df[‘MntSweetProducts’]
df[‘avg_web_visits’] = df[‘NumWebVisitsMonth’] / 12
df[‘online_purchase_ratio’] = df[‘NumWebPurchases’] / (df[‘NumWebPurchases’] + df[‘NumCatalogPurchases’] + df[‘NumStorePurchases’])
to_drop = [‘Dt_Customer’, ‘Z_CostContact’, ‘Z_Revenue’, ‘Year_Birth’, ‘ID’]
df = df.drop(to_drop, axis=1)One-hot encode the categorical variables
Select the numerical columns to scale
num_cols = [‘Income’, ‘Kidhome’, ‘Teenhome’, ‘Recency’, ‘MntWines’, ‘MntFruits’,
‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’,
‘NumDealsPurchases’, ‘NumWebPurchases’, ‘NumCatalogPurchases’,
‘NumStorePurchases’, ‘NumWebVisitsMonth’, ‘AcceptedCmp3’, ‘AcceptedCmp4’,
‘AcceptedCmp5’, ‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘Complain’, ‘Response’,
‘total_spending’, ‘avg_web_visits’, ‘online_purchase_ratio’, ‘Age’,
‘Total_Campaigns_Accepted’, ‘Is_Parent’]Initialize the StandardScaler
scaler = StandardScaler()Fit the scaler to the numerical columns
df[num_cols] = scaler.fit_transform(df[num_cols])df.isnull().sum().sum()
0
al=df[‘online_purchase_ratio’].mean()
print(al)4.1834490782976914e-17
df[‘online_purchase_ratio’] = df[‘online_purchase_ratio’].fillna(0.33)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
al=df[‘Average_Spend’].mean()
print(al)412.04631023016225
df[‘Average_Spend’] = df[‘Average_Spend’].fillna(al)
######## K-means Clusters
from sklearn.decomposition import PCA
Initialize the PCA model
pca = PCA(n_components=8)Fit and transform the data
df_pca = pca.fit_transform(df)Determining the optimal number of clusters using Silhouette Score
import numpy as np
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeanslist_k = list(range(2, 10))
silhouette_scores = []
for k in list_k:
km = KMeans(n_clusters=k)
preds = km.fit_predict(df_pca)
silhouette_scores.append(silhouette_score(df_pca, preds))best_k = list_k[np.argmax(silhouette_scores)]
best_k
2
Let’s fit the KMeans model with the number of clusters set to 4
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.metrics import calinski_harabasz_score,davies_bouldin_scorekmeans = KMeans(n_clusters=4)
kmeans.fit(df_pca)predictions = kmeans.predict(df_pca)
silhouette_score_value = silhouette_score(df_pca, predictions)
calinski_harabasz_score_value = calinski_harabasz_score(df_pca, predictions)
davies_bouldin_score_value = davies_bouldin_score(df_pca, predictions)
plt.scatter(df_pca[:, 0], df_pca[:, 1], c=predictions, cmap=’viridis’)
plt.xlabel(‘Feature 1’)
plt.ylabel(‘Feature 2’)
plt.title(‘KMeans Clustering Results in 2D\nSilhouette Score: {0:.3f}’.format(silhouette_score_value))
plt.show()fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(df_pca[:, 0], df_pca[:, 1], df_pca[:, 2], c=predictions, cmap=’viridis’)
ax.set_xlabel(‘Feature 1’)
ax.set_ylabel(‘Feature 2’)
ax.set_zlabel(‘Feature 3’)
ax.set_title(‘KMeans Clustering Results in 3D\nSilhouette Score: {0:.3f}’.format(silhouette_score_value))
plt.show()####### Agglomerative Clustering
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import AgglomerativeClustering
from scipy.spatial.distance import pdist, squareform
from sklearn.metrics import davies_bouldin_scoreGenerate sample data
X = df_pcaCompute the pairwise distances between samples
dist_matrix = squareform(pdist(X))Fit the Agglomerative Clustering model
agg_cluster = AgglomerativeClustering(n_clusters=4)
agg_cluster.fit(X)Plotting
plt.scatter(X[:, 0], X[:, 1], c=agg_cluster.labels_, cmap=’viridis’)
plt.show()fig = plt.figure()
ax = fig.add_subplot(111, projection=’3d’)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=agg_cluster.labels_, cmap=’viridis’)
plt.show()davies_bouldin_index = davies_bouldin_score(X, agg_cluster.labels_)
print(“Davies-Bouldin Index:”, davies_bouldin_index)silhouette_score_value = silhouette_score(X, agg_cluster.labels_)
calinski_harabasz_score_value = calinski_harabasz_score(X, agg_cluster.labels_)
print(“Silhouette Score:”, silhouette_score_value)
print(“Calinski-Harabasz Index:”, calinski_harabasz_score_value)Read more here.
Supply Chain RFM & ABC Analysis
- Let’s apply the RFM/ABC analysis to the Data Co Supply Chain Dataset. Recency, Frequency, And Monetary Analysis(RFM) and ABC are techniques would give us an idea about important factors in our business and how we can optimize the future business growth.
- The Recency-Frequency-Monetary analysis helps demand side sector of an organization by segmenting customers and providing more insights to marketing department and sales department about the customers who buy products at the store.
- The ABC analysis helps the supply side sector of an organization by segmenting products based on revenue they generate. It helps manage inventory more efficiently.
Let’s set the working directory
import os
os.chdir(‘YOURPATH’)
os. getcwd()and import standard libraries
import pandas as pd
import dateutil.parser
import numpy as np
import datetime
from statsmodels.stats.outliers_influence import variance_inflation_factor
import seaborn as sns;
import matplotlib.pyplot as plt
from plotly import graph_objects as go
import plotly.express as pxLet’s read the input data
dFppsc= pd.read_csv(“DataCoSupplyChainDataset.csv”, encoding_errors=”ignore”)
dFppsc.info()<class 'pandas.core.frame.DataFrame'> RangeIndex: 180519 entries, 0 to 180518 Data columns (total 53 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Type 180519 non-null object 1 Days for shipping (real) 180519 non-null int64 2 Days for shipment (scheduled) 180519 non-null int64 3 Benefit per order 180519 non-null float64 4 Sales per customer 180519 non-null float64 5 Delivery Status 180519 non-null object 6 Late_delivery_risk 180519 non-null int64 7 Category Id 180519 non-null int64 8 Category Name 180519 non-null object 9 Customer City 180519 non-null object 10 Customer Country 180519 non-null object 11 Customer Email 180519 non-null object 12 Customer Fname 180519 non-null object 13 Customer Id 180519 non-null int64 14 Customer Lname 180511 non-null object 15 Customer Password 180519 non-null object 16 Customer Segment 180519 non-null object 17 Customer State 180519 non-null object 18 Customer Street 180519 non-null object 19 Customer Zipcode 180516 non-null float64 20 Department Id 180519 non-null int64 21 Department Name 180519 non-null object 22 Latitude 180519 non-null float64 23 Longitude 180519 non-null float64 24 Market 180519 non-null object 25 Order City 180519 non-null object 26 Order Country 180519 non-null object 27 Order Customer Id 180519 non-null int64 28 order date (DateOrders) 180519 non-null object 29 Order Id 180519 non-null int64 30 Order Item Cardprod Id 180519 non-null int64 31 Order Item Discount 180519 non-null float64 32 Order Item Discount Rate 180519 non-null float64 33 Order Item Id 180519 non-null int64 34 Order Item Product Price 180519 non-null float64 35 Order Item Profit Ratio 180519 non-null float64 36 Order Item Quantity 180519 non-null int64 37 Sales 180519 non-null float64 38 Order Item Total 180519 non-null float64 39 Order Profit Per Order 180519 non-null float64 40 Order Region 180519 non-null object 41 Order State 180519 non-null object 42 Order Status 180519 non-null object 43 Order Zipcode 24840 non-null float64 44 Product Card Id 180519 non-null int64 45 Product Category Id 180519 non-null int64 46 Product Description 0 non-null float64 47 Product Image 180519 non-null object 48 Product Name 180519 non-null object 49 Product Price 180519 non-null float64 50 Product Status 180519 non-null int64 51 shipping date (DateOrders) 180519 non-null object 52 Shipping Mode 180519 non-null object dtypes: float64(15), int64(14), object(24) memory usage: 73.0+ MB
############ Data Cleaning for Future Regression Models
dFppsc.drop(“Product Description”,axis=1,inplace=True)dFppsc.drop(“Product Description”,axis=1,inplace=True)
dFppsc[“order date (DateOrders)”]=pd.to_datetime(dFppsc[“order date (DateOrders)”])
dFppsc[“shipping date (DateOrders)”]=pd.to_datetime(dFppsc[“shipping date (DateOrders)”])
dFppsc=dFppsc.sort_values(by=”order date (DateOrders)”)dFppsc.drop([“Benefit per order”,”Sales per customer”,”Order Item Cardprod Id”,”Order Item Product Price”,”Product Category Id”,”Order Customer Id”],axis=1,inplace=True)
######### Variance Inflation Factor (VIF)
vif=pd.DataFrame()
vif[“columns”]=[‘Order Item Discount’,
‘Order Item Discount Rate’, ‘Order Item Profit Ratio’,
‘Order Item Quantity’, ‘Sales’, ‘Order Item Total’,
‘Order Profit Per Order’,’Product Price’]
vif[“vif value”] = [variance_inflation_factor(dFppsc[[‘Order Item Discount’,
‘Order Item Discount Rate’, ‘Order Item Profit Ratio’,
‘Order Item Quantity’, ‘Sales’, ‘Order Item Total’,
‘Order Profit Per Order’,’Product Price’]].values, i) for i in range(len(vif[“columns”]))]
vif.Tdf1=dFppsc.drop([“Order Item Total”,”Product Price”,”Order Item Discount Rate”,”Order Profit Per Order”],axis=1)
vif=pd.DataFrame()
vif[“columns”]=[‘Sales’,
‘Order Item Quantity’,’Order Item Discount’,’Order Item Profit Ratio’]
vif[“data”] = [variance_inflation_factor(df1[[‘Sales’,
‘Order Item Quantity’,’Order Item Discount’,’Order Item Profit Ratio’]].values, i) for i in range(len(vif[“columns”]))]
dFppsc[“Delivery Status”].unique()array(['Advance shipping', 'Late delivery', 'Shipping on time', 'Shipping canceled'], dtype=object)
dFppsc.groupby(“Delivery Status”)[“Order Id”].count()
Delivery Status Advance shipping 41592 Late delivery 98977 Shipping canceled 7754 Shipping on time 32196 Name: Order Id, dtype: int64
dFppsc[dFppsc[“Delivery Status”]==”Shipping canceled”].groupby(“Order Status”).agg(tc=(“Order Id”,”count”))
Order Status tc CANCELED 3692 SUSPECTED_FRAUD 4062 dF_clean=dFppsc[dFppsc[“Delivery Status”]!=”Shipping canceled”].copy()
############# RFM Analysis
dF_frequency=dF_clean.groupby([“Customer Id”]).agg(Total_count=(“Order Id”,”nunique”))dF_clean[“Year”]=dF_clean[“order date (DateOrders)”].dt.year
dF_frequency_year=dF_clean.groupby([“Customer Id”,”Year”],as_index=False)[“Order Id”].nunique()dF_frequency_year=dF_frequency_year.pivot_table(index=”Customer Id”,columns=”Year”,values=”Order Id”,fill_value=0)
pd.set_option(“display.max_columns”,None)
dF_clean[dF_clean[“Customer Id”]==12436]dt1=datetime.datetime.strptime(’02-10-2017 12:46:00′, ‘%d-%m-%Y %H:%M:%S’)
dt2=dt1-datetime.timedelta(days=5)df_exploring=dF_clean[dF_clean[“order date (DateOrders)”]>=dt2]
df_exploring.to_excel(“afterdt1.xlsx”)
df_exploring[df_exploring[“order date (DateOrders)”]>=dt1]7964 rows ร 47 columns
dF_clean[“Customer Full Name”]=dF_clean[“Customer Fname”]+” “+dF_clean[“Customer Lname”]
datetime_val = datetime.datetime.strptime(’02-10-2017 12:46:00′, ‘%d-%m-%Y %H:%M:%S’)
Customer_Names_After_Dt=dF_clean[dF_clean[“order date (DateOrders)”]>=datetime_val][[“Customer Full Name”,”Customer City”]].drop_duplicates()Customer_Name_Common_Before_After_dt=pd.merge(dF_clean[(dF_clean[“order date (DateOrders)”]<=datetime_val)],Customer_Names_After_Dt,how=”inner”,on=[“Customer Full Name”,”Customer City”])
Records=pd.merge(Customer_Name_Common_Before_After_dt[[“Customer Full Name”,”Customer City”]].drop_duplicates(),dF_clean,how=”inner”,on=[“Customer Full Name”,”Customer City”])dF_clean=dF_clean[dF_clean[“order date (DateOrders)”]<datetime_val].copy()
####### FREQUENCY
dF_frequency=dF_clean[[“Customer Id”,”Order Id”]].groupby(“Customer Id”,as_index=False).nunique()
dF_frequency.columns=[“Customer Id”,”Frequency”]
sns.histplot(dF_frequency.Frequency,bins=15,kde=False)Frequency Histogram:
dF_frequency[“F”],Intervals_Frequency=pd.qcut(dF_frequency[“Frequency”],q=3,labels=[1,2,3],retbins=True)
########### RECENCY
dF_recency=dF_clean[[“Customer Id”,”order date (DateOrders)”]].groupby(“Customer Id”,as_index=False).max()
dF_recency.rename(columns ={“order date (DateOrders)”:”last_purchase_date”},inplace=True)
max_date=dF_recency[“last_purchase_date”].max()
dF_recency[“recency”]=max_date-dF_recency[“last_purchase_date”]
dF_recency[“recency”]=dF_recency[“recency”].dt.dayssns.displot(dF_recency.recency,bins=8,kde=False)
Recency Histogram
dF_recency[‘R’],Intervals_Recency=pd.qcut(dF_recency[“recency”],q=3,labels=[3,2,1],retbins=True)
######MONETARY
dF_monetory=dF_clean[[“Customer Id”,”Order Item Total”]].groupby(“Customer Id”,as_index=False).sum()
dF_monetory.columns=[“Customer Id”,”Sum_of_Sales”]
dF_monetory[“M”],Intervals_Monetory=pd.qcut(dF_monetory[“Sum_of_Sales”],q=3,labels=[1,2,3],retbins=True)sns.displot(dF_monetory.Sum_of_Sales,kde=True)
Monetary histogram
####RFM SCORE
dF_rfm=pd.merge(dF_recency[[“Customer Id”,”R”]],dF_monetory[[“Customer Id”,”M”]],on=”Customer Id”,how=”inner”)
dF_rfm=pd.merge(dF_rfm,dF_frequency[[“Customer Id”,”F”]],on=”Customer Id”,how=”inner”)
dF_rfm[“RFM”]=(dF_rfm[“R”]).astype(str)+(dF_rfm[“F”]).astype(str)+(dF_rfm[“M”]).astype(str)######SEGMENT STRATEGY
Let’s read the input data
pd.set_option(‘display.max_colwidth’, None)
Segment_Strategy=pd.read_excel(“Segments_Strategy.xlsx”)
Segment_Strategy#####Customer Segment Data preparation
def Customer_Segment(data):
if data[“R”]==1 and data[“F”] in [1,2,3] and (data[“M”]==3):
return “Lost Customers – Big Spenders”
elif data[“R”]== 1 and data[“F”] in [1,2] and data[“M”] in [1,2]:
return “Lost Customers – Bargain”
elif data[“R”] in [1,2] and data[“F”]==3 and data[“M”] in [1,2]:
return “Lost/Almost Lost Customers – Loyal”
elif (data[“R”]==3) and (data[“F”]==3) and data[“M”] in [1,2]:
return “Loyal Customers”
elif (data[“R”]==3) and data[“F”] in [3,2] and data[“M”]==3:
return “Big Spenders”
elif (data[“R”]==3) and (data[“F”]==1) and data[“M”] in [1,2,3]:
return “New Customers”
elif (data[“R”]==3) and (data[“F”]==2) and data[“M”] in [1,2]:
return “Bargain Customers”
elif (data[“R”]==2) and data[“F”]==2 and data[“M”] in [1,2]:
return “Occasional Customers-Bargain”
elif (data[“R”]==2) and data[“F”] in [2,3] and data[“M”]==3:
return “Occasional Customers- Big Spenders”
elif (data[“R”]==2) and data[“F”]==1 and data [“M”] in [1,2,3]:
return “Unsatisfied Customers”
else:
return “No Segment”dF_rfm[“R”]=dF_rfm[“R”].astype(“category”)
dF_rfm[“F”]=dF_rfm[“F”].astype(“category”)
dF_rfm[“M”]=dF_rfm[“M”].astype(“category”)
dF_rfm[“Segment”]=dF_rfm.apply(Customer_Segment,axis=1)Segment_count=(dF_rfm.groupby(“Segment”,as_index=False).agg(Total_Count=(“Customer Id”,”count”))).sort_values(by=”Total_Count”,ascending=False)
Let’s plot Number of Customer in Each Segment
fig2=go.Figure()
fig2.add_trace(go.Bar(x=Segment_count.Segment,
y=Segment_count.Total_Count,
hovertemplate =”%{label}
Number of Customers:%{value}”,
texttemplate = “%{value}”))
fig2.update_layout(title=”Number of Customers in Each Segment”,
xaxis=dict(title=”Customer Segment”),
yaxis=dict(title=”Number Of Customers”),width=800)Let’s plot Revenue Generated by Each Segment
dF_clean=pd.merge(dF_clean,dF_rfm[[“Customer Id”,”RFM”,”Segment”]],on=”Customer Id”,how=”left”)
dF_segment_revenue=dF_clean.groupby(“Segment”,as_index=False).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=”Total_Revenue”,ascending=False)
fig3=go.Figure()
fig3.add_trace(go.Bar(x=dF_segment_revenue.Segment,
y=dF_segment_revenue.Total_Revenue,
hovertemplate =”%{label}
Revenue:%{value}”,
texttemplate = “%{value}”))
fig3.update_layout(title=”Revenue Generated by Each Segment”,xaxis=dict(title=”Customer Segment”),
yaxis=dict(title=”Revenue”))######TEXT WRAPS & SEGMENT TREE MAPS
def wrap(a):
str1=a.split(” “)
final_str=””
for c in str1:
final_str=final_str+c+’
‘
return final_strDf1=dF_clean.groupby([“Segment”,”Department Name”,”Product Name”],
as_index=False
).agg(Total_Revenue=(“Order Item Total”,”sum”),Total_Quantity=(“Order Item Quantity”,”sum”))
Df1[“Product Name”]=Df1[“Product Name”].apply(wrap) #Applying text wrap because our product names are too long
segment=[“Big Spenders”,”Bargain Customers”,”Loyal Customers”,”New Customers”]
def type1(data):
if “Big Spenders” in data[“Segment”]:
return “Big spenders”
elif “Bargain” in data[“Segment”]:
return “Bargain”
elif “Loyal” in data[“Segment”]:
return “Loyal”
elif “New” in data[“Segment”]:
return “New Customers”
else:
return “Unsatisfied”Df1[“Customer Type”]=Df1.apply(type1,axis=1)
data_tm=list(Df1.groupby(“Customer Type”))+[(‘All’,Df1)]
traces=[]
buttons=[]
visible=[]
for i,d in enumerate(data_tm):
visible=[False]*len(data_tm)
visible[i]=True
title=d[0]
traces.append((px.treemap(d[1],
path=[px.Constant(“All”),”Customer Type”,”Segment”,”Department Name”,”Product Name”],
values=”Total_Revenue”).update_traces(visible= True if i==0 else False)).data[0])
buttons.append(dict(label=title,
method=”update”,
args=[{
“visible”:visible},
{“title”:f”{title}”}
]))updatemenus=[{“active”:0,”buttons”:buttons}]
fig11=go.Figure(data=traces,layout=dict(updatemenus=updatemenus))
fig11.update_layout(title=data_tm[0][0],title_x=0.6)
fig11.update_coloraxes(colorbar_title_text=”Total
Quantity”)
fig11.show()#####ABC ANALYSIS
Total_Products=dF_clean[“Product Name”].nunique()
print(“Total Number of products: “+f”{Total_Products}”)Total Number of products: 100
Revenue_ABC=dF_clean.groupby([“Department Name”,”Product Name”]).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=”Total_Revenue”,ascending=False).reset_index()
Revenue_ABC[“cum_sum”]=Revenue_ABC[“Total_Revenue”].cumsum()
Revenue_ABC[“cum_per”]=Revenue_ABC[“cum_sum”]/Revenue_ABC[“Total_Revenue”].sum()*100
Revenue_ABC[“per”]=Revenue_ABC[“cum_per”]-Revenue_ABC[“cum_per”].shift(1)
Revenue_ABC.loc[0,”per”]=Revenue_ABC[“cum_per”][0]def ABC(data):
if data[“cum_per”]<=75: return “A” elif data[“cum_per”]>75 and data[“cum_per”]<=95: return “B” elif data[“cum_per”]>95:
return “C”Revenue_ABC[“ABC_Revenue”]=Revenue_ABC.apply(ABC,axis=1)
Bar_graph_Abc=Revenue_ABC[[“ABC_Revenue”,”Product Name”,”Total_Revenue”]].groupby(“ABC_Revenue”).agg(Revenue=(“Total_Revenue”,”sum”),count=(“Product Name”,”count”))
##########Bar_graph_Abc
fig2=go.Figure(go.Bar(x=Bar_graph_Abc.index,
y=Bar_graph_Abc[“Revenue”],
hovertemplate =”%{label}
Revenue:%{value}”,
texttemplate = “Revenue
%{value}”,
marker_color=[“orange”,”lightgreen”,”red”],
showlegend=False))
fig2.add_trace(
go.Scatter(
x=Bar_graph_Abc.index,
y=Bar_graph_Abc[“count”],
name=”Number Of Products”,
mode=’lines’,
line = dict(color=’blue’, width=3),
yaxis=”y2″,
marker_line_width = 0
))fig2.update_layout(
title=”Revenue Generated By Products in Different ABC Segments”,
xaxis=dict(title=”Segment” ),
yaxis=dict(title=”Revenue”,showgrid=False),
yaxis2=dict(title=”Number Of Products”, anchor=”x”, overlaying=”y”,side=”right”,dtick=10),
legend = dict(x = 1.05, y = 1))
fig2.show()######A Segment Products in Top 6 Cities
Top6Cities=dF_clean[[“Order City”,”Order Item Total”]].groupby(“Order City”).sum().sort_values(by=”Order Item Total”,ascending=False).head(6)
data=dF_clean[dF_clean[“Order City”].isin(Top6Cities.index.tolist())].groupby([“Order City”,”Product Name”],as_index=False).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=[“Order City”,”Total_Revenue”],ascending=False)
data[“sum”]=data.groupby(“Order City”)[“Total_Revenue”].transform(‘sum’)
data[“cum_per”]=data.groupby(“Order City”)[“Total_Revenue”].cumsum()/data[“sum”]*100
data[“segment”]=data.apply(ABC,axis=1)
data[“Product Name”]=data[“Product Name”].apply(wrap)
data.sort_values(by=[“sum”,”Total_Revenue”],inplace=True,ascending=False)fig = px.bar(data[data[“segment”]==”A”], x=’Product Name’, y=’Total_Revenue’,
facet_col=’Order City’,facet_col_wrap=2,facet_row_spacing=0.12,
height=1500,title=”A Segment Products in Top 6 Cities”)
fig.for_each_annotation(lambda a: a.update(text=a.text.split(“=”)[-1]))
fig.update_xaxes(showticklabels=True,tickangle=0)
fig.update_layout(width=950)
fig.show()Read more here.
Bank Churn ML Prediction
Let’s set the working directory
import os
os.chdir(‘YOURPATH’)
os. getcwd()Basic Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as snsProcessing libraries
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_classif, chi2Model libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import train_test_split
from sklearn import metricsLet’s read the input dataset
df = pd.read_csv(‘BankChurners.csv’)
df = df.drop(‘Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2’, axis = 1)
df = df.drop(‘Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1’, axis = 1Let’s check the statistics
df.describe().T
Number of churn and non-churn
counts = df.Attrition_Flag.value_counts()
perc_churn = (counts[1] / (counts[0] + counts[1])) * 100No. of duplicates
duplicates = len(df[df.duplicated()])No. of missing values
missing_values = df.isnull().sum().sum()Data types in dataset
types = df.dtypes.value_counts()
print(“Churn Rate = %.1f %%”%(perc_churn))
print(‘Number of Duplicate Entries: %d’%(duplicates))
print(‘Number of Missing Values: %d’%(missing_values))
print(‘Number of Features: %d’%(df.shape[1]))
print(‘Number of Customers: %d’%(df.shape[0]))
print(‘Data Types and Frequency in Dataset:’)
print(types)Churn Rate = 16.1 % Number of Duplicate Entries: 0 Number of Missing Values: 0 Number of Features: 21 Number of Customers: 10127 Data Types and Frequency in Dataset: int64 10 object 6 float64 5 Name: count, dtype: int64
####### Data Pre-processing
Make gender and outcome numerical
df[‘Gender’] = df[‘Gender’].map({‘M’: 1, ‘F’: 0})
df[‘Attrition_Flag’] = df[‘Attrition_Flag’].map({‘Attrited Customer’: 1, ‘Existing Customer’: 0})drop client id
df = df.drop(‘CLIENTNUM’, axis = 1)catcols = df.select_dtypes(exclude = [‘int64′,’float64’]).columns
intcols = df.select_dtypes(include = [‘int64’]).columns
floatcols = df.select_dtypes(include = [‘Float64’]).columnsOne-hot encoding on categorical columns
df = pd.get_dummies(df, columns = catcols)Minmax scaling numeric features
for col in df[floatcols]:
df[col] = MinMaxScaler().fit_transform(df[[col]])for col in df[intcols]:
df[col] = MinMaxScaler().fit_transform(df[[col]])print(‘New Number of Features: %d’%(df.shape[1]))
New Number of Features: 37
Split into X and y
X = df.drop(‘Attrition_Flag’, axis = 1)
y = df[‘Attrition_Flag’]##################Correlation between numerical features
heat = df.corr()
plt.figure(figsize=[16,8])
plt.title(“Correlation between numerical features”, size = 25, pad = 20, color = ‘#8cabb6’)
sns.heatmap(heat,cmap = sns.diverging_palette(20, 220, n = 200), annot=False)
plt.show()#######Feature Engineering
print(“Correlation Coefficient of all the Features”)
corr = df.corr()
corr.sort_values([“Attrition_Flag”], ascending = False, inplace = True)
correlations = corr.Attrition_Flag
a = correlations[correlations > 0.1]
b = correlations[correlations < -0.1]
top_corr_features = a._append(b)
top_corr_featuresCorrelation Coefficient of all the Features
Out[12]:
Attrition_Flag 1.000000 Contacts_Count_12_mon 0.204491 Months_Inactive_12_mon 0.152449 Total_Amt_Chng_Q4_Q1 -0.131063 Total_Relationship_Count -0.150005 Total_Trans_Amt -0.168598 Avg_Utilization_Ratio -0.178410 Total_Revolving_Bal -0.263053 Total_Ct_Chng_Q4_Q1 -0.290054 Total_Trans_Ct -0.371403 Name: Attrition_Flag, dtype: float64
Let’s define the following functions
def plot_importances(model, model_name, features_to_plot, feature_names):
#fit model and performances
model.fit(X,y)
importances = model.feature_importances_# sort and rank importances indices = np.argsort(importances) best_features = np.array(feature_names)[indices][-features_to_plot:] values = importances[indices][-features_to_plot:] # plot a graph y_ticks = np.arange(0, features_to_plot) fig, ax = plt.subplots() ax.barh(y_ticks, values, color = '#b2c4cc') ax.set_yticklabels(best_features) ax.set_yticks(y_ticks) ax.set_title("%s Feature Importances"%(model_name)) fig.tight_layout() plt.show()
def best_features(model, features_to_plot, feature_names):
# get list of best features
model.fit(X,y)
importances = model.feature_importances_indices = np.argsort(importances) best_features = np.array(feature_names)[indices][-features_to_plot:] return best_features
Let’s plot feature importance weights for the following 3 models
feature_names = list(X.columns)
model1 = RandomForestClassifier(random_state = 1234)
plot_importances(model1, ‘Random Forest’, 10, feature_names)model2 = GradientBoostingClassifier(n_estimators = 100, learning_rate = 1.0, max_depth = 1, random_state = 0)
plot_importances(model2, ‘XGBoost’, 10, feature_names)model3 = AdaBoostClassifier(n_estimators = 100, learning_rate = 1.0, random_state = 0)
plot_importances(model3, ‘AdaBoost’, 10, feature_names)Looking at the F-value between label/feature for classification tasks
f_selector = SelectKBest(f_classif, k = 10)
f_selector.fit_transform(X, y)
f_selector_best = f_selector.get_feature_names_out()
print(f_selector_best)['Gender' 'Total_Relationship_Count' 'Months_Inactive_12_mon' 'Contacts_Count_12_mon' 'Total_Revolving_Bal' 'Total_Amt_Chng_Q4_Q1' 'Total_Trans_Amt' 'Total_Trans_Ct' 'Total_Ct_Chng_Q4_Q1' 'Avg_Utilization_Ratio']
forest_best = list(best_features(model1, 10, feature_names))
XG_best = list(best_features(model2, 10, feature_names))
ada_best = list(best_features(model3, 10, feature_names))
top_corr_features = list(top_corr_features.index[1:])
f_selector_best = list(f_selector_best)best_features_overall = forest_best + XG_best + ada_best + top_corr_features + f_selector_best
# create a dictionary with the number of times features appear
from collections import Counter
count_best_features = dict(Counter(best_features_overall))# list of the features without any repeatitions
features_no_repeats = list(dict.fromkeys(best_features_overall))display(count_best_features)
{'Customer_Age': 3, 'Avg_Open_To_Buy': 3, 'Credit_Limit': 2, 'Total_Amt_Chng_Q4_Q1': 5, 'Avg_Utilization_Ratio': 3, 'Total_Relationship_Count': 4, 'Total_Revolving_Bal': 5, 'Total_Ct_Chng_Q4_Q1': 5, 'Total_Trans_Ct': 5, 'Total_Trans_Amt': 5, 'Months_Inactive_12_mon': 4, 'Contacts_Count_12_mon': 4, 'Gender': 1}
# get list of features with high counts in the dictionary
def get_features(threshold):
# remove features below a certain number of appearances
chosen_features = []
for i in features_no_repeats:
if count_best_features[i] > threshold:
chosen_features.append(i)
return chosen_featureschosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
chosen_features['Customer_Age', 'Total_Amt_Chng_Q4_Q1', 'Total_Relationship_Count', 'Total_Revolving_Bal', 'Total_Ct_Chng_Q4_Q1', 'Total_Trans_Ct', 'Total_Trans_Amt', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon']
######### ML Model Evaluation
def eval_model(model, model_name, X, y, threshold):
# make X the chosen subset
chosen_features = get_features(threshold)
X = X[chosen_features]train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42) # fit model model.fit(train_x,train_y) model.score(test_x, test_y) pred_test = model.predict(test_x) # get metrics f1 = metrics.f1_score(test_y, pred_test) test_acc = metrics.accuracy_score(test_y, pred_test) con = metrics.confusion_matrix(test_y, pred_test) print(con,'%s model with %s threshold: %.4f F1-score and %.4f accuracy'%(model_name, threshold, f1, test_acc))
Let’s use the full dataset to perform train/test data splitting with test_size = 0.25
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)
####Model1 Fitting
model1.fit(train_x,train_y)
model1.score(test_x, test_y)
pred_test = model1.predict(test_x)f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)print(con,f1,test_acc)
[[2094 19] [ 96 323]] 0.8488830486202367 0.9545813586097947
Let’s look at the reduced dataset and repeat model1 fitting
chosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
Xnew = X[chosen_features]train_x, test_x, train_y, test_y = train_test_split(Xnew, y, test_size = 0.25, random_state = 42)
model1.fit(train_x,train_y)
model1.score(test_x, test_y)
pred_test = model1.predict(test_x)f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)print(con,f1,test_acc)
[[2086 27] [ 60 359]] 0.8919254658385094 0.9656398104265402
######More ML Classifiers
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
model4 = BaggingClassifier(KNeighborsClassifier(n_neighbors = 7), max_samples = 0.8, max_features = 0.8)eval_model(model4, ‘KNN’, X, y, 2)
[[2074 39] [ 176 243]] KNN model with 2 threshold: 0.6933 F1-score and 0.9151 accuracy
from sklearn.linear_model import LogisticRegression
model5 = LogisticRegression(random_state=1)
eval_model(model5, ‘Logistic’, X, y, 2)[[2059 54] [ 217 202]] Logistic model with 2 threshold: 0.5985 F1-score and 0.8930 accuracy
# lists of possible parameters
n = [400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]
depth = [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
rand = [600, 650, 700, 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250]forest = RandomForestClassifier(n_estimators = 100, max_depth = 15, random_state = 750)
eval_model(forest, ‘forest’, X, y, 2)[[2090 23] [ 64 355]] forest model with 2 threshold: 0.8908 F1-score and 0.9656 accuracy
####Further Evaluation of Random Forest
def eval_forest(model, model_name, X, y, threshold, n, depth, rand):
# create subset from feature selection
chosen_features = get_features(threshold)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
X = X[chosen_features]train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42) model.fit(train_x,train_y) model.score(test_x, test_y) pred_test = model.predict(test_x) f1 = metrics.f1_score(test_y, pred_test) test_acc = metrics.accuracy_score(test_y, pred_test) con = metrics.confusion_matrix(test_y, pred_test) print('Model: %s Threshold: %s F1-Score %.4f Accuracy: %.4f n_estimators: %s depth: %s rand: %s'%(model_name, threshold, f1, test_acc,n,depth,rand))
Let’s define the model
forest = RandomForestClassifier(n_estimators = 850, max_depth = 19, random_state = 1200)
and train/test this model using 2 selected features
chosen_features = get_features(2)
chosen_features.remove(‘Avg_Open_To_Buy’)
chosen_features.remove(‘Avg_Utilization_Ratio’)
X_new = X[chosen_features]train_x, test_x, train_y, test_y = train_test_split(X_new, y, test_size = 0.25, random_state = 42)
forest.fit(train_x,train_y)
forest.score(test_x, test_y)
pred_test = forest.predict(test_x)f1 = metrics.f1_score(test_y, pred_test)
test_acc = metrics.accuracy_score(test_y, pred_test)
con = metrics.confusion_matrix(test_y, pred_test)
precision = metrics.precision_score(test_y, pred_test)
recall = metrics.recall_score(test_y, pred_test)
roc = metrics.roc_auc_score(test_y, pred_test)
print(‘Accuracy Score’, test_acc)
print(‘Precision’, precision)
print(‘Recall’, recall)
print(‘F1-Score’, f1)
print(‘ROC Score’, roc)
print(con)Accuracy Score 0.9664296998420221 Precision 0.9304123711340206 Recall 0.8615751789976134 F1-Score 0.8946716232961587 ROC Score 0.9243985691485936 [[2086 27] [ 58 361]]
#####CROSS-VALIDATION
from sklearn.model_selection import cross_validate
cv_results = cross_validate(forest, X_new, y, scoring = (‘f1’, ‘accuracy’, ‘roc_auc’), cv = 8)
sorted(cv_results.keys())['fit_time', 'score_time', 'test_accuracy', 'test_f1', 'test_roc_auc']
cv_results[‘test_roc_auc’]
array([0.91056078, 0.96866615, 0.98541168, 0.9918462 , 0.99539807, 0.99796213, 0.8989536 , 0.90523967])
cv_results[‘test_accuracy’]
array([0.88230648, 0.94233807, 0.96366509, 0.96366509, 0.96840442, 0.97630332, 0.92969984, 0.92885375])
#####ML PERFORMANCE ANALYSIS USING Scikit Plot
Import standard libraries
import scikitplot as skplt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCAimport matplotlib.pyplot as plt
import sys
import warnings
warnings.filterwarnings(“ignore”)print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)%matplotlib inline
Scikit Plot Version : 0.3.7 Scikit Learn Version : 1.2.2 Python Version : 3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]
#####RFC Learning Curve
skplt.estimators.plot_learning_curve(forest, train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”RFC Learning Curve”);######KNN Classification Learning Curve
skplt.estimators.plot_learning_curve(model4, train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”KNN Classification Learning Curve”);####ETC Learning Curve
skplt.estimators.plot_learning_curve(ExtraTreesClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”ETC Learning Curve”);#####GBC Learning Curve
skplt.estimators.plot_learning_curve(GradientBoostingClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”GBC Learning Curve”);#######DTC Learning Curve
from sklearn.tree import DecisionTreeClassifier
skplt.estimators.plot_learning_curve(DecisionTreeClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”DTC Learning Curve”);######ABC Learning Curve
from sklearn.ensemble import AdaBoostClassifier
skplt.estimators.plot_learning_curve(AdaBoostClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”ABC Learning Curve”);########MLPC Learning Curve
from sklearn.neural_network import MLPClassifier
skplt.estimators.plot_learning_curve(MLPClassifier(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”MLPC Learning Curve”);#######GNB Learning Curve
from sklearn.naive_bayes import GaussianNB
skplt.estimators.plot_learning_curve(GaussianNB(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”GNB Learning Curve”);######LR Learning Curve
skplt.estimators.plot_learning_curve(LogisticRegression(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”LR Learning Curve”);#######QDA Learning Curve
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
skplt.estimators.plot_learning_curve(QuadraticDiscriminantAnalysis(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”QDA Learning Curve”);########SVC Learning Curve
from sklearn.svm import SVC
skplt.estimators.plot_learning_curve(SVC(), train_x, train_y,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”SVC Learning Curve”);######CALIBRATION PLOTS
lr_probas = QuadraticDiscriminantAnalysis().fit(train_x, train_y).predict_proba(test_x)
rf_probas = RandomForestClassifier().fit(train_x, train_y).predict_proba(test_x)
gb_probas = GradientBoostingClassifier().fit(train_x, train_y).predict_proba(test_x)
et_scores = AdaBoostClassifier().fit(train_x, train_y).predict_proba(test_x)probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = [‘QDA’, ‘RFC’, ‘GBC’, ‘ABC’]skplt.metrics.plot_calibration_curve(test_y,
probas_list,
clf_names, n_bins=15,
figsize=(12,6)
);