https://hiscidatmarketradetips.blogspot.com/2022/03/content-marketing-guide-for-absolute.html
  • A Comparison of Automated EDA Tools in Python: Pandas-Profiling vs SweetViz

    A Comparison of Automated EDA Tools in Python: Pandas-Profiling vs SweetViz
    • Exploratory Data Analysis (EDA) is a crucial step in any data science project. 
    • Its primary goals include discovering patterns, identifying anomalies, and finding relationships between variables.
    • EDA can be broken down into different types of descriptive statistical analyses, such as univariate, bivariate, and multivariate data analytics.  
    • Automation can be used to conduct EDA, allowing for faster and more efficient analysis of data.
    • Data scientists can utilize automated EDA tools to accelerate the time-consuming data preparation phase, minimize ETL errors, and gain a more complete from A to Z understanding of input data.
    • This post will compare some popular Python libraries to automate EDA. The goal is to examine every aspect of the EDA procedure, including data cleaning, data visualization, and statistical analysis. Specifically, we will compare the types and quality of visualizations each tool supports.

    Table of Contents

    1. Pandas-Profiling
    2. SweetViz
    3. Other Libraries
    4. Summary
    5. Explore More
    6. References

    Pandas-Profiling

    Let’s set the working directory

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    import the key libraries and load the input data

    import pandas as pd
    from pandas_profiling import ProfileReport

    df = pd.read_csv(“https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv&#8221😉

    Let’s generate the HTML report

    profile = ProfileReport(df, title=”Pandas Profiling Report”)
    profile.to_file(“report.html”)

    ProfileReport Log

    Let’s look at the report

    Overview
    Variables

    Interactions:

    Interactions

    Correlations: Heatmap

    Correlations: Heatmap

    Correlations: Table

    Missing values count
    Missing values matrix
    Sample

    SweetViz

    Let’s import the key libraries

    import sweetviz as sv
    import pandas as pd

    and proceed with the same dataset as above, viz.

    df = pd.read_csv(“https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv&#8221😉

    Let’s create and display an analysis report for our data

    report = sv.analyze(df)

    Done! Use ‘show’ commands to display/save

    report.show_html()

    Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
    

    Let’s look at the report

    SweetViz report
    Associations

    Other Libraries

    Other automation libraries that can assist in EDA are as follows (cf. References):

    Dataprep, D-tale, Pandas GUI, Dabl, Bamboolib, AutoViz, Dora, Visidata, Scattertext, QuickDA, ExploriPy, Rath, and Lux.

    Summary

    • Automated EDA libraries can perform data analysis, wrangling, editing, cleaning, visualization, and ETL transformations in a few lines of Python code.
    • Pandas-Profiling creates an interactive HTML report that displays various summary statistics and visualizations of a given Pandas DataFrame.
    • SweetViz also creates an HTML report with visualizations that provide insights into the data, including distributions of features, missing values, and correlations between features (Associations).
    • The visualizations can be customized and fine-tuned as needed to best suit your EDA needs.
    • These tools are especially powerful for marketers, salespeople and BI analysts looking to perform data analysis and present their data to stakeholders. 

    Explore More

    References


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • Exploratory Data Analysis (EDA) and NLP Visualization of Restaurant Guest Reviews on Tripadvisor – Eating Out in Europe

    Exploratory Data Analysis (EDA) and NLP Visualization of Restaurant Guest Reviews on Tripadvisor – Eating Out in Europe

    Photo by Pablo Merchán Montes on Unsplash

    Tripadvisor, Inc. is an American internet-based travel company headquartered in Needham, MA. Its branded sites and forums operate as online travel guides offering free user-generated reviews of travel-related content, tools for price comparisons, and online travel booking services. As of 2022, Tripadvisor’s total number of user reviews and ratings reached approximately one billion, covering listings for restaurants, hotels, vacation rentals, experiences, and attractions.

    Following the recent Kaggle studies, let’s examine the restaurant guest reviews on Tripadvisor (TA) in terms of their experiences in eating out in Europe.

    In this post, we will do sentiment classification of TA restaurant reviews in Europe and develop recommendation system via restaurant description,summary and user comments. 

    Goal: we will be exploring the restaurant reviews and the rating base on customer experience. 

    How Important Are Online Reviews?

    In an ultra-competitive market like Europe, a restaurant’s online reviews can mean make-or-break. Restaurant owners are fully aware of the importance of customer reviews and doing everything possible to boost their online image.

    We will answer the following relevant questions:

    • Are there differences in average ratings and number of ratings between cities?
    • Are there more vegetarian-friendly cities and if so, are they locally concentrated?
    • Is local cuisine rated better than foreign cusine and if so, is there a difference between cities?

    Disclaimer:

    Review websites such as TA are built on credibility. Biased or inflated reviews can have a detrimental effect on user trust. 

    Table of Contents

    1. About Dataset
    2. Data Preparation
    3. Basic Bar Plots
    4. Spatial Analytics
    5. Word Clouds
    6. Summary
    7. Explore More

    About Dataset

    This Kaggle dataset has been obtained by scraping TA ratings and reviews for restaurants across 31 European cities.

    The dataset contain restaurants information for 31 cities in Europe: Amsterdam (NL), Athens (GR) , Barcelona (ES) , Berlin (DE), Bratislava (SK), Bruxelles (BE), Budapest (HU), Copenhagen (DK), Dublin (IE), Edinburgh (UK), Geneva (CH), Helsinki (FI), Hamburg (DE), Krakow (PL), Lisbon (PT), Ljubljana (SI), London (UK), Luxembourg (LU), Madrid (ES), Lyon (FR), Milan (IT), Munich (DE), Oporto (PT), Oslo (NO), Paris (FR), Prague (CZ), Rome (IT), Stockholm (SE), Vienna (AT), Warsaw (PL), Zurich (CH).

     It is structured as follow:

    • Name: name of the restaurant
      • City: city location of the restaurant
      • Cuisine Style: cuisine style(s) of the restaurant, in a Python list object (94 046 non-null)
      • Ranking: rank of the restaurant among the total number of restaurants in the city as a float object (115 645 non-null)
      • Rating: rate of the restaurant on a scale from 1 to 5, as a float object (115 658 non-null)
      • Price Range: price range of the restaurant among 3 categories , as a categorical type (77 555 non-null)
      • Number of Reviews: number of reviews that customers have let to the restaurant, as a float object (108 020 non-null)
      • Reviews: 2 reviews that are displayed on the restaurants scrolling page of the city, as a list of list object where the first list contains the 2 reviews, and the second le dates when these reviews were written (115 673 non-null)
      • URL_TA: part of the URL of the detailed restaurant page that comes after ‘www.tripadvisor.com’ as a string object (124 995 non-null)
      • ID_TA: identification of the restaurant in the TA database constructed a one letter and a number (124 995 non-null).

    Data Preparation

    Let’s set the working directory, import the key libraries, load the input dataset and examine the data structure

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sb

    import re
    from collections import Counter
    from PIL import Image

    %matplotlib inline

    df_ta_reviews = pd.read_csv(‘TA_restaurants_curated.csv’, encoding=’utf8′, index_col=0)

    df_ta_reviews.head()

    Input data table

    df_ta_reviews.shape

    (125527, 10)
    

    df_ta_reviews.info()

    <class 'pandas.core.frame.DataFrame'>
    Index: 125527 entries, 0 to 1666
    Data columns (total 10 columns):
     #   Column             Non-Null Count   Dtype  
    ---  ------             --------------   -----  
     0   Name               125527 non-null  object 
     1   City               125527 non-null  object 
     2   Cuisine Style      94176 non-null   object 
     3   Ranking            115876 non-null  float64
     4   Rating             115897 non-null  float64
     5   Price Range        77672 non-null   object 
     6   Number of Reviews  108183 non-null  float64
     7   Reviews            115911 non-null  object 
     8   URL_TA             125527 non-null  object 
     9   ID_TA              125527 non-null  object 
    dtypes: float64(3), object(7)
    memory usage: 10.5+ MB

    Let’s perform the following data wrangling:

    • Fill NaN with zero

    df_ta_reviews[‘Ranking’] = df_ta_reviews[‘Ranking’].astype(‘category’)
    df_ta_reviews[‘Number of Reviews’] = df_ta_reviews[‘Number of Reviews’].fillna(0)
    df_ta_reviews[‘Number of Reviews’] = df_ta_reviews[‘Number of Reviews’].round(0).astype(‘int’)

    • Remove duplicates

    print(df_ta_reviews[df_ta_reviews.ID_TA.duplicated() == True].ID_TA.count())
    df_ta_reviews = df_ta_reviews.drop_duplicates(‘ID_TA’, keep=’first’)
    df_ta_reviews.info()

    201
    <class 'pandas.core.frame.DataFrame'>
    Index: 125326 entries, 0 to 1666
    Data columns (total 10 columns):
     #   Column             Non-Null Count   Dtype   
    ---  ------             --------------   -----   
     0   Name               125326 non-null  object  
     1   City               125326 non-null  object  
     2   Cuisine Style      94047 non-null   object  
     3   Ranking            115710 non-null  category
     4   Rating             115734 non-null  float64 
     5   Price Range        77574 non-null   object  
     6   Number of Reviews  125326 non-null  int32   
     7   Reviews            115745 non-null  object  
     8   URL_TA             125326 non-null  object  
     9   ID_TA              125326 non-null  object  
    dtypes: category(1), float64(1), int32(1), object(7)
    memory usage: 10.0+ MB
    • Rename columns

    df_ta_reviews.rename(columns={‘Name’: ‘name’,
    ‘City’: ‘city’,
    ‘Ranking’: ‘ranking’,
    ‘Rating’: ‘rating’,
    ‘Reviews’: ‘reviews’,
    ‘Cuisine Style’:’cuisine_style’,
    ‘Price Range’:’price_range’,
    ‘Number of Reviews’:’reviews_number’}, inplace=True)

    • Fill negative ratings with zero

    print(df_ta_reviews[df_ta_reviews.rating == -1.0].city.count())
    df_ta_reviews.rating.replace(-1, 0, inplace=True)

    41
    • Create the data copy

    ta_reviews = df_ta_reviews.copy()

    • Count the unique restaurants and cities

    rint(“Single Restaurants: {}”.format(ta_reviews.shape[0]))
    print(“Cities: {}”.format(ta_reviews.city.nunique()))

    Single Restaurants: 125326
    Cities: 31

    Basic Bar Plots

    • Number of restaurants in each city

    import matplotlib
    matplotlib.rc(‘xtick’, labelsize=20)
    matplotlib.rc(‘ytick’, labelsize=20)
    fig = plt.figure(figsize=(20, 9))
    ax = plt.subplot();

    df_restaurants_in_cities = ta_reviews.groupby(‘city’).name.count().sort_values(ascending = False)

    plt.bar(x = df_restaurants_in_cities.index, height=df_restaurants_in_cities, color=”#4ECDC4″);
    plt.xticks(rotation=’vertical’);
    plt.ylabel(‘Number of Restaurants’,fontsize=20);
    ax.spines[‘right’].set_visible(False);
    ax.spines[‘top’].set_visible(False);
    ax.spines[‘left’].set_visible(False);
    plt.title(‘Number of restaurants in each city’,fontsize=20);
    ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);
    plt.grid(color = ‘black’, linestyle = ‘–‘, linewidth = 0.5)

    Number of restaurants in each city
    • Number of reviews for each city

    import matplotlib
    matplotlib.rc(‘xtick’, labelsize=20)
    matplotlib.rc(‘ytick’, labelsize=20)
    df_reviews_count = ta_reviews.groupby(‘city’).reviews_number.sum().sort_values(ascending=False)
    count_millions = np.arange(0, 2.14e6, 20e4)
    count = np.arange(0, 2.6, 0.25)

    fig = plt.figure(figsize=(20, 9))
    ax = plt.subplot();

    plt.bar(x = df_reviews_count.index, height=df_reviews_count, color=”#4ECDC4″);
    plt.xticks(rotation=’vertical’);
    plt.yticks(count_millions, count);
    plt.ylabel(‘Total Number of Reviews (Millions)’,fontsize=20);
    ax.spines[‘right’].set_visible(False);
    ax.spines[‘top’].set_visible(False);
    ax.spines[‘left’].set_visible(False);
    plt.title(‘Number of reviews for each city’,fontsize=20);
    ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);
    plt.grid(color = ‘black’, linestyle = ‘–‘, linewidth = 0.5)

    Number of reviews for each city
    • Average Number of reviews for each city

    import matplotlib
    matplotlib.rc(‘xtick’, labelsize=20)
    matplotlib.rc(‘ytick’, labelsize=20)
    tot_reviews_city = pd.DataFrame(ta_reviews.groupby(‘city’).reviews_number.sum())
    tot_places_city = pd.DataFrame(ta_reviews.groupby(‘city’).name.count())

    reviews_per_city = pd.merge(tot_reviews_city, tot_places_city, how=’outer’, on=’city’)
    reviews_per_city.rename(columns={‘name’:’number_of_places’}, inplace=True)
    reviews_per_city[‘avg_reviews’] = round(reviews_per_city.reviews_number / reviews_per_city.number_of_places, 2)
    reviews_per_city.sort_values(by=’avg_reviews’, ascending=False, inplace=True)

    fig = plt.figure(figsize=(20, 9))
    ax = plt.subplot();

    plt.bar(x = reviews_per_city.index, height=reviews_per_city.avg_reviews, color=”#4ECDC4″);
    plt.xticks(rotation=’vertical’,fontsize=20);
    ax.spines[‘right’].set_visible(False);
    ax.spines[‘top’].set_visible(False);
    ax.spines[‘left’].set_visible(False);
    plt.title(‘Average Number of reviews for each city’,fontsize=20);
    ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);
    plt.grid(color = ‘black’, linestyle = ‘–‘, linewidth = 0.5)

    Average Number of reviews for each city
    • Let’s create the new data frame

    cuisines = ta_reviews.loc[:, [‘city’, ‘name’, ‘cuisine_style’, ‘rating’, ‘reviews_number’, ‘reviews’]]

    cuisines.cuisine_style = cuisines.cuisine_style.str.replace(‘[‘, ”)
    cuisines.cuisine_style = cuisines.cuisine_style.str.replace(‘]’, ”)

    cuisines.columns = [‘city’, ‘place’, ‘cuisine_style’, ‘rating’, ‘reviews_number’, ‘reviews’]

    all_cuisines = cuisines[cuisines.cuisine_style.isna() == False]

    dic = []

    for i in all_cuisines.iterrows():

    for j in range(0, len(i[1].cuisine_style.split(', '))):
        dic.append({
            'city': i[1].city,
            'place': i[1].place,
            'cuisine_style': i[1].cuisine_style.split(', ')[j].replace('\'', ''),
            'rating': i[1].rating,
            'reviews_number': i[1].reviews_number
        })
    

    cuisines_list = pd.DataFrame(data=dic)

    cuisines_list.cuisine_style.nunique()

    127
    • Top 10 most common cuisines

    import matplotlib
    matplotlib.rc(‘xtick’, labelsize=20)
    matplotlib.rc(‘ytick’, labelsize=20)

    df_cuisine_style = cuisines_list.cuisine_style.value_counts().sort_values(ascending = False)[:10]

    count_ths = np.arange(0, 3.3e4, 5e3)
    count = np.arange(0, 35, 5)

    fig = plt.figure(figsize=(20, 9))
    ax = plt.subplot();

    plt.bar(x = df_cuisine_style.index, height=df_cuisine_style, color=”#4ECDC4″);

    plt.yticks(count_ths, count,fontsize=20);
    plt.ylabel(‘Total Places (Thousands)’,fontsize=20);
    ax.spines[‘right’].set_visible(False);
    ax.spines[‘top’].set_visible(False);
    ax.spines[‘left’].set_visible(False);
    plt.title(‘Top 10 most common cuisines’,fontsize=20);
    ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);
    plt.xticks(rotation=’vertical’,fontsize=20);
    plt.grid(color = ‘black’, linestyle = ‘–‘, linewidth = 0.5)

    Top 10 most common cuisines
    • Top 10 most reviewed cuisines

    df_cuisine_style = cuisines_list.groupby(‘cuisine_style’).reviews_number.sum().sort_values(ascending=False)[:10]

    count_ths = np.arange(0, 9.3e6, 5e5)
    count = np.arange(0, 9.3, 0.5)

    fig = plt.figure(figsize=(20, 9))
    ax = plt.subplot();

    plt.bar(x = df_cuisine_style.index, height=df_cuisine_style, color=”#4ECDC4″);

    plt.yticks(count_ths, count);
    plt.ylabel(‘Total Reviews (Million)’,fontsize=20);
    ax.spines[‘right’].set_visible(False);
    ax.spines[‘top’].set_visible(False);
    ax.spines[‘left’].set_visible(False);
    plt.title(‘Top 10 most reviewed cuisines’,fontsize=20);
    ax.tick_params(direction=’out’, length=0, width=0, colors=’black’);

    plt.xticks(rotation=’vertical’,fontsize=20);

    Top 10 most reviewed cuisines
    • Cuisine styles per city

    import matplotlib
    matplotlib.rc(‘xtick’, labelsize=20)
    matplotlib.rc(‘ytick’, labelsize=20)

    city_styles = cuisines_list.groupby([‘city’, ‘cuisine_style’]).rating.count().reset_index()
    city_variety = city_styles.groupby(‘city’).cuisine_style.count().sort_values(ascending = False)

    fig = plt.figure(figsize=(20, 13))
    ax = plt.subplot();

    ax.barh(city_variety.index[::-1],city_variety.values[::-1], align=’center’, color=”#4ECDC4″);
    ax.spines[‘right’].set_visible(False);
    ax.spines[‘top’].set_visible(False);
    ax.spines[‘left’].set_visible(False);
    plt.title(‘Cuisine styles per city’,fontsize=20);

    Cuisine styles per city

    Spatial Analytics

    • Southern Europe Favourite Cuisine Styles

    from matplotlib import cm

    euro = np.array(Image.open(‘europeblue.png’))
    grn = [‘#006400’, ‘#007200’, ‘#007200’, ‘#008000’, ‘#008000’,
    ‘#38b000’, ‘#38b000’, ‘#70e000’, ‘#9ef01a’, ‘#9ef01a’, ]

    orng = [‘#FF5400’, ‘#FF6000’, ‘#FF6D00’, ‘#FF6D00’, ‘#FF8500’,
    ‘#FF9100’, ‘#FF9E00’, ‘#FFAA00’, ‘#FFB600’, ‘#FFB600’, ]

    plt.figure(figsize = (18, 10), facecolor = None) ;
    ax = plt.subplot();
    plt.title(‘Southern Europe Favourite Cuisine Styles’)
    plt.imshow(euro);

    plt.text(
    300, 30, ‘Most Reviewed Cuisine Styles’,
    ha=”center”, va=”center”, rotation=0, size=15, color = “white”,
    bbox=dict(boxstyle=”square,pad=0.3″, fc=’#004b23′, ec=”w”, lw=1));

    top = 40
    left = 0
    stili = reviews_south.cuisine_style.to_list()
    i = 0
    for styl in stili:
    styl_score = styl + ” ” + str(reviews_south[reviews_south.cuisine_style == styl].revs.item())
    top = top + 40
    plt.text(
    300, top, styl_score, ha=”center”, va=”center”, rotation=0, size=15, color = “white”,
    bbox=dict(boxstyle=”square,pad=0.3″, fc=grn[i], ec=”w”, lw=1))
    i += 1

    plt.text(
    520, 40, ‘Best Rated Cuisine Styles’,
    ha=”center”, va=”center”, rotation=0, size=15, color = “white”,
    bbox=dict(boxstyle=”square,pad=0.3″, fc=’#ff4800′, ec=”w”, lw=1));

    top = 40
    left = 520
    rating_1000 = rating_south[rating_south.rat_count >= 3000]
    rating_1000 = rating_1000.sort_values(by=’rat_mean’, ascending = False)[:10]

    stili = rating_1000.cuisine_style.to_list()
    i = 0
    for styl in stili:
    styl_score = styl + ” ” + str(round(rating_1000[rating_1000.cuisine_style == styl].rat_mean.item(), 2))
    top = top + 40
    plt.text(
    left, top, styl_score, ha=”center”, va=”center”, rotation=0, size=15, color = “white”,
    bbox=dict(boxstyle=”square,pad=0.3″, fc=orng[i], ec=’w’, lw=1))
    i += 1

    ax.spines[‘right’].set_visible(False);
    ax.spines[‘top’].set_visible(False);
    ax.spines[‘left’].set_visible(False);
    ax.spines[‘bottom’].set_visible(False);
    ax.yaxis.set_visible(False);
    ax.xaxis.set_visible(False);

    Southern Europe Favourite Cuisine Styles
    • Folium map of European cities

    import folium
    import base64
    from folium import IFrame

    def get_flag(city):
    flag = map_cities.get(city)
    flag_url = flag[‘flag’]
    return(flag_url)

    def get_top_options2(city):
    opt = top_options[top_options.index.get_level_values(‘city’).isin([city])][:3]
    opt = opt.reset_index()
    opt = opt.sort_values(by=[“rating”, “reviews_number”], ascending=[False, False])
    top_3 = “”
    icon_class = “”
    for i in opt.iterrows():
    if (i[1][‘cuisine_style’] == “Gluten Free Options”) or (i[1][‘cuisine_style’] == “Vegan Options”):
    icon_class = “fa-pagelines”
    elif i[1][‘cuisine_style’] == “Vegetarian Friendly”:
    icon_class = “fa-tree”
    else:
    icon_class = “fa-globe”
    top_3 += “

     ” + i[1][‘cuisine_style’] + ” ” + str(i[1][‘rating’]) + ” (” + str(i[1][‘reviews_number’]) + ” reviews)”
    return(top_3)
    europe = folium.Map(
    location=[52.4214, 8.3750],
    tiles = “OpenStreetMap”,
    zoom_start=4
    )

    for k, v in map_cities.items():
    flag = get_flag(k)
    html = “” html += “

    {}

    {}

    “.format(k, flag, get_top_options2(k)) html += “”
    iframe = folium.IFrame(html, width=(300)+20, height=(110)+20)
    popup = folium.Popup(iframe, max_width=1000)

    folium.Marker(location =[v['lat'], v['lon']], 
                popup=popup,
                icon = folium.Icon(color='darkpurple', icon='fa-cutlery', prefix='fa')
    ).add_to(europe)    
    

    europe

    Folium map of European cities

    Word Clouds

    words_cuisines = cuisines.copy()
    words_cuisines[(words_cuisines.reviews.isna()) | (words_cuisines.reviews == ‘[[], []]’)].info()

    <class 'pandas.core.frame.DataFrame'>
    Index: 28622 entries, 1361 to 1666
    Data columns (total 6 columns):
     #   Column          Non-Null Count  Dtype  
    ---  ------          --------------  -----  
     0   city            28622 non-null  object 
     1   place           28622 non-null  object 
     2   cuisine_style   14754 non-null  object 
     3   rating          19036 non-null  float64
     4   reviews_number  28622 non-null  int32  
     5   reviews         19041 non-null  object 
    dtypes: float64(1), int32(1), object(4)
    memory usage: 1.4+ MB

    all_revs = []
    nan = 0
    for ind, rev in words_cuisines[‘reviews’].items():
    if type(rev) != float:
    # the string is in a format [[reviews separated by comma], [review dates separated by comma]]
    # in order to only keep the reviews, i split by ‘], [‘
    g = rev.split(‘], [‘)
    # and remove the first two chars at the beginning
    g = g[0][2:].replace(‘\”, “”)
    else:
    g = “”
    nan +=1
    all_revs.append(g)

    words_cuisines[‘reviews_only’] = all_revs
    words_cuisines = words_cuisines.drop(‘reviews’, axis = 1)

    from wordcloud import WordCloud, STOPWORDS

    stopwords = set(STOPWORDS)

    def words_cloud(words_list):
    all_words = ”
    #looping through all incidents and joining them to one text, to extract most common words
    for ind, wrds in words_list.iteritems():
    tokens = wrds.split()
    all_words += ” “.join(tokens) + ” “
    return all_words

    def words_freq(items):
    all_words = []
    for i, item in items.items():
    item = item.lower()
    item = re.sub(“[^a-zA-Z]”, ” “, item)
    #keep numbers
    item = re.sub(“[^a-zA-Z0-9\s]+”, ” “, item)
    all_words += item.split()

    words = Counter(all_words)   
    return words
    

    cloudmask = np.array(Image.open(‘europe_black.png’))
    cloudmask = np.array(euro)

    words = words_freq(words_cuisines[‘reviews_only’])
    wordcloud = WordCloud(width = 1000, height = 700,
    stopwords = stopwords, mask = cloudmask,
    collocations = False,
    background_color =’white’,
    min_font_size = 6).generate_from_frequencies(words)

    plt.figure(figsize = (20, 9), facecolor = None)
    plt.imshow(wordcloud)
    plt.axis(“off”)
    plt.tight_layout(pad = 0)

    plt.show()

    Wordcloud of reviews

    Summary

    • Food has been viewed as one of the noteworthy features influencing customer satisfaction in restaurant evaluations.
    • Data and analytics are valuable for increasing revenue and improving operations.
    • We utilize data to tailor marketing, optimize pricing and menu offerings, and make informed operational changes.
    • In this study, we used the TripAdvisor Restaurants Info for 31 Euro-Cities from Kaggle.
    • Online restaurant reviews retrieved from TripAdvisor have been analyzed by using domain-specific aspect-based sentiment detection and NLP visualization.
    • The influence of different restaurant features on the overall evaluation of visitors and locals has been determined and the most critical factors are identified by the frequency of their online discussion.
    • As it can clearly be seen, there is a difference in average ratings by city.
    • There are also great differences in average number of restaurants with vegetarian/vegan option available.
    • Local foods and meals provided by restaurants have been recognized a critical part of customer satisfaction.

    Explore More


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • Improved Multiple-Model ML/DL Credit Card Fraud Detection: F1=88% & ROC=91%

    Improved Multiple-Model ML/DL Credit Card Fraud Detection: F1=88% & ROC=91%

    Photo by CardMapr.nl on Unsplash

    • In 2023, global card industry losses to fraud are projected to reach $36.13 bln, or $0.0668 per $100 of volume.
    • Enhancing credit card fraud detection is therefore a top priority for all banks and financial organisations.
    • Thanks to AI-based techniques, credit card fraud detection (CCFD) is becoming easier and more accurate. AI models can recognise unusual credit card transactions and fraud. CCFD involves collecting and sorting raw data, which is then used to train the model to predict the probability of fraud.
    • AI allows for creating algorithms that process large datasets with many variables and help find these hidden correlations between user behavior and the likelihood of fraudulent actions. Another strength of AI systems compared to rule-based ones is faster data processing and less manual work.
    • Inspired by the recent study, our Python CCFD AI project aims to create an improved binary classifier that can recognize fraudulent credit card transactions with a high degree of confidence.
    • Specifically, we have conducted a comparative analysis of the available supervised Machine Learning (ML) and Deep Learning (DL) techniques for CCFD.
    • In this post, we will discuss a multiple-model ML/DL approach that closely resembles ensemble learning. An ensemble learning method involves combining the predictions from multiple contributing models.
    • In addition, we will compare multiple techniques to determine the best performing model in detecting fraudulent transactions in terms of scikit-learn metrics and scoring to quantify the quality of CCFD predictions.
    • Conventionally, we use the open-source Kaggle dataset that contains transactions made by credit cards in September 2013 by European cardholders. These are anonymized credit card transactions labeled as fraudulent (1) or genuine (0). The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
    • Read more: Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

    Clickable Table of Contents

    1. Data Preparation & Exploratory Analysis
    2. Keras ANN Model Training & Validation
    3. Multiple-Model Training & Validation
    4. Multiple-Model ML/DL F1-Score Comparison
    5. Best ML/DL Model Performance QC Analysis
    6. K-Means PCA Clusters & Variance Analysis
    7. Final XGBoost Classification Report
    8. Summary
    9. Explore More

    Data Preparation & Exploratory Analysis

    Let’s set the working directory

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    and import the necessary packages

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns

    %matplotlib inline
    sns.set_style(“whitegrid”)

    Let’s load the dataset from the csv file using Pandas

    data = pd.read_csv(“creditcard.csv”)
    data.head()

    Input data table

    (5 rows × 31 columns)

    Here, features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependant cost-sensitive learning. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.

     Describing the Input Data:

    data.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 284807 entries, 0 to 284806
    Data columns (total 31 columns):
     #   Column  Non-Null Count   Dtype  
    ---  ------  --------------   -----  
     0   Time    284807 non-null  float64
     1   V1      284807 non-null  float64
     2   V2      284807 non-null  float64
     3   V3      284807 non-null  float64
     4   V4      284807 non-null  float64
     5   V5      284807 non-null  float64
     6   V6      284807 non-null  float64
     7   V7      284807 non-null  float64
     8   V8      284807 non-null  float64
     9   V9      284807 non-null  float64
     10  V10     284807 non-null  float64
     11  V11     284807 non-null  float64
     12  V12     284807 non-null  float64
     13  V13     284807 non-null  float64
     14  V14     284807 non-null  float64
     15  V15     284807 non-null  float64
     16  V16     284807 non-null  float64
     17  V17     284807 non-null  float64
     18  V18     284807 non-null  float64
     19  V19     284807 non-null  float64
     20  V20     284807 non-null  float64
     21  V21     284807 non-null  float64
     22  V22     284807 non-null  float64
     23  V23     284807 non-null  float64
     24  V24     284807 non-null  float64
     25  V25     284807 non-null  float64
     26  V26     284807 non-null  float64
     27  V27     284807 non-null  float64
     28  V28     284807 non-null  float64
     29  Amount  284807 non-null  float64
     30  Class   284807 non-null  int64  
    dtypes: float64(30), int64(1)
    memory usage: 67.4 MB

    pd.set_option(“display.float”, “{:.2f}”.format)
    data.describe().T

    Descriptive statistics of input data

    Let us now check the missing values in the dataset

    data.isnull().sum().sum()

    0

    data.columns

    Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
           'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20','V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
           'Class'], dtype='object')

    Let’s look at the Transaction Class Distribution

    LABELS = [“Normal”, “Fraud”]

    count_classes = pd.value_counts(data[‘Class’], sort = True)
    count_classes.plot(kind = ‘bar’, rot=0)
    plt.title(“Transaction Class Distribution”)
    plt.xticks(range(2), LABELS)
    plt.xlabel(“Class”)
    plt.ylabel(“Frequency”);

    Transaction class distribution

    data.Class.value_counts()

    Class
    0    284315
    1       492
    Name: count, dtype: int64
    

    data.Class.value_counts(normalize=True).mul(100).round(3).astype(str) + ‘%’

    Class
    0    99.827%
    1     0.173%
    Name: proportion, dtype: object

    It appears that most of the transactions are non-fraud. Only 0.173% fraudulent transaction out all the transactions. The data is highly unbalanced.

    Let’s separate fraud and normal data

    fraud = data[data[‘Class’]==1]
    normal = data[data[‘Class’]==0]

    print(f”Shape of Fraudulent transactions: {fraud.shape}”)
    print(f”Shape of Non-Fraudulent transactions: {normal.shape}”)

    Shape of Fraudulent transactions: (492, 31)
    Shape of Non-Fraudulent transactions: (284315, 31)

    Let’s compare Amount fraud vs normal data

    pd.concat([fraud.Amount.describe(), normal.Amount.describe()], axis=1)

    Descriptive statistics Amount fraud vs normal data

    Let’s compare Time fraud vs normal data

    pd.concat([fraud.Time.describe(), normal.Time.describe()], axis=1)

    Descriptive statistics Time fraud vs normal data
    Let's look at the feature heatmap to find any high correlations

    plt.figure(figsize=(10,10))
    sns.heatmap(data=data.corr(), cmap=”seismic”)
    plt.show();

     Feature correlation matrix heatmap

    The correlation matrix graphically gives us an idea of how features correlate with each other and can help us predict what are the features that are most relevant for the prediction. It is clear that most of the features do not correlate to other features but there are some features that either has a positive or a negative correlation with each other. For example, V2 and V5 are highly negatively correlated with the Amount

    Finally, let’s divide the data into features and target variables. In addition, we will be dividing the dataset into two main groups: one for training the model and the other for testing our trained model’s performance:

    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler

    scalar = StandardScaler()

    X = data.drop(‘Class’, axis=1)
    y = data.Class

    X_train_v, X_test, y_train_v, y_test = train_test_split(X, y,
    test_size=0.3, random_state=42)
    X_train, X_validate, y_train, y_validate = train_test_split(X_train_v, y_train_v,
    test_size=0.2, random_state=42)

    X_train = scalar.fit_transform(X_train)
    X_validate = scalar.transform(X_validate)
    X_test = scalar.transform(X_test)

    w_p = y_train.value_counts()[0] / len(y_train)
    w_n = y_train.value_counts()[1] / len(y_train)

    print(f”Fraudulant transaction weight: {w_n}”)
    print(f”Non-Fraudulant transaction weight: {w_p}”)

    Fraudulant transaction weight: 0.0017994745785028623
    Non-Fraudulant transaction weight: 0.9982005254214972

    print(f”TRAINING: X_train: {X_train.shape}, y_train: {y_train.shape}\n{‘’55}”) print(f”VALIDATION: X_validate: {X_validate.shape}, y_validate: {y_validate.shape}\n{”50}”)
    print(f”TESTING: X_test: {X_test.shape}, y_test: {y_test.shape}”)

    TRAINING: X_train: (159491, 30), y_train: (159491,)
    _______________________________________________________
    VALIDATION: X_validate: (39873, 30), y_validate: (39873,)
    __________________________________________________
    TESTING: X_test: (85443, 30), y_test: (85443,)

    We will need the following function to print the model score

    from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, f1_score

    def print_score(label, prediction, train=True):
    if train:
    clf_report = pd.DataFrame(classification_report(label, prediction, output_dict=True))
    print(“Train Result:\n================================================”)
    print(f”Accuracy Score: {accuracy_score(label, prediction) * 100:.2f}%”)
    print(“___________________________________“)
    print(f”Classification Report:\n{clf_report}”)
    print(“___________________________________“)
    print(f”Confusion Matrix: \n {confusion_matrix(y_train, prediction)}\n”)

    elif train==False:
        clf_report = pd.DataFrame(classification_report(label, prediction, output_dict=True))
        print("Test Result:\n================================================")        
        print(f"Accuracy Score: {accuracy_score(label, prediction) * 100:.2f}%")
        print("_______________________________________________")
        print(f"Classification Report:\n{clf_report}")
        print("_______________________________________________")
        print(f"Confusion Matrix: \n {confusion_matrix(label, prediction)}\n") 
    

    Keras ANN Model Training & Validation

    Let’s define the following Artificial Neural Network (ANN) model

    from tensorflow import keras

    model = keras.Sequential([
    keras.layers.Dense(256, activation=’relu’, input_shape=(X_train.shape[-1],)),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(256, activation=’relu’),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(256, activation=’relu’),
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),
    keras.layers.Dense(1, activation=’sigmoid’),
    ])

    model.summary()

    Model: "sequential"
    _________________________________________________________________
     Layer (type)                Output Shape              Param #   
    =================================================================
     dense (Dense)               (None, 256)               7936      
                                                                     
     batch_normalization (BatchN  (None, 256)              1024      
     ormalization)                                                   
                                                                     
     dropout (Dropout)           (None, 256)               0         
                                                                     
     dense_1 (Dense)             (None, 256)               65792     
                                                                     
     batch_normalization_1 (Batc  (None, 256)              1024      
     hNormalization)                                                 
                                                                     
     dropout_1 (Dropout)         (None, 256)               0         
                                                                     
     dense_2 (Dense)             (None, 256)               65792     
                                                                     
     batch_normalization_2 (Batc  (None, 256)              1024      
     hNormalization)                                                 
                                                                     
     dropout_2 (Dropout)         (None, 256)               0         
                                                                     
     dense_3 (Dense)             (None, 1)                 257       
                                                                     
    =================================================================
    Total params: 142,849
    Trainable params: 141,313
    Non-trainable params: 1,536

    Let’s compile and train this model

    METRICS = [

    keras.metrics.FalseNegatives(name='fn'),
    keras.metrics.FalsePositives(name='fp'),
    keras.metrics.TrueNegatives(name='tn'),
    keras.metrics.TruePositives(name='tp'),
    keras.metrics.Precision(name='precision'),
    keras.metrics.Recall(name='recall')
    

    ]

    model.compile(optimizer=keras.optimizers.Adam(1e-4), loss=’binary_crossentropy’, metrics=METRICS)

    callbacks = [keras.callbacks.ModelCheckpoint(‘fraud_model_at_epoch_{epoch}.h5’)]
    class_weight = {0:w_p, 1:w_n}

    r = model.fit(
    X_train, y_train,
    validation_data=(X_validate, y_validate),
    batch_size=2048,
    epochs=300,

    callbacks=callbacks,
    

    )

    Epoch 1/300
    78/78 [==============================] - 4s 39ms/step - loss: 0.8119 - fn: 68.0000 - fp: 74850.0000 - tn: 84354.0000 - tp: 219.0000 - precision: 0.0029 - recall: 0.7631 - val_loss: 0.6675 - val_fn: 7.0000 - val_fp: 11144.0000 - val_tn: 28660.0000 - val_tp: 62.0000 - val_precision: 0.0055 - val_recall: 0.8986
    ..........................................
    Epoch 300/300
    78/78 [==============================] - 3s 36ms/step - loss: 8.1106e-04 - fn: 29.0000 - fp: 9.0000 - tn: 159195.0000 - tp: 258.0000 - precision: 0.9663 - recall: 0.8990 - val_loss: 0.0059 - val_fn: 14.0000 - val_fp: 8.0000 - val_tn: 39796.0000 - val_tp: 55.0000 - val_precision: 0.8730 - val_recall: 0.7971

    Let’s plot the keras.metrics vs Epochs

    plt.figure(figsize=(12, 16))

    plt.subplot(4, 2, 1)
    plt.plot(r.history[‘loss’], label=’Loss’)
    plt.plot(r.history[‘val_loss’], label=’val_Loss’)
    plt.title(‘Loss Function evolution during training’)
    plt.legend()

    plt.subplot(4, 2, 2)
    plt.plot(r.history[‘fn’], label=’fn’)
    plt.plot(r.history[‘val_fn’], label=’val_fn’)
    plt.title(‘Accuracy evolution during training’)
    plt.legend()

    plt.subplot(4, 2, 3)
    plt.plot(r.history[‘precision’], label=’precision’)
    plt.plot(r.history[‘val_precision’], label=’val_precision’)
    plt.title(‘Precision evolution during training’)
    plt.legend()

    plt.subplot(4, 2, 4)
    plt.plot(r.history[‘recall’], label=’recall’)
    plt.plot(r.history[‘val_recall’], label=’val_recall’)
    plt.title(‘Recall evolution during training’)
    plt.legend()
    plt.savefig(‘lossval.png’)

    Keras metrics: loss, accuracy, precision, and recall.

    Given the above trained model, let’s predict labels of the train/test data

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    print_score(y_train, y_train_pred.round(), train=True)
    print_score(y_test, y_test_pred.round(), train=False)

    scores_dict = {
    ‘ANNs’: {
    ‘Train’: f1_score(y_train, y_train_pred.round()),
    ‘Test’: f1_score(y_test, y_test_pred.round()),
    },
    }

    4985/4985 [==============================] - 4s 744us/step
    2671/2671 [==============================] - 2s 783us/step
    Train Result:
    ================================================
    Accuracy Score: 99.99%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   1.00      1.00       1.00          1.00
    recall         1.00   0.97      1.00       0.98          1.00
    f1-score       1.00   0.98      1.00       0.99          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159203      1]
     [     9    278]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.96%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.90      1.00       0.95          1.00
    recall        1.00   0.82      1.00       0.91          1.00
    f1-score      1.00   0.86      1.00       0.93          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85295    12]
     [   25   111]]
    

    Multiple-Model Training & Validation

    Let’s invoke several popular supervised ML binary classification algorithms.

    XGBClassifier

    from xgboost import XGBClassifier

    xgb_clf = XGBClassifier()
    xgb_clf.fit(X_train, y_train, eval_metric=’aucpr’)

    y_train_pred = xgb_clf.predict(X_train)
    y_test_pred = xgb_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘XGBoost’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 100.00%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   1.00      1.00       1.00          1.00
    recall         1.00   1.00      1.00       1.00          1.00
    f1-score       1.00   1.00      1.00       1.00          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159204      0]
     [     0    287]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.96%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.95      1.00       0.97          1.00
    recall        1.00   0.82      1.00       0.91          1.00
    f1-score      1.00   0.88      1.00       0.94          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85301     6]
     [   25   111]]
    

    RandomForestClassifier

    from sklearn.ensemble import RandomForestClassifier

    rf_clf = RandomForestClassifier(n_estimators=100, oob_score=False)
    rf_clf.fit(X_train, y_train)

    y_train_pred = rf_clf.predict(X_train)
    y_test_pred = rf_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘Random Forest’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 100.00%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   1.00      1.00       1.00          1.00
    recall         1.00   1.00      1.00       1.00          1.00
    f1-score       1.00   1.00      1.00       1.00          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159204      0]
     [     0    287]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.96%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.90      1.00       0.95          1.00
    recall        1.00   0.81      1.00       0.90          1.00
    f1-score      1.00   0.85      1.00       0.93          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85295    12]
     [   26   110]]
    

    CatBoostClassifier

    from catboost import CatBoostClassifier

    cb_clf = CatBoostClassifier()
    cb_clf.fit(X_train, y_train)

    Learning rate set to 0.089847
    0:	learn: 0.3914646	total: 169ms	remaining: 2m 49s
    .....
    999:	learn: 0.0001216	total: 16.6s	remaining: 0us

    y_train_pred = cb_clf.predict(X_train)
    y_test_pred = cb_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘CatBoost’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 100.00%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   1.00      1.00       1.00          1.00
    recall         1.00   1.00      1.00       1.00          1.00
    f1-score       1.00   1.00      1.00       1.00          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159204      0]
     [     1    286]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.96%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.93      1.00       0.97          1.00
    recall        1.00   0.82      1.00       0.91          1.00
    f1-score      1.00   0.87      1.00       0.94          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85299     8]
     [   25   111]]

    DecisionTreeClassifier

    from sklearn.tree import DecisionTreeClassifier

    lgbm_clf = DecisionTreeClassifier(random_state=0)

    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘DTC’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 100.00%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   1.00      1.00       1.00          1.00
    recall         1.00   1.00      1.00       1.00          1.00
    f1-score       1.00   1.00      1.00       1.00          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159204      0]
     [     0    287]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.90%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.66      1.00       0.83          1.00
    recall        1.00   0.74      1.00       0.87          1.00
    f1-score      1.00   0.69      1.00       0.85          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85255    52]
     [   36   100]]

    AdaBoostClassifier

    from sklearn.ensemble import AdaBoostClassifier
    lgbm_clf=AdaBoostClassifier()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘Ada’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 99.92%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   0.80      1.00       0.90          1.00
    recall         1.00   0.73      1.00       0.87          1.00
    f1-score       1.00   0.76      1.00       0.88          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159150     54]
     [    77    210]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.93%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.80      1.00       0.90          1.00
    recall        1.00   0.76      1.00       0.88          1.00
    f1-score      1.00   0.78      1.00       0.89          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85281    26]
     [   32   104]]
    

    GaussianNB

    from sklearn.naive_bayes import GaussianNB
    lgbm_clf=GaussianNB()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘GNB’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 97.77%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   0.06      0.98       0.53          1.00
    recall         0.98   0.82      0.98       0.90          0.98
    f1-score       0.99   0.12      0.98       0.55          0.99
    support   159204.00 287.00      0.98  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[155706   3498]
     [    52    235]]
    
    Test Result:
    ================================================
    Accuracy Score: 97.78%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.06      0.98       0.53          1.00
    recall        0.98   0.85      0.98       0.92          0.98
    f1-score      0.99   0.11      0.98       0.55          0.99
    support   85307.00 136.00      0.98   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[83428  1879]
     [   20   116]]
    

    SVC

    from sklearn.svm import SVC
    lgbm_clf=SVC()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘SVC’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 99.96%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   0.99      1.00       0.99          1.00
    recall         1.00   0.81      1.00       0.91          1.00
    f1-score       1.00   0.89      1.00       0.95          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159201      3]
     [    54    233]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.94%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.92      1.00       0.96          1.00
    recall        1.00   0.65      1.00       0.83          1.00
    f1-score      1.00   0.76      1.00       0.88          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85299     8]
     [   47    89]]
    

    KNeighborsClassifier

    from sklearn.neighbors import KNeighborsClassifier
    lgbm_clf=KNeighborsClassifier()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘KNN’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 99.95%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   0.96      1.00       0.98          1.00
    recall         1.00   0.78      1.00       0.89          1.00
    f1-score       1.00   0.86      1.00       0.93          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159194     10]
     [    62    225]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.94%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.87      1.00       0.93          1.00
    recall        1.00   0.77      1.00       0.89          1.00
    f1-score      1.00   0.82      1.00       0.91          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85291    16]
     [   31   105]]
    

    LogisticRegression

    from sklearn.linear_model import LogisticRegression
    lgbm_clf=LogisticRegression()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘LR’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 99.92%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   0.89      1.00       0.94          1.00
    recall         1.00   0.62      1.00       0.81          1.00
    f1-score       1.00   0.73      1.00       0.87          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159182     22]
     [   108    179]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.93%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.87      1.00       0.93          1.00
    recall        1.00   0.64      1.00       0.82          1.00
    f1-score      1.00   0.74      1.00       0.87          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85294    13]
     [   49    87]]
    

    SGDClassifier

    from sklearn.linear_model import SGDClassifier
    lgbm_clf=SGDClassifier()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘SGD’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 99.91%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   0.88      1.00       0.94          1.00
    recall         1.00   0.58      1.00       0.79          1.00
    f1-score       1.00   0.70      1.00       0.85          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159182     22]
     [   121    166]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.92%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.85      1.00       0.92          1.00
    recall        1.00   0.58      1.00       0.79          1.00
    f1-score      1.00   0.69      1.00       0.84          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85293    14]
     [   57    79]]
    

    GradientBoostingClassifier

    from sklearn.ensemble import GradientBoostingClassifier
    lgbm_clf=GradientBoostingClassifier()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘GBC’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 99.85%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   0.98      1.00       0.99          1.00
    recall         1.00   0.19      1.00       0.59          1.00
    f1-score       1.00   0.32      1.00       0.66          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159203      1]
     [   233     54]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.85%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.73      1.00       0.86          1.00
    recall        1.00   0.14      1.00       0.57          1.00
    f1-score      1.00   0.23      1.00       0.62          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85300     7]
     [  117    19]]
    

    MLPClassifier

    from sklearn.neural_network import MLPClassifier
    lgbm_clf=MLPClassifier()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘MLP’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 99.98%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   1.00      1.00       1.00          1.00
    recall         1.00   0.87      1.00       0.93          1.00
    f1-score       1.00   0.93      1.00       0.96          1.00
    support   159204.00 287.00      1.00  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[159203      1]
     [    38    249]]
    
    Test Result:
    ================================================
    Accuracy Score: 99.95%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.92      1.00       0.96          1.00
    recall        1.00   0.76      1.00       0.88          1.00
    f1-score      1.00   0.83      1.00       0.92          1.00
    support   85307.00 136.00      1.00   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[85298     9]
     [   33   103]]
    

    QuadraticDiscriminantAnalysis

    from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
    lgbm_clf=QuadraticDiscriminantAnalysis()
    lgbm_clf.fit(X_train, y_train)

    y_train_pred = lgbm_clf.predict(X_train)
    y_test_pred = lgbm_clf.predict(X_test)

    print_score(y_train, y_train_pred, train=True)
    print_score(y_test, y_test_pred, train=False)

    scores_dict[‘QDA’] = {
    ‘Train’: f1_score(y_train,y_train_pred),
    ‘Test’: f1_score(y_test, y_test_pred),
    }

    Train Result:
    ================================================
    Accuracy Score: 97.55%
    _______________________________________________
    Classification Report:
                      0      1  accuracy  macro avg  weighted avg
    precision      1.00   0.06      0.98       0.53          1.00
    recall         0.98   0.87      0.98       0.92          0.98
    f1-score       0.99   0.11      0.98       0.55          0.99
    support   159204.00 287.00      0.98  159491.00     159491.00
    _______________________________________________
    Confusion Matrix: 
     [[155334   3870]
     [    38    249]]
    
    Test Result:
    ================================================
    Accuracy Score: 97.55%
    _______________________________________________
    Classification Report:
                     0      1  accuracy  macro avg  weighted avg
    precision     1.00   0.06      0.98       0.53          1.00
    recall        0.98   0.90      0.98       0.94          0.98
    f1-score      0.99   0.11      0.98       0.55          0.99
    support   85307.00 136.00      0.98   85443.00      85443.00
    _______________________________________________
    Confusion Matrix: 
     [[83229  2078]
     [   13   123]]
    

    Multiple-Model ML/DL F1-Score Comparison

    scores_df = pd.DataFrame(scores_dict)

    scores_df.plot(kind=’barh’, figsize=(15, 8),fontsize=16)

    F1-score comparison joint bar plot

    Best ML/DL Model Performance QC Analysis

    Let’s compare ML/DL model results and performance using SciKit-Plot visualization

    import scikitplot as skplt

    import sklearn
    from sklearn.model_selection import train_test_split
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA

    import matplotlib.pyplot as plt

    import sys
    import warnings
    warnings.filterwarnings(“ignore”)

    print(“Scikit Plot Version : “, skplt.version)
    print(“Scikit Learn Version : “, sklearn.version)
    print(“Python Version : “, sys.version)

    %matplotlib inline

    Scikit Plot Version :  0.3.7
    Scikit Learn Version :  1.2.2
    Python Version :  3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]

    XGBoost Classification Learning Curve

    skplt.estimators.plot_learning_curve(xgb_clf, X_train, y_train,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”XGBoost Classification Learning Curve”);

    XGBoost Classification Learning Curve

    Feature Importance: Random Forest vs XGBoost Classifier

    col=data.columns
    col.drop(‘Class’)

    Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
           'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount'],
          dtype='object')

    fig = plt.figure(figsize=(11,6))

    ax1 = fig.add_subplot(121)
    skplt.estimators.plot_feature_importances(rf_clf, feature_names=col,
    title=”Random Forest Regressor Feature Importance”,
    x_tick_rotation=90, order=”ascending”,
    ax=ax1);

    ax2 = fig.add_subplot(122)
    skplt.estimators.plot_feature_importances(xgb_clf, feature_names=col,
    title=”XGBoost Classifier Feature Importance”,
    x_tick_rotation=90,
    ax=ax2);

    plt.tight_layout()

    Feature Importance: Random Forest vs XGBoost Classifier

    Calibration Plots

    lr_probas = CatBoostClassifier().fit(X_train, y_train).predict_proba(X_test)
    rf_probas = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
    gb_probas = XGBClassifier().fit(X_train, y_train).predict_proba(X_test)
    et_scores = MLPClassifier().fit(X_train, y_train).predict_proba(X_test)

    probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
    clf_names = [‘CatBoost’, ‘Random Forest’, ‘XGBoost’, ‘MLP’]

    skplt.metrics.plot_calibration_curve(y_test,
    probas_list,
    clf_names, n_bins=15,
    figsize=(12,6)
    );

    Calibration plots clf_names = ['CatBoost', 'Random Forest', 'XGBoost', 'MLP']

    Normalized Confusion Matrix

    y_train_pred = xgb_clf.predict(X_train)
    y_test_pred = xgb_clf.predict(X_test)
    target_names=[‘Normal’,’Fraud’]

    from sklearn.metrics import confusion_matrix
    import seaborn as sns

    cm = confusion_matrix(y_test, y_test_pred)

    cmn = cm.astype(‘float’) / cm.sum(axis=1)[:, np.newaxis]
    fig, ax = plt.subplots(figsize=(6,6))
    sns.heatmap(cmn, annot=True, fmt=’.2f’, xticklabels=target_names, yticklabels=target_names)
    plt.ylabel(‘Actual’)
    plt.xlabel(‘Predicted’)
    plt.title(‘XGBoost Normalized Confusion Matrix’)
    plt.show(block=True)

    XGBoost Normalized Confusion Matrix

    XGBoost ROC Curve

    Y_test_probs = xgb_clf.predict_proba(X_test)

    skplt.metrics.plot_roc_curve(y_test, Y_test_probs,
    title=”XGBoost ROC Curve”, figsize=(12,6));

    XGBoost ROC Curve

    Precision-Recall Curve

    skplt.metrics.plot_precision_recall_curve(y_test, Y_test_probs,
    title=”XGBoost Precision-Recall Curve”, figsize=(12,6));

    XGBoost Precision-Recall Curve

    XGBoost KS Statistic Plot

    Y_probas = xgb_clf.predict_proba(X_test)

    skplt.metrics.plot_ks_statistic(y_test, Y_probas, figsize=(10,6));

    XGBoost KS Statistic Plot

    XGBoost Cumulative Gains Curve

    skplt.metrics.plot_cumulative_gain(y_test, Y_probas, figsize=(10,6));

    XGBoost Cumulative Gains Curve

    XGBoost Lift Curve

    skplt.metrics.plot_lift_curve(y_test, Y_probas, figsize=(10,6));

    XGBoost Lift Curve

    K-Means PCA Clusters & Variance Analysis

    Elbow Plot

    skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
    X_train,
    cluster_ranges=range(2, 20),
    figsize=(8,6));

    Elbow plot

    Silhouette Analysis

    kmeans = KMeans(n_clusters=10, random_state=1)
    kmeans.fit(X_train, y_train)
    cluster_labels = kmeans.predict(X_test)

    skplt.metrics.plot_silhouette(X_test, cluster_labels,
    figsize=(8,6));

    Silhouette Analysis

    PCA Component Explained Variances

    pca = PCA(random_state=1)
    pca.fit(X_train)

    skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

    PCA Component Explained Variances

    PCA 2-D Projection

    skplt.decomposition.plot_pca_2d_projection(pca, X_train, y_train,
    figsize=(10,10),
    cmap=”tab10″);

    PCA 2-D Projection

    Final XGBoost Classification Report

    from yellowbrick.classifier import ClassificationReport

    viz = ClassificationReport(xgb_clf,
    classes=target_names,
    support=True,
    fig=plt.figure(figsize=(8,6)))

    viz.fit(X_train, y_train)

    viz.score(X_test, y_test)

    viz.show();

    XGBoost Classification Report

    Summary of sklearn.metrics for XGBoost

    from sklearn.metrics import class_likelihood_ratios

    class_likelihood_ratios(y_test,y_test_pred)

    (11604.261029411764, 0.18383645940293095)
    

    from sklearn.metrics import balanced_accuracy_score
    balanced_accuracy_score(y_test,y_test_pred)

    0.9080530681917697

    from sklearn.metrics import cohen_kappa_score
    cohen_kappa_score(y_test,y_test_pred)

    0.8772897054030634

    from sklearn.metrics import hinge_loss
    hinge_loss(y_test,y_test_pred)

    0.9987711105649381

    from sklearn.metrics import matthews_corrcoef
    matthews_corrcoef(y_test,y_test_pred)

    0.8797814646276692

    from sklearn.metrics import roc_auc_score
    roc_auc_score(y_test,y_test_pred)

    0.9080530681917698

    from sklearn.metrics import top_k_accuracy_score
    top_k_accuracy_score(y_test,y_test_pred)

    1.0

    from sklearn.metrics import hamming_loss
    hamming_loss(y_test,y_test_pred)

    0.0003628149760659153

    from sklearn.metrics import jaccard_score
    jaccard_score(y_test,y_test_pred)

    0.7816901408450704

    from sklearn.metrics import log_loss
    log_loss(y_test,y_test_pred)

    0.013077177241700906

    from sklearn.metrics import zero_one_loss
    zero_one_loss(y_test,y_test_pred)

    0.0003628149760659394

    from sklearn.metrics import precision_recall_fscore_support
    precision_recall_fscore_support(y_test,y_test_pred)

    (array([0.99970701, 0.94871795]),
     array([0.99992967, 0.81617647]),
     array([0.99981832, 0.87747036]),
     array([85307,   136], dtype=int64))

    from sklearn.metrics import f1_score
    f1_score(y_test,y_test_pred)

    0.8774703557312252

    from sklearn.metrics import precision_score
    precision_score(y_test,y_test_pred)

    0.9487179487179487

    from sklearn.metrics import recall_score
    recall_score(y_test,y_test_pred)

    0.8161764705882353

    from sklearn.metrics import fbeta_score
    fbeta_score(y_test,y_test_pred,beta=0.5)

    0.9188741721854304

    Discrimination Threshold

    from yellowbrick.classifier import DiscriminationThreshold

    viz = DiscriminationThreshold(xgb_clf,
    classes=target_names,
    cv=0.2,
    fig=plt.figure(figsize=(9,6)))

    viz.fit(X_train, y_train)

    viz.score(X_test, y_test)

    viz.show();

    Discrimination Threshold Plot for XGBoost Classifier

    Summary

    • This project presents a comparison of some established supervised ML and ANN algorithms to differentiate between genuine and fraudulent transactions.
    • Multiple ML/DL models are built and fit on the training set, followed by comprehensive cross-validation and QC performance tests.
    • We found that the XGBoost model performs the best in terms of FP = 18%, f1- and ROC-scores of 88% and 91%, respectively.
    • Overfitting challenge: the model shows low bias but high variance. This is a result of an excessively complicated model, and it can be prevented by fitting multiple models and using validation or cross-validation to compare their predictive accuracies on test data.
    • In principle, the obtained model can be applied in anti-fraud monitoring systems, or a similar CCFD model development process can be performed in related business areas to detect fraud.

    Explore More


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • Datapane Stock Screener App from Scratch

    Datapane Stock Screener App from Scratch

    Photo by Carl Heyerdahl on Unsplash

    • This API is built around the concept of Blocks, which are Python objects that represent an individual unit that can be processed, composed, and viewed. 

    Let’s install Datapane

    !pip install datapane_components

    and import standard libraries

    import datapane as dp
    import altair as alt
    import pandas as pd
    import plotly.express as px
    import yfinance as yf

    from datetime import datetime
    import threading
    from time import sleep

    Let’s set the stock ticker

    ticker=’MSFT’

    and download the stock Adj Close price in USD

    df = pd.DataFrame(yf.download(ticker, period=”1d”, interval=”5m”)[“Adj Close”]).reset_index()
    df.columns = [“time”, “adjusted_close”]

    [*********************100%***********************]  1 of 1 completed

    df.tail()

    MSFT Adj Close price in USD

    “””

    Stock Portfolio Analysis

    A Datapane app that analyses stock portfolio data.

    “””
    from datetime import date, timedelta

    import pandas as pd
    import numpy as np
    import plotly.figure_factory as ff
    import plotly.graph_objs as go
    import plotly.express as px
    import plotly.io as pio
    import datapane as dp
    import datapane_components as dc
    import yfinance as yf

    pio.templates.default = “ggplot2”

    dp.enable_logging()

    def process(ticker1: str, ticker2: str = “GOOG”) -> dp.View:
    print(f”Downloading datasets for {ticker1} vs {ticker2}”)
    start = date.today() – timedelta(365)
    end = date.today() + timedelta(2)
    # get the data
    df1 = yf.download(ticker1, start=start, end=end, interval=”1d”).reset_index() # [“Adj Close”]).reset_index()
    df2 = yf.download(ticker2, start=start, end=end, interval=”1d”).reset_index() # [“Adj Close”]).reset_index()
    df1[“Date”] = pd.to_datetime(df1[“Date”], format=”%d/%m/%Y”)
    df2[“Date”] = pd.to_datetime(df2[“Date”], format=”%d/%m/%Y”)
    orig_df1 = df1.copy(deep=True)

    # Build all the visualizations
    print("Creating plots...")
    trace0 = go.Scatter(x=df1.Date, y=df1.Close, name=ticker1)
    fig0 = go.Figure([trace0])
    fig0.update_layout(title={"text": f"{ticker1} Stock Price", "x": 0.5, "xanchor": "center"})
    
    df1["10-day MA"] = df1["Close"].rolling(window=10).mean()
    df1["20-day MA"] = df1["Close"].rolling(window=20).mean()
    df1["50-day MA"] = df1["Close"].rolling(window=50).mean()
    
    trace0 = go.Scatter(x=df1.Date, y=df1.Close, name=ticker1)
    trace1 = go.Scatter(x=df1.Date, y=df1["10-day MA"], name="10-day MA")
    trace2 = go.Scatter(x=df1.Date, y=df1["20-day MA"], name="20-day MA")
    fig1 = go.Figure([trace0, trace1, trace2])
    fig1.update_layout(title={"text": f"{ticker1} Stock Price (Rolling Average)", "x": 0.5, "xanchor": "center"})
    
    fig2 = go.Figure(go.Candlestick(x=df1.Date, open=df1.Open, high=df1.High, low=df1.Low, close=df1.Close))
    fig2.update_layout(title={"text": f"{ticker1} Stock Price (Candle Stick)", "x": 0.5, "xanchor": "center"})
    
    df1["Daily return (%)"] = round(df1["Close"].pct_change() * 100, 2)
    fig3 = px.bar(df1, x="Date", y="Daily return (%)")
    fig3.update_layout(title={"text": f"{ticker1} Stock Daily Return", "x": 0.5, "xanchor": "center"})
    
    trace0 = go.Scatter(x=df1.Date, y=df1.Close, name=ticker1, line=dict(color="lime"))
    trace1 = go.Scatter(x=df2.Date, y=df2.Close, name=ticker2, line=dict(color="red"))
    
    fig4 = go.Figure([trace0, trace1])
    fig4.update_layout(
        dict(
            title=dict({"text": f"{ticker1} vs {ticker2} Stock Price", "x": 0.5, "xanchor": "center"}),
            xaxis=dict(
                rangeselector=dict(
                    buttons=list(
                        [
                            dict(count=1, label="1m", step="month", stepmode="backward"),
                            dict(count=3, label="3m", step="month", stepmode="backward"),
                            dict(count=6, label="6m", step="month", stepmode="backward"),
                            dict(count=12, label="12m", step="month", stepmode="backward"),
                        ]
                    )
                ),
                rangeslider=dict(visible=True),
                type="date",
            ),
        )
    )
    
    trace0 = go.Scatterpolar(
        r=[df1["Close"].mean(), df1["Open"].min(), df1["Low"].min(), df1["High"].max()],
        theta=["Close", "Open", "Low", "High"],
        line=dict(color="lime"),
        name=ticker1,
        fill="toself",
    )
    
    trace1 = go.Scatterpolar(
        r=[df2["Close"].mean(), df2["Open"].min(), df2["Low"].min(), df2["High"].max()],
        theta=["Close", "Open", "Low", "High"],
        line=dict(color="red"),
        name="ZM",
        fill="toself",
    )
    
    fig5 = go.Figure([trace0, trace1])
    fig5.update_layout(
        go.Layout(
            polar=dict(radialaxis=dict(visible=True)),
            title=dict({"text": f"{ticker1} vs {ticker2} Stock Price (Radar Chart)", "x": 0.5, "xanchor": "center"}),
        )
    )
    
    df1 = df1[["Open", "Close", "Volume"]]
    df1["index"] = np.arange(len(df1))
    
    fig7 = go.Figure(
        ff.create_scatterplotmatrix(
            df1,
            diag="box",
            index="index",
            size=3,
            height=600,
            width=1150,
            colormap="RdBu",
            title={"text": f"{ticker1} Stock Price (Scatterplot Matrix)", "x": 0.5, "xanchor": "center"},
        )
    )
    
    # Building the output Datapane View
    # Now that we have a series of plots, we can construct our custom View.
    # In addition to the visualizations, this View includes Datapane's `BigNumber` component to
    # display today's stock prices, and our `DataTable` component to allow our viewers to filter,
    # explore, and download the data themselves.
    # We can build a more powerful report by using Datapane's layout components.
    # e.g. using `Group` to place the `BigNumber` blocks in two columns, and `Select` block to add multiple tabs.
    
    ticker1_today = df1.iloc[-1]
    ticker2_today = df2.iloc[-1]
    
    return dp.View(
        dp.Group(
            f"""## {ticker1} analysis (against {ticker2})""",
            dp.Group(
                dp.BigNumber(
                    heading=f"{ticker1} Day Performance",
                    value="${:,.2f}".format(ticker1_today.Close),
                    prev_value="${:,.2f}".format(ticker1_today.Open),
                ),
                dp.BigNumber(
                    heading=f"{ticker2} Day Performance",
                    value="${:,.2f}".format(ticker2_today.Close),
                    prev_value="${:,.2f}".format(ticker2_today.Open),
                ),
                columns=2,
            ),
            dp.Group(fig0, fig1, fig2, fig3, fig4, fig5, columns=2),
            dp.Plot(fig7),
            # datasets
            *dc.section(
                """
            # Datasets
            _These datasets are pulled live_.
            """
            ),
            dp.Select(dp.DataTable(orig_df1, label=ticker1), dp.DataTable(df2, label=ticker2)),
            label=f"{ticker1} vs {ticker2}",
        ),
    )
    

    controls = dp.Controls(
    ticker1=dp.TextBox(label=”Ticker”, initial=”MSFT”),
    ticker2=dp.TextBox(label=”(Optional) Comparison Ticker”, initial=”GOOG”, allow_empty=True),
    )

    v = dp.View(
    dp.Compute(
    function=process, controls=controls, label=”Choose Ticker”, cache=True, target=”results”, swap=dp.Swap.APPEND
    ),
    dp.Select(name=”results”),
    )
    dp.serve_app(v)

    Datapane dashboard: Choose Ticker
    Datapane dashboard: MSFT stock price, MA, candle sticks, and daily return
    Datapane dashboard: MSFT vs SPY stock price and radar chart
    Datapane dashboard: MSFT scatterplot matrix
    Datapane dashboard: MSFT and SPY datasets with SQL Query and Export options.

    Summary

    This is A Quick Guide for Value Investors Who Can Use Datapane to Build Their Own Value Investing Stock Screener API in 10 Minutes.

    Explore More

    An Interactive GPT Index and DeepLake Interface – 1. Amazon Financial Statements

    Working with FRED API in Python: U.S. Recession Forecast & Beyond

    Advanced Integrated Data Visualization (AIDV) in Python – 1. Stock Technical Indicators

    The Donchian Channel vs Buy-and-Hold Breakout Trading Systems – $MO Use-Case

    A Comparative Analysis of The 3 Best U.S. Growth Stocks in Q1’23 – 1. WMT

    Python Technical Analysis for BioTech – Get Buy Alerts on ABBV in 2023

    Basic Stock Price Analysis in Python


    One-Time
    Monthly
    Yearly

    Make a one-time donation

    Make a monthly donation

    Make a yearly donation

    Choose an amount

    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00
    $5.00
    $15.00
    $100.00

    Or enter a custom amount

    $

    Your contribution is appreciated.

    Your contribution is appreciated.

    Your contribution is appreciated.

    DonateDonate monthlyDonate yearly
  • Unsupervised ML Clustering, Customer Segmentation, Cohort, Market Basket, Bank Churn, CRM, ABC & RFM Analysis – A Comprehensive Guide in Python

    Unsupervised ML Clustering, Customer Segmentation, Cohort, Market Basket, Bank Churn, CRM, ABC & RFM Analysis – A Comprehensive Guide in Python
    • The goal of this post is to describe the most popular, easy-to-use, and effective data-driven techniques that help organisations streamline their operations, build effective relationships with with individual people (customers, service users, colleagues, or suppliers), increase sales, improve customer service, and increase profitability.
    • Customer Segmentation (CS) is a core concept in this technology. This is the process of tagging and grouping customers based on shared characteristics. This process also makes it easy to tailor and personalize your marketing, service, and sales efforts to the needs of specific groups. The result is a potential boost to customer loyalty and conversions.
    • This article represents a comprehensive step-by-step CS guide in Python that helps you manage customer relationships across the entire customer lifecycle, at every marketingsalese-commerce, and customer service interaction.
    • Our guide consists of hands-on case examples and best industry practices with open-source data and Python source codes: mall customers, online retail, marketing campaigns, supply chain, bank churn, groceries market, and e-commerce.

    Table of Clickable Contents

    1. Motivation
    2. Methods
    3. Open-Source Datasets
    4. Mall Customer Segmentation
    5. Online Retail K-Means Clustering
    6. Online Retail Data Analytics
    7. RFM Customer Segmentation
    8. RFM TreeMap: Online Retail Dataset II
      1. ####### DATA PREPARATION
    9. CRM Analytics: CLTV
    10. Cohort Analysis in Online Retail
    11. Customer Segmentation, RFM, K-Means and Cohort Analysis in Online Retail
      1. ############################# segment1
    12. Customer Clustering PCA, K-Means and Agglomerative for Marketing Campaigns
    13. Supply Chain RFM & ABC Analysis
    14. Bank Churn ML Prediction
    15. Groceries Market Basket & RFM Analysis
    16. Largest E-Commerce Showcase in Pakistan
    17. Conclusions
    18. Explore More

    Motivation

    • Uniform communication strategies that are not customer centric have failed to exploit millions of dollars in opportunities at the individual and segment level. 
    • The more comprehensive  CS-based modeling provides executives and marketers with the ability  to identify high-profit customers and micro-segments that will drive increases in traffic, sales, profit and retention.

    Methods

    • We will use the following techniques: unsupervised ML clustering, customer segmentation, cohort, market basket, bank churn, CRM data analytics, ABC & RFM analysis.
    • ABC analysis is a tool that allows companies to divide their inventory or customers into groups based on their respective values to the company. The goal of this type of analysis is to ensure that businesses are optimizing their time and resources when serving customers.
    • What is RFM (recency, frequency, monetary) analysis? RFM analysis is a marketing technique used to quantitatively rank and group customers based on the recency, frequency and monetary total of their recent transactions to identify the best customers and perform targeted marketing campaigns.
    • Cohort analysis is a type of behavioral analytics in which you take a group of users, and analyze their usage patterns based on their shared traits to better track and understand their actions. A cohort is simply a group of people with shared characteristics.
    • Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. To put it another way, it allows retailers to identify relationships between the items that people buy.
    • Customer Churn prediction means knowing which customers are likely to leave or unsubscribe from your service. Customers have different behaviors and preferences, and reasons for cancelling their subscriptions. Therefore, it is important to actively communicate with each of them to keep them on your customer list. You need to know which marketing activities are most effective for individual customers and when they are most effective.

    Open-Source Datasets

    This file contains the basic information (ID, age, gender, income, and spending score) about the customers.

    Online retail is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail. The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

    We will be using the online retail transnational dataset to build RFM clustering and choose the best set of customers which the company should target.

    Online Retail II dataset contains all the transactions occurring for a UK-based and registered, non-store online retail. This is a real online retail transaction data set of two years between 01/12/2009 and 09/12/2011.

    The dataset was used in the Customer Personality Analysis (CPA). CPA is a detailed analysis of a company’s ideal customers. It helps a business to better understand its customers and makes it easier for them to modify products according to the specific needs, behaviours and concerns of different types of customers.

    SUPPLY CHAIN FOR BIG DATA ANALYSIS

    Areas of important registered activities : Provisioning , Production , Sales , Commercial Distribution.

    Type Data :
    Structured Data : DataCoSupplyChainDataset.csv
    Unstructured Data : tokenized_access_logs.csv (Clickstream)

    Types of Products : Clothing , Sports , and Electronic Supplies

    Additionally it is attached in another file called DescriptionDataCoSupplyChain.csv, the description of each of the variables of the DataCoSupplyChainDatasetc.csv.

    3 columns: ID of customer, Date of purchase, and Description of product purchased.

    Market Basket Analysis is one of the key techniques used by large retailers to uncover associations between items. It works by looking for combinations of items that occur together frequently in transactions. 

    Association Rules are widely used to analyze retail basket or transaction data and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.

    Half a million transaction record for e-commerce sales in Pakistan.

    Attributes

    People

    • ID: Customer’s unique identifier
    • Year_Birth: Customer’s birth year
    • Education: Customer’s education level
    • Marital_Status: Customer’s marital status
    • Income: Customer’s yearly household income
    • Kidhome: Number of children in customer’s household
    • Teenhome: Number of teenagers in customer’s household
    • Dt_Customer: Date of customer’s enrollment with the company
    • Recency: Number of days since customer’s last purchase
    • Complain: 1 if the customer complained in the last 2 years, 0 otherwise

    Products

    • MntWines: Amount spent on wine in last 2 years
    • MntFruits: Amount spent on fruits in last 2 years
    • MntMeatProducts: Amount spent on meat in last 2 years
    • MntFishProducts: Amount spent on fish in last 2 years
    • MntSweetProducts: Amount spent on sweets in last 2 years
    • MntGoldProds: Amount spent on gold in last 2 years

    Promotion

    • NumDealsPurchases: Number of purchases made with a discount
    • AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
    • AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
    • AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
    • AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
    • AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
    • Response: 1 if customer accepted the offer in the last campaign, 0 otherwise

    Place

    • NumWebPurchases: Number of purchases made through the company’s website
    • NumCatalogPurchases: Number of purchases made using a catalogue
    • NumStorePurchases: Number of purchases made directly in stores
    • NumWebVisitsMonth: Number of visits to company’s website in the last month

    Scope:  Bank Churn ModellingBank Customers Churn Churn Prediction of bank customers

    Mall Customer Segmentation

    Let’s set the working directory YOURPATH

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    Importing libraries

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns

    import plotly.express as px
    import plotly.graph_objs as go
    import missingno as msno

    from sklearn.cluster import KMeans
    from scipy.cluster.hierarchy import dendrogram, linkage
    from sklearn.cluster import AgglomerativeClustering
    from sklearn.cluster import DBSCAN
    from sklearn import metrics
    from sklearn.datasets import make_blobs
    from sklearn.preprocessing import StandardScaler

    import warnings
    warnings.filterwarnings(‘ignore’)

    plt.style.use(‘fivethirtyeight’)
    %matplotlib inline

    Let’s read the input dataset

    df = pd.read_csv(‘Mall_Customers.csv’)

    and check the content
    df.head()

    Mall_Customers input table

    Let’s check the descriptive statistics

    df.describe().T

    Mall data descriptive statistics

    df.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 200 entries, 0 to 199
    Data columns (total 5 columns):
     #   Column                  Non-Null Count  Dtype 
    ---  ------                  --------------  ----- 
     0   CustomerID              200 non-null    int64 
     1   Gender                  200 non-null    object
     2   Age                     200 non-null    int64 
     3   Annual Income (k$)      200 non-null    int64 
     4   Spending Score (1-100)  200 non-null    int64 
    dtypes: int64(4), object(1)
    memory usage: 7.9+ KB

    Let’s drop the column

    df.drop(‘CustomerID’, axis = 1, inplace = True)

    and plot the following distribution plots

    plt.figure(figsize = (20, 8))
    plotnumber = 1

    for col in [‘Age’, ‘Annual Income (k$)’, ‘Spending Score (1-100)’]:
    if plotnumber <= 3:
    ax = plt.subplot(1, 3, plotnumber)
    sns.distplot(df[col])

    plotnumber += 1
    

    plt.tight_layout()
    plt.show()

    Distributions: age, annual income, and spending score.

    Let’s look at the Gender pie-plot

    values = df[‘Gender’].value_counts()
    labels = [‘Male’, ‘Female’]

    fig, ax = plt.subplots(figsize = (4, 4), dpi = 100)
    explode = (0, 0.06)

    patches, texts, autotexts = ax.pie(values, labels = labels, autopct = ‘%1.2f%%’, shadow = True,
    startangle = 90, explode = explode)

    plt.setp(texts, color = ‘red’)
    plt.setp(autotexts, size = 12, color = ‘white’)
    autotexts[1].set_color(‘black’)
    plt.show()

    Gender pie-plot

    Let’s compare the corresponding violin plots

    plt.figure(figsize = (20, 8))
    plotnumber = 1

    for col in [‘Age’, ‘Annual Income (k$)’, ‘Spending Score (1-100)’]:
    if plotnumber <= 3:
    ax = plt.subplot(1, 3, plotnumber)
    sns.violinplot(x = col, y = ‘Gender’, data = df)

    plotnumber += 1
    

    plt.tight_layout()
    plt.show()

    Violin plots: gender vs age, annual income, and spending score.

    Let’s compare different age groups

    age_18_25 = df.Age[(df.Age >= 18) & (df.Age <= 25)] age_26_35 = df.Age[(df.Age >= 26) & (df.Age <= 35)] age_36_45 = df.Age[(df.Age >= 36) & (df.Age <= 45)] age_46_55 = df.Age[(df.Age >= 46) & (df.Age <= 55)] age_55above = df.Age[df.Age >= 55]

    x_age = [’18-25′, ’26-35′, ’36-45′, ’46-55′, ’55+’]
    y_age = [len(age_18_25.values), len(age_26_35.values), len(age_36_45.values), len(age_46_55.values),
    len(age_55above.values)]

    px.bar(data_frame = df, x = x_age, y = y_age, color = x_age,
    title = ‘Number of customers per age group’)

    Number of customers per age group

    Let’s plot the Relation between Annual Income and Spending Score

    fig=px.scatter(data_frame = df, x = ‘Annual Income (k$)’, y = ‘Spending Score (1-100)’, color=”Age”,
    title = ‘Relation between Annual Income and Spending Score’)
    fig.update_traces(marker=dict(size=18,
    line=dict(width=1,
    color=’DarkSlateGrey’)),
    selector=dict(mode=’markers’))

    Relation between Annual Income and Spending Score

    Let’s look at the Number of customers per Spending Score group

    ss_1_20 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 1) &
    (df[‘Spending Score (1-100)’] <= 20)]

    ss_21_40 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 21) &
    (df[‘Spending Score (1-100)’] <= 40)]

    ss_41_60 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 41) &
    (df[‘Spending Score (1-100)’] <= 60)]

    ss_61_80 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 61) &
    (df[‘Spending Score (1-100)’] <= 80)]

    ss_81_100 = df[‘Spending Score (1-100)’][(df[‘Spending Score (1-100)’] >= 81) &
    (df[‘Spending Score (1-100)’] <= 100)]

    x_ss = [‘1-20′, ’21-40′, ’41-60′, ’61-80′, ’81-100’]
    y_ss = [len(ss_1_20.values), len(ss_21_40.values), len(ss_41_60.values), len(ss_61_80.values),
    len(ss_81_100.values)]

    px.bar(data_frame = df, x = x_ss, y = y_ss, color = x_ss,
    title = ‘Number of customers per Spending Score group’)

    Number of customers per Spending Score group

    Let’s check the Number of customers per Annual Income group

    ai_0_30 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 0) & (df[‘Annual Income (k$)’] <= 30)] ai_31_60 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 31)&(df[‘Annual Income (k$)’] <= 60)] ai_61_90 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’] >= 61)&(df[‘Annual Income (k$)’] <= 90)] ai_91_120 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’]>= 91)&(df[‘Annual Income (k$)’]<=120)] ai_121_150 = df[‘Annual Income (k$)’][(df[‘Annual Income (k$)’]>=121)&(df[‘Annual Income (k$)’]<=150)]

    x_ai = [‘$ 0-30,000’, ‘$ 30,001-60,000’, ‘$ 60,001-90,000’, ‘$ 90,001-120,000’, ‘$ 120,000-150,000’]
    y_ai = [len(ai_0_30.values) , len(ai_31_60.values) , len(ai_61_90.values) , len(ai_91_120.values),
    len(ai_121_150.values)]

    px.bar(data_frame = df, x = x_ai, y = y_ai, color = x_ai,
    title = ‘Number of customers per Annual Income group’)

    Number of customers per Annual Income group

    Let’s perform K-means clustering using Age and Spending Score columns

    X1 = df.loc[:, [‘Age’, ‘Spending Score (1-100)’]].values

    wcss= []
    for k in range(1, 11):
    kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
    kmeans.fit(X1)
    wcss.append(kmeans.inertia_)

    plt.figure(figsize = (12, 7))

    plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
    plt.title(‘Elbow Plot\n’, fontsize = 20)
    plt.xlabel(‘K’)
    plt.ylabel(‘WCSS’)
    plt.show()

    K-means clustering using Age and Spending Score columns: Elbow plot

    Let’s set n_clusters = 4 and run

    kmeans = KMeans(n_clusters = 4)
    labels = kmeans.fit_predict(X1)

    Let’s plot customer clusters in the domain Age vs Spending Score (1-100)

    plt.figure(figsize = (14, 8))

    plt.scatter(X1[:, 0], X1[:, 1], c = kmeans.labels_, s = 200)
    plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color = ‘red’, s = 350)
    plt.title(‘Clusters of Customers\n’, fontsize = 20)
    plt.xlabel(‘Age’)
    plt.ylabel(‘Spending Score (1-100)’)
    plt.show()

    Customer clusters in the domain Age vs Spending Score (1-100)

    Let’s perform K-means clustering using Age and Annual Income columns

    X2 = df.loc[:, [‘Age’, ‘Annual Income (k$)’]].values

    wcss= []
    for k in range(1, 11):
    kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
    kmeans.fit(X2)
    wcss.append(kmeans.inertia_)

    plt.figure(figsize = (12, 7))

    plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
    plt.title(‘Elbow Plot\n’, fontsize = 20)
    plt.xlabel(‘K’)
    plt.ylabel(‘WCSS’)
    plt.show()

    K-means clustering using Age and Annual Income columns: Elbow Plot

    Let’s set n_clusters = 5 and run

    kmeans = KMeans(n_clusters = 5)
    labels = kmeans.fit_predict(X2)

    Let’s plot Customer Clusters in the domain Annual Income (k$) vs Spending Score (1-100)

    Customer Clusters in the domain Annual Income (k$) vs Spending Score (1-100)

    Let’s perform K-means clustering using Age, Annual Score and Spending Score columns
    X4 = df.iloc[:, 1:]

    wcss= []
    for k in range(1, 11):
    kmeans = KMeans(n_clusters = k, init = ‘k-means++’)
    kmeans.fit(X4)
    wcss.append(kmeans.inertia_)

    plt.figure(figsize = (12, 7))

    plt.plot(range(1, 11), wcss, linewidth = 2, marker = ‘8’)
    plt.title(‘Elbow Plot\n’, fontsize = 20)
    plt.xlabel(‘K’)
    plt.ylabel(‘WCSS’)
    plt.show()

    K-means clustering using Age, Annual Score and Spending Score columns: Elbow plot

    Let’s set n_clusters = 6 and run

    kmeans = KMeans(n_clusters = 6)
    clusters = kmeans.fit_predict(X4)
    X4[‘label’] = clusters

    Let’s look at the corresponding 3D plot

    fig = px.scatter_3d(X4, x=”Annual Income (k$)”, y=”Spending Score (1-100)”, z=”Age”,
    color = ‘label’, size = ‘label’)
    fig.show()

    Customer clusters 3D plot:  x="Annual Income (k$)", y="Spending Score (1-100)", z="Age"

    Let’s plot the Hierarchical Clustering dendrogram
    plt.figure(figsize = (22, 8))

    dendo = dendrogram(linkage(X3, method = ‘ward’))
    plt.title(‘Dendrogram’, fontsize = 15)
    plt.show()

    Hierarchical Clustering dendrogram

    Let’s run Agglomerative Clustering with n_clusters = 5

    agc = AgglomerativeClustering(n_clusters = 5, affinity = ‘euclidean’, linkage = ‘ward’)
    labels = agc.fit_predict(X3)

    plt.figure(figsize = (12, 8))
    smb=200

    plt.scatter(X3[labels == 0,0], X3[labels == 0,1], label = ‘Cluster 1’, s = smb)
    plt.scatter(X3[labels == 1,0], X3[labels == 1,1], label = ‘Cluster 2’, s = smb)
    plt.scatter(X3[labels == 2,0], X3[labels == 2,1], label = ‘Cluster 3’, s = smb)
    plt.scatter(X3[labels == 3,0], X3[labels == 3,1], label = ‘Cluster 4’, s = smb)
    plt.scatter(X3[labels == 4,0], X3[labels == 4,1], label = ‘Cluster 5’, s = smb)

    plt.legend(loc = ‘best’)
    plt.title(‘Clusters of Customers\n ‘, fontsize = 20)
    plt.xlabel(‘Annual Income (k$)’)
    plt.ylabel(‘Spending Score (1-100)’)
    plt.show()

    Clusters of Customers:
Annual Income (k$) vs
Spending Score (1-100)

    Let’s compare these results with the DBSCAN clustering algorithm

    centers = [[1, 1], [-1, -1], [1, -1]]
    X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
    random_state=0) # generate sample blobs

    X = StandardScaler().fit_transform(X)

    db = DBSCAN(eps=0.3, min_samples=10).fit(X)

    Creating an array of true and false as the same size as db.labels
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True

    (setting the indices of the core regions to True)
    labels = db.labels_

    (similar to the model.fit() method, it gives the labels of the clustered data).

    The label -1 is considered as noise by the DBSCAN algorithm
    n_clusters_ = len(set(labels)) – (1 if -1 in labels else 0)
    n_noise_ = list(labels).count(-1) # calculating the number of clusters

    print(‘Estimated number of clusters: %d’ % n_clusters_)
    print(‘Estimated number of noise points: %d’ % n_noise_)

    “””Homogeneity metric of a cluster labeling given a ground truth.

    A clustering result satisfies homogeneity if all of its clusters
    contain only data points which are members of a single class.”””

    print(“Homogeneity: %0.3f” % metrics.homogeneity_score(labels_true, labels))

    Estimated number of clusters: 3
    Estimated number of noise points: 18
    Homogeneity: 0.953

    Let’s plot the result representing 3 clusters + noise

    plt.figure(figsize = (10, 8))

    unique_labels = set(labels) # identifying all the unique labels/clusters
    colors = [plt.cm.Spectral(each)
    # creating the list of colours, generating the colourmap
    for each in np.linspace(0, 1, len(unique_labels))]

    for k, col in zip(unique_labels, colors):
    if k == -1:
    # Black used for noise.
    col = [0, 0, 0, 1]
    class_member_mask = (labels == k) # assigning class members for each class
    xy = X[class_member_mask & core_samples_mask] # creating the list of points for each class
    plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col),markeredgecolor=’k’, markersize=14)
    xy = X[class_member_mask & ~core_samples_mask] # creating the list of noise points
    plt.plot(xy[:, 0], xy[:, 1], ‘o’, markerfacecolor=tuple(col), markeredgecolor=’k’, markersize=14)

    plt.title(‘Clustering using DBSCAN\n’, fontsize = 15)
    plt.show()

    Clustering using DBSCAN

    Here black symbols designate noise.

    Read more here.

    Online Retail K-Means Clustering

    Let’s continue using the dataset Mall_Customers.csv

    from sklearn.preprocessing import StandardScaler

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    %matplotlib inline

    import os
    import warnings

    warnings.filterwarnings(‘ignore’)

    df = pd.read_csv(‘Mall_Customers.csv’)

    Let’s rename the columns

    df.rename(index=str, columns={‘Annual Income (k$)’: ‘Income’,
    ‘Spending Score (1-100)’: ‘Score’}, inplace=True)
    df.head()

    The input data table

    Let’s examine the input data using sns.pairplot
    X = df.drop([‘CustomerID’, ‘Gender’], axis=1)
    sns.pairplot(df.drop(‘CustomerID’, axis=1), hue=’Gender’, aspect=1.5)
    plt.show()

    Input data pairplot

    Let’s look at the K-Means Elbow plot

    from sklearn.cluster import KMeans

    clusters = []

    for i in range(1, 11):
    km = KMeans(n_clusters=i).fit(X)
    clusters.append(km.inertia_)

    fig, ax = plt.subplots(figsize=(12, 8))
    sns.lineplot(x=list(range(1, 11)), y=clusters, ax=ax,linewidth = 4)
    ax.set_title(‘Searching for Elbow’)
    ax.set_xlabel(‘Clusters’)
    ax.set_ylabel(‘Inertia’)

    ax.annotate(‘Possible Elbow Point’, xy=(3, 140000), xytext=(3, 50000), xycoords=’data’,
    arrowprops=dict(arrowstyle=’->’, connectionstyle=’arc3′, color=’red’, lw=2))

    ax.annotate(‘Possible Elbow Point’, xy=(5, 80000), xytext=(5, 150000), xycoords=’data’,
    arrowprops=dict(arrowstyle=’->’, connectionstyle=’arc3′, color=’red’, lw=2))

    plt.show()

    Searching for Elbow

    Let’s plot K-means with 3 clusters Income vs Score

    km3 = KMeans(n_clusters=3).fit(X)

    X[‘Labels’] = km3.labels_
    plt.figure(figsize=(12, 8))

    sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Income”,
    sizes=(40, 400), alpha=.5, palette=”muted”,
    height=6, data=X)
    plt.title(‘KMeans with 3 Clusters’)
    plt.show()

    K-means with 3 clusters Income vs Score

    Let’s plot K-means with 3 clusters Income-Age vs Score

    km3 = KMeans(n_clusters=3).fit(X)

    X[‘Labels’] = km3.labels_
    plt.figure(figsize=(12, 8))

    sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
    sizes=(40, 400), alpha=.5, palette=”muted”,
    height=6, data=X)
    plt.title(‘KMeans with 3 Clusters’)
    plt.show()

    K-means with 3 clusters Income-Age vs Score

    Let’s plot K-means with 5 clusters Income vs Score

    km3 = KMeans(n_clusters=5).fit(X)

    X[‘Labels’] = km3.labels_
    plt.figure(figsize=(12, 8))
    sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Income”,
    sizes=(40, 400), alpha=.5, palette=”muted”,
    height=6, data=X)
    plt.title(‘KMeans with 5 Clusters’)
    plt.show()

    K-means with 5 clusters Income vs Score

    Let’s plot K-means with 5 clusters Income-Age vs Score

    km3 = KMeans(n_clusters=5).fit(X)

    X[‘Labels’] = km3.labels_
    plt.figure(figsize=(12, 8))
    sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
    sizes=(40, 400), alpha=.5, palette=”muted”,
    height=6, data=X)
    plt.title(‘KMeans with 5 Clusters’)
    plt.show()

    K-means with 5 clusters Income-Age vs Score

    Let’s invoke sns.swarmplot to compare Labels According to Annual Income and Scoring History

    fig = plt.figure(figsize=(20,8))
    sns.set(font_scale=2)
    ax = fig.add_subplot(121)
    sns.swarmplot(x=’Labels’, y=’Income’, data=X, ax=ax,hue=”Labels”, legend=False)
    ax.set_title(‘Labels According to Annual Income’)

    ax = fig.add_subplot(122)
    sns.swarmplot(x=’Labels’, y=’Score’, data=X, ax=ax)
    ax.set_title(‘Labels According to Scoring History’)

    plt.show()

    sns.swarmplot: Labels according to Annual Income and Scoring History

    Hierarchical Clustering:
    from sklearn.cluster import AgglomerativeClustering
    sns.set(font_scale=1)
    agglom = AgglomerativeClustering(n_clusters=5, linkage=’average’).fit(X)

    X[‘Labels’] = agglom.labels_
    plt.figure(figsize=(12, 8))

    sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
    sizes=(40, 400), alpha=.5, palette=sns.color_palette(‘hls’, 5),
    height=6, data=X)

    plt.title(‘Agglomerative with 5 Clusters’)
    plt.show()

    Agglomerative Clustering with 5 clusters

    Let’s compute the spatial distance matrix

    from scipy.cluster import hierarchy
    from scipy.spatial import distance_matrix

    dist = distance_matrix(X, X)
    Z = hierarchy.linkage(dist, ‘complete’)

    and plot the dendrogram with dist=’complete’

    plt.figure(figsize=(18, 50))
    dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size=12, orientation=’right’)

    The dendrogram with dist='complete'

    Let’s plot the dendrogram with dist=’average’

    Z = hierarchy.linkage(dist, ‘average’)
    plt.figure(figsize=(18, 50))
    dendro = hierarchy.dendrogram(Z, leaf_rotation=0, leaf_font_size =12, orientation = ‘right’)

    The dendrogram with dist='average'

    Density Based Clustering (DBSCAN):
    from sklearn.cluster import DBSCAN

    db = DBSCAN(eps=11, min_samples=6).fit(X)

    X[‘Labels’] = db.labels_
    plt.figure(figsize=(12, 8))

    sns.relplot(x=”Income”, y=”Score”, hue=”Labels”, size=”Age”,
    sizes=(40, 400), alpha=.5, palette=sns.color_palette(‘hls’, 5),
    height=6, data=X)
    plt.title(‘DBSCAN with epsilon 11, min samples 6’)
    plt.show()

    DBSCAN with epsilon 11, min samples 6

    Read more here.

    Online Retail Data Analytics

    Let’s look at the OnlineRetail.csv dataset

    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    import missingno as msno
    df = pd.read_csv(“OnlineRetail.csv”, delimiter=’,’, encoding = “ISO-8859-1”)

    Let’s check the data structure

    df.head()

    Online Retail input data table

    df.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 541909 entries, 0 to 541908
    Data columns (total 8 columns):
     #   Column       Non-Null Count   Dtype  
    ---  ------       --------------   -----  
     0   InvoiceNo    541909 non-null  object 
     1   StockCode    541909 non-null  object 
     2   Description  540455 non-null  object 
     3   Quantity     541909 non-null  int64  
     4   InvoiceDate  541909 non-null  object 
     5   UnitPrice    541909 non-null  float64
     6   CustomerID   406829 non-null  float64
     7   Country      541909 non-null  object 
    dtypes: float64(2), int64(1), object(5)
    memory usage: 33.1+ MB

    df.describe()

    Online Retail descriptive statistics

    Data Cleaning: Checking for Null Values
    msno.bar(df)

    Checking for Null Values before data cleaning

    df.count()

    InvoiceNo      541909
    StockCode      541909
    Description    540455
    Quantity       541909
    InvoiceDate    541909
    UnitPrice      541909
    CustomerID     406829
    Country        541909
    dtype: int64

    df[df[‘CustomerID’].isnull()].count()

    InvoiceNo      135080
    StockCode      135080
    Description    133626
    Quantity       135080
    InvoiceDate    135080
    UnitPrice      135080
    CustomerID          0
    Country        135080
    dtype: int64

    Let’s check the percentage

    100 – ((541909-135000)/541909 * 100)

    24.911931708091203

    So, approximately 25% of the data is missing.

    We will proceed with dropping the missing rows now.
    df.dropna(inplace=True)

    msno.bar(df)

    Checking for Null Values after data cleaning

    Let’s perform further data editing

    df[‘InvoiceDate’] = pd.to_datetime(df[‘InvoiceDate’], format=’%d-%m-%Y %H:%M’)

    df[‘Total Amount Spent’]= df[‘Quantity’] * df[‘UnitPrice’]

    total_amount = df[‘Total Amount Spent’].groupby(df[‘CustomerID’]).sum()
    total_amount = pd.DataFrame(total_amount).reset_index()

    transactions = df[‘InvoiceNo’].groupby(df[‘CustomerID’]).count()
    transaction = pd.DataFrame(transactions).reset_index()

    final = df[‘InvoiceDate’].max()
    df[‘Last_transact’] = final – df[‘InvoiceDate’]
    LT = df.groupby(df[‘CustomerID’]).min()[‘Last_transact’]
    LT = pd.DataFrame(LT).reset_index()

    df_new = pd.merge(total_amount, transaction, how=’inner’, on=’CustomerID’)
    df_new = pd.merge(df_new, LT, how=’inner’, on=’CustomerID’)

    df_new[‘Last_transact’] = df_new[‘Last_transact’].dt.days

    Let’s run the K-Means Clustering Model
    from sklearn.cluster import KMeans
    kmeans= KMeans(n_clusters=2)

    kmeans.fit(df_new[[‘Total Amount Spent’, ‘InvoiceNo’, ‘Last_transact’]])
    pred = kmeans.predict(df_new[[‘Total Amount Spent’, ‘InvoiceNo’, ‘Last_transact’]])

    df_new=df_new.join(pred, lsuffix=”_left”, rsuffix=”_right”)

    error_rate = []
    for clusters in range(1,16):
    kmeans = KMeans(n_clusters = clusters)
    kmeans.fit(df_new)
    kmeans.predict(df_new)
    error_rate.append(kmeans.inertia_)

    error_rate = pd.DataFrame({‘Cluster’:range(1,16) , ‘Error’:error_rate})

    Let’s plot the Error Rate and Clusters

    plt.figure(figsize=(12,8))
    p = sns.barplot(x=’Cluster’, y= ‘Error’, data= error_rate, palette=’coolwarm_r’)
    sns.despine(left=True)
    p.set_title(‘Error Rate and Clusters’)

    Error Rate and Clusters

    Let’s drop 1 column

    del df[‘InvoiceDate’]

    df.info()

    <class 'pandas.core.frame.DataFrame'>
    Index: 392692 entries, 0 to 541908
    Data columns (total 9 columns):
     #   Column        Non-Null Count   Dtype  
    ---  ------        --------------   -----  
     0   InvoiceNo     392692 non-null  object 
     1   StockCode     392692 non-null  object 
     2   Description   392692 non-null  object 
     3   Quantity      392692 non-null  int64  
     4   UnitPrice     392692 non-null  float64
     5   CustomerID    392692 non-null  float64
     6   Country       392692 non-null  object 
     7   cancellation  392692 non-null  int64  
     8   TotalPrice    392692 non-null  float64
    dtypes: float64(3), int64(2), object(4)
    memory usage: 30.0+ MB

    Let’s group the data

    country_wise = df.groupby(‘Country’).sum()

    and read the full country list

    country_codes = pd.read_csv(‘wikipedia-iso-country-codes.csv’, names=[‘Country’, ‘two’, ‘three’, ‘numeric’, ‘ISO’])

    country_codes.head()

    Full country list

    Let’s merge df and country_codes

    country_wise = pd.merge(country_codes,country_wise, on=’Country’)

    Let’s invoke plotly for plotting purposes

    from plotly import version
    import cufflinks as cf
    from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
    init_notebook_mode(connected=True)
    cf.go_offline()
    import plotly.graph_objs as go

    Let’s plot European Countries According to Revenue

    data = dict(type=’choropleth’,colorscale=’GnBu’, locations = country_wise[‘three’], locationmode = ‘ISO-3’, z= country_wise[‘Total Amount Spent’], text = country_wise[‘Country’], colorbar={‘title’:’Revenue’}, marker = dict(line=dict(width=0)))
    layout = dict(title = ‘European Countries According to Revenue!’, geo = dict(scope=’europe’,showlakes=False, projection = {‘type’: ‘winkel tripel’}))
    Choromaps2 = go.Figure(data=[data], layout=layout)
    iplot(Choromaps2)

    European Countries According to Revenue

    Let’s plot All Countries According to Revenue

    data = dict(type=’choropleth’,colorscale=’rainbow’, locations = country_wise[‘three’], locationmode = ‘ISO-3’, z= country_wise[‘Total Amount Spent’], text = country_wise[‘Country’], colorbar={‘title’:’Revenue’}, marker = dict(line=dict(width=0)))
    layout = dict(title = ‘All Countries According to Revenue!’, geo = dict(scope=’world’,showlakes=False, projection = {‘type’: ‘winkel tripel’}))
    Choromaps2 = go.Figure(data=[data], layout=layout)
    iplot(Choromaps2)

    All Countries According to Revenue

    Let’s create the list of countries based upon the percentage of Total Price

    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    df = pd.read_csv(“OnlineRetail.csv”, encoding= ‘unicode_escape’)

    df = df.drop_duplicates()

    df[‘cancellation’] = df.InvoiceNo.str.extract(‘([C])’).fillna(0).replace({‘C’:1})
    df.cancellation.value_counts()

    cancellation
    0    527390
    1      9251
    Name: count, dtype: int64

    df[df.cancellation == 1][‘CustomerID’].nunique() / df.CustomerID.nunique() * 100

    36.34492223238792

    df = df[df.CustomerID.notnull()]
    df = df[(df.Quantity > 0) & (df.UnitPrice > 0)]

    df = df.drop(‘cancellation’, axis = 1)

    df[“TotalPrice”] = df.UnitPrice * df.Quantity

    df2 = pd.DataFrame(df.groupby(‘Country’).TotalPrice.sum().apply(lambda x: round(x, 2))).sort_values(‘TotalPrice’, ascending = False)

    df2[‘perc_of_TotalPrice’] = round(df2.TotalPrice / df2.TotalPrice.sum() * 100, 2)
    df2

    The list of countries based upon the percentage of Total Price

    Read more here.

    RFM Customer Segmentation

    Let’s set the working directory YOURPATH

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    and import the key libraries

    !pip install termcolor

    import numpy as np # linear algebra
    import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
    import ipywidgets
    from ipywidgets import interact

    import numpy as np
    import pandas as pd

    import seaborn as sns
    import matplotlib.pyplot as plt
    %matplotlib inline
    import matplotlib.ticker as mticker
    import squarify as sq

    import scipy.stats as stats
    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    import missingno as msno

    import datetime as dt
    from datetime import datetime

    from sklearn.preprocessing import scale, StandardScaler, MinMaxScaler, RobustScaler
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import PowerTransformer
    from sklearn.preprocessing import LabelEncoder
    from sklearn.compose import make_column_transformer
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA

    import plotly.express as px
    import cufflinks as cf
    import plotly.offline
    cf.go_offline()
    cf.set_config_file(offline=False, world_readable=True)

    import colorama
    from colorama import Fore, Style # maakes strings colored
    from termcolor import colored
    from termcolor import cprint

    from wordcloud import WordCloud

    import warnings
    warnings.filterwarnings(“ignore”)
    warnings.warn(“this will not show”)

    plt.rcParams[“figure.figsize”] = (10,6)
    pd.set_option(‘max_colwidth’,200)
    pd.set_option(‘display.max_rows’, 1000)
    pd.set_option(‘display.max_columns’, 200)
    pd.set_option(‘display.float_format’, lambda x: ‘%.3f’ % x)

    Let’s define the following functions

    ##########################

    def missing_values(df):
    missing_number = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=[‘Missing_Number’, ‘Missing_Percent’])
    return missing_values[missing_values[‘Missing_Number’]>0]

    def first_looking(df):
    print(colored(“Shape:”, attrs=[‘bold’]), df.shape,’\n’,
    colored(‘100, ‘red’, attrs=[‘bold’]),
    colored(“\nInfo:\n”, attrs=[‘bold’]), sep=”)
    print(df.info(), ‘\n’,
    colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)
    print(colored(“Number of Uniques:\n”, attrs=[‘bold’]), df.nunique(),’\n’,
    colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)
    print(colored(“Missing Values:\n”, attrs=[‘bold’]), missing_values(df),’\n’,
    colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)
    print(colored(“All Columns:”, attrs=[‘bold’]), list(df.columns),’\n’,
    colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)

    df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')
    print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
          colored('*'*100, 'red', attrs=['bold']), sep='')  
    print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
          colored('*'*100, 'red', attrs=['bold']), sep='')
    print(colored("Descriptive Statistics \n", attrs=['bold']), df.describe().round(2),'\n',
          colored('*'*100, 'red', attrs=['bold']), sep='') # Gives a statstical breakdown of the data.
    print(colored("Descriptive Statistics (Categorical Columns) \n", attrs=['bold']), df.describe(include=object).T,'\n',
          colored('*'*100, 'red', attrs=['bold']), sep='') # Gives a statstical breakdown of the data.
    

    def multicolinearity_control(df):
    feature =[]
    collinear=[]
    for col in df.corr().columns:
    for i in df.corr().index:
    if (abs(df.corr()[col][i])> .9 and abs(df.corr()[col][i]) < 1):
    feature.append(col)
    collinear.append(i)
    print(colored(f”Multicolinearity alert in between:{col} – {i}”,
    “red”, attrs=[‘bold’]), df.shape,’\n’,
    colored(‘100, ‘red’, attrs=[‘bold’]), sep=”)

    def duplicate_values(df):
    print(colored(“Duplicate check…”, attrs=[‘bold’]), sep=”)
    print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)

    def drop_null(df, limit):
    print(‘Shape:’, df.shape)
    for i in df.isnull().sum().index:
    if (df.isnull().sum()[i]/df.shape[0]*100)>limit:
    print(df.isnull().sum()[i], ‘percent of’, i ,’null and were dropped’)
    df.drop(i, axis=1, inplace=True)
    print(‘new shape:’, df.shape)
    print(‘New shape after missing value control:’, df.shape)

    def first_look(col):
    print(“column name : “, col)
    print(“——————————–“)
    print(“Per_of_Nulls : “, “%”, round(df[col].isnull().sum()/df.shape[0]*100, 2))
    print(“Num_of_Nulls : “, df[col].isnull().sum())
    print(“Num_of_Uniques : “, df[col].nunique())
    print(“Duplicates : “, df.duplicated(subset=None, keep=’first’).sum())
    print(df[col].value_counts(dropna = False))

    def fill_most(df, group_col, col_name):
    ”’Fills the missing values with the most existing value (mode) in the relevant column according to single-stage grouping”’
    for group in list(df[group_col].unique()):
    cond = df[group_col]==group
    mode = list(df[cond][col_name].mode())
    if mode != []:
    df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[cond][col_name].mode()[0])
    else:
    df.loc[cond, col_name] = df.loc[cond, col_name].fillna(df[col_name].mode()[0])
    print(“Number of NaN : “,df[col_name].isnull().sum())
    print(“——————“)
    print(df[col_name].value_counts(dropna=False))

    ##########################

    Let’s read and copy the input Online Retail dataset

    df0=pd.read_csv(‘Online Retail.csv’,index_col=0)
    df = df0.copy()

    Let’s check the dataset content

    first_looking(df)
    duplicate_values(df)

    Shape:(541909, 8)
    ****************************************************************************************************
    Info:
    
    <class 'pandas.core.frame.DataFrame'>
    Index: 541909 entries, 0 to 541908
    Data columns (total 8 columns):
     #   Column       Non-Null Count   Dtype  
    ---  ------       --------------   -----  
     0   InvoiceNo    541909 non-null  object 
     1   StockCode    541909 non-null  object 
     2   Description  540455 non-null  object 
     3   Quantity     541909 non-null  float64
     4   InvoiceDate  541909 non-null  object 
     5   UnitPrice    541909 non-null  float64
     6   CustomerID   406829 non-null  float64
     7   Country      541909 non-null  object 
    dtypes: float64(3), object(5)
    memory usage: 37.2+ MB
    None
    ****************************************************************************************************
    Number of Uniques:
    InvoiceNo      25900
    StockCode       4070
    Description     4223
    Quantity         722
    InvoiceDate    23260
    UnitPrice       1630
    CustomerID      4372
    Country           38
    dtype: int64
    ****************************************************************************************************
    Missing Values:
                 Missing_Number  Missing_Percent
    CustomerID           135080            0.249
    Description            1454            0.003
    ****************************************************************************************************
    All Columns:['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']
    ****************************************************************************************************
    Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
    ****************************************************************************************************
    Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
    ****************************************************************************************************
    Descriptive Statistics 
            quantity  unitprice  customerid
    count 541909.000 541909.000  406829.000
    mean       9.550      4.610   15287.690
    std      218.080     96.760    1713.600
    min   -80995.000 -11062.060   12346.000
    25%        1.000      1.250   13953.000
    50%        3.000      2.080   15152.000
    75%       10.000      4.130   16791.000
    max    80995.000  38970.000   18287.000
    ****************************************************************************************************
    Descriptive Statistics (Categorical Columns) 
                  count unique                                 top    freq
    invoiceno    541909  25900                            573585.0    1114
    stockcode    541909   4070                              85123A    2313
    description  540455   4223  WHITE HANGING HEART T-LIGHT HOLDER    2369
    invoicedate  541909  23260                 2011-10-31 14:41:00    1114
    country      541909     38                      United Kingdom  495478
    ****************************************************************************************************
    Duplicate check...
    There are 5268 duplicated observations in the dataset.

    Let’s calculate the total price

    df[‘total_price’] = df[‘quantity’] * df[‘unitprice’]
    print(“There is”, df.shape[0], “observations and”, df.shape[1], “columns in the dataset”)

    There is 541909 observations and 9 columns in the dataset

    Let’s plot Invoice Counts Per Country

    fig = px.histogram(df, x = df.groupby(‘country’)[‘invoiceno’].nunique().index,
    y = df.groupby(‘country’)[‘invoiceno’].nunique().values,
    title = ‘Invoice Counts Per Country’,
    labels = dict(x = “Countries”, y =”Invoice”))
    fig.show();

    Invoice Counts Per Country

    Let’s group the data

    df.groupby([‘invoiceno’,’customerid’, ‘country’])[‘total_price’].sum().sort_values().head()

    invoiceno  customerid  country       
    C581484    16446.000   United Kingdom   -168469.600
    C541433    12346.000   United Kingdom    -77183.600
    C556445    15098.000   United Kingdom    -38970.000
    C550456    15749.000   United Kingdom    -22998.400
    C570556    16029.000   United Kingdom    -11816.640
    Name: total_price, dtype: float64

    Let’s plot Customer Counts by Country

    fig = px.histogram(df, x = df.groupby(‘country’)[‘customerid’].nunique().index,
    y = df.groupby(‘country’)[‘customerid’].nunique().values,
    title = ‘Customer Counts by Country’,
    labels = dict(x = “Countries”, y =”Customer”))
    fig.show()

    Customer Counts by Country

    Let’s check missing values

    df.isnull().melt(value_name=”missing”)

    Clean the Data from the Noise and Missing Values:
    missing_values(df)

    Missing numbers

    Drop observations with missing values:
    df = df.dropna(subset=[“customerid”])

    Get rid of cancellations and negative values:
    df = df[(df[‘unitprice’] > 0) & (df[‘quantity’] > 0)]

    Check for duplicates and get rid of them:
    print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)
    print(df.duplicated(subset=None, keep=’first’).sum(), “Duplicated observations are dropped!”)
    df.drop_duplicates(keep=’first’, inplace=True)

    print(“There are”, df.duplicated(subset=None, keep=’first’).sum(), “duplicated observations in the dataset.”)
    print(df.duplicated(subset=None, keep=’first’).sum(), “Duplicated observations are dropped!”)
    df.drop_duplicates(keep=’first’, inplace=True)

    There are 5192 duplicated observations in the dataset.
    5192 Duplicated observations are dropped!

    df.columns

    Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate',
           'unitprice', 'customerid', 'country', 'total_price'],
          dtype='object')

    Explore the Orders:
    cprint(“Unique number of invoice per customer”,’red’)
    df.groupby(“customerid”)[“invoiceno”].nunique().sort_values(ascending = False)

    Unique number of invoice per customer
    customerid
    12748.000    209
    14911.000    201
    17841.000    124
    13089.000     97
    14606.000     93
                ... 
    15314.000      1
    15313.000      1
    15308.000      1
    15307.000      1
    15300.000      1
    Name: invoiceno, Length: 4338, dtype: int64

    Let’s plot the country-based wordcloud

    from wordcloud import WordCloud

    country_text = df[“country”].str.split(” “).str.join(“_”)
    all_countries = ” “.join(country_text)

    wc = WordCloud(background_color=”red”,
    max_words=250,
    max_font_size=256,
    random_state=42,
    width=800, height=400)
    wc.generate(all_countries)
    plt.figure(figsize = (12, 10))
    plt.imshow(wc)
    plt.axis(‘off’)
    plt.show()

    Country-based wordcloud

    Let’s focus on the UK online retail market

    df_uk = df[df[“country”]==”United Kingdom”]

    first_looking(df_uk)
    duplicate_values(df_uk)

    Shape:(349203, 9)
    ****************************************************************************************************
    Info:
    
    <class 'pandas.core.frame.DataFrame'>
    Index: 349203 entries, 0 to 541893
    Data columns (total 9 columns):
     #   Column       Non-Null Count   Dtype  
    ---  ------       --------------   -----  
     0   invoiceno    349203 non-null  object 
     1   stockcode    349203 non-null  object 
     2   description  349203 non-null  object 
     3   quantity     349203 non-null  float64
     4   invoicedate  349203 non-null  object 
     5   unitprice    349203 non-null  float64
     6   customerid   349203 non-null  float64
     7   country      349203 non-null  object 
     8   total_price  349203 non-null  float64
    dtypes: float64(4), object(5)
    memory usage: 26.6+ MB
    None
    ****************************************************************************************************
    Number of Uniques:
    invoiceno      16646
    stockcode       3645
    description     3844
    quantity         293
    invoicedate    15612
    unitprice        402
    customerid      3920
    country            1
    total_price     2792
    dtype: int64
    ****************************************************************************************************
    Missing Values:
    Empty DataFrame
    Columns: [Missing_Number, Missing_Percent]
    Index: []
    ****************************************************************************************************
    All Columns:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
    ****************************************************************************************************
    Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
    ****************************************************************************************************
    Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country', 'total_price']
    ****************************************************************************************************
    Descriptive Statistics 
            quantity  unitprice  customerid  total_price
    count 349203.000 349203.000  349203.000   349203.000
    mean      12.150      2.970   15548.380       20.860
    std      190.630     17.990    1594.380      328.420
    min        1.000      0.000   12346.000        0.000
    25%        2.000      1.250   14191.000        4.200
    50%        4.000      1.950   15518.000       10.200
    75%       12.000      3.750   16931.000       17.850
    max    80995.000   8142.750   18287.000   168469.600
    ****************************************************************************************************
    Descriptive Statistics (Categorical Columns) 
                  count unique                                 top    freq
    invoiceno    349203  16646                            576339.0     542
    stockcode    349203   3645                              85123A    1936
    description  349203   3844  WHITE HANGING HEART T-LIGHT HOLDER    1929
    invoicedate  349203  15612                 2011-11-14 15:27:00     542
    country      349203      1                      United Kingdom  349203
    ****************************************************************************************************
    Duplicate check...
    There are 0 duplicated observations in the dataset.

    Let’s prepare the date for the RFM analysis

    import datetime as dt
    from datetime import datetime
    import datetime as dt

    df_uk[“ref_date”] = pd.to_datetime(df_uk[‘invoicedate’]).dt.date.max()

    df_uk[‘ref_date’] = pd.to_datetime(df_uk[‘ref_date’])

    df_uk[“date”] = pd.to_datetime(df_uk[“invoicedate”]).dt.date

    df_uk[‘date’] = pd.to_datetime(df_uk[‘date’])

    df_uk[‘last_purchase_date’] = df_uk.groupby(‘customerid’)[‘invoicedate’].transform(max)
    df_uk[‘last_purchase_date’] = pd.to_datetime(df_uk[‘last_purchase_date’]).dt.date
    df_uk[‘last_purchase_date’] = pd.to_datetime(df_uk[‘last_purchase_date’])

    df_uk.groupby(‘customerid’)[[‘last_purchase_date’]].max().sample(5)

    Last purchase date

    df_uk[“customer_recency”] = df_uk[“ref_date”] – df_uk[“last_purchase_date”]

    df_uk[‘recency_value’] = pd.to_numeric(df_uk[‘customer_recency’].dt.days.astype(‘int64’))

    customer_recency = pd.DataFrame(df_uk.groupby(‘customerid’)[‘recency_value’].min())
    customer_recency.rename(columns={‘recency_value’:’recency’}, inplace=True)
    customer_recency.reset_index(inplace=True)

    df_uk.drop(‘last_purchase_date’, axis=1, inplace=True)

    Let’s plot Customer Regency Distribution

    fig = px.histogram(df_uk, x = ‘recency_value’, title = ‘Customer Regency Distribution’)
    fig.show()

    Customer Regency Distribution

    Let’s compute Frequency – Number of purchases

    df_uk1 = df_uk.copy()
    customer_frequency = pd.DataFrame(df_uk.groupby(‘customerid’)[‘invoiceno’].nunique())
    customer_frequency.rename(columns={‘invoiceno’:’frequency’}, inplace=True)
    customer_frequency.reset_index(inplace=True)

    df_uk[‘customer_frequency’] = df_uk.groupby(‘customerid’)[‘invoiceno’].transform(‘count’)

    Let’s plot Customer Frequency Distribution

    fig = px.histogram(df_uk, x = ‘customer_frequency’, title = ‘Customer Frequency Distribution’)
    fig.show()

    Customer Frequency Distribution

    Let’s calculate Monetary

    customer_monetary = pd.DataFrame(df_uk.groupby(‘customerid’)[‘total_price’].sum())
    customer_monetary.rename(columns={‘total_price’:’monetary’}, inplace=True)
    customer_monetary.reset_index(inplace=True)

    df_uk[‘customer_monetary’] = df_uk.groupby(‘customerid’)[‘total_price’].transform(‘sum’)

    Let’s plot Customer Monetary Distribution

    fig = px.histogram(df_uk, x = ‘customer_monetary’, title = ‘Customer Monetary Distribution’)
    fig.show()

    Customer Monetary Distribution

    Let’s create the RFM Table
    customer_rfm = pd.merge(pd.merge(customer_recency, customer_frequency, on=’customerid’), customer_monetary, on=’customerid’)
    customer_rfm.head()

    Recency, frequency and monetary columns

    customer_rfm.set_index(‘customerid’, inplace = True)

    quantiles = customer_rfm.quantile(q = [0.25,0.50,0.75])
    quantiles

    3 quantiles of recency, frequency and monetary columns

    Let’s introduce several functions creating the RFM Segmentation Table

    def rec_score(x):
    if x < 17.000:
    return 4
    elif 17.000 <= x < 50.000:
    return 3
    elif 50.000 <= x < 142.000:
    return 2
    else:
    return 1

    def freq_score(x):
    if x > 5.000:
    return 4
    elif 5.000 >= x > 2.000:
    return 3
    elif 2.000 >= x > 1:
    return 2
    else:
    return 1

    def mon_score(x):
    if x > 1571.285:
    return 4
    elif 1571.285 >= x > 644.975:
    return 3
    elif 644.975 >= x > 298.185:
    return 2
    else:
    return 1

    Let’s calculate the RFM score

    customer_rfm[‘recency_score’] = customer_rfm[‘recency’].apply(rec_score)
    customer_rfm[‘frequency_score’] = customer_rfm[‘frequency’].apply(freq_score)
    customer_rfm[‘monetary_score’] = customer_rfm[‘monetary’].apply(mon_score)

    customer_rfm.info()

    <class 'pandas.core.frame.DataFrame'>
    Index: 3920 entries, 12346.0 to 18287.0
    Data columns (total 6 columns):
     #   Column           Non-Null Count  Dtype  
    ---  ------           --------------  -----  
     0   recency          3920 non-null   int64  
     1   frequency        3920 non-null   int64  
     2   monetary         3920 non-null   float64
     3   recency_score    3920 non-null   int64  
     4   frequency_score  3920 non-null   int64  
     5   monetary_score   3920 non-null   int64  
    dtypes: float64(1), int64(5)
    memory usage: 214.4 KB

    Let’s calculate the total RFM score

    customer_rfm[‘RFM_score’] = (customer_rfm[‘recency_score’].astype(str) + customer_rfm[‘frequency_score’].astype(str) +
    customer_rfm[‘monetary_score’].astype(str))
    customer_rfm.sample(5)

    RFM score table

    Let’s plot Customer_RFM Distribution

    fig = px.histogram(customer_rfm, x = customer_rfm[‘RFM_score’].value_counts().index,
    y = customer_rfm[‘RFM_score’].value_counts().values,
    title = ‘Customer_RFM Distribution’,
    labels = dict(x = “RFM_score”, y =”counts”))
    fig.show()

    Customer_RFM Distribution

    Let’s calculate the RFM_level

    customer_rfm[‘RFM_level’] = customer_rfm[‘recency_score’] + customer_rfm[‘frequency_score’] + customer_rfm[‘monetary_score’]
    customer_rfm.sample(5)

    RFM_level

    print(‘Min value for RFM_level : ‘, customer_rfm[‘RFM_level’].min())
    print(‘Max value for RFM_level : ‘, customer_rfm[‘RFM_level’].max())

    Min value for RFM_level :  3
    Max value for RFM_level :  12

    Let’s introduce the RFM segment function

    def segments(df_rfm):
    if df_rfm[‘RFM_level’] == 12 :
    return ‘champion’
    elif (df_rfm[‘RFM_level’] == 11) or (df_rfm[‘RFM_level’] == 10 ):
    return ‘loyal_customer’
    elif (df_rfm[‘RFM_level’] == 9) or (df_rfm[‘RFM_level’] == 8 ):
    return ‘promising’
    elif (df_rfm[‘RFM_level’] == 7) or (df_rfm[‘RFM_level’] == 6 ):
    return ‘need_attention’
    elif (df_rfm[‘RFM_level’] == 5) or (df_rfm[‘RFM_level’] == 4 ):
    return ‘hibernating’
    else:
    return ‘almost_lost’

    Let’s call this function to compute customer_segment

    customer_rfm[‘customer_segment’] = customer_rfm.apply(segments,axis=1)
    customer_rfm.sample(5)

    customer_segment

    Let’s plot the customer_segment distribution

    fig = px.histogram(customer_rfm, x = customer_rfm[‘customer_segment’].value_counts().index,
    y = customer_rfm[‘customer_segment’].value_counts().values,
    title = ‘Customer_Segment Distribution’,
    labels = dict(x = “customer_segment”, y =”counts”))
    fig.show()

    customer_segment distribution

    Let’s calculate the RFM_level Mean Values

    avg_rfm_segment = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False)
    avg_rfm_segment

    customer_segment
    champion         12.000
    loyal_customer   10.482
    promising         8.530
    need_attention    6.479
    hibernating       4.499
    almost_lost       3.000
    Name: RFM_level, dtype: float64

    Let’s plot the Customer_Segment RFM_level Mean Values

    fig = px.histogram(customer_rfm, x = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False).index,
    y = customer_rfm.groupby(‘customer_segment’).RFM_level.mean().sort_values(ascending=False).values,
    title = ‘Customer_Segment RFM_level Mean Values’,
    labels = dict(x = “customer_segment”, y =”RFM_level Mean Values”))
    fig.show()

    Customer_Segment RFM_level Mean Values

    Let’s calculate the size of RFM_Segment

    size_rfm_segment = customer_rfm[‘customer_segment’].value_counts()
    size_rfm_segment

    hibernating       918
    promising         783
    need_attention    771
    loyal_customer    650
    champion          402
    almost_lost       396
    Name: count, dtype: int64

    Let’s plot Size RFM_Segment

    fig = px.histogram(customer_rfm, x = customer_rfm[‘customer_segment’].value_counts().index,
    y = customer_rfm[‘customer_segment’].value_counts().values,
    title = ‘Size RFM_Segment’,
    labels = dict(x = “customer_segment”, y =”Size RFM_Segment”))
    fig.show()

    Size RFM_Segment

    Final output table:

    customer_segment = pd.DataFrame(pd.concat([avg_rfm_segment, size_rfm_segment], axis=1))
    customer_segment.rename(columns={‘RFM_level’: ‘avg_RFM_level’, ‘customer_segment’: ‘segment_size’}, inplace=True)
    customer_segment

    avg_RFM_level count

    Let’s plot the wordcloud of RFM Segments

    from wordcloud import WordCloud

    segment_text = customer_rfm[“customer_segment”].str.split(” “).str.join(“_”)
    all_segments = ” “.join(segment_text)

    wc = WordCloud(background_color=”orange”,
    max_words=250,
    max_font_size=256,
    random_state=42,
    width=800, height=400)
    wc.generate(all_segments)
    plt.figure(figsize = (16, 15))
    plt.imshow(wc)
    plt.title(“RFM Segments”,fontsize=18,fontweight=”bold”)
    plt.axis(‘off’)
    plt.show()

    Wordcloud of RFM Segments

    Let’s check the structure and content of customer_rfm

    first_looking(customer_rfm)
    duplicate_values(customer_rfm)

    Shape:(3920, 9)
    ****************************************************************************************************
    Info:
    
    <class 'pandas.core.frame.DataFrame'>
    Index: 3920 entries, 12346.0 to 18287.0
    Data columns (total 9 columns):
     #   Column            Non-Null Count  Dtype  
    ---  ------            --------------  -----  
     0   recency           3920 non-null   int64  
     1   frequency         3920 non-null   int64  
     2   monetary          3920 non-null   float64
     3   recency_score     3920 non-null   int64  
     4   frequency_score   3920 non-null   int64  
     5   monetary_score    3920 non-null   int64  
     6   RFM_score         3920 non-null   object 
     7   RFM_level         3920 non-null   int64  
     8   customer_segment  3920 non-null   object 
    dtypes: float64(1), int64(6), object(2)
    memory usage: 306.2+ KB
    None
    ****************************************************************************************************
    Number of Uniques:
    recency              302
    frequency             57
    monetary            3849
    recency_score          4
    frequency_score        4
    monetary_score         4
    RFM_score             61
    RFM_level             10
    customer_segment       6
    dtype: int64
    ****************************************************************************************************
    Missing Values:
    Empty DataFrame
    Columns: [Missing_Number, Missing_Percent]
    Index: []
    ****************************************************************************************************
    All Columns:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'RFM_score', 'RFM_level', 'customer_segment']
    ****************************************************************************************************
    Columns after rename:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_level', 'customer_segment']
    ****************************************************************************************************
    Columns after rename:['recency', 'frequency', 'monetary', 'recency_score', 'frequency_score', 'monetary_score', 'rfm_score', 'rfm_level', 'customer_segment']
    ****************************************************************************************************
    Descriptive Statistics 
           recency  frequency   monetary  recency_score  frequency_score   
    count 3920.000   3920.000   3920.000       3920.000         3920.000  \
    mean    91.740      4.250   1858.420          2.480            2.320   
    std     99.530      7.200   7478.630          1.110            1.150   
    min      0.000      1.000      3.750          1.000            1.000   
    25%     17.000      1.000    298.180          1.000            1.000   
    50%     50.000      2.000    644.970          2.000            2.000   
    75%    142.000      5.000   1571.280          3.000            3.000   
    max    373.000    209.000 259657.300          4.000            4.000   
    
           monetary_score  rfm_level  
    count        3920.000   3920.000  
    mean            2.500      7.300  
    std             1.120      2.880  
    min             1.000      3.000  
    25%             1.750      5.000  
    50%             2.500      7.000  
    75%             3.250     10.000  
    max             4.000     12.000  
    ****************************************************************************************************
    Descriptive Statistics (Categorical Columns) 
                     count unique          top freq
    rfm_score         3920     61          444  402
    customer_segment  3920      6  hibernating  918
    ****************************************************************************************************
    Duplicate check...
    There are 0 duplicated observations in the dataset.

    customer_rfm.groupby(‘rfm_level’).agg({‘recency’: [‘mean’,’min’,’max’,’count’],
    ‘frequency’: [‘mean’,’min’,’max’,’count’],
    ‘monetary’: [‘mean’,’min’,’max’,’count’] }).round(1)

    RFM summary table

    Let’s check RFM correlations

    plt.figure(figsize = (20,20))
    sns.pairplot(customer_rfm[[‘recency’, ‘frequency’, ‘monetary’,’customer_segment’]], hue = ‘customer_segment’);

    RFM pairplots

    Let’s compute the RFM correlation matrix

    matrix = np.triu(customer_rfm[[‘recency’,’frequency’,’monetary’]].corr())
    fig, ax = plt.subplots(figsize=(14,10))
    sns.heatmap (customer_rfm[[‘recency’,’frequency’,’monetary’]].corr(), annot=True, fmt= ‘.2f’, vmin=-1, vmax=1, center=0, cmap=’coolwarm’,mask=matrix, ax=ax);

    RFM correlation matrix

    Read more here.

    RFM TreeMap: Online Retail Dataset II

    Let’s import libraries and read the Online Retail II dataset Year 2010-2011

    import datetime as dt
    import pandas as pd
    pd.set_option(‘display.max_columns’, None)
    pd.set_option(‘display.max_rows’, None)
    pd.set_option(‘display.float_format’, lambda x: ‘%.2f’ % x)

    df_ = pd.read_excel(“online_retail_II.xlsx”,sheet_name=”Year 2010-2011″ )
    df = df_.copy()

    Let’s check the overall data structure

    def check_df(dataframe):
    print(“################ Shape ####################”)
    print(dataframe.shape)
    print(“############### Columns ###################”)
    print(dataframe.columns)
    print(“############### Types #####################”)
    print(dataframe.dtypes)
    print(“############### Head ######################”)
    print(dataframe.head())
    print(“############### Tail ######################”)
    print(dataframe.tail())
    print(“############### Describe ###################”)
    print(dataframe.describe().T)

    check_df(df)

    ################ Shape ####################
    (541910, 8)
    ############### Columns ###################
    Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
           'Price', 'Customer ID', 'Country'],
          dtype='object')
    ############### Types #####################
    Invoice                object
    StockCode              object
    Description            object
    Quantity                int64
    InvoiceDate    datetime64[ns]
    Price                 float64
    Customer ID           float64
    Country                object
    dtype: object
    ############### Head ######################
      Invoice StockCode                          Description  Quantity   
    0  536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6  \
    1  536365     71053                  WHITE METAL LANTERN         6   
    2  536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
    3  536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
    4  536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
    
              InvoiceDate  Price  Customer ID         Country  
    0 2010-12-01 08:26:00   2.55     17850.00  United Kingdom  
    1 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
    2 2010-12-01 08:26:00   2.75     17850.00  United Kingdom  
    3 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
    4 2010-12-01 08:26:00   3.39     17850.00  United Kingdom  
    ############### Tail ######################
           Invoice StockCode                      Description  Quantity   
    541905  581587     22899     CHILDREN'S APRON DOLLY GIRL          6  \
    541906  581587     23254    CHILDRENS CUTLERY DOLLY GIRL          4   
    541907  581587     23255  CHILDRENS CUTLERY CIRCUS PARADE         4   
    541908  581587     22138    BAKING SET 9 PIECE RETROSPOT          3   
    541909  581587      POST                          POSTAGE         1   
    
                   InvoiceDate  Price  Customer ID Country  
    541905 2011-12-09 12:50:00   2.10     12680.00  France  
    541906 2011-12-09 12:50:00   4.15     12680.00  France  
    541907 2011-12-09 12:50:00   4.15     12680.00  France  
    541908 2011-12-09 12:50:00   4.95     12680.00  France  
    541909 2011-12-09 12:50:00  18.00     12680.00  France  
    ############### Describe ###################
                    count                           mean                  min   
    Quantity    541910.00                           9.55            -80995.00  \
    InvoiceDate    541910  2011-07-04 13:35:22.342307584  2010-12-01 08:26:00   
    Price       541910.00                           4.61            -11062.06   
    Customer ID 406830.00                       15287.68             12346.00   
    
                                 25%                  50%                  75%   
    Quantity                    1.00                 3.00                10.00  \
    InvoiceDate  2011-03-28 11:34:00  2011-07-19 17:17:00  2011-10-19 11:27:00   
    Price                       1.25                 2.08                 4.13   
    Customer ID             13953.00             15152.00             16791.00   
    
                                 max     std  
    Quantity                80995.00  218.08  
    InvoiceDate  2011-12-09 12:50:00     NaN  
    Price                   38970.00   96.76  
    Customer ID             18287.00 1713.60  
    ####### DATA PREPARATION

    df.isnull().any()
    df.isnull().sum()
    df.dropna(inplace=True)

    df[“StockCode”].nunique()

    df.groupby(“StockCode”)[“Quantity”].sum()

    df.groupby(“StockCode”).agg({“Quantity”:”sum”}).sort_values(by=”Quantity”, ascending=False).head()

    df = df[~df[“Invoice”].str.contains(“C”, na=False)]

    df[“TotalPrice”] = df[“Price”] * df[“Quantity”]

    df[“InvoiceDate”].max()
    today_date = dt.datetime(2011, 12, 11)

    rfm = df.groupby(“Customer ID”).agg({“InvoiceDate”: lambda InvıiceDate: (today_date- InvıiceDate.max()).days,
    “Invoice”: lambda Invoice: Invoice.nunique(),
    “TotalPrice”: lambda TotalPrice: TotalPrice.sum()})

    rfm.columns = [“recency”,”frequency”,”monetary”]
    rfm = rfm[rfm[“monetary”] > 0]

    ####### RFM CALCULATION

    rfm[“recency_score”] = pd.qcut(rfm[‘recency’], 5, labels=[5, 4, 3, 2, 1])

    rfm[“frequency_score”] = pd.qcut(rfm[‘frequency’].rank(method=”first”), 5, labels=[1, 2, 3, 4, 5])

    rfm[“monetary_score”] = pd.qcut(rfm[‘monetary’], 5, labels=[1, 2, 3, 4, 5])

    rfm[“RFM_SCORE”] = (rfm[‘recency_score’].astype(str) + rfm[‘frequency_score’].astype(str))
    rfm.head()

    rfm.describe().T

    rfm[rfm[“RFM_SCORE”] == “55”].head() # champions

    rfm[rfm[“RFM_SCORE”] == “11”].head() # hibernating

    RFM score table

    #######CUSTOMER SEGMENTATION

    seg_map = {
    r'[1-2][1-2]’: ‘hibernating’,
    r'[1-2][3-4]’: ‘at_Risk’,
    r'[1-2]5′: ‘cant_loose’,
    r’3[1-2]’: ‘about_to_sleep’,
    r’33’: ‘need_attention’,
    r'[3-4][4-5]’: ‘loyal_customers’,
    r’41’: ‘promising’,
    r’51’: ‘new_customers’,
    r'[4-5][2-3]’: ‘potential_loyalists’,
    r’5[4-5]’: ‘champions’
    }

    rfm[‘segment’] = rfm[‘RFM_SCORE’].replace(seg_map, regex=True)

    rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

    rfm[rfm[“segment”] == “need_attention”].head()

    rfm[rfm[“segment”] == “new_customers”].index

    seg_list = [“at_Risk”, “hibernating”, “cant_loose”, “loyal_customers”]
    for i in seg_list:
    print(F” {i.upper()} “.center(50, “*”))
    print(rfm[rfm[“segment”]==i].describe().T)

    rfm[rfm[“segment”] == “loyal_customers”].index
    loyalCus_df = pd.DataFrame()
    loyalCus_df[“loyal_customersId”] = rfm[rfm[“segment”] == “loyal_customers”].index
    loyalCus_df.head()

    loyalCus_df.to_csv(“loyal_customers.csv”)

    ******************** AT_RISK *********************
               count    mean     std   min    25%    50%     75%      max
    recency   593.00  153.79   68.62 73.00  96.00 139.00  195.00   374.00
    frequency 593.00    2.88    0.95  2.00   2.00   3.00    3.00     6.00
    monetary  593.00 1084.54 2562.07 52.00 412.78 678.25 1200.62 44534.30
    ****************** HIBERNATING *******************
                count   mean     std   min    25%    50%    75%      max
    recency   1071.00 217.61   92.01 73.00 135.00 219.00 289.50   374.00
    frequency 1071.00   1.10    0.30  1.00   1.00   1.00   1.00     2.00
    monetary  1071.00 488.64 2419.68  3.75 155.11 296.25 457.93 77183.60
    ******************* CANT_LOOSE *******************
               count    mean     std   min     25%     50%     75%      max
    recency    63.00  132.97   65.25 73.00   89.00  108.00  161.00   373.00
    frequency  63.00    8.38    4.29  6.00    6.00    7.00    9.00    34.00
    monetary   63.00 2796.16 2090.49 70.02 1137.51 2225.97 3532.24 10254.18
    **************** LOYAL_CUSTOMERS *****************
               count    mean     std   min    25%     50%     75%       max
    recency   819.00   33.61   15.58 15.00  20.00   30.00   44.00     72.00
    frequency 819.00    6.48    4.55  3.00   4.00    5.00    8.00     63.00
    monetary  819.00 2864.25 6007.06 36.56 991.80 1740.48 3052.91 124914.53

    #####CREATE RFM

    def create_rfm(dataframe):

    dataframe["TotalPrice"] = dataframe["Quantity"] * dataframe["Price"]
    dataframe.dropna(inplace=True)
    dataframe = dataframe[~dataframe["Invoice"].str.contains("C", na=False)]
    
    today_date = dt.datetime(2011, 12, 11)
    rfm = dataframe.groupby('Customer ID').agg({'InvoiceDate': lambda date: (today_date - date.max()).days,
                                                'Invoice': lambda num: num.nunique(),
                                                "TotalPrice": lambda price: price.sum()})
    rfm.columns = ['recency', 'frequency', "monetary"]
    rfm = rfm[(rfm['monetary'] > 0)]
    
    
    rfm["recency_score"] = pd.qcut(rfm['recency'], 5, labels=[5, 4, 3, 2, 1])
    rfm["frequency_score"] = pd.qcut(rfm["frequency"].rank(method="first"), 5, labels=[1, 2, 3, 4, 5])
    rfm["monetary_score"] = pd.qcut(rfm['monetary'], 5, labels=[1, 2, 3, 4, 5])
    
    
    rfm["RFM_SCORE"] = (rfm['recency_score'].astype(str) +
                        rfm['frequency_score'].astype(str))
    

    rfm.head()

    RFM segment

    rfm[‘score’]=rfm[‘recency’]rfm[‘frequency’]rfm[‘monetary’]

    Statistical informations of segments
    rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

    How many segments are there
    segments_counts = rfm[‘segment’].value_counts().sort_values(ascending=True)

    Let’s plot our RFM segments

    fig, ax = plt.subplots()

    bars = ax.barh(range(len(segments_counts)),
    segments_counts,
    color=’lightcoral’)
    ax.set_frame_on(False)
    ax.tick_params(left=False,
    bottom=False,
    labelbottom=False)
    ax.set_yticks(range(len(segments_counts)))
    ax.set_yticklabels(segments_counts.index)

    for i, bar in enumerate(bars):
    value = bar.get_width()
    if segments_counts.index[i] in [‘Can\’t loose’]:
    bar.set_color(‘firebrick’)
    ax.text(value,
    bar.get_y() + bar.get_height()/2,
    ‘{:,} ({:}%)’.format(int(value),
    int(value*100/segments_counts.sum())),
    va=’center’,
    ha=’left’
    )

    plt.show(block=True)

    RFM segments bar plot

    Let’s plot the RFM segmentation treemap

    import matplotlib.pyplot as plt
    import squarify
    sizes=[24, 18, 14, 13,11,8,4,3,2,1]
    label=[“hibernating”, “loyal_customers”, ‘champions’, ‘at_risk’,’potential_loyalists’,’about_to_sleep’,’need_attention’,’promising’,’CL’,’NC’]

    colors = [‘#91DCEA’, ‘#64CDCC’, ‘#5FBB68’,
    ‘#F9D23C’, ‘#F9A729’, ‘#FD6F30′,’grey’,’red’,’blue’,’cyan’]
    squarify.plot(sizes=sizes, label=label, color=colors, alpha=0.6 )
    plt.show()
    plt.axis(“off”)

    RFM segmentation treemap

    where CL = cant_loose and NC = new_cutomers.

    Read more here.

    CRM Analytics: CLTV

    Let’s look at the CRM Analytics – CLTV (Customer Lifetime Value).

    Importing libraries:
    import datetime as dt
    import pandas as pd
    import warnings
    warnings.filterwarnings(‘ignore’)
    import matplotlib.pyplot as plt

    and set display setting
    pd.set_option(“display.max_columns”, None)
    pd.set_option(“display.float_format”, lambda x: “%.3f” % x)

    ####### DATA PREPARATION

    Reading the Online Retail II dataset and creating a copy of it
    df_ = pd.read_excel(“online_retail.xlsx”)
    df = df_.copy()
    df.head()

    Online Retail dataset II

    df.shape

    (525461, 8)

    Number of unique products
    df[“Description”].nunique()

    4681

    value_counts() of each product
    df[“Description”].value_counts().head()

    WHITE HANGING HEART T-LIGHT HOLDER    3549
    REGENCY CAKESTAND 3 TIER              2212
    STRAWBERRY CERAMIC TRINKET BOX        1843
    PACK OF 72 RETRO SPOT CAKE CASES      1466
    ASSORTED COLOUR BIRD ORNAMENT         1457
    Name: Description, dtype: int64

    What is the most ordered product
    df.groupby(“Description”).agg({“Quantity”: “sum”}).head()

    Missing value check
    df.isnull().sum

    Customers with missing Customer ID must be removed from the dataset
    df.dropna(inplace =True)

    Adding the total price on the basis of products to the data set as a variable
    df[“TotalPrice”] = df[“Quantity”] * df[“Price”]

    Let’s remove attribute values started with “C” (returned products)

    df = df[~df[“Invoice”].str.contains(“C”, na=False)]

    with the descriptive statistics of the dataset
    df.describe().T

    Descriptive statistics of the dataset

    ######### CALCULATION OF RFM METRICS

    Last date in the InvoiceDate
    df[“InvoiceDate”].max()

    Timestamp('2010-12-09 20:01:00')

    Let’s set 2 days after the last date
    analysis_date = dt.datetime(2010, 12, 11)

    Let’s introduce the following RFM columns

    rfm = df.groupby(“Customer ID”).agg({“InvoiceDate”: lambda invoice_date: (analysis_date – invoice_date.max()).days,
    “Invoice”: lambda invoice: invoice.nunique(),
    “TotalPrice”: lambda total_price: total_price.sum()})
    rfm.columns = [“recency”, “frequency”, “monetary”]
    rfm.head()

    while excluding monetary<=0

    rfm = rfm[rfm[“monetary”] > 0]

    and checking descriptive statistics

    rfm.describe().T

    RFM descriptive statistics

    RECENCY SCORE
    rfm[“recency_score”] = pd.qcut(rfm[“recency”], 5, labels=[5, 4, 3, 2, 1])

    FREQUENCY SCORE
    rfm[“frequency_score”] = pd.qcut(rfm[“frequency”].rank(method=”first”), 5, labels=[1, 2, 3, 4, 5])

    MONETARY SCORE
    rfm[“monetary_score”] = pd.qcut(rfm[“monetary”], 5, labels=[1, 2, 3, 4, 5])

    RF SCORE
    rfm[“RF_SCORE”] = (rfm[“recency_score”].astype(str) + rfm[“frequency_score”].astype(str))

    rfm.head()

    RFM and  RF_score columns

    Let’s convert RF_SCORE to segment with regex=True

    seg_map = {
    r'[1-2][1-2]’: ‘hibernating’,
    r'[1-2][3-4]’: ‘at_Risk’,
    r'[1-2]5′: ‘cant_loose’,
    r’3[1-2]’: ‘about_to_sleep’,
    r’33’: ‘need_attention’,
    r'[3-4][4-5]’: ‘loyal_customers’,
    r’41’: ‘promising’,
    r’51’: ‘new_customers’,
    r'[4-5][2-3]’: ‘potential_loyalists’,
    r’5[4-5]’: ‘champions’
    }

    rfm[‘segment’] = rfm[‘RF_SCORE’].replace(seg_map, regex=True)

    rfm.head()

    RFM, RF_scoe and segment columns

    Let’s compute mean and count of our RFM columns sorted by segment in the ascending order

    rfm[[“segment”, “recency”, “frequency”, “monetary”]].groupby(“segment”).agg([“mean”, “count”])

    segments_counts = rfm[‘segment’].value_counts().sort_values(ascending=True)

    Let’s plot our RFM segments

    fig, ax = plt.subplots()

    bars = ax.barh(range(len(segments_counts)),
    segments_counts,
    color=’lightcoral’)
    ax.set_frame_on(False)
    ax.tick_params(left=False,
    bottom=False,
    labelbottom=False)
    ax.set_yticks(range(len(segments_counts)))
    ax.set_yticklabels(segments_counts.index)

    for i, bar in enumerate(bars):
    value = bar.get_width()
    if segments_counts.index[i] in [‘Can\’t loose’]:
    bar.set_color(‘firebrick’)
    ax.text(value,
    bar.get_y() + bar.get_height()/2,
    ‘{:,} ({:}%)’.format(int(value),
    int(value*100/segments_counts.sum())),
    va=’center’,
    ha=’left’
    )

    plt.show(block=True)

    Barplot of RFM segments

    ########## CLTV CALCULATION

    Let’s prepare our input data

    cltv_c = df.groupby(‘Customer ID’).agg({“Invoice”: lambda x: x.nunique(),
    “Quantity”: lambda x: x.sum(),
    “TotalPrice”: lambda x: x.sum()})

    cltv_c.columns = [“total_transaction”, “total_unit”, “total_price”]

    Let’s calculate the following metrics:

    profit margin:
    cltv_c[“profit_margin”] = cltv_c[“total_price”] * 0.10

    Churn Rate:

    repeat_rate = cltv_c[cltv_c[“total_transaction”] > 1].shape[0] / cltv_c.shape[0]

    churn_rate = 1 – repeat_rate

    Purchase Frequency:
    cltv_c[“purchase_frequency”] = cltv_c[“total_transaction”] / cltv_c.shape[0]

    Average Order Value:
    cltv_c[“average_order_value”] = cltv_c[“total_price”] / cltv_c[“total_transaction”]

    Customer Value:
    cltv_c[“customer_value”] = cltv_c[“average_order_value”] * cltv_c[“purchase_frequency”]

    CLTV (Customer Lifetime Value):
    cltv_c[“cltv”] = (cltv_c[“customer_value”] / churn_rate) * cltv_c[“profit_margin”]

    Sorting our dataset by cltv in the descending order
    cltv_c.sort_values(by = “cltv”, ascending=False).head()

    Input data with cltv

    Eliminating outliers with interquantile_range
    def outlier_threshold(dataframe, variable):
    quartile1 = dataframe[variable].quantile(0.01)
    quartile3 = dataframe[variable].quantile(0.99)
    interquantile_range = quartile3 – quartile1
    up_limit = quartile3 + 1.5 * interquantile_range
    low_limit = quartile1 – 1.5 * interquantile_range
    return low_limit, up_limit

    def replace_with_thresholds(dataframe, variable):
    low_limit, up_limit = outlier_threshold(dataframe, variable)
    # dataframe.loc[(dataframe[variable] < low_limit), variable] = low_limit dataframe.loc[(dataframe[variable] > up_limit), variable] = up_limit

    replace_with_thresholds(df, “Quantity”)
    replace_with_thresholds(df, “Price”)

    Define the analysis date

    df[“InvoiceDate”].max()
    analysis_date = dt.datetime(2011, 12, 11)

    ############ Preparation of Lifetime Data Structure

    cltv_df = df.groupby(“Customer ID”).agg({“InvoiceDate”: [lambda InvoiceDate: (InvoiceDate.max() – InvoiceDate.min()).days / 7,
    lambda InvoiceDate: (analysis_date – InvoiceDate.min()).days / 7],
    “Invoice”: lambda Invoice: Invoice.nunique(),
    “TotalPrice”: lambda TotalPrice: TotalPrice.sum()})

    cltv_df.columns = cltv_df.columns.droplevel(0)

    cltv_df.columns = [“recency”, “T”, “frequency”, “monetary”]

    cltv_df[“monetary”] = cltv_df[“monetary”] / cltv_df[“frequency”]

    cltv_df = cltv_df[cltv_df[“frequency”] > 1]

    cltv_df.head()

    Data columns recency, T, frequency, and monetary.

    cltv_df.describe().T

    Descriptive statistics of data columns recency, T, frequency, and monetary.

    #######CLTV MODELLING

    Let’s install and import the following libraries

    !pip install Lifetimes

    from lifetimes import BetaGeoFitter
    from lifetimes import BetaGeoFitter
    from lifetimes import GammaGammaFitter
    from lifetimes.plotting import plot_period_transactions

    import datetime as dt
    import pandas as pd
    import warnings
    warnings.filterwarnings(‘ignore’)
    import matplotlib.pyplot as plt
    pd.set_option(‘display.max_columns’, None)
    pd.set_option(‘display.float_format’, lambda x: ‘%.5f’ % x)
    pd.set_option(‘display.width’, 500)

    ######BG/NBD Model

    bgf = BetaGeoFitter(penalizer_coef=0.001)

    bgf.fit(cltv_df[“frequency”],
    cltv_df[“recency”],
    cltv_df[“T”])

    <lifetimes.BetaGeoFitter: fitted with 2893 subjects, a: 1.93, alpha: 9.47, b: 6.27, r: 2.22>

    Let’s look at the first 10 customers with the highest purchase:
    bgf.conditional_expected_number_of_purchases_up_to_time(1, # number of weeks
    cltv_df[“frequency”],
    cltv_df[“recency”],
    cltv_df[“T”]).sort_values(ascending=False).head(10)

    Customer ID
    15989.00000   0.00756
    16720.00000   0.00722
    14119.00000   0.00720
    16204.00000   0.00715
    17591.00000   0.00714
    15169.00000   0.00700
    17193.00000   0.00698
    17251.00000   0.00696
    17411.00000   0.00690
    17530.00000   0.00680
    dtype: float64

    We can do same thing with predict:
    bgf.predict(1, #number of weeks
    cltv_df[“frequency”],
    cltv_df[“recency”],
    cltv_df[“T”]).sort_values(ascending=False).head(10)

    Let’s apply predict for 1 month:
    bgf.predict(4, # 4 weeks = 1 month
    cltv_df[“frequency”],
    cltv_df[“recency”],
    cltv_df[“T”]).sort_values(ascending=False).head(10)

    Customer ID
    15989.00000   0.02986
    16720.00000   0.02849
    14119.00000   0.02841
    16204.00000   0.02825
    17591.00000   0.02816
    15169.00000   0.02763
    17193.00000   0.02756
    17251.00000   0.02748
    17411.00000   0.02726
    17530.00000   0.02682
    dtype: float64

    Let’s compare expected purchases for several time intervals:

    Expected purchases for 1 months
    cltv_df[“expected_purchase_1_month”] = bgf.predict(4,
    cltv_df[“frequency”],
    cltv_df[“recency”],
    cltv_df[“T”])

    Expected purchases for 3 months
    cltv_df[“expected_purchase_3_month”] = bgf.predict(4 * 3,
    cltv_df[“frequency”],
    cltv_df[“recency”],
    cltv_df[“T”])

    Expected purchases for 1 year
    cltv_df[“expected_purchase_1_year”] = bgf.predict(4 * 12,
    cltv_df[“frequency”],
    cltv_df[“recency”],
    cltv_df[“T”])
    cltv_df.head()

    Expected purchases for 1, 3, and 12 months

    Evaluation of predicted results:
    plot_period_transactions(bgf)
    plt.show(block=True)

    Frequency of repeat transactions: actual vs model.

    Let’s plot the FR matrix

    from lifetimes.plotting import plot_frequency_recency_matrix
    import matplotlib.pyplot as plt
    fig = plt.figure(figsize=(12,8))
    plot_frequency_recency_matrix(bgf)
    plt.show(block=True)

    Expected number of future purchases for 1 unit of time, by frequency and recency of a customer

    ############ Gamma-Gamma Submodel

    ggf = GammaGammaFitter(penalizer_coef=0.01)

    ggf.fit(cltv_df[“frequency”], cltv_df[“monetary”])

    cltv_df[“expected_average_profit”] = ggf.conditional_expected_average_profit(cltv_df[“frequency”],
    cltv_df[“monetary”])

    cltv_df.sort_values(“expected_average_profit”, ascending=False).head(10)

    ######### Calculation of CLTV with BG/NBD and GG Model

    cltv = ggf.customer_lifetime_value(bgf,
    cltv_df[“frequency”],
    cltv_df[“recency”],
    cltv_df[“T”],
    cltv_df[“monetary”],
    time=3, # 3 month
    freq=”W”, # frequency of T
    discount_rate=0.01)

    cltv = cltv.reset_index()

    cltv_final = cltv_df.merge(cltv, on=”Customer ID”, how=”left”)

    cltv_final.sort_values(by=”clv”, ascending=False).head(10)

    ########## Segmentation according to CLTV

    cltv_final[“segment”] = pd.qcut(cltv_final[“clv”], 4, labels=[“D”, “C”, “B”, “A”])

    cltv_final.groupby(“segment”).agg({“count”, “mean”, “sum”})

    _df_plt = cltv_final[[“expected_average_profit”, “segment”]]
    plot_df = _df_plt.groupby(“segment”).agg(“mean”)
    plot_df.plot(kind=”bar”, color=”tomato”,figsize=(15, 8))
    plt.ylabel(“mean”)

    plt.show(block=True)

    Expected average profit for CLTV segments A-D

    Read more here.

    Cohort Analysis in Online Retail

    Let’s dive into Cohort Analysis using the Online Retail Dataset.

    Let’s install and import standard libraries

    !pip install sklearn-pandas

    import pandas as pd
    import numpy as np
    import datetime as dt
    import seaborn as sns

    from sklearn import preprocessing
    import sklearn
    from sklearn.svm import SVC
    from sklearn.preprocessing import StandardScaler
    from sklearn.datasets import make_classification
    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import OneHotEncoder
    from sklearn_pandas import DataFrameMapper
    from sklearn.model_selection import train_test_split
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import Normalizer
    from matplotlib import pyplot as plt

    Reading the dataset
    df = pd.read_excel(“online_retail.xlsx”)

    #############DATA PREPARATION

    df.shape

    (525461, 8)
    

    print(‘Duplicate entries: {}’.format(df.duplicated().sum()))
    print(‘{}% rows are duplicate.’.format(round((df.duplicated().sum()/df.shape[0])*100),2))
    df.drop_duplicates(inplace = True)

    Duplicate entries: 6865
    1% rows are duplicate.
    

    pd.DataFrame([{‘products’: len(df[‘StockCode’].value_counts()),
    ‘transactions’: len(df[‘Invoice’].value_counts()),
    ‘customers’: len(df[‘Customer ID’].value_counts()),
    }], columns = [‘products’, ‘transactions’, ‘customers’], index = [‘quantity’])

    products transactions customers

    quantity 4632 28816 4383

    temp = df.groupby([‘Country’],as_index=False).agg({‘Invoice’:’nunique’}).rename(columns = {‘Invoice’:’Orders’})
    total = temp[‘Orders’].sum(axis=0)
    temp[‘%Orders’] = round((temp[‘Orders’]/total)*100,4)

    temp.sort_values(by=[‘%Orders’],ascending=False,inplace=True)
    temp.reset_index(drop=True,inplace=True)

    Let’s plot %Orders vs Country

    plt.figure(figsize=(13,6))
    splot=sns.barplot(x=”Country”,y=”%Orders”,data=temp[:10])
    for p in splot.patches:
    splot.annotate(format(p.get_height(), ‘.1f’),
    (p.get_x() + p.get_width() / 2., p.get_height()),
    ha = ‘center’, va = ‘center’,
    xytext = (0, 9),
    textcoords = ‘offset points’)
    plt.xlabel(“Country”, size=14)
    plt.ylabel(“%Orders”, size=14)

    %Orders vs Country

    invoices = df[‘Invoice’]

    x = invoices.str.contains(‘C’, regex=True)
    x.fillna(0, inplace=True)

    x = x.astype(int)

    x.value_counts()

    0    508414
    1     10182
    Name: Invoice, dtype: int64

    df[‘order_canceled’] = x
    #df.head()

    n1 = df[‘order_canceled’].value_counts()[1]
    n2 = df.shape[0]
    print(‘Number of orders canceled: {}/{} ({:.2f}%) ‘.format(n1, n2, n1/n2*100))

    Number of orders canceled: 10182/518596 (1.96%) 

    df = df.loc[df[‘order_canceled’] == 0,:]

    df.reset_index(drop=True,inplace=True)

    df.reset_index(drop=True,inplace=True)

    df.loc[df[‘Quantity’] < 0,:]

    df = df[df[‘Customer ID’].notna()]

    df.reset_index(drop=True,inplace=True)

    #############NL COHORT INDEX ANALYSIS

    df_uk = df[df.Country == ‘Netherlands’]

    df_uk.info()

    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 2729 entries, 4346 to 387139
    Data columns (total 9 columns):
     #   Column          Non-Null Count  Dtype         
    ---  ------          --------------  -----         
     0   Invoice         2729 non-null   object        
     1   StockCode       2729 non-null   object        
     2   Description     2729 non-null   object        
     3   Quantity        2729 non-null   int64         
     4   InvoiceDate     2729 non-null   datetime64[ns]
     5   Price           2729 non-null   float64       
     6   Customer ID     2729 non-null   float64       
     7   Country         2729 non-null   object        
     8   order_canceled  2729 non-null   int32         
    dtypes: datetime64[ns](1), float64(2), int32(1), int64(1), object(4)
    memory usage: 202.5+ KB

    pd.DataFrame([{‘products’: len(df[‘StockCode’].value_counts()),
    ‘transactions’: len(df[‘Invoice’].value_counts()),
    ‘customers’: len(df[‘Customer ID’].value_counts()),
    }], columns = [‘products’, ‘transactions’, ‘customers’], index = [‘Quantity’])

    products transactions customers

    Quantity 4017 19215 4314

    cohort_data = df_uk[[‘Invoice’,’StockCode’,’Description’,’Quantity’,’InvoiceDate’,’Price’,’Customer ID’,’Country’]]

    Checking for nulls in the data
    cohort_data.isnull().sum()

    Invoice        0
    StockCode      0
    Description    0
    Quantity       0
    InvoiceDate    0
    Price          0
    Customer ID    0
    Country        0
    dtype: int64

    all_dates = (pd.to_datetime(cohort_data[‘InvoiceDate’])).apply(lambda x:x.date())

    (all_dates.max() – all_dates.min()).days

    364

    Start and end dates:
    print(‘Start date: {}’.format(all_dates.min()))
    print(‘End date: {}’.format(all_dates.max()))

    Start date: 2009-12-02
    End date: 2010-12-01

    cohort_data.head()

    InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountry

    def get_month(x):
    return dt.datetime(x.year, x.month, 1)

    cohort_data[‘InvoiceMonth’] = cohort_data[‘InvoiceDate’].apply(get_month)
    grouping = cohort_data.groupby(‘Customer ID’)[‘InvoiceMonth’]
    cohort_data[‘CohortMonth’] = grouping.transform(‘min’)
    cohort_data.head()

    InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountryInvoiceMonthCohortMonth

    def get_date_int(df, column):
    year = df[column].dt.year
    month = df[column].dt.month
    day = df[column].dt.day
    return year, month, day

    invoice_year, invoice_month, _ = get_date_int(cohort_data, ‘InvoiceMonth’)
    cohort_year, cohort_month, _ = get_date_int(cohort_data, ‘CohortMonth’)
    years_diff = invoice_year – cohort_year
    months_diff = invoice_month – cohort_month
    cohort_data[‘CohortIndex’] = years_diff * 12 + months_diff
    #cohort_data.head()

    grouping = cohort_data.groupby([‘CohortMonth’, ‘CohortIndex’])

    cohort_data = grouping[‘Quantity’].mean()
    cohort_data = cohort_data.reset_index()
    average_quantity = cohort_data.pivot(index=’CohortMonth’,columns=’CohortIndex’,values=’Quantity’)
    average_quantity.round(1)
    #average_quantity

    Let’s plot retention rates heatmap

    plt.figure(figsize=(10, 8))
    plt.title(‘Retention rates’)
    sns.heatmap(data = retention,annot = True,fmt = ‘.0%’,vmin = 0.0,vmax = 0.5,cmap = ‘BuGn’)
    plt.show()

    Retention rates vs Cohort Month/Index heatmap

    Let’s compute TotalPrice

    grdf=grouping.apply(pd.DataFrame)

    grdf[‘TotalPrice’]=grdf[‘Quantity’]*grdf[‘Price’]

    grdf[grdf[‘TotalPrice’]>139.2].sort_values(‘TotalPrice’,ascending=False)

    #############RFM SCORE ANALYSIS

    from dateutil.relativedelta import relativedelta
    start_date = all_dates.max()-relativedelta(months=12,days=-1)
    print(‘Start date: {}’.format(start_date))
    print(‘End date: {}’.format(all_dates.max()))

    Start date: 2009-12-02
    End date: 2010-12-01

    data_rfm = grdf[grdf[‘InvoiceDate’] >= pd.to_datetime(start_date)]
    data_rfm.reset_index(drop=True,inplace=True)
    data_rfm.head()

    InvoiceStockCodeDescriptionQuantityInvoiceDatePriceCustomer IDCountryInvoiceMonthCohortMonthCohortIndexTotalPrice

    snapshot_date = max(data_rfm.InvoiceDate) + dt.timedelta(days=1)
    print(‘Snapshot date: {}’.format(snapshot_date.date()))

    Snapshot date: 2010-12-02

    Aggregate data on a customer level
    data = data_rfm.groupby([‘Customer ID’],as_index=False).agg({‘InvoiceDate’: lambda x: (snapshot_date – x.max()).days,
    ‘Invoice’: ‘count’,
    ‘TotalPrice’: ‘sum’}).rename(columns = {‘InvoiceDate’: ‘Recency’,
    ‘Invoice’: ‘Frequency’,
    ‘TotalPrice’: ‘MonetaryValue’})

    r_labels = range(4, 0, -1)
    r_quartiles = pd.qcut(data[‘Recency’], 4, labels = r_labels)
    data = data.assign(R = r_quartiles.values)

    f_labels = range(1,5)
    m_labels = range(1,5)
    f_quartiles = pd.qcut(data[‘Frequency’], 4, labels = f_labels)
    m_quartiles = pd.qcut(data[‘MonetaryValue’], 4, labels = m_labels)
    data = data.assign(F = f_quartiles.values)
    data = data.assign(M = m_quartiles.values)
    #data.head()

    def join_rfm(x):
    return str(x[‘R’]) + str(x[‘F’]) + str(x[‘M’])

    data[‘RFM_Segment’] = data.apply(join_rfm, axis=1)
    data[‘RFM_Score’] = data[[‘R’,’F’,’M’]].sum(axis=1)
    data.head()

    RFM score vs segment

    data.groupby(‘RFM_Score’).agg({‘Recency’: ‘mean’,
    ‘Frequency’: ‘mean’,
    ‘MonetaryValue’: [‘mean’, ‘count’] }).round(1)

    Let’s create segment

    def create_segment(df):
    if df[‘RFM_Score’] >= 9:
    return ‘Top’
    elif (df[‘RFM_Score’] >= 5) and (df[‘RFM_Score’] < 9):
    return ‘Middle’
    else:
    return ‘Low’

    data[‘General_Segment’] = data.apply(create_segment, axis=1)
    data.groupby(‘General_Segment’).agg({‘Recency’: ‘mean’,
    ‘Frequency’: ‘mean’,
    ‘MonetaryValue’: [‘mean’, ‘count’]}).round(1)

    Plotting the distributions of Recency, Frequency and Monetary Value variables

    plt.figure(figsize=(12,10))

    plt.subplot(3, 1, 1); sns.distplot(data[‘Recency’])

    plt.subplot(3, 1, 2); sns.distplot(data[‘Frequency’])

    plt.subplot(3, 1, 3); sns.distplot(data[‘MonetaryValue’])

    RFM distributions

    Checking for constant mean and variance.
    data[[‘Recency’,’Frequency’,’MonetaryValue’]].describe()

    data[data[‘MonetaryValue’] == 0]

    raw_data = data[[‘Recency’,’Frequency’,’MonetaryValue’]]

    #############RFM SCORE TRANSFORMATIONS

    Let’s unskew the data
    data_log = np.log(raw_data)

    Initialize a standard scaler and fit it
    scaler = StandardScaler()
    scaler.fit(data_log)

    Scale and center the data
    data_normalized = scaler.transform(data_log)

    Create a pandas DataFrame
    data_norm = pd.DataFrame(data=data_log, index=raw_data.index, columns=raw_data.columns)

    Let’s plot the transformed RFM distributions

    plt.figure(figsize=(12,10))

    plt.subplot(3, 1, 1); sns.distplot(data_norm[‘Recency’])

    plt.subplot(3, 1, 2); sns.distplot(data_norm[‘Frequency’])

    plt.subplot(3, 1, 3); sns.distplot(data_norm[‘MonetaryValue’])

    plt.show()

    The transformed RFM distributions

    #############K-MEANS CLUSTERING

    from sklearn.cluster import KMeans

    sse = {}

    Fit KMeans and calculate SSE for each k
    for k in range(1, 21):

    # Initialize KMeans with k clusters
    kmeans = KMeans(n_clusters=k, random_state=1)
    
    # Fit KMeans on the normalized dataset
    kmeans.fit(data_norm)
    
    # Assign sum of squared distances to k element of dictionary
    sse[k] = kmeans.inertia_
    

    Let’s look at the Elbow plot

    plt.figure(figsize=(12,8))

    plt.title(‘The Elbow Method’)
    plt.xlabel(‘k’);
    plt.ylabel(‘Sum of squared errors’)
    sns.pointplot(x=list(sse.keys()), y=list(sse.values()))
    plt.show()

    The Elbow plot

    Let’s apply Kmeans with n_clusters=3

    kmeans = KMeans(n_clusters=3, random_state=1)

    Compute k-means clustering on pre-processed data
    kmeans.fit(data_norm)

    Extract cluster labels from labels_ attribute
    cluster_labels = kmeans.labels_

    Create a cluster label column in the original DataFrame
    data_norm_k3 = data_norm.assign(Cluster = cluster_labels)
    data_k3 = raw_data.assign(Cluster = cluster_labels)

    Calculate average RFM values and size for each cluster
    summary_k3 = data_k3.groupby([‘Cluster’]).agg({‘Recency’: ‘mean’,
    ‘Frequency’: ‘mean’,
    ‘MonetaryValue’: [‘mean’, ‘count’],}).round(0)

    summary_k3

    RFM mean count for 3 clusters

    Let’s apply Kmeans with n_clusters=4

    kmeans = KMeans(n_clusters=4, random_state=1)

    Compute k-means clustering on pre-processed data
    kmeans.fit(data_norm)

    Extract cluster labels from labels_ attribute
    cluster_labels = kmeans.labels_

    Create a cluster label column in the original DataFrame
    data_norm_k4 = data_norm.assign(Cluster = cluster_labels)
    data_k4 = raw_data.assign(Cluster = cluster_labels)

    Calculate average RFM values and size for each cluster
    summary_k4 = data_k4.groupby([‘Cluster’]).agg({‘Recency’: ‘mean’,
    ‘Frequency’: ‘mean’,
    ‘MonetaryValue’: [‘mean’, ‘count’],}).round(0)

    summary_k4

    RFM mean count for 4 clusters

    data_norm_k4.index = data[‘Customer ID’].astype(int)

    Let’s invoke melt to massage the data as follows
    data_melt = pd.melt(data_norm_k4.reset_index(),
    id_vars=[‘Customer ID’, ‘Cluster’],
    value_vars=[‘Recency’, ‘Frequency’, ‘MonetaryValue’],
    var_name=’Attribute’,
    value_name=’Value’)

    Let’s look at the RFM snake plot for 4 clusters

    plt.title(‘Snake plot of standardized variables’)
    sns.lineplot(x=”Attribute”, y=”Value”, hue=’Cluster’, data=data_melt)

    RFm snake plot for 4 clusters

    ######RFM Data Wrangling:

    data_k4.index = data[‘Customer ID’].astype(int)

    raw_data.index = data[‘Customer ID’].astype(int)

    cluster_avg = data_k4.groupby([‘Cluster’]).mean()
    population_avg = raw_data.head().mean()

    population_avg

    Recency            88.400
    Frequency          65.800
    MonetaryValue    1224.666
    dtype: float64

    relative_imp = cluster_avg / population_avg – 1

    Let’s plot Relative importance of attributes as a sns heatmap

    plt.figure(figsize=(8, 4))
    plt.title(‘Relative importance of attributes’)
    sns.heatmap(data=relative_imp, annot=True, fmt=’.2f’, cmap=’RdYlGn’)
    plt.show()

    Relative importance of attributes

    Read more here.

    Customer Segmentation, RFM, K-Means and Cohort Analysis in Online Retail

    Let’s set the working directory YOURPATH

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    and import standard libraries with optional settings

    import numpy as np
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    %matplotlib inline
    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    import missingno as msno

    from sklearn.compose import make_column_transformer

    from sklearn.preprocessing import scale
    from sklearn.preprocessing import StandardScaler
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.preprocessing import PowerTransformer
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.preprocessing import RobustScaler

    import cufflinks as cf
    import plotly.offline
    cf.go_offline()
    cf.set_config_file(offline=False, world_readable=True)

    import warnings
    warnings.filterwarnings(“ignore”)
    warnings.warn(“this will not show”)

    plt.rcParams[“figure.figsize”] = (10,6)
    pd.set_option(‘max_colwidth’,200)
    pd.set_option(‘display.max_rows’, 1000)
    pd.set_option(‘display.max_columns’, 200)
    pd.set_option(‘display.float_format’, lambda x: ‘%.3f’ % x)

    import colorama
    from colorama import Fore, Style
    from termcolor import colored

    import ipywidgets
    from ipywidgets import interact

    sns.set_style(“whitegrid”)

    Let’s read the input dataset

    df=pd.read_csv(‘Online Retail.csv’,index_col=0)

    and define some useful functions

    #######################

    def missing_values(df):
    missing_number = df.isnull().sum().sort_values(ascending=False)
    missing_percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
    missing_values = pd.concat([missing_number, missing_percent], axis=1, keys=[‘Missing_Number’, ‘Missing_Percent’])
    return missing_values[missing_values[‘Missing_Number’]>0]

    def first_looking(df):
    print(colored(“Shape:”, attrs=[‘bold’]), df.shape,’\n’,
    colored(‘-‘79, ‘red’, attrs=[‘bold’]), colored(“\nInfo:\n”, attrs=[‘bold’]), sep=”) print(df.info(), ‘\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)
    print(colored(“Number of Uniques:\n”, attrs=[‘bold’]), df.nunique(),’\n’,
    colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”) print(colored(“Missing Values:\n”, attrs=[‘bold’]), missing_values(df),’\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)
    print(colored(“All Columns:”, attrs=[‘bold’]), list(df.columns),’\n’,
    colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)

    df.columns= df.columns.str.lower().str.replace('&', '_').str.replace(' ', '_')
    
    print(colored("Columns after rename:", attrs=['bold']), list(df.columns),'\n',
              colored('-'*79, 'red', attrs=['bold']), sep='')
    

    def multicolinearity_control(df):
    feature =[]
    collinear=[]
    for col in df.corr().columns:
    for i in df.corr().index:
    if (abs(df.corr()[col][i])> .9 and abs(df.corr()[col][i]) < 1):
    feature.append(col)
    collinear.append(i)
    print(colored(f”Multicolinearity alert in between:{col} – {i}”,
    “red”, attrs=[‘bold’]), df.shape,’\n’,
    colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)

    def duplicate_values(df):
    print(colored(“Duplicate check…”, attrs=[‘bold’]), sep=”)
    duplicate_values = df.duplicated(subset=None, keep=’first’).sum()
    if duplicate_values > 0:
    df.drop_duplicates(keep=’first’, inplace=True)
    print(duplicate_values, colored(” Duplicates were dropped!”),’\n’,
    colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”) else: print(colored(“There are no duplicates”),’\n’, colored(‘-‘79, ‘red’, attrs=[‘bold’]), sep=”)

    def drop_columns(df, drop_columns):
    if drop_columns !=[]:
    df.drop(drop_columns, axis=1, inplace=True)
    print(drop_columns, ‘were dropped’)
    else:
    print(colored(‘We will now check the missing values and if necessary will drop realted columns!’, attrs=[‘bold’]),’\n’,
    colored(‘-‘*79, ‘red’, attrs=[‘bold’]), sep=”)

    def drop_null(df, limit):
    print(‘Shape:’, df.shape)
    for i in df.isnull().sum().index:
    if (df.isnull().sum()[i]/df.shape[0]*100)>limit:
    print(df.isnull().sum()[i], ‘percent of’, i ,’null and were dropped’)
    df.drop(i, axis=1, inplace=True)
    print(‘new shape:’, df.shape)
    print(‘New shape after missing value control:’, df.shape)

    #######################

    Let’s check the structure and content of our input data

    first_looking(df)
    duplicate_values(df)
    drop_columns(df,[])
    drop_null(df, 90)
    df.head()
    df.tail()
    df.sample(5)
    df.describe().T
    df.describe(include=object).T

    Shape:(541909, 8)
    -------------------------------------------------------------------------------
    Info:
    
    <class 'pandas.core.frame.DataFrame'>
    Index: 541909 entries, 0 to 541908
    Data columns (total 8 columns):
     #   Column       Non-Null Count   Dtype  
    ---  ------       --------------   -----  
     0   InvoiceNo    541909 non-null  object 
     1   StockCode    541909 non-null  object 
     2   Description  540455 non-null  object 
     3   Quantity     541909 non-null  float64
     4   InvoiceDate  541909 non-null  object 
     5   UnitPrice    541909 non-null  float64
     6   CustomerID   406829 non-null  float64
     7   Country      541909 non-null  object 
    dtypes: float64(3), object(5)
    memory usage: 37.2+ MB
    None
    -------------------------------------------------------------------------------
    Number of Uniques:
    InvoiceNo      25900
    StockCode       4070
    Description     4223
    Quantity         722
    InvoiceDate    23260
    UnitPrice       1630
    CustomerID      4372
    Country           38
    dtype: int64
    -------------------------------------------------------------------------------
    Missing Values:
                 Missing_Number  Missing_Percent
    CustomerID           135080            0.249
    Description            1454            0.003
    -------------------------------------------------------------------------------
    All Columns:['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate', 'UnitPrice', 'CustomerID', 'Country']
    -------------------------------------------------------------------------------
    Columns after rename:['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate', 'unitprice', 'customerid', 'country']
    -------------------------------------------------------------------------------
    Duplicate check...
    5268 Duplicates were dropped!
    -------------------------------------------------------------------------------
    We will now check the missing values and if necessary will drop realted columns!
    -------------------------------------------------------------------------------
    Shape: (536641, 8)
    New shape after missing value control: (536641, 8)
    Input data statistics

    ######## DATA WRANGLING

    df_model = df.copy()

    df.isnull().sum()

    invoiceno           0
    stockcode           0
    description      1454
    quantity            0
    invoicedate         0
    unitprice           0
    customerid     135037
    country             0
    dtype: int64

    We will drop NaN values of customerid
    df = df.dropna(subset=[‘customerid’])
    df.shape

    (401604, 8)

    df[‘cancelled’] = df[‘invoiceno’].str.contains(‘C’)

    Let’s drop cancelled
    df = df.drop([‘cancelled’],axis=1)

    Clean the Data from the Noise and Missing Values
    df = df[(df.quantity > 0) & (df.unitprice > 0)]
    df.shape

    (392692, 8)

    Let’s look at the treemap Number of Customers Per Country Without UK

    fig = px.treemap(df.groupby(‘country’)[[“customerid”]].nunique().sort_values(by=”customerid”, ascending=False).iloc[1:],
    path=[df.groupby(‘country’)[[“customerid”]].nunique().sort_values(by=”customerid”, ascending=False).iloc[1:].index],
    values=’customerid’,
    width=1000,
    height=600)
    fig.update_layout(title_text=’Number of Customers Per Country Without UK’,
    title_x = 0.5, title_font = dict(size=20)
    )
    fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
    fig.show()

    Number of Customers Per Country Without UK

    #######TOTAL PRICE aka REVENUE

    The calculation of total price
    df[‘total_price’] = df[‘quantity’] * df[‘unitprice’]

    Let’s save this dataframe for future analysis
    df_total_price = pd.DataFrame(df.groupby(‘country’)[“total_price”].sum().sort_values(ascending=False))

    Let’s plot the treemap Total Prices By Countries Without UK

    fig = px.treemap(df_total_price.iloc[1:],
    path=[df_total_price.iloc[1:].index],
    values=’total_price’,
    width=1000,
    height=600)
    fig.update_layout(title_text=’Total Prices By Countries Without UK’,
    title_x = 0.5, title_font = dict(size=20)
    )
    fig.update_layout(margin = dict(t=50, l=25, r=25, b=25))
    fig.show()

    Total Prices By Countries Without UK

    #########UK RFM ANALYSIS

    df_uk = df[df[“country”] == ‘United Kingdom’]

    Top 15 most owned products
    df_uk[“stockcode”].value_counts().head(15).plot(kind=”bar”, width=0.5, color=’pink’, edgecolor=’purple’, figsize=(16,9))
    plt.xticks(rotation=45);

    Top 15 most owned products in UK

    last_invoice = max(df_uk[‘invoicedate’])
    last_invoice

    '2011-12-09 12:49:00'

    first_invoice = min(df_uk[‘invoicedate’])
    first_invoice

    '2010-12-01 08:26:00'

    df_uk[‘invoicedate’] = pd.to_datetime(df_uk[‘invoicedate’])

    df_uk[“date”] = df_uk[‘invoicedate’].dt.date

    df_uk[“last_purchased_date”] = df_uk.groupby(“customerid”).date.transform(max)

    last_invoice = pd.to_datetime(last_invoice).date()

    df_uk.groupby(‘customerid’)[‘last_purchased_date’].apply(lambda x: last_invoice – x)

    # Recency

    df_recency = df_uk.groupby(by=’customerid’,
    as_index=False)[‘date’].max()
    df_recency.columns = [‘customerid’, ‘last_purchased_date’]
    recent_date = df_recency[‘last_purchased_date’].max()
    df_recency[‘recency’] = df_recency[‘last_purchased_date’].apply(
    lambda x: (recent_date – x).days)
    #df_recency.head()

    Let’s plot recency

    plt.figure(figsize=(16,9))
    sns.distplot(df_recency[‘recency’], bins=35);

    UK Recency

    # Frequency: Number of purchases

    df_uk[‘frequency’] = df_uk.groupby(‘customerid’)[“invoiceno”].transform(‘count’)

    plt.figure(figsize=(16,9))
    sns.distplot(df_uk[‘frequency’], bins=50);

    UK Frequency

    ########### Monetary: Total amount of money spent

    df_uk[“monetary”] = df_uk.groupby(‘customerid’)[“total_price”].transform(‘sum’)

    plt.figure(figsize=(16,9))
    sns.distplot(df_uk[‘monetary’], bins=50);

    UK Monetary

    ############ UK RFM Table

    df_uk.columns

    Index(['invoiceno', 'stockcode', 'description', 'quantity', 'invoicedate',
           'unitprice', 'customerid', 'country', 'total_price', 'date',
           'last_purchased_date', 'frequency', 'monetary'],
          dtype='object')

    df_recency.columns

    Index(['customerid', 'last_purchased_date', 'recency'], dtype='object')

    df_uk[‘recency’] = df_recency[‘recency’]

    df_rfm_table = df_uk[[‘customerid’, ‘recency’, ‘frequency’, ‘monetary’]]

    Customer Segmentation with RFM Scores
    df_rfm_table = df_rfm_table.set_index(‘customerid’)

    df_rfm_table.reset_index().duplicated().sum()

    df_rfm_table.drop_duplicates(inplace=True)
    df_rfm_table.shape

    (6050, 3)

    ########### Recency Scoring

    df_rfm_table[“recency”].quantile(q = [.25,.5,.75])

    0.250    21.000
    0.500    58.000
    0.750   163.000
    Name: recency, dtype: float64

    def recency_scoring(data):
    if data[“recency”] <= 17.000:
    return 4
    elif data[“recency”] <= 50.000:
    return 3
    elif data[“recency”] <= 142.000:
    return 2
    else:
    return 1

    df_rfm_table[‘recency_quantile’] = df_rfm_table.apply(recency_scoring, axis =1)
    #df_rfm_table.head()

    ########### Frequency Scoring

    df_rfm_table[“frequency”].quantile(q = [.25,.5,.75])

    0.250    23.000
    0.500    65.000
    0.750   147.000
    Name: frequency, dtype: float64

    def frequency_scoring(data):
    if data.frequency <= 17.000:
    return 1
    elif data.frequency <= 40.000:
    return 2
    elif data.frequency <= 98.000:
    return 3
    else:
    return 4

    df_rfm_table[‘frequency_quantile’] = df_rfm_table.apply(frequency_scoring, axis =1)
    #df_rfm_table.head()

    ########### Monetary Scoring

    df_rfm_table[“monetary”].quantile(q = [.25,.5,.75])

    def monetary_scoring(data):
    if data.monetary <= 298.185:
    return 1
    elif data.monetary <= 644.975:
    return 2
    elif data.monetary <= 1571.285:
    return 3
    else:
    return 4

    df_rfm_table[‘monetary_quantile’] = df_rfm_table.apply(monetary_scoring, axis =1)
    #df_rfm_table.head()

    ############### RFM Scoring

    def rfm_scoring(data):
    return str(int(data[‘recency_quantile’])) + str(int(data[‘frequency_quantile’])) + str(int(data[‘monetary_quantile’]))

    df_rfm_table[‘rfm_score’] = df_rfm_table.apply(rfm_scoring, axis=1)
    #df_rfm_table.head()

    df_rfm_table[‘rfm_level’] = df_rfm_table[‘recency_quantile’] + df_rfm_table[‘frequency_quantile’] + df_rfm_table[‘monetary_quantile’]
    #df_rfm_table.head()

    def segments1(data):
    if data[‘rfm_level’] >= 10 :
    return ‘Gold’
    elif (data[‘rfm_level’] >= 6) and (data[‘rfm_level’] < 10 ):
    return ‘Silver’
    else:
    return ‘Bronze’

    df_rfm_table [‘segments1’] = df_rfm_table.apply(segments1,axis=1)
    #df_rfm_table.head()

    segments2 = {
    ‘Customer Segment’:
    [‘Champions’,
    ‘Loyal Customers’,
    ‘Potential Loyalist’,
    ‘Recent Customers’,
    ‘Customers Needing Attention’,
    ‘Still Got Hope’,
    ‘Need to Get Them Back’,
    ‘Lost’, ‘Give it a Try’],\
    ‘RFM’:
    [‘(3|4)-(3|4)-(3|4)’,
    ‘(2|3|4)-(3|4)-(1|2|3|4)’,
    ‘(3|4)-(2|3)-(1|2|3|4)’,
    ‘(4)-(1)-(1|2|3|4)’,
    ‘(2|3)-(2|3)-(2|3)’,
    ‘(2|3)-(1|2)-(1|2|3|4)’,
    ‘(1|2)-(3|4)-(2|3|4)’,
    ‘(1|2)-(1|2)-(1|2)’,
    ‘(1|2)-(1|2|3)-(1|2|3|4)’]
    }
    pd.DataFrame(segments2)

    def categorizer(rfm):
    if (rfm[0] in [‘3’, ‘4’]) & (rfm[1] in [‘3’, ‘4’]) & (rfm[2] in [‘3’, ‘4’]):
    rfm = ‘Champions’

    elif (rfm[0] in ['2', '3', '4']) & (rfm[1] in ['3', '4']) & (rfm[2] in ['1', '2', '3', '4']):
        rfm = 'Loyal Customers'
    
    elif (rfm[0] in ['3', '4']) & (rfm[1] in ['2', '3']) & (rfm[2] in ['1', '2', '3', '4']):
        rfm = 'Potential Loyalist'
    
    elif (rfm[0] in ['4']) & (rfm[1] in ['1']) & (rfm[2] in ['1', '2', '3', '4']):
        rfm = 'Recent Customers'
    
    elif (rfm[0] in ['2', '3']) & (rfm[1] in ['2', '3']) & (rfm[2] in ['2', '3']):
        rfm = 'Customers Needing Attention'
    
    elif (rfm[0] in ['2', '3']) & (rfm[1] in ['1', '2']) & (rfm[2] in ['1', '2', '3', '4']):
        rfm = 'Still Got Hope'
    
    elif (rfm[0] in ['1', '2']) & (rfm[1] in ['3', '4']) & (rfm[2] in ['2', '3', '4']):
        rfm = 'Need to Get Them Back'
    
    elif (rfm[0] in ['1', '2']) & (rfm[1] in ['1', '2']) & (rfm[2] in ['1', '2']):
        rfm = 'Lost'
    
    elif (rfm[0] in ['1', '2']) & (rfm[1] in ['1', '2', '3']) & (rfm[2] in ['1', '2', '3', '4']):
        rfm = 'Give it a Try'
    
    return rfm 
    

    df_rfm_table[‘segments2’] = df_rfm_table[“rfm_score”].apply(categorizer)
    #df_rfm_table.head()

    df_rfm_table[“segments2”].value_counts(dropna=False)

    segments2
    Need to Get Them Back          2343
    Lost                           1666
    Champions                       660
    Loyal Customers                 602
    Give it a Try                   483
    Potential Loyalist              115
    Still Got Hope                   93
    Recent Customers                 46
    Customers Needing Attention      40
    141                               2
    Name: count, dtype: int64

    df_rfm_table[“segments1”].info()

    <class 'pandas.core.series.Series'>
    Index: 6050 entries, 17850.0 to 14569.0
    Series name: segments1
    Non-Null Count  Dtype 
    --------------  ----- 
    6050 non-null   object
    dtypes: object(1)
    memory usage: 94.5+ KB

    Plot RFM Segments
    df_plot1 = pd.DataFrame(df_rfm_table[“segments1”].value_counts(dropna=False).sort_values(ascending=False)).reset_index().rename(columns={‘index’:’Segments’, ‘segments1′:’Customers’})
    df_plot1

    import matplotlib.pyplot as plt
    import squarify

    ############################# segment1

    sizes=[3200, 1944, 906]
    label=[“Silver”, “Bronze”, “Gold”]
    color=[‘yellow’,’blue’,’orange’]
    squarify.plot(sizes=sizes, label=label, color=color,alpha=0.8 , text_kwargs={‘fontsize’:16})
    plt.axis(‘off’)
    plt.show()

    UK RFM segment1

    ################ segment2

    import matplotlib.pyplot as plt
    import seaborn as sns

    data = [2343, 1666, 660, 602, 483, 115]
    labels = [‘Need to Get Them Back’, ‘Lost’,’Champions’,’Loyal Customers’,’Give it a Try’,’Potential Loyalist’]

    colors = sns.color_palette(‘bright’)[0:6]

    plt.pie(data, labels = labels, colors = colors, autopct=’%.0f%%’)
    plt.show()

    UK RFM segment2

    Read more here.

    Customer Clustering PCA, K-Means and Agglomerative for Marketing Campaigns

    Let’s set the working directory

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    and read the dataset

    import pandas as pd
    import numpy as np
    df = pd.read_csv(‘marketing_campaign.csv’, sep=”\t”)

    Get the shape of the dataframe
    print(“Shape of the dataframe:”, df.shape)

    Shape of the dataframe: (2240, 29)

    Get the information about the dataframe
    print(“\nInformation about the dataframe:”)
    print(df.info())

    Information about the dataframe:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 2240 entries, 0 to 2239
    Data columns (total 29 columns):
     #   Column               Non-Null Count  Dtype  
    ---  ------               --------------  -----  
     0   ID                   2240 non-null   int64  
     1   Year_Birth           2240 non-null   int64  
     2   Education            2240 non-null   object 
     3   Marital_Status       2240 non-null   object 
     4   Income               2216 non-null   float64
     5   Kidhome              2240 non-null   int64  
     6   Teenhome             2240 non-null   int64  
     7   Dt_Customer          2240 non-null   object 
     8   Recency              2240 non-null   int64  
     9   MntWines             2240 non-null   int64  
     10  MntFruits            2240 non-null   int64  
     11  MntMeatProducts      2240 non-null   int64  
     12  MntFishProducts      2240 non-null   int64  
     13  MntSweetProducts     2240 non-null   int64  
     14  MntGoldProds         2240 non-null   int64  
     15  NumDealsPurchases    2240 non-null   int64  
     16  NumWebPurchases      2240 non-null   int64  
     17  NumCatalogPurchases  2240 non-null   int64  
     18  NumStorePurchases    2240 non-null   int64  
     19  NumWebVisitsMonth    2240 non-null   int64  
     20  AcceptedCmp3         2240 non-null   int64  
     21  AcceptedCmp4         2240 non-null   int64  
     22  AcceptedCmp5         2240 non-null   int64  
     23  AcceptedCmp1         2240 non-null   int64  
     24  AcceptedCmp2         2240 non-null   int64  
     25  Complain             2240 non-null   int64  
     26  Z_CostContact        2240 non-null   int64  
     27  Z_Revenue            2240 non-null   int64  
     28  Response             2240 non-null   int64  
    dtypes: float64(1), int64(25), object(3)
    memory usage: 507.6+ KB
    None

    Get the summary statistics of the dataframe
    print(“\nSummary statistics of the dataframe:”)
    df.describe().T

    Summary statistics of the dataframe

    ############# DATA PREPARATION

    Check for missing data
    print(“Missing data in the dataframe:”)
    print(df.isnull().sum())

    Missing data in the dataframe:
    ID                      0
    Year_Birth              0
    Education               0
    Marital_Status          0
    Income                 24
    Kidhome                 0
    Teenhome                0
    Dt_Customer             0
    Recency                 0
    MntWines                0
    MntFruits               0
    MntMeatProducts         0
    MntFishProducts         0
    MntSweetProducts        0
    MntGoldProds            0
    NumDealsPurchases       0
    NumWebPurchases         0
    NumCatalogPurchases     0
    NumStorePurchases       0
    NumWebVisitsMonth       0
    AcceptedCmp3            0
    AcceptedCmp4            0
    AcceptedCmp5            0
    AcceptedCmp1            0
    AcceptedCmp2            0
    Complain                0
    Z_CostContact           0
    Z_Revenue               0
    Response                0
    dtype: int64

    Let’s drop duplicates

    df = df.dropna()
    df.duplicated().sum()

    0

    ########### Outliers

    Calculate the IQR for the Income column
    Q1 = df[‘Income’].quantile(0.25)
    Q3 = df[‘Income’].quantile(0.75)
    IQR = Q3 – Q1

    Identify the outliers in the Income column
    outliers = df[(df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR))]

    Print the number of outliers
    print(“Number of outliers in the Income column:”, len(outliers))

    Number of outliers in the Income column: 8

    Remove the outliers in the Income column
    df = df[~((df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR)))]

    Print the updated shape of the dataframe
    print(“Updated shape of the dataframe:”, df.shape)

    ############## Feature engineering
    print(“Unique values in Education column:”, df[‘Education’].unique())
    print(“Unique values in Marital_Status column:”, df[‘Marital_Status’].unique())

    Unique values in Education column: ['Graduation' 'PhD' 'Master' 'Basic' '2n Cycle']
    Unique values in Marital_Status column: ['Single' 'Together' 'Married' 'Divorced' 'Widow' 'Alone' 'Absurd' 'YOLO']

    Let’s define the following new columns

    def education_level(education):
    if education in [‘Graduation’, ‘PhD’, ‘Master’]:
    return ‘High’
    elif education in [‘Basic’]:
    return ‘Middle’
    else:
    return ‘Low’

    df[‘Education_Level’] = df[‘Education’].apply(education_level)

    def living_status(marital_status):
    if marital_status in [‘Alone’, ‘Absurd’, ‘YOLO’]:
    return ‘Living Alone’
    else:
    return ‘Living with Others’

    df[‘Living_Status’] = df[‘Marital_Status’].apply(living_status)

    df[‘Age’] = 2022 – df[‘Year_Birth’]

    df[‘Total_Campaigns_Accepted’] = df[[‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘AcceptedCmp3’, ‘AcceptedCmp4’, ‘AcceptedCmp5’]].sum(axis=1)

    df[‘Average_Spend’] = (df[[‘MntWines’, ‘MntFruits’, ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’]].sum(axis=1)) / df[‘NumDealsPurchases’]

    df[‘Spent’] = df[‘MntWines’]+df[“MntWines”] +df[‘MntFruits’]+ df[‘MntMeatProducts’] +df[‘MntFishProducts’]+df[‘MntSweetProducts’]+ df[‘MntGoldProds’]

    df[‘Is_Parent’] = (df[‘Kidhome’] + df[‘Teenhome’] > 0).astype(int)

    Create the new feature for total spending in the last 2 years
    df[‘total_spending’] = df[‘MntWines’] + df[‘MntFruits’] + df[‘MntMeatProducts’] + df[‘MntFishProducts’] + df[‘MntSweetProducts’] + df[‘MntGoldProds’]

    Create new feature for average monthly visits to the company’s website
    df[‘avg_web_visits’] = df[‘NumWebVisitsMonth’] / 12

    Create new feature for the ratio of online purchases to total purchases
    df[‘online_purchase_ratio’] = df[‘NumWebPurchases’] / (df[‘NumWebPurchases’] + df[‘NumCatalogPurchases’] + df[‘NumStorePurchases’])

    Let’s drop a few columns

    to_drop = [‘Dt_Customer’, ‘Z_CostContact’, ‘Z_Revenue’, ‘Year_Birth’, ‘ID’]
    df = df.drop(to_drop, axis=1)

    df.dtypes

    Education                    object
    Marital_Status               object
    Income                      float64
    Kidhome                       int64
    Teenhome                      int64
    Recency                       int64
    MntWines                      int64
    MntFruits                     int64
    MntMeatProducts               int64
    MntFishProducts               int64
    MntSweetProducts              int64
    MntGoldProds                  int64
    NumDealsPurchases             int64
    NumWebPurchases               int64
    NumCatalogPurchases           int64
    NumStorePurchases             int64
    NumWebVisitsMonth             int64
    AcceptedCmp3                  int64
    AcceptedCmp4                  int64
    AcceptedCmp5                  int64
    AcceptedCmp1                  int64
    AcceptedCmp2                  int64
    Complain                      int64
    Response                      int64
    Education_Level              object
    Living_Status                object
    Age                           int64
    Total_Campaigns_Accepted      int64
    Average_Spend               float64
    Spent                         int64
    Is_Parent                     int32
    total_spending                int64
    avg_web_visits              float64
    online_purchase_ratio       float64
    dtype: object

    ########### PLOTLY EXPLORATORY DATA ANALYSIS (EDA)

    import plotly.express as px

    fig0 = px.histogram(df, x=”Income”, nbins=50)
    fig0.show()

    Income histogram

    fig1 = px.histogram(df, x=”Age”, nbins=30, color=’Age’, title=”Distribution of Age”)
    fig1.show()

    Age histogram

    fig2 = px.histogram(df, x=’Marital_Status’, nbins=5, title=”Marital Status Distribution”)
    fig2.show()

    Marital Status Distribution

    fig3 = px.histogram(df, x=’Education_Level’, nbins=5, title=”Education Level Distribution”)
    fig3.show()

    Education Level Distribution

    import plotly.express as px

    df_plot = df.groupby([‘Marital_Status’])[‘Average_Spend’].mean().reset_index()

    fig4 = px.bar(df_plot, x=’Marital_Status’, y=’Average_Spend’, color=’Marital_Status’)

    fig4.show()

    Marital Status vs Average Spend Distribution

    Marital Status vs Average Spend Distribution

    Plot the distribution of number of children in household
    fig6 = px.histogram(df, x=’Kidhome’)
    fig6.show()

     Distribution of number of children in household

    Total Campaigns Accepted Distribution
    fig8 = px.histogram(df, x=’Total_Campaigns_Accepted’, nbins=20, title=”Total Campaigns Accepted Distribution”)
    fig8.show()

    Total Campaigns Accepted Distribution

    Average Spend per Purchase Distribution
    fig9 = px.histogram(df, x=’Average_Spend’, nbins=20, title=”Average Spend per Purchase Distribution”)
    fig9.show()

    Average Spend per Purchase Distribution

    Spending distribution by marital status, education, and is_parent

    fig10 = px.histogram(df, x=’total_spending’, color=’Marital_Status’, nbins=50,
    title=’Spending Distribution by Marital Status’)
    fig11 = px.histogram(df, x=’total_spending’, color=’Education_Level’, nbins=50,
    title=’Spending Distribution by Education Level’)
    fig12 = px.histogram(df, x=’total_spending’, color=’Is_Parent’, nbins=50,
    title=’Spending Distribution by Is_Parent’)

    fig10.show()
    fig11.show()
    fig12.show()

    Spending distribution by marital status
    Spending distribution by education level
    Spending distribution is_parent

    Plot the Distribution of Online Purchase Ratio
    fig13 = px.histogram(df, x=’online_purchase_ratio’)
    fig13.show()

    Online purchase ratio

    Plot the Distribution of Number of Web Visits per Month
    fig14 = px.histogram(df, x=’NumWebVisitsMonth’)
    fig14.show()

    Number of Web Visits per Month

    Plot the Distribution of Number of Web Purchases
    fig15 = px.histogram(df, x=’NumWebPurchases’)
    fig15.show()

    Number of Web Purchases

    Plot the Distribution of Number of Catalog Purchases
    fig16 = px.histogram(df, x=’NumCatalogPurchases’)
    fig16.show()

    Number of Catalog Purchases

    Plot the Distribution of Number of Store Purchases
    fig17 = px.histogram(df, x=’NumStorePurchases’)
    fig17.show()

    Number of Store Purchases

    Box plot of NumWebPurchases vs NumStorePurchases
    fig18 = px.box(df, x=”NumWebPurchases”, y=”NumStorePurchases”)
    fig18.show()

    Box plot of NumWebPurchases vs NumStorePurchases

    Scatter plot of “MntWines” vs “MntSweetProducts” with a third variable represented by size or color
    fig22 = px.scatter(df, x=”MntWines”, y=”MntSweetProducts”, size=”NumWebVisitsMonth”, color=”Income”, size_max=50)
    fig22.show()

    fig22 = px.scatter(df, x="MntWines", y="MntSweetProducts", size="NumWebVisitsMonth", color="Income", size_max=50)

    ########### CLUSTERING ALGORITHMS

    import plotly.express as px
    import plotly.io as pio

    from sklearn.preprocessing import StandardScaler, MinMaxScaler

    ######## Data Preparation

    df = pd.read_csv(‘marketing_campaign.csv’, sep=”\t”)

    df = df.dropna()

    df.duplicated().sum()

    0

    Calculate the IQR for the Income column
    Q1 = df[‘Income’].quantile(0.25)
    Q3 = df[‘Income’].quantile(0.75)
    IQR = Q3 – Q1

    Identify the outliers in the Income column
    outliers = df[(df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR))]

    Remove the outliers in the Income column
    df = df[~((df[‘Income’] < (Q1 – 1.5 * IQR)) | (df[‘Income’] > (Q3 + 1.5 * IQR)))]

    df[‘Age’] = 2022 – df[‘Year_Birth’]
    df[‘Total_Campaigns_Accepted’] = df[[‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘AcceptedCmp3’, ‘AcceptedCmp4’, ‘AcceptedCmp5’]].sum(axis=1)

    df[‘Average_Spend’] = (df[[‘MntWines’, ‘MntFruits’, ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’]].sum(axis=1)) / df[‘NumDealsPurchases’]

    df[‘Spent’] = df[‘MntWines’]+df[“MntWines”] +df[‘MntFruits’]+ df[‘MntMeatProducts’] +df[‘MntFishProducts’]+df[‘MntSweetProducts’]+ df[‘MntGoldProds’]

    df[‘Is_Parent’] = (df[‘Kidhome’] + df[‘Teenhome’] > 0).astype(int)

    df[‘total_spending’] = df[‘MntWines’] + df[‘MntFruits’] + df[‘MntMeatProducts’] + df[‘MntFishProducts’] + df[‘MntSweetProducts’]

    df[‘avg_web_visits’] = df[‘NumWebVisitsMonth’] / 12

    df[‘online_purchase_ratio’] = df[‘NumWebPurchases’] / (df[‘NumWebPurchases’] + df[‘NumCatalogPurchases’] + df[‘NumStorePurchases’])

    to_drop = [‘Dt_Customer’, ‘Z_CostContact’, ‘Z_Revenue’, ‘Year_Birth’, ‘ID’]
    df = df.drop(to_drop, axis=1)

    One-hot encode the categorical variables

    Select the numerical columns to scale
    num_cols = [‘Income’, ‘Kidhome’, ‘Teenhome’, ‘Recency’, ‘MntWines’, ‘MntFruits’,
    ‘MntMeatProducts’, ‘MntFishProducts’, ‘MntSweetProducts’, ‘MntGoldProds’,
    ‘NumDealsPurchases’, ‘NumWebPurchases’, ‘NumCatalogPurchases’,
    ‘NumStorePurchases’, ‘NumWebVisitsMonth’, ‘AcceptedCmp3’, ‘AcceptedCmp4’,
    ‘AcceptedCmp5’, ‘AcceptedCmp1’, ‘AcceptedCmp2’, ‘Complain’, ‘Response’,
    ‘total_spending’, ‘avg_web_visits’, ‘online_purchase_ratio’, ‘Age’,
    ‘Total_Campaigns_Accepted’, ‘Is_Parent’]

    Initialize the StandardScaler
    scaler = StandardScaler()

    Fit the scaler to the numerical columns
    df[num_cols] = scaler.fit_transform(df[num_cols])

    df.isnull().sum().sum()

    0

    al=df[‘online_purchase_ratio’].mean()
    print(al)

    4.1834490782976914e-17
    

    df[‘online_purchase_ratio’] = df[‘online_purchase_ratio’].fillna(0.33)

    df.replace([np.inf, -np.inf], np.nan, inplace=True)

    al=df[‘Average_Spend’].mean()
    print(al)

    412.04631023016225

    df[‘Average_Spend’] = df[‘Average_Spend’].fillna(al)

    ######## K-means Clusters

    from sklearn.decomposition import PCA

    Initialize the PCA model
    pca = PCA(n_components=8)

    Fit and transform the data
    df_pca = pca.fit_transform(df)

    Determining the optimal number of clusters using Silhouette Score
    import numpy as np
    from sklearn.metrics import silhouette_score
    from sklearn.cluster import KMeans

    list_k = list(range(2, 10))
    silhouette_scores = []
    for k in list_k:
    km = KMeans(n_clusters=k)
    preds = km.fit_predict(df_pca)
    silhouette_scores.append(silhouette_score(df_pca, preds))

    best_k = list_k[np.argmax(silhouette_scores)]

    best_k

    2

    Let’s fit the KMeans model with the number of clusters set to 4

    import matplotlib.pyplot as plt
    from sklearn.metrics import silhouette_score
    from sklearn.metrics import calinski_harabasz_score,davies_bouldin_score

    kmeans = KMeans(n_clusters=4)
    kmeans.fit(df_pca)

    predictions = kmeans.predict(df_pca)

    silhouette_score_value = silhouette_score(df_pca, predictions)

    calinski_harabasz_score_value = calinski_harabasz_score(df_pca, predictions)

    davies_bouldin_score_value = davies_bouldin_score(df_pca, predictions)

    plt.scatter(df_pca[:, 0], df_pca[:, 1], c=predictions, cmap=’viridis’)
    plt.xlabel(‘Feature 1’)
    plt.ylabel(‘Feature 2’)
    plt.title(‘KMeans Clustering Results in 2D\nSilhouette Score: {0:.3f}’.format(silhouette_score_value))
    plt.show()

    fig = plt.figure()
    ax = fig.add_subplot(111, projection=’3d’)
    ax.scatter(df_pca[:, 0], df_pca[:, 1], df_pca[:, 2], c=predictions, cmap=’viridis’)
    ax.set_xlabel(‘Feature 1’)
    ax.set_ylabel(‘Feature 2’)
    ax.set_zlabel(‘Feature 3’)
    ax.set_title(‘KMeans Clustering Results in 3D\nSilhouette Score: {0:.3f}’.format(silhouette_score_value))
    plt.show()

    KMeans clustering result in 2D
    KMeans clustering result in 3D

    ####### Agglomerative Clustering

    import numpy as np
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    from sklearn.cluster import AgglomerativeClustering
    from scipy.spatial.distance import pdist, squareform
    from sklearn.metrics import davies_bouldin_score

    Generate sample data
    X = df_pca

    Compute the pairwise distances between samples
    dist_matrix = squareform(pdist(X))

    Fit the Agglomerative Clustering model
    agg_cluster = AgglomerativeClustering(n_clusters=4)
    agg_cluster.fit(X)

    Plotting

    plt.scatter(X[:, 0], X[:, 1], c=agg_cluster.labels_, cmap=’viridis’)
    plt.show()

    fig = plt.figure()
    ax = fig.add_subplot(111, projection=’3d’)
    ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=agg_cluster.labels_, cmap=’viridis’)
    plt.show()

    davies_bouldin_index = davies_bouldin_score(X, agg_cluster.labels_)
    print(“Davies-Bouldin Index:”, davies_bouldin_index)

    silhouette_score_value = silhouette_score(X, agg_cluster.labels_)

    calinski_harabasz_score_value = calinski_harabasz_score(X, agg_cluster.labels_)

    print(“Silhouette Score:”, silhouette_score_value)
    print(“Calinski-Harabasz Index:”, calinski_harabasz_score_value)

    Agglomerative Clustering 2D
    Agglomerative Clustering 3D

    Read more here.

    Supply Chain RFM & ABC Analysis

    • Let’s apply the RFM/ABC analysis to the Data Co Supply Chain Dataset. Recency, Frequency, And Monetary Analysis(RFM) and ABC are techniques would give us an idea about important factors in our business and how we can optimize the future business growth.
    • The Recency-Frequency-Monetary analysis helps demand side sector of an organization by segmenting customers and providing more insights to marketing department and sales department about the customers who buy products at the store.
    • The ABC analysis helps the supply side sector of an organization by segmenting products based on revenue they generate. It helps manage inventory more efficiently.

    Let’s set the working directory

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    and import standard libraries

    import pandas as pd
    import dateutil.parser
    import numpy as np
    import datetime
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    import seaborn as sns;
    import matplotlib.pyplot as plt
    from plotly import graph_objects as go
    import plotly.express as px

    Let’s read the input data

    dFppsc= pd.read_csv(“DataCoSupplyChainDataset.csv”, encoding_errors=”ignore”)
    dFppsc.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 180519 entries, 0 to 180518
    Data columns (total 53 columns):
     #   Column                         Non-Null Count   Dtype  
    ---  ------                         --------------   -----  
     0   Type                           180519 non-null  object 
     1   Days for shipping (real)       180519 non-null  int64  
     2   Days for shipment (scheduled)  180519 non-null  int64  
     3   Benefit per order              180519 non-null  float64
     4   Sales per customer             180519 non-null  float64
     5   Delivery Status                180519 non-null  object 
     6   Late_delivery_risk             180519 non-null  int64  
     7   Category Id                    180519 non-null  int64  
     8   Category Name                  180519 non-null  object 
     9   Customer City                  180519 non-null  object 
     10  Customer Country               180519 non-null  object 
     11  Customer Email                 180519 non-null  object 
     12  Customer Fname                 180519 non-null  object 
     13  Customer Id                    180519 non-null  int64  
     14  Customer Lname                 180511 non-null  object 
     15  Customer Password              180519 non-null  object 
     16  Customer Segment               180519 non-null  object 
     17  Customer State                 180519 non-null  object 
     18  Customer Street                180519 non-null  object 
     19  Customer Zipcode               180516 non-null  float64
     20  Department Id                  180519 non-null  int64  
     21  Department Name                180519 non-null  object 
     22  Latitude                       180519 non-null  float64
     23  Longitude                      180519 non-null  float64
     24  Market                         180519 non-null  object 
     25  Order City                     180519 non-null  object 
     26  Order Country                  180519 non-null  object 
     27  Order Customer Id              180519 non-null  int64  
     28  order date (DateOrders)        180519 non-null  object 
     29  Order Id                       180519 non-null  int64  
     30  Order Item Cardprod Id         180519 non-null  int64  
     31  Order Item Discount            180519 non-null  float64
     32  Order Item Discount Rate       180519 non-null  float64
     33  Order Item Id                  180519 non-null  int64  
     34  Order Item Product Price       180519 non-null  float64
     35  Order Item Profit Ratio        180519 non-null  float64
     36  Order Item Quantity            180519 non-null  int64  
     37  Sales                          180519 non-null  float64
     38  Order Item Total               180519 non-null  float64
     39  Order Profit Per Order         180519 non-null  float64
     40  Order Region                   180519 non-null  object 
     41  Order State                    180519 non-null  object 
     42  Order Status                   180519 non-null  object 
     43  Order Zipcode                  24840 non-null   float64
     44  Product Card Id                180519 non-null  int64  
     45  Product Category Id            180519 non-null  int64  
     46  Product Description            0 non-null       float64
     47  Product Image                  180519 non-null  object 
     48  Product Name                   180519 non-null  object 
     49  Product Price                  180519 non-null  float64
     50  Product Status                 180519 non-null  int64  
     51  shipping date (DateOrders)     180519 non-null  object 
     52  Shipping Mode                  180519 non-null  object 
    dtypes: float64(15), int64(14), object(24)
    memory usage: 73.0+ MB

    ############ Data Cleaning for Future Regression Models
    dFppsc.drop(“Product Description”,axis=1,inplace=True)

    dFppsc.drop(“Product Description”,axis=1,inplace=True)

    dFppsc[“order date (DateOrders)”]=pd.to_datetime(dFppsc[“order date (DateOrders)”])
    dFppsc[“shipping date (DateOrders)”]=pd.to_datetime(dFppsc[“shipping date (DateOrders)”])
    dFppsc=dFppsc.sort_values(by=”order date (DateOrders)”)

    dFppsc.drop([“Benefit per order”,”Sales per customer”,”Order Item Cardprod Id”,”Order Item Product Price”,”Product Category Id”,”Order Customer Id”],axis=1,inplace=True)

    ######### Variance Inflation Factor (VIF)
    vif=pd.DataFrame()
    vif[“columns”]=[‘Order Item Discount’,
    ‘Order Item Discount Rate’, ‘Order Item Profit Ratio’,
    ‘Order Item Quantity’, ‘Sales’, ‘Order Item Total’,
    ‘Order Profit Per Order’,’Product Price’]
    vif[“vif value”] = [variance_inflation_factor(dFppsc[[‘Order Item Discount’,
    ‘Order Item Discount Rate’, ‘Order Item Profit Ratio’,
    ‘Order Item Quantity’, ‘Sales’, ‘Order Item Total’,
    ‘Order Profit Per Order’,’Product Price’]].values, i) for i in range(len(vif[“columns”]))]
    vif.T

    Variance Inflation Factor (VIF)

    df1=dFppsc.drop([“Order Item Total”,”Product Price”,”Order Item Discount Rate”,”Order Profit Per Order”],axis=1)

    vif=pd.DataFrame()
    vif[“columns”]=[‘Sales’,
    ‘Order Item Quantity’,’Order Item Discount’,’Order Item Profit Ratio’]
    vif[“data”] = [variance_inflation_factor(df1[[‘Sales’,
    ‘Order Item Quantity’,’Order Item Discount’,’Order Item Profit Ratio’]].values, i) for i in range(len(vif[“columns”]))]
    dFppsc[“Delivery Status”].unique()

    array(['Advance shipping', 'Late delivery', 'Shipping on time',
           'Shipping canceled'], dtype=object)

    dFppsc.groupby(“Delivery Status”)[“Order Id”].count()

    Delivery Status
    Advance shipping     41592
    Late delivery        98977
    Shipping canceled     7754
    Shipping on time     32196
    Name: Order Id, dtype: int64

    dFppsc[dFppsc[“Delivery Status”]==”Shipping canceled”].groupby(“Order Status”).agg(tc=(“Order Id”,”count”))


    Order Statustc
    CANCELED3692
    SUSPECTED_FRAUD4062

    dF_clean=dFppsc[dFppsc[“Delivery Status”]!=”Shipping canceled”].copy()

    ############# RFM Analysis
    dF_frequency=dF_clean.groupby([“Customer Id”]).agg(Total_count=(“Order Id”,”nunique”))

    dF_clean[“Year”]=dF_clean[“order date (DateOrders)”].dt.year
    dF_frequency_year=dF_clean.groupby([“Customer Id”,”Year”],as_index=False)[“Order Id”].nunique()

    dF_frequency_year=dF_frequency_year.pivot_table(index=”Customer Id”,columns=”Year”,values=”Order Id”,fill_value=0)

    pd.set_option(“display.max_columns”,None)
    dF_clean[dF_clean[“Customer Id”]==12436]

    dt1=datetime.datetime.strptime(’02-10-2017 12:46:00′, ‘%d-%m-%Y %H:%M:%S’)
    dt2=dt1-datetime.timedelta(days=5)

    df_exploring=dF_clean[dF_clean[“order date (DateOrders)”]>=dt2]
    df_exploring.to_excel(“afterdt1.xlsx”)
    df_exploring[df_exploring[“order date (DateOrders)”]>=dt1]

    7964 rows × 47 columns

    dF_clean[“Customer Full Name”]=dF_clean[“Customer Fname”]+” “+dF_clean[“Customer Lname”]
    datetime_val = datetime.datetime.strptime(’02-10-2017 12:46:00′, ‘%d-%m-%Y %H:%M:%S’)
    Customer_Names_After_Dt=dF_clean[dF_clean[“order date (DateOrders)”]>=datetime_val][[“Customer Full Name”,”Customer City”]].drop_duplicates()

    Customer_Name_Common_Before_After_dt=pd.merge(dF_clean[(dF_clean[“order date (DateOrders)”]<=datetime_val)],Customer_Names_After_Dt,how=”inner”,on=[“Customer Full Name”,”Customer City”])
    Records=pd.merge(Customer_Name_Common_Before_After_dt[[“Customer Full Name”,”Customer City”]].drop_duplicates(),dF_clean,how=”inner”,on=[“Customer Full Name”,”Customer City”])

    dF_clean=dF_clean[dF_clean[“order date (DateOrders)”]<datetime_val].copy()

    ####### FREQUENCY

    dF_frequency=dF_clean[[“Customer Id”,”Order Id”]].groupby(“Customer Id”,as_index=False).nunique()
    dF_frequency.columns=[“Customer Id”,”Frequency”]
    sns.histplot(dF_frequency.Frequency,bins=15,kde=False)

    Frequency Histogram:

    Frequency histogram

    dF_frequency[“F”],Intervals_Frequency=pd.qcut(dF_frequency[“Frequency”],q=3,labels=[1,2,3],retbins=True)

    ########### RECENCY

    dF_recency=dF_clean[[“Customer Id”,”order date (DateOrders)”]].groupby(“Customer Id”,as_index=False).max()
    dF_recency.rename(columns ={“order date (DateOrders)”:”last_purchase_date”},inplace=True)
    max_date=dF_recency[“last_purchase_date”].max()
    dF_recency[“recency”]=max_date-dF_recency[“last_purchase_date”]
    dF_recency[“recency”]=dF_recency[“recency”].dt.days

    sns.displot(dF_recency.recency,bins=8,kde=False)

    Recency Histogram

    Recency histogram

    dF_recency[‘R’],Intervals_Recency=pd.qcut(dF_recency[“recency”],q=3,labels=[3,2,1],retbins=True)

    ######MONETARY

    dF_monetory=dF_clean[[“Customer Id”,”Order Item Total”]].groupby(“Customer Id”,as_index=False).sum()
    dF_monetory.columns=[“Customer Id”,”Sum_of_Sales”]
    dF_monetory[“M”],Intervals_Monetory=pd.qcut(dF_monetory[“Sum_of_Sales”],q=3,labels=[1,2,3],retbins=True)

    sns.displot(dF_monetory.Sum_of_Sales,kde=True)

    Monetary histogram

    Monetary histogram

    ####RFM SCORE

    dF_rfm=pd.merge(dF_recency[[“Customer Id”,”R”]],dF_monetory[[“Customer Id”,”M”]],on=”Customer Id”,how=”inner”)
    dF_rfm=pd.merge(dF_rfm,dF_frequency[[“Customer Id”,”F”]],on=”Customer Id”,how=”inner”)
    dF_rfm[“RFM”]=(dF_rfm[“R”]).astype(str)+(dF_rfm[“F”]).astype(str)+(dF_rfm[“M”]).astype(str)

    ######SEGMENT STRATEGY

    Let’s read the input data

    pd.set_option(‘display.max_colwidth’, None)
    Segment_Strategy=pd.read_excel(“Segments_Strategy.xlsx”)
    Segment_Strategy

    Segment Strategy table

    #####Customer Segment Data preparation

    def Customer_Segment(data):
    if data[“R”]==1 and data[“F”] in [1,2,3] and (data[“M”]==3):
    return “Lost Customers – Big Spenders”
    elif data[“R”]== 1 and data[“F”] in [1,2] and data[“M”] in [1,2]:
    return “Lost Customers – Bargain”
    elif data[“R”] in [1,2] and data[“F”]==3 and data[“M”] in [1,2]:
    return “Lost/Almost Lost Customers – Loyal”
    elif (data[“R”]==3) and (data[“F”]==3) and data[“M”] in [1,2]:
    return “Loyal Customers”
    elif (data[“R”]==3) and data[“F”] in [3,2] and data[“M”]==3:
    return “Big Spenders”
    elif (data[“R”]==3) and (data[“F”]==1) and data[“M”] in [1,2,3]:
    return “New Customers”
    elif (data[“R”]==3) and (data[“F”]==2) and data[“M”] in [1,2]:
    return “Bargain Customers”
    elif (data[“R”]==2) and data[“F”]==2 and data[“M”] in [1,2]:
    return “Occasional Customers-Bargain”
    elif (data[“R”]==2) and data[“F”] in [2,3] and data[“M”]==3:
    return “Occasional Customers- Big Spenders”
    elif (data[“R”]==2) and data[“F”]==1 and data [“M”] in [1,2,3]:
    return “Unsatisfied Customers”
    else:
    return “No Segment”

    dF_rfm[“R”]=dF_rfm[“R”].astype(“category”)
    dF_rfm[“F”]=dF_rfm[“F”].astype(“category”)
    dF_rfm[“M”]=dF_rfm[“M”].astype(“category”)
    dF_rfm[“Segment”]=dF_rfm.apply(Customer_Segment,axis=1)

    Segment_count=(dF_rfm.groupby(“Segment”,as_index=False).agg(Total_Count=(“Customer Id”,”count”))).sort_values(by=”Total_Count”,ascending=False)

    Let’s plot Number of Customer in Each Segment

    fig2=go.Figure()
    fig2.add_trace(go.Bar(x=Segment_count.Segment,
    y=Segment_count.Total_Count,
    hovertemplate =”%{label}
    Number of Customers:%{value}”,
    texttemplate = “%{value}”))
    fig2.update_layout(title=”Number of Customers in Each Segment”,
    xaxis=dict(title=”Customer Segment”),
    yaxis=dict(title=”Number Of Customers”),width=800)

    Number of Customers vs Customer Segment

    Let’s plot Revenue Generated by Each Segment

    dF_clean=pd.merge(dF_clean,dF_rfm[[“Customer Id”,”RFM”,”Segment”]],on=”Customer Id”,how=”left”)
    dF_segment_revenue=dF_clean.groupby(“Segment”,as_index=False).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=”Total_Revenue”,ascending=False)
    fig3=go.Figure()
    fig3.add_trace(go.Bar(x=dF_segment_revenue.Segment,
    y=dF_segment_revenue.Total_Revenue,
    hovertemplate =”%{label}
    Revenue:%{value}”,
    texttemplate = “%{value}”))
    fig3.update_layout(title=”Revenue Generated by Each Segment”,xaxis=dict(title=”Customer Segment”),
    yaxis=dict(title=”Revenue”))

    Revenue Generated by Each Segment

    ######TEXT WRAPS & SEGMENT TREE MAPS

    def wrap(a):
    str1=a.split(” “)
    final_str=””
    for c in str1:
    final_str=final_str+c+’

    return final_str

    Df1=dF_clean.groupby([“Segment”,”Department Name”,”Product Name”],
    as_index=False
    ).agg(Total_Revenue=(“Order Item Total”,”sum”),Total_Quantity=(“Order Item Quantity”,”sum”))
    Df1[“Product Name”]=Df1[“Product Name”].apply(wrap) #Applying text wrap because our product names are too long
    segment=[“Big Spenders”,”Bargain Customers”,”Loyal Customers”,”New Customers”]
    def type1(data):
    if “Big Spenders” in data[“Segment”]:
    return “Big spenders”
    elif “Bargain” in data[“Segment”]:
    return “Bargain”
    elif “Loyal” in data[“Segment”]:
    return “Loyal”
    elif “New” in data[“Segment”]:
    return “New Customers”
    else:
    return “Unsatisfied”

    Df1[“Customer Type”]=Df1.apply(type1,axis=1)

    data_tm=list(Df1.groupby(“Customer Type”))+[(‘All’,Df1)]

    traces=[]
    buttons=[]
    visible=[]
    for i,d in enumerate(data_tm):
    visible=[False]*len(data_tm)
    visible[i]=True
    title=d[0]
    traces.append((px.treemap(d[1],
    path=[px.Constant(“All”),”Customer Type”,”Segment”,”Department Name”,”Product Name”],
    values=”Total_Revenue”).update_traces(visible= True if i==0 else False)).data[0])
    buttons.append(dict(label=title,
    method=”update”,
    args=[{
    “visible”:visible},
    {“title”:f”{title}”}
    ]))

    updatemenus=[{“active”:0,”buttons”:buttons}]
    fig11=go.Figure(data=traces,layout=dict(updatemenus=updatemenus))
    fig11.update_layout(title=data_tm[0][0],title_x=0.6)
    fig11.update_coloraxes(colorbar_title_text=”Total
    Quantity”)
    fig11.show()

    Bargain tree map
    Big spenders tree map
    Loyal tree map
    New Customers tree map
    Unsatisfied tree map
    All tree map

    #####ABC ANALYSIS

    Total_Products=dF_clean[“Product Name”].nunique()
    print(“Total Number of products: “+f”{Total_Products}”)

    Total Number of products: 100

    Revenue_ABC=dF_clean.groupby([“Department Name”,”Product Name”]).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=”Total_Revenue”,ascending=False).reset_index()
    Revenue_ABC[“cum_sum”]=Revenue_ABC[“Total_Revenue”].cumsum()
    Revenue_ABC[“cum_per”]=Revenue_ABC[“cum_sum”]/Revenue_ABC[“Total_Revenue”].sum()*100
    Revenue_ABC[“per”]=Revenue_ABC[“cum_per”]-Revenue_ABC[“cum_per”].shift(1)
    Revenue_ABC.loc[0,”per”]=Revenue_ABC[“cum_per”][0]

    def ABC(data):
    if data[“cum_per”]<=75: return “A” elif data[“cum_per”]>75 and data[“cum_per”]<=95: return “B” elif data[“cum_per”]>95:
    return “C”

    Revenue_ABC[“ABC_Revenue”]=Revenue_ABC.apply(ABC,axis=1)

    Bar_graph_Abc=Revenue_ABC[[“ABC_Revenue”,”Product Name”,”Total_Revenue”]].groupby(“ABC_Revenue”).agg(Revenue=(“Total_Revenue”,”sum”),count=(“Product Name”,”count”))

    ##########Bar_graph_Abc
    fig2=go.Figure(go.Bar(x=Bar_graph_Abc.index,
    y=Bar_graph_Abc[“Revenue”],
    hovertemplate =”%{label}
    Revenue:%{value}”,
    texttemplate = “Revenue
    %{value}”,
    marker_color=[“orange”,”lightgreen”,”red”],
    showlegend=False))
    fig2.add_trace(
    go.Scatter(
    x=Bar_graph_Abc.index,
    y=Bar_graph_Abc[“count”],
    name=”Number Of Products”,
    mode=’lines’,
    line = dict(color=’blue’, width=3),
    yaxis=”y2″,
    marker_line_width = 0
    ))

    fig2.update_layout(
    title=”Revenue Generated By Products in Different ABC Segments”,
    xaxis=dict(title=”Segment” ),
    yaxis=dict(title=”Revenue”,showgrid=False),
    yaxis2=dict(title=”Number Of Products”, anchor=”x”, overlaying=”y”,side=”right”,dtick=10),
    legend = dict(x = 1.05, y = 1))
    fig2.show()

    Revenue Generated By Products in Different ABC Segments

    ######A Segment Products in Top 6 Cities

    Top6Cities=dF_clean[[“Order City”,”Order Item Total”]].groupby(“Order City”).sum().sort_values(by=”Order Item Total”,ascending=False).head(6)

    data=dF_clean[dF_clean[“Order City”].isin(Top6Cities.index.tolist())].groupby([“Order City”,”Product Name”],as_index=False).agg(Total_Revenue=(“Order Item Total”,”sum”)).sort_values(by=[“Order City”,”Total_Revenue”],ascending=False)
    data[“sum”]=data.groupby(“Order City”)[“Total_Revenue”].transform(‘sum’)
    data[“cum_per”]=data.groupby(“Order City”)[“Total_Revenue”].cumsum()/data[“sum”]*100
    data[“segment”]=data.apply(ABC,axis=1)
    data[“Product Name”]=data[“Product Name”].apply(wrap)
    data.sort_values(by=[“sum”,”Total_Revenue”],inplace=True,ascending=False)

    fig = px.bar(data[data[“segment”]==”A”], x=’Product Name’, y=’Total_Revenue’,
    facet_col=’Order City’,facet_col_wrap=2,facet_row_spacing=0.12,
    height=1500,title=”A Segment Products in Top 6 Cities”)
    fig.for_each_annotation(lambda a: a.update(text=a.text.split(“=”)[-1]))
    fig.update_xaxes(showticklabels=True,tickangle=0)
    fig.update_layout(width=950)
    fig.show()

    A Segment Products in Top 6 Cities
    A Segment Products in Top 6 Cities
    A Segment Products in Top 6 Cities

    Read more here.

    Bank Churn ML Prediction

    Let’s set the working directory

    import os
    os.chdir(‘YOURPATH’)
    os. getcwd()

    Basic Libraries
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns

    Processing libraries
    from sklearn.preprocessing import MinMaxScaler
    from sklearn.decomposition import PCA
    from sklearn.feature_selection import SelectKBest, f_classif, chi2

    Model libraries
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.ensemble import GradientBoostingClassifier

    from sklearn.model_selection import train_test_split
    from sklearn import metrics

    Let’s read the input dataset

    df = pd.read_csv(‘BankChurners.csv’)

    df = df.drop(‘Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2’, axis = 1)
    df = df.drop(‘Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1’, axis = 1

    Let’s check the statistics

    df.describe().T

    Descriptive statistics

    Number of churn and non-churn
    counts = df.Attrition_Flag.value_counts()
    perc_churn = (counts[1] / (counts[0] + counts[1])) * 100

    No. of duplicates
    duplicates = len(df[df.duplicated()])

    No. of missing values
    missing_values = df.isnull().sum().sum()

    Data types in dataset
    types = df.dtypes.value_counts()
    print(“Churn Rate = %.1f %%”%(perc_churn))
    print(‘Number of Duplicate Entries: %d’%(duplicates))
    print(‘Number of Missing Values: %d’%(missing_values))
    print(‘Number of Features: %d’%(df.shape[1]))
    print(‘Number of Customers: %d’%(df.shape[0]))
    print(‘Data Types and Frequency in Dataset:’)
    print(types)

    Churn Rate = 16.1 %
    Number of Duplicate Entries: 0
    Number of Missing Values: 0
    Number of Features: 21
    Number of Customers: 10127
    Data Types and Frequency in Dataset:
    int64      10
    object      6
    float64     5
    Name: count, dtype: int64

    ####### Data Pre-processing

    Make gender and outcome numerical

    df[‘Gender’] = df[‘Gender’].map({‘M’: 1, ‘F’: 0})
    df[‘Attrition_Flag’] = df[‘Attrition_Flag’].map({‘Attrited Customer’: 1, ‘Existing Customer’: 0})

    drop client id
    df = df.drop(‘CLIENTNUM’, axis = 1)

    catcols = df.select_dtypes(exclude = [‘int64′,’float64’]).columns
    intcols = df.select_dtypes(include = [‘int64’]).columns
    floatcols = df.select_dtypes(include = [‘Float64’]).columns

    One-hot encoding on categorical columns
    df = pd.get_dummies(df, columns = catcols)

    Minmax scaling numeric features
    for col in df[floatcols]:
    df[col] = MinMaxScaler().fit_transform(df[[col]])

    for col in df[intcols]:
    df[col] = MinMaxScaler().fit_transform(df[[col]])

    print(‘New Number of Features: %d’%(df.shape[1]))

    New Number of Features: 37

    Split into X and y
    X = df.drop(‘Attrition_Flag’, axis = 1)
    y = df[‘Attrition_Flag’]

    ##################Correlation between numerical features

    heat = df.corr()
    plt.figure(figsize=[16,8])
    plt.title(“Correlation between numerical features”, size = 25, pad = 20, color = ‘#8cabb6’)
    sns.heatmap(heat,cmap = sns.diverging_palette(20, 220, n = 200), annot=False)
    plt.show()

    Correlation between numerical features.

    #######Feature Engineering

    print(“Correlation Coefficient of all the Features”)
    corr = df.corr()
    corr.sort_values([“Attrition_Flag”], ascending = False, inplace = True)
    correlations = corr.Attrition_Flag
    a = correlations[correlations > 0.1]
    b = correlations[correlations < -0.1]
    top_corr_features = a._append(b)
    top_corr_features

    Correlation Coefficient of all the Features
    

    Out[12]:

    Attrition_Flag              1.000000
    Contacts_Count_12_mon       0.204491
    Months_Inactive_12_mon      0.152449
    Total_Amt_Chng_Q4_Q1       -0.131063
    Total_Relationship_Count   -0.150005
    Total_Trans_Amt            -0.168598
    Avg_Utilization_Ratio      -0.178410
    Total_Revolving_Bal        -0.263053
    Total_Ct_Chng_Q4_Q1        -0.290054
    Total_Trans_Ct             -0.371403
    Name: Attrition_Flag, dtype: float64

    Let’s define the following functions

    def plot_importances(model, model_name, features_to_plot, feature_names):
    #fit model and performances
    model.fit(X,y)
    importances = model.feature_importances_

    # sort and rank importances
    indices = np.argsort(importances)
    best_features = np.array(feature_names)[indices][-features_to_plot:]
    values = importances[indices][-features_to_plot:]
    
    # plot a graph
    y_ticks = np.arange(0, features_to_plot)
    fig, ax = plt.subplots()
    ax.barh(y_ticks, values, color = '#b2c4cc')
    ax.set_yticklabels(best_features)
    ax.set_yticks(y_ticks)
    ax.set_title("%s Feature Importances"%(model_name))
    fig.tight_layout()
    plt.show()
    

    def best_features(model, features_to_plot, feature_names):
    # get list of best features
    model.fit(X,y)
    importances = model.feature_importances_

    indices = np.argsort(importances)
    best_features = np.array(feature_names)[indices][-features_to_plot:]
    return best_features
    

    Let’s plot feature importance weights for the following 3 models

    feature_names = list(X.columns)

    model1 = RandomForestClassifier(random_state = 1234)
    plot_importances(model1, ‘Random Forest’, 10, feature_names)

    model2 = GradientBoostingClassifier(n_estimators = 100, learning_rate = 1.0, max_depth = 1, random_state = 0)
    plot_importances(model2, ‘XGBoost’, 10, feature_names)

    model3 = AdaBoostClassifier(n_estimators = 100, learning_rate = 1.0, random_state = 0)
    plot_importances(model3, ‘AdaBoost’, 10, feature_names)

    Random Forest Feature Importances
    XGBoost Feature Importances
    AdaBoost Feature Importances

    Looking at the F-value between label/feature for classification tasks

    f_selector = SelectKBest(f_classif, k = 10)
    f_selector.fit_transform(X, y)
    f_selector_best = f_selector.get_feature_names_out()
    print(f_selector_best)

    ['Gender' 'Total_Relationship_Count' 'Months_Inactive_12_mon'
     'Contacts_Count_12_mon' 'Total_Revolving_Bal' 'Total_Amt_Chng_Q4_Q1'
     'Total_Trans_Amt' 'Total_Trans_Ct' 'Total_Ct_Chng_Q4_Q1'
     'Avg_Utilization_Ratio']

    forest_best = list(best_features(model1, 10, feature_names))
    XG_best = list(best_features(model2, 10, feature_names))
    ada_best = list(best_features(model3, 10, feature_names))
    top_corr_features = list(top_corr_features.index[1:])
    f_selector_best = list(f_selector_best)

    best_features_overall = forest_best + XG_best + ada_best + top_corr_features + f_selector_best

    # create a dictionary with the number of times features appear
    from collections import Counter
    count_best_features = dict(Counter(best_features_overall))

    # list of the features without any repeatitions
    features_no_repeats = list(dict.fromkeys(best_features_overall))

    display(count_best_features)

    {'Customer_Age': 3,
     'Avg_Open_To_Buy': 3,
     'Credit_Limit': 2,
     'Total_Amt_Chng_Q4_Q1': 5,
     'Avg_Utilization_Ratio': 3,
     'Total_Relationship_Count': 4,
     'Total_Revolving_Bal': 5,
     'Total_Ct_Chng_Q4_Q1': 5,
     'Total_Trans_Ct': 5,
     'Total_Trans_Amt': 5,
     'Months_Inactive_12_mon': 4,
     'Contacts_Count_12_mon': 4,
     'Gender': 1}

    # get list of features with high counts in the dictionary
    def get_features(threshold):
    # remove features below a certain number of appearances
    chosen_features = []
    for i in features_no_repeats:
    if count_best_features[i] > threshold:
    chosen_features.append(i)
    return chosen_features

    chosen_features = get_features(2)
    chosen_features.remove(‘Avg_Open_To_Buy’)
    chosen_features.remove(‘Avg_Utilization_Ratio’)
    chosen_features

    ['Customer_Age',
     'Total_Amt_Chng_Q4_Q1',
     'Total_Relationship_Count',
     'Total_Revolving_Bal',
     'Total_Ct_Chng_Q4_Q1',
     'Total_Trans_Ct',
     'Total_Trans_Amt',
     'Months_Inactive_12_mon',
     'Contacts_Count_12_mon']

    ######### ML Model Evaluation

    def eval_model(model, model_name, X, y, threshold):
    # make X the chosen subset
    chosen_features = get_features(threshold)
    X = X[chosen_features]

    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)
    
    # fit model
    model.fit(train_x,train_y)
    model.score(test_x, test_y)
    pred_test = model.predict(test_x)
    
    # get metrics
    f1 = metrics.f1_score(test_y, pred_test)
    test_acc = metrics.accuracy_score(test_y, pred_test)
    con = metrics.confusion_matrix(test_y, pred_test)
    
    print(con,'%s model with %s threshold: %.4f F1-score and %.4f accuracy'%(model_name, threshold, f1, test_acc))
    

    Let’s use the full dataset to perform train/test data splitting with test_size = 0.25

    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)

    ####Model1 Fitting

    model1.fit(train_x,train_y)
    model1.score(test_x, test_y)
    pred_test = model1.predict(test_x)

    f1 = metrics.f1_score(test_y, pred_test)
    test_acc = metrics.accuracy_score(test_y, pred_test)
    con = metrics.confusion_matrix(test_y, pred_test)

    print(con,f1,test_acc)

    [[2094   19]
     [  96  323]] 0.8488830486202367 0.9545813586097947

    Let’s look at the reduced dataset and repeat model1 fitting

    chosen_features = get_features(2)
    chosen_features.remove(‘Avg_Open_To_Buy’)
    chosen_features.remove(‘Avg_Utilization_Ratio’)
    Xnew = X[chosen_features]

    train_x, test_x, train_y, test_y = train_test_split(Xnew, y, test_size = 0.25, random_state = 42)

    model1.fit(train_x,train_y)
    model1.score(test_x, test_y)
    pred_test = model1.predict(test_x)

    f1 = metrics.f1_score(test_y, pred_test)
    test_acc = metrics.accuracy_score(test_y, pred_test)
    con = metrics.confusion_matrix(test_y, pred_test)

    print(con,f1,test_acc)

    [[2086   27]
     [  60  359]] 0.8919254658385094 0.9656398104265402

    ######More ML Classifiers

    from sklearn.ensemble import BaggingClassifier
    from sklearn.neighbors import KNeighborsClassifier
    model4 = BaggingClassifier(KNeighborsClassifier(n_neighbors = 7), max_samples = 0.8, max_features = 0.8)

    eval_model(model4, ‘KNN’, X, y, 2)

    [[2074   39]
     [ 176  243]] KNN model with 2 threshold: 0.6933 F1-score and 0.9151 accuracy

    from sklearn.linear_model import LogisticRegression

    model5 = LogisticRegression(random_state=1)
    eval_model(model5, ‘Logistic’, X, y, 2)

    [[2059   54]
     [ 217  202]] Logistic model with 2 threshold: 0.5985 F1-score and 0.8930 accuracy

    # lists of possible parameters
    n = [400, 450, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]
    depth = [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
    rand = [600, 650, 700, 750, 800, 850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250]

    forest = RandomForestClassifier(n_estimators = 100, max_depth = 15, random_state = 750)
    eval_model(forest, ‘forest’, X, y, 2)

    [[2090   23]
     [  64  355]] forest model with 2 threshold: 0.8908 F1-score and 0.9656 accuracy

    ####Further Evaluation of Random Forest

    def eval_forest(model, model_name, X, y, threshold, n, depth, rand):
    # create subset from feature selection
    chosen_features = get_features(threshold)
    chosen_features.remove(‘Avg_Open_To_Buy’)
    chosen_features.remove(‘Avg_Utilization_Ratio’)
    X = X[chosen_features]

    train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.25, random_state = 42)
    
    model.fit(train_x,train_y)
    model.score(test_x, test_y)
    pred_test = model.predict(test_x)
    
    f1 = metrics.f1_score(test_y, pred_test)
    test_acc = metrics.accuracy_score(test_y, pred_test)
    con = metrics.confusion_matrix(test_y, pred_test)
    
    print('Model: %s Threshold: %s F1-Score %.4f Accuracy: %.4f n_estimators: %s depth: %s rand: %s'%(model_name, threshold, f1, test_acc,n,depth,rand))
    

    Let’s define the model

    forest = RandomForestClassifier(n_estimators = 850, max_depth = 19, random_state = 1200)

    and train/test this model using 2 selected features

    chosen_features = get_features(2)
    chosen_features.remove(‘Avg_Open_To_Buy’)
    chosen_features.remove(‘Avg_Utilization_Ratio’)
    X_new = X[chosen_features]

    train_x, test_x, train_y, test_y = train_test_split(X_new, y, test_size = 0.25, random_state = 42)

    forest.fit(train_x,train_y)
    forest.score(test_x, test_y)
    pred_test = forest.predict(test_x)

    f1 = metrics.f1_score(test_y, pred_test)
    test_acc = metrics.accuracy_score(test_y, pred_test)
    con = metrics.confusion_matrix(test_y, pred_test)
    precision = metrics.precision_score(test_y, pred_test)
    recall = metrics.recall_score(test_y, pred_test)
    roc = metrics.roc_auc_score(test_y, pred_test)
    print(‘Accuracy Score’, test_acc)
    print(‘Precision’, precision)
    print(‘Recall’, recall)
    print(‘F1-Score’, f1)
    print(‘ROC Score’, roc)
    print(con)

    Accuracy Score 0.9664296998420221
    Precision 0.9304123711340206
    Recall 0.8615751789976134
    F1-Score 0.8946716232961587
    ROC Score 0.9243985691485936
    [[2086   27]
     [  58  361]]

    #####CROSS-VALIDATION

    from sklearn.model_selection import cross_validate

    cv_results = cross_validate(forest, X_new, y, scoring = (‘f1’, ‘accuracy’, ‘roc_auc’), cv = 8)
    sorted(cv_results.keys())

    ['fit_time', 'score_time', 'test_accuracy', 'test_f1', 'test_roc_auc']

    cv_results[‘test_roc_auc’]

    array([0.91056078, 0.96866615, 0.98541168, 0.9918462 , 0.99539807,
           0.99796213, 0.8989536 , 0.90523967])

    cv_results[‘test_accuracy’]

    array([0.88230648, 0.94233807, 0.96366509, 0.96366509, 0.96840442,
           0.97630332, 0.92969984, 0.92885375])

    #####ML PERFORMANCE ANALYSIS USING Scikit Plot

    Import standard libraries

    import scikitplot as skplt

    import sklearn

    from sklearn.model_selection import train_test_split

    from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
    from sklearn.linear_model import LinearRegression, LogisticRegression
    from sklearn.cluster import KMeans
    from sklearn.decomposition import PCA

    import matplotlib.pyplot as plt

    import sys
    import warnings
    warnings.filterwarnings(“ignore”)

    print(“Scikit Plot Version : “, skplt.version)
    print(“Scikit Learn Version : “, sklearn.version)
    print(“Python Version : “, sys.version)

    %matplotlib inline

    Scikit Plot Version :  0.3.7
    Scikit Learn Version :  1.2.2
    Python Version :  3.9.16 (main, Jan 11 2023, 16:16:36) [MSC v.1916 64 bit (AMD64)]

    #####RFC Learning Curve

    skplt.estimators.plot_learning_curve(forest, train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”RFC Learning Curve”);

    RFC Learning Curve

    ######KNN Classification Learning Curve

    skplt.estimators.plot_learning_curve(model4, train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”KNN Classification Learning Curve”);

    KNN Classification Learning Curve

    ####ETC Learning Curve

    skplt.estimators.plot_learning_curve(ExtraTreesClassifier(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”ETC Learning Curve”);

    ETC Learning Curve

    #####GBC Learning Curve

    skplt.estimators.plot_learning_curve(GradientBoostingClassifier(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”GBC Learning Curve”);

    GBC Learning Curve

    #######DTC Learning Curve

    from sklearn.tree import DecisionTreeClassifier
    skplt.estimators.plot_learning_curve(DecisionTreeClassifier(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”DTC Learning Curve”);

    DTC Learning Curve

    ######ABC Learning Curve

    from sklearn.ensemble import AdaBoostClassifier
    skplt.estimators.plot_learning_curve(AdaBoostClassifier(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”ABC Learning Curve”);

    ABC Learning Curve

    ########MLPC Learning Curve

    from sklearn.neural_network import MLPClassifier
    skplt.estimators.plot_learning_curve(MLPClassifier(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”MLPC Learning Curve”);

    MLPC Learning Curve

    #######GNB Learning Curve

    from sklearn.naive_bayes import GaussianNB
    skplt.estimators.plot_learning_curve(GaussianNB(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”GNB Learning Curve”);

    GNB Learning Curve

    ######LR Learning Curve

    skplt.estimators.plot_learning_curve(LogisticRegression(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”LR Learning Curve”);

    LR Learning Curve

    #######QDA Learning Curve

    from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
    skplt.estimators.plot_learning_curve(QuadraticDiscriminantAnalysis(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”QDA Learning Curve”);

    QDA Learning Curve

    ########SVC Learning Curve

    from sklearn.svm import SVC
    skplt.estimators.plot_learning_curve(SVC(), train_x, train_y,
    cv=7, shuffle=True, scoring=”accuracy”,
    n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
    title=”SVC Learning Curve”);

    SVC Learning Curve

    ######CALIBRATION PLOTS

    lr_probas = QuadraticDiscriminantAnalysis().fit(train_x, train_y).predict_proba(test_x)
    rf_probas = RandomForestClassifier().fit(train_x, train_y).predict_proba(test_x)
    gb_probas = GradientBoostingClassifier().fit(train_x, train_y).predict_proba(test_x)
    et_scores = AdaBoostClassifier().fit(train_x, train_y).predict_proba(test_x)

    probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
    clf_names = [‘QDA’, ‘RFC’, ‘GBC’, ‘ABC’]

    skplt.metrics.plot_calibration_curve(test_y,
    probas_list,
    clf_names, n_bins=15,
    figsize=(12,6)
    );