Data-Driven ML Credit Card Fraud Detection

Data Analytics RCA applied to Creadit Card Fraud Detection [1]

Contents:

  1. Introduction
    1. Business Case
    2. Global fraud average costs:
    3. Old School Approach
    4. Machine Learning Approach
    5. The ultimate goal of ML solutions is to minimize fraud losses while keeping customers safe by maximizing their CLV.
  2. The Algorithm
  3. E2E Pipeline
  4. Prerequisites
    1. Adjusting the display to fit rows and columns effectively
  5. The Input Dataset
    1. Checking the number of columns and rows in the dataset
  6. Data Preparation
  7. Exploratory Data Analysis (EDA)
    1. Univariate Analysis
    2. Bivariate Analysis
    3. Multivariate Analysis
  8. Feature Engineering
  9. Data Resampling and Base Model Testing
    1. Random Under-Sampling;
    2. Random Over-Sampling;
    3. SMOTE – Synthetic Minority Oversampling Technique;
    4. ADASYN – Adaptive Synthetic Sampling Method;
    5. SMOTETomek – Over-sampling followed by under-sampling.
    6. Unddersampling leads to the loss of data so we will not use that.
    7. We will proceed with Random Over-Sampling, SMOTE – Synthetic Minority Oversampling Technique,
    8. ADASYN – Adaptive Synthetic Sampling Method and see which technique works better.
  10. Hyper-Parameter Optimization (HPO)
    1. * Logistic Regression SMOTE model
    2. * Decision Tree SMOTE model
    3. * XGBoost ADASYN model
    4. * Random Forest ADASYN model.
  11. Performance Evaluation
    1. Recall for the Final Model is 94%
  12. Cost Benefit Analysis
    1. Cost of providing customer executive support per fraudulent transaction detected by the model is $1.5.
    2. Total cost of providing customer support per month for fraudulent transactions detected by the model is TF*$1.5
    3. Cost incurred due to fraudulent transactions left undetected by the model is
  13. Conclusions
    1. and Recall 94%. The deployment of this model results in saving 110320$. The most important features to monitor are the amount and late-night transaction hours.
  14. References

Introduction

Business Case

Payment fraud is already a billion dollar business, and it’s growing. When you look at the stats behind global online payment fraud, it’s no surprise that almost three quarters of businesses say it’s a major concern. According to Juniper Research, online sellers will lose $130 billion to online payment fraud between 2018 and 2023.

As eCommerce sales rise, payment fraud continues to plague customers and merchants.

Since 40% of customers blame the merchant when they experience fraud, it’s time to stop fraud from cutting into your bottom line.

Global fraud average costs:

  • Online payment fraud costs global businesses 1.8% of revenue.
  • For every $1 of fraud from chargebacks, ecommerce businesses lose an extra $2.94

The extra costs of fraud for businesses include chargeback fees, merchandise distribution, fraud investigation, legal prosecution and software security.

It’s not only about the financial cost – fraud also impacts brand and customer loyalty. Because consumers aren’t aware of how fraud works, they often blame the online seller and are less likely to buy from their site again.

Old School Approach

Traditionally businesses relied on rules alone to block fraudulent payments. Today, rules are still an important part of the anti-fraud toolkit but in the past, using them on their own also caused some issues:

  • False positives (FP) – using lots of rules tends to result in a high number of FP – meaning you’re likely to block a lot of genuine customers. For example, high-value orders and orders from high-risk locations are more likely to be fraudulent. 
  • Fixed outcomes – the thresholds for fraudy behaviour can change over time – if your prices change, the average order value can go up, meaning that orders over $500 become the norm, and so rules can become invalid.
  • Inefficient and hard to scale – using a rules-only approach means that your library must keep expanding as fraud evolves. This makes the system slower and puts a heavy maintenance burden on your fraud analyst team, demanding increasing numbers of manual reviews

Bottom Line: Rules, machine learning (ML) and Artificial Intelligence (AI) are complementary tools for fraud detection.

Machine Learning Approach

The ML/AI approach overcomes the above drawbacks of traditional rule-based fraud detection techniques as follows [1-5]:

  • Speed – Machine learning is like having several teams of analysts running hundreds of thousands of queries and comparing the outcomes to find the best result – this is all done in real-time and only takes milliseconds.
  • Scalability – Every online business wants to increase its transaction volume. With a rules only system, increasing amounts of payment and customer data puts more pressure on the rules library to expand. But with machine learning it’s the opposite – the more data the better.
  • Efficiency & cost – Machine learning does all the dirty work of data analysis in a fraction of the time it would take for even 100 fraud analysts. Unlike humans, machines can perform repetitive, tedious tasks 24/7 and only need to escalate decisions to a human when specific insight is needed. Recall that the cost of machine learning is just the cost of the servers running.
  • Accuracy – Machine learning models are able to learn from patterns of normal behavior. They are very fast to adapt to changes in that normal behaviour and can quickly identify patterns of rapidly varying fraud transactions. The model can identify suspicious customers even when there hasn’t been a chargeback yet. 

Multiple Supervised and Semi-Supervised machine learning techniques are used for fraud detection [3-5], but we aim is to overcome three main challenges with card frauds related dataset i.e., strong class imbalance, the inclusion of labelled and unlabelled samples, and to increase the ability to process a large number of transactions.

The ultimate goal of ML solutions is to minimize fraud losses while keeping customers safe by maximizing their CLV.

The Algorithm

The supervised deep learning binary classification algorithm [3-4] consists of the following steps:

  • Input Data Management
  • Feature Engineering
  • Model Training, Testing and Validation
  • Model Performance Analysis
  • Cost and Risk Score Estimates
  • Final Risk Threshold Adjustment

At the point of the transaction, the ML model gives each customer a risk score. The higher the score, the higher the probability of fraud.

You can choose what level of risk is right for your business, and set thresholds for what proportion of transactions you want to allow, block and manually review or challenge.

E2E Pipeline

The proposed end-to-end Python sequence [3] detects fraudulent transactions based on the historical transactional data of customers with a pool of merchants:

  • Setup the Jupyter Notebook Python open-source ML workspace within the Anaconda IDE, including all necessary libraries
  • Read and examine the input csv dataset
  • Data preparation, transformation, editing, train/test split and clean-up
  • Exploratory Data Analysis (EDA) and Visualization
  • Feature selection, ranking, scaling and correlations
  • Handling data imbalance by increasing minority class (Random Over-Sampling, SMOTE and ADASYN)
  • Model Training, Testing and Validation (4 base models + 3 sampling techniques x 4 ML models = 16 models)
  • Hyper-Parameter Optimization using grid search and randomly selected hyperparameter combinations (GridSearchCV and RandomizedSearchCV)
  • ML performance evaluation using precision-recall, f1-score, and confusion matrix classification reports, feature importance, ROC and other metrics
  • Risk score estimates are based on user-defined thresholds for what proportion of transactions you want to allow, block and manually review or challenge; these form the important triggers for fraud identification and using these triggers, techniques can be built further to mitigate the frauds in real time
  • Cost Benefit Analysis using the best ML model (cost incurred per month before/after the model is built and deployed).

Prerequisites

We install the Anaconda IDE to perform Python data science and machine learning on a single machine. In doing so, we start working with thousands of open-source packages and libraries. We import and install the following libraries:

import scikitplot as skplt

import sklearn
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score

from imblearn.over_sampling import RandomOverSampler

from collections import Counter

import sys
import warnings
warnings.filterwarnings(“ignore”)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
%matplotlib inline
warnings.filterwarnings('ignore')
sns.set_style("darkgrid")

print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)

Scikit Plot Version :  0.3.7
Scikit Learn Version :  1.0.2
Python Version :  3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]
Adjusting the display to fit rows and columns effectively

start_time = time.time()
pd.set_option(‘display.max_rows’, 500)
pd.set_option(‘display.max_columns’, 500)
pd.set_option(‘display.width’, 1000)

from datetime import datetime, date
import math
from math import radians, sin, cos, acos, atan2
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler

!pip install mpu –user

!pip install imbalanced-learn

The Input Dataset

We use the Kaggle Credit Card Transactions Fraud Detection Dataset [2]. This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 – 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants. This was generated using Sparkov Data Generation | Github tool created by Brandon Harris. The files were combined and converted into a standard csv format. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Let’s read and understand the train fraud data
data_train = pd.read_csv(r’YourPath\fraudTrain.csv’)
data_train.head()

Input data table

Let’s read and understand the test fraud data
data_test = pd.read_csv(r’YourPath\fraudTest.csv’)
data_test.head()

Input data table
Checking the number of columns and rows in the dataset

print(data_train.shape)
print(data_test.shape)

(1296675, 23)
(555719, 23)

Checking for duplicates and deleteing the duplicate records.

data_train.drop_duplicates()
data_test.drop_duplicates()

print(data_train.shape)
print(data_test.shape)

(1296675, 23)
(555719, 23)

Checking the fields and their datatypes

data_train.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1296675 entries, 0 to 1296674
Data columns (total 23 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   Unnamed: 0             1296675 non-null  int64  
 1   trans_date_trans_time  1296675 non-null  object 
 2   cc_num                 1296675 non-null  int64  
 3   merchant               1296675 non-null  object 
 4   category               1296675 non-null  object 
 5   amt                    1296675 non-null  float64
 6   first                  1296675 non-null  object 
 7   last                   1296675 non-null  object 
 8   gender                 1296675 non-null  object 
 9   street                 1296675 non-null  object 
 10  city                   1296675 non-null  object 
 11  state                  1296675 non-null  object 
 12  zip                    1296675 non-null  int64  
 13  lat                    1296675 non-null  float64
 14  long                   1296675 non-null  float64
 15  city_pop               1296675 non-null  int64  
 16  job                    1296675 non-null  object 
 17  dob                    1296675 non-null  object 
 18  trans_num              1296675 non-null  object 
 19  unix_time              1296675 non-null  int64  
 20  merch_lat              1296675 non-null  float64
 21  merch_long             1296675 non-null  float64
 22  is_fraud               1296675 non-null  int64  
dtypes: float64(5), int64(6), object(12)
memory usage: 227.5+ MB

data_test.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555719 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   Unnamed: 0             555719 non-null  int64  
 1   trans_date_trans_time  555719 non-null  object 
 2   cc_num                 555719 non-null  int64  
 3   merchant               555719 non-null  object 
 4   category               555719 non-null  object 
 5   amt                    555719 non-null  float64
 6   first                  555719 non-null  object 
 7   last                   555719 non-null  object 
 8   gender                 555719 non-null  object 
 9   street                 555719 non-null  object 
 10  city                   555719 non-null  object 
 11  state                  555719 non-null  object 
 12  zip                    555719 non-null  int64  
 13  lat                    555719 non-null  float64
 14  long                   555719 non-null  float64
 15  city_pop               555719 non-null  int64  
 16  job                    555719 non-null  object 
 17  dob                    555719 non-null  object 
 18  trans_num              555719 non-null  object 
 19  unix_time              555719 non-null  int64  
 20  merch_lat              555719 non-null  float64
 21  merch_long             555719 non-null  float64
 22  is_fraud               555719 non-null  int64  
dtypes: float64(5), int64(6), object(12)
memory usage: 97.5+ MB

We see that there are no null values in both train and test data.

Check statistical information about numerical fields.

data_train.describe()

Descriptive statistics of input data

data_test.describe()

Descriptive statistics of input data

Data Preparation

The variable is_fraud represents 0 for non-fraudulent and 1 for fraudulent transactions. This is our TARGET variable. Let’s check the class imbalance of target variable is_fraud in train and test sets.

data_train[‘is_fraud’].value_counts(normalize=True)

0    0.994211
1    0.005789
Name: is_fraud, dtype: float64

data_test[‘is_fraud’].value_counts(normalize=True)

0    0.99614
1    0.00386
Name: is_fraud, dtype: float64

Forming a consolidated train+test dataset

data = pd.concat([data_train, data_test])

data.head()

A consolidated train+test dataset

Let’s check statistical information abut numerical fields.

data.describe()

Descriptive statistics of the consolidated train+test dataset

Let’s confirm that the concatenation is done properly

data.shape

(1852394, 23)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   Unnamed: 0             int64  
 1   trans_date_trans_time  object 
 2   cc_num                 int64  
 3   merchant               object 
 4   category               object 
 5   amt                    float64
 6   first                  object 
 7   last                   object 
 8   gender                 object 
 9   street                 object 
 10  city                   object 
 11  state                  object 
 12  zip                    int64  
 13  lat                    float64
 14  long                   float64
 15  city_pop               int64  
 16  job                    object 
 17  dob                    object 
 18  trans_num              object 
 19  unix_time              int64  
 20  merch_lat              float64
 21  merch_long             float64
 22  is_fraud               int64  
dtypes: float64(5), int64(6), object(12)
memory usage: 339.2+ MB

Let’s check null or missing values and sort them in ascending order

data.isnull().sum().sort_values()

Unnamed: 0               0
merch_lat                0
unix_time                0
trans_num                0
dob                      0
job                      0
city_pop                 0
long                     0
lat                      0
zip                      0
merch_long               0
state                    0
street                   0
gender                   0
last                     0
first                    0
amt                      0
category                 0
merchant                 0
cc_num                   0
trans_date_trans_time    0
city                     0
is_fraud                 0
dtype: int64

data.head()

Input data after checking null or missing values and sorting them in ascending order

data[‘Unnamed: 0’].value_counts()

0          2
370474     2
370488     2
370487     2
370486     2
          ..
802705     1
802706     1
802707     1
802708     1
1296674    1
Name: Unnamed: 0, Length: 1296675, dtype: int64

Dropping unwanted columns

cols_to_delete = [‘Unnamed: 0’, ‘cc_num’, ‘street’, ‘zip’, ‘trans_num’, ‘unix_time’ ]
data.drop(cols_to_delete, axis = 1, inplace = True)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 17 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   trans_date_trans_time  object 
 1   merchant               object 
 2   category               object 
 3   amt                    float64
 4   first                  object 
 5   last                   object 
 6   gender                 object 
 7   city                   object 
 8   state                  object 
 9   lat                    float64
 10  long                   float64
 11  city_pop               int64  
 12  job                    object 
 13  dob                    object 
 14  merch_lat              float64
 15  merch_long             float64
 16  is_fraud               int64  
dtypes: float64(5), int64(2), object(10)
memory usage: 254.4+ MB

Let’s create a column customer name with columns first and last

data[‘Customer_name’] = data[‘first’]+” “+data[‘last’]
data.drop([‘first’,’last’], axis=1, inplace=True)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 16 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   trans_date_trans_time  object 
 1   merchant               object 
 2   category               object 
 3   amt                    float64
 4   gender                 object 
 5   city                   object 
 6   state                  object 
 7   lat                    float64
 8   long                   float64
 9   city_pop               int64  
 10  job                    object 
 11  dob                    object 
 12  merch_lat              float64
 13  merch_long             float64
 14  is_fraud               int64  
 15  Customer_name          object 
dtypes: float64(5), int64(2), object(9)
memory usage: 240.3+ MB

Let’s look at city_pop variable

print(“Min population : “, data[‘city_pop’].min())
print(“Max population : “, data[‘city_pop’].max())

Min population :  23
Max population :  2906700

Let's Create a categorical column Population_group by binning the variable city_pop

data["Population_group"] = pd.cut(data["city_pop"], bins=list(range(0,3000001,500000)), labels = ["<5lac","5-10lac","10-15lac","15-20","20-25lac","25-30lac"])
data["Population_group"].value_counts()

<5lac       1758657
5-10lac       46877
10-15lac      21224
15-20         16105
25-30lac       8794
20-25lac        737
Name: Population_group, dtype: int64

Let’s create a column age from dob variable

data[‘dob’] = pd.to_datetime(data[‘dob’])

def calculate_age(born):
today = date.today()
return today.year – born.year – ((today.month, today.day) < (born.month, born.day))

by calling the function calculate_age
data[‘age’] = data[“dob”].apply(calculate_age)

Let’s create a column age_group from the column age
data[“age_group”] = pd.cut(data[“age”], bins=[0,25,40,60,80,9999], labels = [“<25″,”25-40″,”40-60″,”60-80″,”80+”])

Calculating distance between the customer and merchant location using customer location co-ordinates(lat and long) and merchant location co-ordinates(merch_lat and merch_long)

R = 6373.0 # radius of the Earth
data[‘lat’] = data[‘lat’].astype(‘float’)
data[‘long’] = data[‘long’].astype(‘float’)
data[‘merch_lat’] = data[‘merch_lat’].astype(‘float’)
data[‘merch_long’] = data[‘merch_long’].astype(‘float’)#coordinates

data[‘lat’] = np.radians(data[‘lat’])
data[‘long’] = np.radians(data[‘long’])
data[‘merch_lat’] = np.radians(data[‘merch_lat’])
data[‘merch_long’] = np.radians(data[‘merch_long’])

data[‘dlon’] = data[‘merch_long’] – data[‘long’] #change in coordinates
data[‘dlat’] = data[‘merch_lat’] – data[‘lat’]

a = np.sin(data[‘dlat’] / 2)**2 + np.cos(data[‘lat’]) * np.cos(data[‘merch_lat’]) * np.sin(data[‘dlon’] / 2)**2 #Haversine formula

c = 22np.arctan2(np.sqrt(a), np.sqrt(1 – a))
data[‘distance’] = R * c

data[‘distance’].head()

0    157.244484
1     60.443320
2    216.480102
3    191.406530
4    155.162181
Name: distance, dtype: float64

data[‘distance’].min()

0.0445230036617093

data[‘distance’].max()

304.3298522066305

data[‘distance’].describe()

count    1.852394e+06
mean     1.522712e+02
std      5.825222e+01
min      4.452300e-02
25%      1.106749e+02
50%      1.564819e+02
75%      1.970808e+02
max      3.043299e+02
Name: distance, dtype: float64

Let’s create a column dist_range_km from the column distance
data[“dist_range_km”] = pd.cut(data[“distance”], bins=[0,25,50,100,150,200,250,300,9999], labels = [“<25″,”25-50″,”50-100″,”100-150″,”150-200″,”200-250″,”250-300″,”300+”])
data.head()

Edited input data table after creating a column dist_range_km from the column distance

data.drop([‘dlat’, ‘dlon’], axis=1, inplace=True)
data.drop([‘dob’,’city_pop’], axis=1, inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 19 columns):
 #   Column                 Dtype   
---  ------                 -----   
 0   trans_date_trans_time  object  
 1   merchant               object  
 2   category               object  
 3   amt                    float64 
 4   gender                 object  
 5   city                   object  
 6   state                  object  
 7   lat                    float64 
 8   long                   float64 
 9   job                    object  
 10  merch_lat              float64 
 11  merch_long             float64 
 12  is_fraud               int64   
 13  Customer_name          object  
 14  Population_group       category
 15  age                    int64   
 16  age_group              category
 17  distance               float64 
 18  dist_range_km          category
dtypes: category(3), float64(6), int64(2), object(8)
memory usage: 245.6+ MB

Let’s create the transaction date and time column

data[‘trans_date_trans_time’] = pd.to_datetime(data[‘trans_date_trans_time’])

Extract year and month from trans_date_trans_time column

data[‘year’] = pd.DatetimeIndex(data[‘trans_date_trans_time’]).year
data[‘month’] = pd.DatetimeIndex(data[‘trans_date_trans_time’]).month

Extract day of the week and transaction hour from trans_date_trans_time column

data[‘day_of_week’] = data[‘trans_date_trans_time’].dt.day_name()
data[‘transaction_hour’] = data[‘trans_date_trans_time’].dt.hour
data.head()

Input data table after creating the transaction date and time column

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   trans_date_trans_time  datetime64[ns]
 1   merchant               object        
 2   category               object        
 3   amt                    float64       
 4   gender                 object        
 5   city                   object        
 6   state                  object        
 7   lat                    float64       
 8   long                   float64       
 9   job                    object        
 10  merch_lat              float64       
 11  merch_long             float64       
 12  is_fraud               int64         
 13  Customer_name          object        
 14  Population_group       category      
 15  age                    int64         
 16  age_group              category      
 17  distance               float64       
 18  dist_range_km          category      
 19  year                   int64         
 20  month                  int64         
 21  day_of_week            object        
 22  transaction_hour       int64         
dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8)
memory usage: 302.1+ MB

Exploratory Data Analysis (EDA)

Univariate Analysis

Let’s begin with the target variable is_fraud = 0 (no fraud), 1 (fraud)

plt.figure(figsize= (10,6))
fig = data[“is_fraud”].value_counts(normalize = True).plot.pie(autopct=’%1.2f%%’)
plt.title(“Pie-chart showing imbalance in is_fraud variable”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
fig.legend(title=”is_fraud”,
loc=”center left”,
bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()

Pie-chart showing imbalance in is_fraud variable

As we can see, the input dataset is highly imbalanced with 0.52% of transactions being fraudulent and 99.48% transactions being non-fraudulent.

Let’s plot the bar chart of the category variable

plt.figure(figsize= (8,4))
data[“category”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing category variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing category variable

It appears that maximum transactions in the dataset belong to merchants falling in gas_transport category and least being travel category.

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   trans_date_trans_time  datetime64[ns]
 1   merchant               object        
 2   category               object        
 3   amt                    float64       
 4   gender                 object        
 5   city                   object        
 6   state                  object        
 7   lat                    float64       
 8   long                   float64       
 9   job                    object        
 10  merch_lat              float64       
 11  merch_long             float64       
 12  is_fraud               int64         
 13  Customer_name          object        
 14  Population_group       category      
 15  age                    int64         
 16  age_group              category      
 17  distance               float64       
 18  dist_range_km          category      
 19  year                   int64         
 20  month                  int64         
 21  day_of_week            object        
 22  transaction_hour       int64         
dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8)
memory usage: 302.1+ MB

Let’s look at the gender variable

plt.figure(figsize= (8,4))
data[“gender”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing gender variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing gender variable

It is clear that maximum transactions in the dataset are performed by females.

Let’s look at the state variable

plt.figure(figsize= (12,8))
data[“state”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing state variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing state variable

We can see that maximum transactions in the dataset take place in Texas and least being in Delaware state.

Let’s plot the population_group variable

plt.figure(figsize= (10,6))
data[“Population_group”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing Population_group variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing Population_group variable

It turns out that maximum transactions take place in areas where population is less than 5 lakh (aka lac).

Let’s look at the age_group variable

plt.figure(figsize= (10,6))
data[“age_group”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing age_group variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing age_group variable

This plot shows that maximum transactions are performed by people within the age group 40-60 years.

Let’s look at the year variable

plt.figure(figsize= (10,6))
data[“year”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing year variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing year variable

It is interesting that equal number of transactions are performed in the years 2019 and 2020.

Let’s look at the month variable

plt.figure(figsize= (10,6))
data[“month”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing month variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing month variable

We can see that maximum transactions are performed in December.

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   trans_date_trans_time  datetime64[ns]
 1   merchant               object        
 2   category               object        
 3   amt                    float64       
 4   gender                 object        
 5   city                   object        
 6   state                  object        
 7   lat                    float64       
 8   long                   float64       
 9   job                    object        
 10  merch_lat              float64       
 11  merch_long             float64       
 12  is_fraud               int64         
 13  Customer_name          object        
 14  Population_group       category      
 15  age                    int64         
 16  age_group              category      
 17  distance               float64       
 18  dist_range_km          category      
 19  year                   int64         
 20  month                  int64         
 21  day_of_week            object        
 22  transaction_hour       int64         
dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8)
memory usage: 302.1+ MB

Let’s plot the day_of_week variable

plt.figure(figsize= (10,6))
data[“day_of_week”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing day_of_week variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing day_of_week variable

We can see that maximum transactions take place on Monday and Sunday.

Let’s plot the transaction_hour variable

plt.figure(figsize= (10,6))
data[“transaction_hour”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing transaction_hour variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing transaction_hour variable

It appears that most transactions take place after 12 noon.

Let’s portray the dist_range_km variable

plt.figure(figsize= (10,6))
data[“dist_range_km”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing dist_range_km variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Bar chart analysing dist_range_km variable

it is insightful to see that maximum transactions are performed within the distance range 152-200 kms from the customer location.

Bivariate Analysis

Let’s begin with amt vs age

plt.figure(figsize= [10,6])
plt.scatter(data[“age”], data[“amt”], alpha = 0.5)
plt.title(“Scatter plot analysing amt vs age\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.xlabel(“age”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})
plt.ylabel(“amt”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’} )
plt.show()

Scatter plot analysing amt vs age

We can say that people of all age groups usually perform transactions less than 5000 $.

Let’s examine spatial distributions of customer/merchant locations responsible for fraud transactions

fraud_data = data[data[‘is_fraud’]==1]

plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
plt.scatter(fraud_data[“lat”], fraud_data[“long”], alpha = 0.5)
plt.title(“Plot analysing distribution of customer location for frauds\n”, fontdict={‘fontsize’: 15, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.ylabel(“Longitude”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})
plt.xlabel(“Latitude”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})

plt.subplot(1,2,2)
plt.scatter(fraud_data[“merch_lat”], fraud_data[“merch_long”], alpha = 0.5)
plt.title(“Plot analysing distribution of merchant location for frauds\n”, fontdict={‘fontsize’: 15, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.ylabel(“Longitude”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})
plt.xlabel(“Latitude”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})

plt.show()

Plot analysing distribution of merchant location for frauds

We can see the complete overlap between customer and merchant locations responsible for fraud transactions.

Now let’s look at is_fraud vs amt plot

plt.figure(figsize= (8,4))
data.groupby(“is_fraud”)[“amt”].mean().plot.bar()
plt.title(“Plot analysing amt w.r.t. is_fraud variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Plot analysing amt w.r.t. is_fraud variable

It is clear that the average amount of fraudulent transactions is slightly greater than 500 $.

Let’s plot is_fraud vs age

plt.figure(figsize= (8,4))
data.groupby(“is_fraud”)[“age”].mean().plot.bar()
plt.title(“Plot analysing age w.r.t. is_fraud variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Plot analysing age w.r.t. is_fraud variable

Generally, people within the 40-50 years age group have done more fraudulent transactions than genuine ones.

Lest’s look at is_fraud vs distance

plt.figure(figsize= (8,4))
data.groupby(“is_fraud”)[“distance”].mean().plot.bar()
plt.title(“Plot analysing distance w.r.t. is_fraud variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})

plt.show()

Plot analysing distance w.r.t. is_fraud variable

Generally speaking, transactions performed within the distance range 140-160 kms from the customer locations have equal chances of being fraudelent and non-fraudulent.

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   trans_date_trans_time  datetime64[ns]
 1   merchant               object        
 2   category               object        
 3   amt                    float64       
 4   gender                 object        
 5   city                   object        
 6   state                  object        
 7   lat                    float64       
 8   long                   float64       
 9   job                    object        
 10  merch_lat              float64       
 11  merch_long             float64       
 12  is_fraud               int64         
 13  Customer_name          object        
 14  Population_group       category      
 15  age                    int64         
 16  age_group              category      
 17  distance               float64       
 18  dist_range_km          category      
 19  year                   int64         
 20  month                  int64         
 21  day_of_week            object        
 22  transaction_hour       int64         
dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8)
memory usage: 302.1+ MB

Let’s switch to the categorical-categorical variable analysis.

Let’s begin with category vs is_fraud

plt.figure(figsize= (10,6))
data.groupby(“category”)[“is_fraud”].mean().plot.bar()
plt.show()

Category bar plot

We can see that maximim fraudulent transactions are performed by merchants with category shopping_net.

Let’s plot gender vs is_fraud

plt.figure(figsize= (8,6))
data.groupby(“gender”)[“is_fraud”].mean().plot.bar()
plt.show()

Gender bar plot

Even though women have performed maximum transactions, men have performed more fraudulent transactions than women.

Let’s check state vs is_fraud

plt.figure(figsize= (15,6))
data.groupby(“state”)[“is_fraud”].mean().plot.bar()
plt.show()

State bar plot

Surprisingly, most of the transactions performed in Deleware (DE) are fraudulent, even though the number of actual transaction performed in Delaware(DE) are the least amongst all the states.

Let’s plot population_group vs is_fraud

plt.figure(figsize= (10,6))
data.groupby(“Population_group”)[“is_fraud”].mean().plot.bar()
plt.show()

Population_group bar plot

We can see that maximum fraudulent transactions are performed in the areas with population range 20-25 lakhs.

Let’s lokk at age_group vs is_fraud

plt.figure(figsize= (10,6))
data.groupby(“age_group”)[“is_fraud”].mean().plot.bar()
plt.show()

age_group bar plot

Whatever least transactions performed by people with 80+ years of age are fraudulent. This can indicate identity theft.

Let’s plot year vs is_fraud

plt.figure(figsize= (8,4))
data.groupby(“year”)[“is_fraud”].mean().plot.bar()
plt.show()

Year bar plot

It appears that more fraudulent transactions are performed in 2019 compared to 2020.

Let’s check month vs is_fraud

plt.figure(figsize= (15,6))
data.groupby(“month”)[“is_fraud”].mean().plot.bar()
plt.show()

Month bar plot

It is clear that maximum fraudulent transactions are performed in February.

Let’s look at day_of_week vs is_fraud

plt.figure(figsize= (10,6))
data.groupby(“day_of_week”)[“is_fraud”].mean().plot.bar()
plt.show()

day_of_week bar plot

The above plot shows that maximum fraudulent transactions take place on Thursday and Friday.

Let’s check transaction_hour vs is_fraud

plt.figure(figsize= (10,6))
data.groupby(“transaction_hour”)[“is_fraud”].mean().plot.bar()
plt.show()

transaction_hour bar plot

It is clear that maximum fraudulent transactions take place late at night.

Let’s look at dist_range_km vs is_fraud

plt.figure(figsize= (10,6))
data.groupby(“dist_range_km”)[“is_fraud”].mean().plot.bar()
plt.show()

dist_range_km bar plot

It appears that maximum fraudulent transactions are performed within the distance range 100-150 kms from the customer location.

Multivariate Analysis

Let’s plot pivot tables is_fraud vs 2 attributes taken from the list below

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 23 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   trans_date_trans_time  datetime64[ns]
 1   merchant               object        
 2   category               object        
 3   amt                    float64       
 4   gender                 object        
 5   city                   object        
 6   state                  object        
 7   lat                    float64       
 8   long                   float64       
 9   job                    object        
 10  merch_lat              float64       
 11  merch_long             float64       
 12  is_fraud               int64         
 13  Customer_name          object        
 14  Population_group       category      
 15  age                    int64         
 16  age_group              category      
 17  distance               float64       
 18  dist_range_km          category      
 19  year                   int64         
 20  month                  int64         
 21  day_of_week            object        
 22  transaction_hour       int64         
dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8)
memory usage: 302.1+ MB

Let’s begin with category vs gender vs is_fraud
pivot1 = pd.pivot_table(data = data, index = “gender”, columns = “category”, values = “is_fraud”)
pivot1

Input data to pivot analysis

Let’s plot it

plt.figure(figsize= (16,4))
sns.heatmap(pivot1, cmap = “Greens”, annot = True)
plt.show()

category vs gender vs is_fraud pivot table

It turns out that most fraudulent transactions are performed by males within the category shopping_net.

Let’s look at state vs gender vs is_fraud

pivot2 = pd.pivot_table(data = data, index = “state”, columns = “gender”, values = “is_fraud”)
pivot2

state vs gender vs is_fraud input to pivot table

The corresponding plot is given by

plt.figure(figsize= (10,15))
sns.heatmap(pivot2, cmap = “Greens”, annot = True)
plt.show()

state vs gender vs is_fraud pivot table

It is clear that 100% fraudulent transactions in the states DE and NV are performed by females.

Let’s consider age_group vs gender vs is_fraud
pivot3 = pd.pivot_table(data = data, index = “age_group”, columns = “gender”, values = “is_fraud”)
pivot3

Input to age_group vs gender vs is_fraud pivot table

Let’s plot this table

age_group vs gender vs is_fraud pivot table

Transactions carried out by 80+ years old males may be fraudulent.

Let’s look at Population_group vs dist_range_km vs is_fraud
pivot4 = pd.pivot_table(data = data, index = “dist_range_km”, columns = “Population_group”, values = “is_fraud”)
pivot4

Input to population_group vs dist_range_km vs is_fraud pivot table

Let’s plot it

population_group vs dist_range_km vs is_fraud pivot table

We can see that fraudulent transactions are carried out within the distance range 250-300 kms from the customer locations in areas within the population range 20-25 lac.

Let’s check year vs month vs is_fraud
pivot5 = pd.pivot_table(data = data, index = “month”, columns = “year”, values = “is_fraud”)
pivot5

Input to year vs month vs is_fraud pivot table

Let’s plot it

plt.figure(figsize= (10,6))
sns.heatmap(pivot5, cmap = “Greens”, annot = True)
plt.show()

year vs month vs is_fraud pivot table

We can see that most fraudulent transactions are conducted in Jan and Feb 2019.

Let’s look at age_group vs dist_range_km vs is_fraud
pivot6 = pd.pivot_table(data = data, index = “age_group”, columns = “dist_range_km”, values = “is_fraud”)
pivot6

Input to age_group vs dist_range_km vs is_fraud pivot table

and plot it

plt.figure(figsize= (10,6))
sns.heatmap(pivot6, cmap = “Greens”, annot = True)
plt.show()

age_group vs dist_range_km vs is_fraud pivot table

Transactions performed by 80+ years old customers within the distance range 200-250 kms from the customer locations may be fraudulent.

Let’s look at transaction_hour vs day_of_week vs is_fraud
pivot7 = pd.pivot_table(data = data, index = “transaction_hour”, columns = “day_of_week”, values = “is_fraud”)
pivot7

Input to transaction_hour vs day_of_week vs is_fraud pivot table

Let’s plot it

plt.figure(figsize= (10,6))
sns.heatmap(pivot7, cmap = “Greens”, annot = True)
plt.show()

transaction_hour vs day_of_week vs is_fraud pivot table

We can see that late-night midweek transactions may be fraudulent.

Let’s look at transaction_hour vs gender vs is_fraud
pivot8 = pd.pivot_table(data = data, index = “transaction_hour”, columns = “gender”, values = “is_fraud”)
pivot8

Input to transaction_hour vs gender vs is_fraud pivot table

and plot it

plt.figure(figsize= (10,6))
sns.heatmap(pivot8, cmap = “Greens”, annot = True)
plt.show()

transaction_hour vs gender vs is_fraud pivot table

Late-night transactions performed by male customers may be fraudulent.

Let’s look at transaction_hour vs dist_range_km vs is_fraud
pivot9 = pd.pivot_table(data = data, index = “transaction_hour”, columns = “dist_range_km”, values = “is_fraud”)
pivot9

Input to transaction_hour vs dist_range_km vs is_fraud pivot table

Let’s plot it

plt.figure(figsize= (10,6))
sns.heatmap(pivot9, cmap = “Greens”, annot = True)
plt.show()

transaction_hour vs dist_range_km vs is_fraud pivot table

Late-night transactions regardless of distance from the customer locations may be fraudulent.

Let’s check data skewness
data.describe()

Descriptive statistics of input data after editing

Let’s plot the following histograms of interest

cols = [‘amt’, ‘age’, ‘distance’]

plt.figure(figsize=[20,7])
for ind, col in enumerate(cols):
plt.subplot(2,2,ind+1)
data[col].value_counts(normalize=True).plot.hist()
plt.title(col)
plt.show()

Histograms of cols = ['amt', 'age', 'distance']

and the density plot

sns.distplot(data.amt)
plt.show()

Density plot of amt

We can see that the amt variable is skewed.

In the scikit-learn world, we can apply a power transform featurewise to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Let’s apply PowerTransformer to amt, age and plot the density curve

from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer()

data[‘amt’] = pt.fit_transform(data[[‘amt’]])

sns.distplot(data.amt)
plt.show()

Density plot of amt column after PowerTransformer()

sns.distplot(data.age)
plt.show()

Density plot of age column after PowerTransformer()

We can see the both attributes result in multi-mode skewed non-normal density distributions.

Feature Engineering

Let’s count fraud-related merchants

data.merchant.value_counts()

fraud_Kilback LLC                        6262
fraud_Cormier LLC                        5246
fraud_Schumm PLC                         5195
fraud_Kuhn LLC                           5031
fraud_Boyer PLC                          4999
                                         ... 
fraud_Douglas, DuBuque and McKenzie      1101
fraud_Treutel-King                       1098
fraud_Satterfield-Lowe                   1095
fraud_Hahn, Douglas and Schowalter       1091
fraud_Ritchie, Bradtke and Stiedemann    1090
Name: merchant, Length: 693, dtype: int64

Let’s count fraud-related jobs

data.job.value_counts()

Film/video editor                                              13898
Exhibition designer                                            13167
Surveyor, land/geomatics                                       12436
Naval architect                                                12434
Materials engineer                                             11711
Designer, ceramics/pottery                                     11688
Environmental consultant                                       10974
Financial adviser                                              10963
Systems developer                                              10962
IT trainer                                                     10943
Copywriter, advertising                                        10241
Scientist, audiological                                        10234
Chartered public finance accountant                            10211
Chief Executive Officer                                        10199
Podiatrist                                                      9525
Comptroller                                                     9515
Magazine features editor                                        9506
Agricultural consultant                                         9500
Paramedic                                                       9494
Sub                                                             9488
Audiological scientist                                          8801
Historic buildings inspector/conservation officer               8787
Building surveyor                                               8786
Librarian, public                                               8773
Musician                                                        8772
Scientist, research (maths)                                     8768
Barrister                                                       8767
etc.

len(data.job.value_counts())

497

Let’s count transaction_hour

23    95902
22    95370
16    94289
18    94052
21    93738
17    93514
13    93492
15    93439
19    93433
12    93294
14    93089
20    93081
1     61330
3     60968
2     60796
0     60655
8     60498
6     60406
10    60320
7     60301
9     60231
11    60170
5     60088
4     59938
Name: transaction_hour, dtype: int64

Let’s drop unwanted columns

data.drop([‘trans_date_trans_time’, “lat”, “long”, “merch_lat”, “merch_long”, “Customer_name”, “year”], axis=1, inplace=True)

data.drop([“merchant”, “city”, “job”], axis=1, inplace=True)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 13 columns):
 #   Column            Dtype   
---  ------            -----   
 0   category          object  
 1   amt               float64 
 2   gender            object  
 3   state             object  
 4   is_fraud          int64   
 5   Population_group  category
 6   age               int64   
 7   age_group         category
 8   distance          float64 
 9   dist_range_km     category
 10  month             int64   
 11  day_of_week       object  
 12  transaction_hour  int64   
dtypes: category(3), float64(2), int64(4), object(4)
memory usage: 160.8+ MB

Let’s perform train-test data split

train,test = train_test_split(data,test_size=0.3,random_state=42, stratify=data.is_fraud)

print(f”train data shape:{train.shape}”)
print(f”Test data shape:{test.shape}”)

while checking the train/test data shape

train data shape:(1296675, 13)
Test data shape:(555719, 13)

Let’ look at the normalized is_fraud value counts

train.is_fraud.value_counts(normalize=True)

0    0.99479
1    0.00521
Name: is_fraud, dtype: float64

test.is_fraud.value_counts(normalize=True)

0    0.994791
1    0.005209
Name: is_fraud, dtype: float64

Let’s proceed with the train/test data segregation

y_train = train.pop(“is_fraud”)
X_train = train

y_test = test.pop(“is_fraud”)
X_test = test

X_train.head()

train data

Creating dummy variables

X_train[‘transaction_hour’]= X_train[‘transaction_hour’].astype(str)
X_train[‘month’]= X_train[‘month’].astype(str)
X_test[‘transaction_hour’]= X_test[‘transaction_hour’].astype(str)

X_test[‘month’]= X_test[‘month’].astype(str)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 13 columns):
 #   Column            Dtype   
---  ------            -----   
 0   category          object  
 1   amt               float64 
 2   gender            object  
 3   state             object  
 4   is_fraud          int64   
 5   Population_group  category
 6   age               int64   
 7   age_group         category
 8   distance          float64 
 9   dist_range_km     category
 10  month             int64   
 11  day_of_week       object  
 12  transaction_hour  int64   
dtypes: category(3), float64(2), int64(4), object(4)
memory usage: 160.8+ MB

cat_cols = [“category”, “state”, “month”, “day_of_week”, “transaction_hour”, ‘gender’, ‘Population_group’,’age_group’, ‘dist_range_km’]
dummy = pd.get_dummies(X_train[cat_cols], drop_first=True)

Adding the results to the master dataframe
X_train = pd.concat([X_train, dummy], axis=1)
X_train.drop(cat_cols, axis=1, inplace=True)
X_train.head()

Train data table after editing

and drop columns age and distance

X_train.drop([‘age’,’distance’], axis=1, inplace=True)
X_train.head()

Train data table after dropping columns age and distance

Let’s scale the numerical variables of train data

scaler = MinMaxScaler()

scale_var = [“amt”]

X_train[scale_var] = scaler.fit_transform(X_train[scale_var]) # Scaling of train set
X_train.describe() # Check if scaling is proper

Descriptive statistics of train data after MinMaxScaler()

Dummy variables creation for X_test:

dummy1 = pd.get_dummies(X_test[cat_cols], drop_first=True)

Adding the results to the master dataframe
X_test = pd.concat([X_test, dummy1], axis=1)

while removing columns age and distance

X_test.drop(cat_cols, axis=1, inplace=True)
X_test.drop([‘age’,’distance’], axis=1, inplace=True)
X_test.head()

Test data after removing columns age and distance

X_test[scale_var] = scaler.transform(X_test[scale_var]) #applying scaler transform

Let’s check train data heatmap for correlation

plt.figure(figsize=(20,20))
sns.heatmap(X_train.corr())
plt.show()

Heatmap train data correlations
Heatmap train data correlations

Let’s begin the feature selection process by running random forest

rf = RandomForestClassifier(n_estimators = 25).fit(X_train, y_train)

feats = X_train.columns

for feature in zip(feats, rf.feature_importances_):
print(feature)

('amt', 0.44884653633977767)
('category_food_dining', 0.002372802581305735)
('category_gas_transport', 0.024489745855491785)
('category_grocery_net', 0.002141319843316232)
('category_grocery_pos', 0.046934143613396484)
('category_health_fitness', 0.0019287492434254044)
('category_home', 0.0031504986728694293)
('category_kids_pets', 0.0024307865687283802)
('category_misc_net', 0.01036166861243191)
('category_misc_pos', 0.010701233251879962)
('category_personal_care', 0.0027006945386455128)
('category_shopping_net', 0.009838187929644734)
('category_shopping_pos', 0.008626217675562224)
('category_travel', 0.008089480453702001)
('state_AL', 0.0028451917910117964)
('state_AR', 0.0025863113237657065)
('state_AZ', 0.0007275615509002449)
('state_CA', 0.003985682357137098)
('state_CO', 0.001432823417517122)
('state_CT', 0.0009176589468720602)
('state_DC', 0.00036086393446330936)
('state_DE', 0.0004957902299982193)
('state_FL', 0.0034067591286467817)
('state_GA', 0.00191173441492063)
('state_HI', 0.00032649560874954046)
('state_IA', 0.002522684148010048)
('state_ID', 0.0005130031762308228)
('state_IL', 0.0031306355654785725)
('state_IN', 0.001718640762194997)
('state_KS', 0.0021523696300904493)
('state_KY', 0.0018749886942408573)
('state_LA', 0.0013180421160900225)
('state_MA', 0.0011769883432360799)
('state_MD', 0.0018891607272615305)
('state_ME', 0.0014708303814694452)
('state_MI', 0.002714156063496067)
('state_MN', 0.0029512062020218364)
('state_MO', 0.0029553796181859817)
('state_MS', 0.0017194170527638957)
('state_MT', 0.0010971013578513307)
('state_NC', 0.0021431590494776667)
('state_ND', 0.0012864115515251978)
('state_NE', 0.0025524290660134535)
('state_NH', 0.0009430691355806164)
('state_NJ', 0.0017198456851276757)
('state_NM', 0.0017526919926552512)
('state_NV', 0.0006523200851623981)
('state_NY', 0.0050968284355943924)
('state_OH', 0.003109290933850257)
('state_OK', 0.002587653106086186)
('state_OR', 0.0025336746622146985)
('state_PA', 0.004186466351887547)
('state_RI', 0.00020523285496891944)
('state_SC', 0.00275344758560034)
('state_SD', 0.001051836957874794)
('state_TN', 0.001925047648587341)
('state_TX', 0.00471032610213278)
('state_UT', 0.0011421486162993275)
('state_VA', 0.0022731719234685287)
('state_VT', 0.0013083357103399159)
('state_WA', 0.0016483046455105601)
('state_WI', 0.0021518745062484045)
('state_WV', 0.0019238414922494978)
('state_WY', 0.0018998829319713622)
('month_10', 0.005402371161650254)
('month_11', 0.004914005979080988)
('month_12', 0.005258891376983022)
('month_2', 0.004705495331029598)
('month_3', 0.005734176419160778)
('month_4', 0.004743138624643532)
('month_5', 0.0051636266445033935)
('month_6', 0.005446093194120895)
('month_7', 0.0046133826098524134)
('month_8', 0.006074684297321303)
('month_9', 0.0054900987255024755)
('day_of_week_Monday', 0.008555357268766122)
('day_of_week_Saturday', 0.008159321126532816)
('day_of_week_Sunday', 0.00815564259496755)
('day_of_week_Thursday', 0.007204152924990988)
('day_of_week_Tuesday', 0.00764217782147093)
('day_of_week_Wednesday', 0.0063330896189024)
('transaction_hour_1', 0.004890282089502399)
('transaction_hour_10', 0.0016521138538497192)
('transaction_hour_11', 0.0017717968946582864)
('transaction_hour_12', 0.001695621350767716)
('transaction_hour_13', 0.0016548576008602997)
('transaction_hour_14', 0.0021788577694004784)
('transaction_hour_15', 0.0015478959923511215)
('transaction_hour_16', 0.0022681461261766264)
('transaction_hour_17', 0.0016246963882984827)
('transaction_hour_18', 0.0017728139510987259)
('transaction_hour_19', 0.0016305237953596094)
('transaction_hour_2', 0.0034283266378819675)
('transaction_hour_20', 0.0016024313734906263)
('transaction_hour_21', 0.0016571170671093698)
('transaction_hour_22', 0.024470521644854196)
('transaction_hour_23', 0.023721758601067)
('transaction_hour_3', 0.0034593466446824765)
('transaction_hour_4', 0.0020262122602349194)
('transaction_hour_5', 0.0015626607483672133)
('transaction_hour_6', 0.0015595754359422134)
('transaction_hour_7', 0.0016255074240301667)
('transaction_hour_8', 0.0016137106185135032)
('transaction_hour_9', 0.0016202468711129857)
('gender_M', 0.01666026401486715)
('Population_group_5-10lac', 0.002367474653793451)
('Population_group_10-15lac', 0.0010827374314643264)
('Population_group_15-20', 0.0011892716291091788)
('Population_group_20-25lac', 0.0002028015406884771)
('Population_group_25-30lac', 0.0009669303189092789)
('age_group_25-40', 0.014185096524970091)
('age_group_40-60', 0.012718165264989536)
('age_group_60-80', 0.022788615963389207)
('age_group_80+', 0.008995927397751997)
('dist_range_km_25-50', 0.0031651607722741663)
('dist_range_km_50-100', 0.007565177041413791)
('dist_range_km_100-150', 0.009485275311664227)
('dist_range_km_150-200', 0.010142872717989861)
('dist_range_km_200-250', 0.008731533690317395)
('dist_range_km_250-300', 0.002751136114109341)
('dist_range_km_300+', 0.0)

imp_df = pd.DataFrame({
“Varname”: X_train.columns,
“Imp”: rf.feature_importances_
})
imp_df.sort_values(by=”Imp”, ascending=False)

Feature importance table

Let’s group features of interest into a single list

cols_for_model = [‘amt’, ‘category_grocery_pos’, ‘transaction_hour_22’, ‘transaction_hour_23’, ‘category_gas_transport’,
‘age_group_60-80’, ‘gender_M’, ‘age_group_25-40’, ‘age_group_40-60’, ‘category_misc_net’, ‘dist_range_km_150-200’,
‘category_misc_pos’, ‘category_shopping_net’, ‘dist_range_km_100-150’, ‘day_of_week_Sunday’, ‘dist_range_km_200-250’,
‘category_shopping_pos’, ‘age_group_80+’, ‘day_of_week_Saturday’]

and create corresponding test and train subsets

X_train = X_train[cols_for_model]
X_test = X_test[cols_for_model]
X_train.columns

Index(['amt', 'category_grocery_pos', 'transaction_hour_22', 'transaction_hour_23', 'category_gas_transport', 'age_group_60-80', 'gender_M', 'age_group_25-40', 'age_group_40-60', 'category_misc_net', 'dist_range_km_150-200', 'category_misc_pos', 'category_shopping_net', 'dist_range_km_100-150', 'day_of_week_Sunday', 'dist_range_km_200-250', 'category_shopping_pos', 'age_group_80+', 'day_of_week_Saturday'], dtype='object')

Data Resampling and Base Model Testing

Let’s check the training and testing data shape
print(f”train data shape:{X_train.shape}”)
print(f”Test data shape:{X_test.shape}”)

train data shape:(1296675, 19)
Test data shape:(555719, 19)

Let’s check normalized value counts for both train and test data:

print(y_train.value_counts())
y_train.value_counts(normalize = True).reset_index()

0    1289919
1       6756
Name: is_fraud, dtype: int64
Percentage of fraud vs no fraud in the training dataset

print(y_test.value_counts())
y_test.value_counts(normalize = True).reset_index()

0    552824
1      2895
Name: is_fraud, dtype: int64
Percentage of fraud vs no fraud in the test dataset

Let’s look at Logistic regression – Base model

lreg = LogisticRegression()
lreg.fit(X_train, y_train)

LogisticRegression()

y_pred = lreg.predict(X_test)

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred))
print (‘F1 score: ‘, f1_score(y_test, y_pred))
print (‘Recall: ‘, recall_score(y_test, y_pred))
print (‘Precision: ‘, precision_score(y_test, y_pred))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred))

Accuracy:  0.9959907795126673
F1 score:  0.42577319587628865
Recall:  0.2853195164075993
Precision:  0.8385786802030457

 clasification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    552824
           1       0.84      0.29      0.43      2895

    accuracy                           1.00    555719
   macro avg       0.92      0.64      0.71    555719
weighted avg       1.00      1.00      1.00    555719


 confussion matrix:
 [[552665    159]
 [  2069    826]]

We can handle imbalanced classes by balancing the classes by increasing minority or decreasing majority. We can do that by using the following techniques:

Random Under-Sampling;
Random Over-Sampling;
SMOTE – Synthetic Minority Oversampling Technique;
ADASYN – Adaptive Synthetic Sampling Method;
SMOTETomek – Over-sampling followed by under-sampling.
Unddersampling leads to the loss of data so we will not use that.
We will proceed with Random Over-Sampling, SMOTE – Synthetic Minority Oversampling Technique,
ADASYN – Adaptive Synthetic Sampling Method and see which technique works better.

We begin with RandomOverSampler

over_sample = RandomOverSampler(sampling_strategy = 1)
X_resampled_os, y_resampled_os = over_sample.fit_resample(X_train, y_train)
len(X_resampled_os)

2579838

print(sorted(Counter(y_resampled_os).items()))

[(0, 1289919), (1, 1289919)]

Let’s apply LogisticRegression

lreg_os = LogisticRegression()
lreg_os.fit(X_resampled_os, y_resampled_os)

y_pred_os = lreg_os.predict(X_test)

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_os))

Accuracy:  0.7639598430141852
F1 score:  0.035443261368315784
Recall:  0.8324697754749568
Precision:  0.01810709482557834

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.76      0.87    552824
           1       0.02      0.83      0.04      2895

    accuracy                           0.76    555719
   macro avg       0.51      0.80      0.45    555719
weighted avg       0.99      0.76      0.86    555719


 confussion matrix:
 [[422137 130687]
 [   485   2410]]

Let’s apply RandomOverSampler

from imblearn.over_sampling import RandomOverSampler
over_sample = RandomOverSampler(sampling_strategy = 1)
X_resampled_os, y_resampled_os = over_sample.fit_resample(X_train, y_train)
len(X_resampled_os)

2579838

from collections import Counter
print(sorted(Counter(y_resampled_os).items()))

[(0, 1289919), (1, 1289919)]

Let’s run LogisticRegression

lreg_os = LogisticRegression()
lreg_os.fit(X_resampled_os, y_resampled_os)

y_pred_os = lreg_os.predict(X_test)

and print the classification summary

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_os))

Accuracy:  0.7639598430141852
F1 score:  0.035443261368315784
Recall:  0.8324697754749568
Precision:  0.01810709482557834

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.76      0.87    552824
           1       0.02      0.83      0.04      2895

    accuracy                           0.76    555719
   macro avg       0.51      0.80      0.45    555719
weighted avg       0.99      0.76      0.86    555719


 confussion matrix:
 [[422137 130687]
 [   485   2410]]

Let’s apply the SMOTE resampling

from imblearn.over_sampling import SMOTE
smt = SMOTE(random_state=45, k_neighbors=5)
X_resampled_smt, y_resampled_smt = smt.fit_resample(X_train, y_train)
len(X_resampled_smt)

2579838

print(sorted(Counter(y_resampled_smt).items()))

[(0, 1289919), (1, 1289919)]

Let’s apply LogisticRegression

lreg_smt = LogisticRegression()
lreg_smt.fit(X_resampled_smt, y_resampled_smt)

y_pred_smt = lreg_smt.predict(X_test)

and print the classification summary

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_smt))
print (‘F1 score: ‘, f1_score(y_test, y_pred_smt))
print (‘Recall: ‘, recall_score(y_test, y_pred_smt))
print (‘Precision: ‘, precision_score(y_test, y_pred_smt))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_smt))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_smt))

Accuracy:  0.7686168729159881
F1 score:  0.03591404621590415
Recall:  0.8272884283246977
Precision:  0.018355444171092666

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.77      0.87    552824
           1       0.02      0.83      0.04      2895

    accuracy                           0.77    555719
   macro avg       0.51      0.80      0.45    555719
weighted avg       0.99      0.77      0.86    555719


 confussion matrix:
 [[424740 128084]
 [   500   2395]]

Let’s apply the ADASYN resampling

from imblearn.over_sampling import ADASYN
ada = ADASYN(random_state=45, n_neighbors=5)
X_resampled_ada, y_resampled_ada = ada.fit_resample(X_train, y_train)
len(X_resampled_ada)

2579654

print(sorted(Counter(y_resampled_ada).items()))

[(0, 1289919), (1, 1289735)]

Let’s apply LogisticRegression

lreg_ada = LogisticRegression()
lreg_ada.fit(X_resampled_ada, y_resampled_ada)

y_pred_ada = lreg_ada.predict(X_test)

and print the classification summary

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_ada))
print (‘F1 score: ‘, f1_score(y_test, y_pred_ada))
print (‘Recall: ‘, recall_score(y_test, y_pred_ada))
print (‘Precision: ‘, precision_score(y_test, y_pred_ada))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_ada))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_ada))

Accuracy:  0.843919318936369
F1 score:  0.044736175508540844
Recall:  0.7015544041450777
Precision:  0.02310475063705861

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.84      0.92    552824
           1       0.02      0.70      0.04      2895

    accuracy                           0.84    555719
   macro avg       0.51      0.77      0.48    555719
weighted avg       0.99      0.84      0.91    555719


 confussion matrix:
 [[466951  85873]
 [   864   2031]]

Let’s apply DecisionTreeClassifier before resampling

from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=0)
dtc.fit(X_train, y_train)

y_pred_dtc = dtc.predict(X_test)

and print the classification report

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_dtc))
print (‘F1 score: ‘, f1_score(y_test, y_pred_dtc))
print (‘Recall: ‘, recall_score(y_test, y_pred_dtc))
print (‘Precision: ‘, precision_score(y_test, y_pred_dtc))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_dtc))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_dtc))

Accuracy:  0.9965882037504566
F1 score:  0.6757865937072504
Recall:  0.6825561312607945
Precision:  0.6691500169319337

 clasification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    552824
           1       0.67      0.68      0.68      2895

    accuracy                           1.00    555719
   macro avg       0.83      0.84      0.84    555719
weighted avg       1.00      1.00      1.00    555719


 confussion matrix:
 [[551847    977]
 [   919   1976]]

Let’s apply DecisionTreeClassifier after Random Over-Sampling

from sklearn.tree import DecisionTreeClassifier
dtc_os = DecisionTreeClassifier(random_state=0)
dtc_os.fit(X_resampled_os, y_resampled_os)

y_pred_dtc_os = dtc_os.predict(X_test)

and print the classification report

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_dtc_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_dtc_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_dtc_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_dtc_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_dtc_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_dtc_os))

Accuracy:  0.9964532434557754
F1 score:  0.6650807136788445
Recall:  0.6759930915371329
Precision:  0.6545150501672241

 clasification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    552824
           1       0.65      0.68      0.67      2895

    accuracy                           1.00    555719
   macro avg       0.83      0.84      0.83    555719
weighted avg       1.00      1.00      1.00    555719


 confussion matrix:
 [[551791   1033]
 [   938   1957]]

Let’s apply DecisionTreeClassifier after SMOTE resampling

from sklearn.tree import DecisionTreeClassifier
dtc_smt = DecisionTreeClassifier(random_state=0)
dtc_smt.fit(X_resampled_smt, y_resampled_smt)

y_pred_dtc_smt = dtc_smt.predict(X_test)

and print the classification report

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_dtc_smt))
print (‘F1 score: ‘, f1_score(y_test, y_pred_dtc_smt))
print (‘Recall: ‘, recall_score(y_test, y_pred_dtc_smt))
print (‘Precision: ‘, precision_score(y_test, y_pred_dtc_smt))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_dtc_smt))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_dtc_smt))

Accuracy:  0.9538579749837598
F1 score:  0.16037982973149964
Recall:  0.8459412780656304
Precision:  0.08858744800144691

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.98    552824
           1       0.09      0.85      0.16      2895

    accuracy                           0.95    555719
   macro avg       0.54      0.90      0.57    555719
weighted avg       0.99      0.95      0.97    555719


 confussion matrix:
 [[527628  25196]
 [   446   2449]]

Let’s apply DecisionTreeClassifier after ADASYN resampling

from sklearn.tree import DecisionTreeClassifier
dtc_ada = DecisionTreeClassifier(random_state=0)
dtc_ada.fit(X_resampled_ada, y_resampled_ada)

y_pred_dtc_ada = dtc_ada.predict(X_test)

and print the classification report

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_dtc_ada))
print (‘F1 score: ‘, f1_score(y_test, y_pred_dtc_ada))
print (‘Recall: ‘, recall_score(y_test, y_pred_dtc_ada))
print (‘Precision: ‘, precision_score(y_test, y_pred_dtc_ada))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_dtc_ada))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_dtc_ada))

Accuracy:  0.934169967195651
F1 score:  0.11964865840452413
Recall:  0.8587219343696028
Precision:  0.06430419037765132

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.93      0.97    552824
           1       0.06      0.86      0.12      2895

    accuracy                           0.93    555719
   macro avg       0.53      0.90      0.54    555719
weighted avg       0.99      0.93      0.96    555719


 confussion matrix:
 [[516650  36174]
 [   409   2486]]

Let’s apply RandomForestClassifier before resampling (base model)

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)

and print the classification report

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rf))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rf))
print (‘Recall: ‘, recall_score(y_test, y_pred_rf))
print (‘Precision: ‘, precision_score(y_test, y_pred_rf))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rf))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rf))

Accuracy:  0.9971766306352671
F1 score:  0.7162235485621269
Recall:  0.6839378238341969
Precision:  0.7517084282460137

 clasification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    552824
           1       0.75      0.68      0.72      2895

    accuracy                           1.00    555719
   macro avg       0.88      0.84      0.86    555719
weighted avg       1.00      1.00      1.00    555719


 confussion matrix:
 [[552170    654]
 [   915   1980]]

Let’s apply RandomForestClassifier after Random Over-Sampling

rf_os = RandomForestClassifier()
rf_os.fit(X_resampled_os, y_resampled_os)

y_pred_rf_os = rf_os.predict(X_test)

and print the classification report

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rf_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rf_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_rf_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_rf_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rf_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rf_os))

Accuracy:  0.9966781772802441
F1 score:  0.6860544217687076
Recall:  0.6967184801381693
Precision:  0.6757118927973199

 clasification report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    552824
           1       0.68      0.70      0.69      2895

    accuracy                           1.00    555719
   macro avg       0.84      0.85      0.84    555719
weighted avg       1.00      1.00      1.00    555719


 confussion matrix:
 [[551856    968]
 [   878   2017]]

Let’s apply RandomForestClassifier after SMOTE resampling

rf_smt = RandomForestClassifier()
rf_smt.fit(X_resampled_smt, y_resampled_smt)

y_pred_rf_smt = rf_smt.predict(X_test)

and print the classification summary

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rf_smt))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rf_smt))
print (‘Recall: ‘, recall_score(y_test, y_pred_rf_smt))
print (‘Precision: ‘, precision_score(y_test, y_pred_rf_smt))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rf_smt))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rf_smt))

Accuracy:  0.9540793098670372
F1 score:  0.15903773274015487
Recall:  0.8335060449050087
Precision:  0.08790528233151183

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.95      0.98    552824
           1       0.09      0.83      0.16      2895

    accuracy                           0.95    555719
   macro avg       0.54      0.89      0.57    555719
weighted avg       0.99      0.95      0.97    555719


 confussion matrix:
 [[527787  25037]
 [   482   2413]]

Let’s apply RandomForestClassifier after ADASYN resampling

rf_ada = RandomForestClassifier()
rf_ada.fit(X_resampled_ada, y_resampled_ada)

y_pred_rf_ada = rf_ada.predict(X_test)

and print the classification summary

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rf_ada))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rf_ada))
print (‘Recall: ‘, recall_score(y_test, y_pred_rf_ada))
print (‘Precision: ‘, precision_score(y_test, y_pred_rf_ada))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rf_ada))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rf_ada))

Accuracy:  0.9337614873704156
F1 score:  0.11684261036468328
Recall:  0.8411053540587219
Precision:  0.06278200335181126

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.93      0.97    552824
           1       0.06      0.84      0.12      2895

    accuracy                           0.93    555719
   macro avg       0.53      0.89      0.54    555719
weighted avg       0.99      0.93      0.96    555719


 confussion matrix:
 [[516474  36350]
 [   460   2435]]

Let’s plot the confusion matrix

cf_matrix=confusion_matrix(y_test, y_pred_rf_ada)
sns.heatmap(cf_matrix, annot=True)

Confusion matrix RandomForestClassifier after ADASYN resampling

sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
fmt=’.2%’, cmap=’Blues’)

Normalized confusion matrix RandomForestClassifier after ADASYN resampling

group_names = [“True Neg”,”False Pos”,”False Neg”,”True Pos”]
group_counts = [“{0:0.0f}”.format(value) for value in
cf_matrix.flatten()]
group_percentages = [“{0:.2%}”.format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f”{v1}\n{v2}\n{v3}” for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt=”, cmap=’Blues’)

Normalized confusion matrix RandomForestClassifier after ADASYN resampling

Let’s look at XGBoost

from xgboost import XGBClassifier

Let’s apply XGBClassifier after Random Over-Sampling

xgb_os = XGBClassifier()
xgb_os.fit(X_resampled_os, y_resampled_os)

y_pred_xgb_os = xgb_os.predict(X_test)

and print the classification report

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_xgb_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_xgb_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_xgb_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_xgb_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_xgb_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_xgb_os))

Accuracy:  0.9771449239633699
F1 score:  0.29301419426662956
Recall:  0.909153713298791
Precision:  0.17465162574651627

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99    552824
           1       0.17      0.91      0.29      2895

    accuracy                           0.98    555719
   macro avg       0.59      0.94      0.64    555719
weighted avg       1.00      0.98      0.98    555719


 confussion matrix:
 [[540386  12438]
 [   263   2632]]

Let’s apply XGBClassifier after SMOTE resampling

xgb_smt = XGBClassifier()
xgb_smt.fit(X_resampled_smt, y_resampled_smt)

y_pred_xgb_smt = xgb_smt.predict(X_test)

and print the classification report

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_xgb_smt))
print (‘F1 score: ‘, f1_score(y_test, y_pred_xgb_smt))
print (‘Recall: ‘, recall_score(y_test, y_pred_xgb_smt))
print (‘Precision: ‘, precision_score(y_test, y_pred_xgb_smt))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_xgb_smt))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_xgb_smt))

Accuracy:  0.9660457893287795
F1 score:  0.21954750382595029
Recall:  0.9167530224525043
Precision:  0.12470632459355324

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.97      0.98    552824
           1       0.12      0.92      0.22      2895

    accuracy                           0.97    555719
   macro avg       0.56      0.94      0.60    555719
weighted avg       0.99      0.97      0.98    555719


 confussion matrix:
 [[534196  18628]
 [   241   2654]]

Let’s apply XGBClassifier after ADASYN resampling

xgb_ada = XGBClassifier()
xgb_ada.fit(X_resampled_ada, y_resampled_ada)

y_pred_xgb_ada = xgb_ada.predict(X_test)

and print the classification summary

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_xgb_ada))
print (‘F1 score: ‘, f1_score(y_test, y_pred_xgb_ada))
print (‘Recall: ‘, recall_score(y_test, y_pred_xgb_ada))
print (‘Precision: ‘, precision_score(y_test, y_pred_xgb_ada))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_xgb_ada))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_xgb_ada))

Accuracy:  0.9230924262082095
F1 score:  0.11335394062610211
Recall:  0.9436960276338515
Precision:  0.06029840204820341

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.92      0.96    552824
           1       0.06      0.94      0.11      2895

    accuracy                           0.92    555719
   macro avg       0.53      0.93      0.54    555719
weighted avg       0.99      0.92      0.96    555719


 confussion matrix:
 [[510248  42576]
 [   163   2732]]

Hyper-Parameter Optimization (HPO)

The objective of HPO is to resolve a trade-off between ACCURACY and RECALL for the following models :

* Logistic Regression SMOTE model
* Decision Tree SMOTE model
* XGBoost ADASYN model
* Random Forest ADASYN model.

Let’s begin with Logistic Regression SMOTE by applying the Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
rfe = RFE(estimator=lreg_smt, n_features_to_select=10) # Selecting 10 features that are important
rfe.fit(X_resampled_smt, y_resampled_smt)

RFE(estimator=LogisticRegression(), n_features_to_select=10)

rfe.ranking_

array([ 1,  1,  1,  1,  1, 10,  4,  5,  9,  1,  7,  1,  1,  6,  1,  8,  3,
        1,  2])

X_resampled_ada.columns[rfe.support_]

Index(['amt', 'category_grocery_pos', 'transaction_hour_22', 'transaction_hour_23', 'category_gas_transport', 'category_misc_net', 'category_misc_pos', 'category_shopping_net', 'day_of_week_Sunday', 'age_group_80+'], dtype='object')

X_resampled_ada2 = X_resampled_ada.loc[:,rfe.support_]
X_resampled_ada2.shape

(2579654, 10)

Let’s look at top 10 important features
X_resampled_ada2.columns

Index(['amt', 'category_grocery_pos', 'transaction_hour_22', 'transaction_hour_23', 'category_gas_transport', 'category_misc_net', 'category_misc_pos', 'category_shopping_net', 'day_of_week_Sunday', 'age_group_80+'], dtype='object')

Let’s check the cross-validation score

from sklearn.model_selection import cross_val_score
cross_val_score(lreg_ada, X_resampled_ada2, y_resampled_ada, n_jobs=-1)

array([0.82018138, 0.81259897, 0.82518399, 0.80935435, 0.81977982])

Let’s look at cross validation for feature selection – Logistic Regression SMOTE
num_features = X_resampled_smt.shape
num_features[1]

19

Let’s run the estimator RFECV

rfecv = RFECV(estimator=lreg_smt, cv=5)
rfecv.fit(X_resampled_smt, y_resampled_smt)

RFECV(cv=5, estimator=LogisticRegression())

rfecv.grid_scores_

array([[0.80675546, 0.80701323, 0.80670119, 0.80656515, 0.80697603],
       [0.83262334, 0.83306523, 0.83312143, 0.83283815, 0.83287497],
       [0.85665972, 0.85629535, 0.85768691, 0.85693853, 0.85693853],
       [0.8118488 , 0.81180422, 0.81217827, 0.81202286, 0.81247056],
       [0.79551833, 0.79550282, 0.79591951, 0.79603928, 0.79595594],
       [0.79128551, 0.79193283, 0.79170801, 0.79163978, 0.79170955],
       [0.78841905, 0.78890358, 0.788859  , 0.78862408, 0.78878688],
       [0.78213378, 0.78211246, 0.7826997 , 0.78235817, 0.78264308],
       [0.79002574, 0.79047538, 0.79081455, 0.79066684, 0.79012417],
       [0.7892505 , 0.78962067, 0.78992302, 0.78961445, 0.78943808],
       [0.79136303, 0.79152583, 0.79215378, 0.79207391, 0.79155062],
       [0.7895509 , 0.79009745, 0.79040173, 0.79005053, 0.79006409],
       [0.79205687, 0.79224487, 0.79314996, 0.79285885, 0.79213593],
       [0.79193477, 0.79223905, 0.79279335, 0.79270961, 0.79194018],
       [0.7918398 , 0.79209176, 0.79280692, 0.79250223, 0.79215337],
       [0.79165762, 0.79209176, 0.79261698, 0.79230261, 0.79190723],
       [0.7918398 , 0.79234953, 0.79274878, 0.79250223, 0.79186847],
       [0.79195997, 0.79235534, 0.79298522, 0.79270574, 0.79211461],
       [0.79194834, 0.79235922, 0.79297747, 0.79269992, 0.79213399]])

Let’s plot the scores

plt.figure(figsize=[10, 5])
plt.plot(range(1, num_features[1]+1), rfecv.grid_scores_)
plt.show()

RFECV resampling scores

rfecv.n_features_

3

rfecv.support_

array([ True, False,  True,  True, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False])

Let’s apply HPO using Cross Validation for Decision Tree SMOTE

params = {
“max_depth”: [2,3,5,10,20],
“min_samples_leaf”: [5,10,20,50,100]
}
model_rcv_dt = RandomizedSearchCV(estimator=dtc_smt,
param_distributions=params,
verbose=1,
cv=5,
return_train_score=True,
n_jobs=-1,
n_iter=20)
model_rcv_dt.fit(X_resampled_smt, y_resampled_smt)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
                   n_iter=20, n_jobs=-1,
                   param_distributions={'max_depth': [2, 3, 5, 10, 20],
                                        'min_samples_leaf': [5, 10, 20, 50,
                                                             100]},
                   return_train_score=True, verbose=1)

Let’s check the score

model_rcv_dt.best_score_

0.9548638326370502

and choose the best model

dt_best = model_rcv_dt.best_estimator_
dt_best

DecisionTreeClassifier(max_depth=20, min_samples_leaf=20, random_state=0)

Let’s plot the ROC curve

from sklearn.metrics import plot_roc_curve
plot_roc_curve(dt_best, X_resampled_smt, y_resampled_smt)
plt.show()

DecisionTreeClassifier ROC curve

Let’s apply HPO using Cross Validation for the XGBOOST ADASYN model

from sklearn.model_selection import RandomizedSearchCV
hyper_params = {‘max_depth’: range(3, 10),
‘max_features’: range(3, 10),
‘min_samples_leaf’: range(20, 200, 50),
‘n_estimators’: range(10, 51, 10)}

while performing RandomizedSearchCV for this model

model_rcv_xgb = RandomizedSearchCV(estimator=xgb_ada,
param_distributions=hyper_params,
verbose=1,
cv=5,
return_train_score=True,
n_jobs=-1,
n_iter=20)
model_rcv_xgb.fit(X_resampled_ada, y_resampled_ada)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
[12:05:20] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/learner.cc:627: 
Parameters: { "max_features", "min_samples_leaf" } might not be used.
RandomizedSearchCV(cv=5,
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           callbacks=None, colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1,
                                           early_stopping_rounds=None,
                                           enable_categorical=False,
                                           eval_metric=None, gamma=0, gpu_id=-1,
                                           grow_policy='depthwise',
                                           importance_type=None,
                                           interaction_constraints='',
                                           learning_rate=0.300000012,
                                           max_bin=256,...
                                           max_leaves=0, min_child_weight=1,
                                           missing=nan,
                                           monotone_constraints='()',
                                           n_estimators=100, n_jobs=0,
                                           num_parallel_tree=1,
                                           predictor='auto', random_state=0,
                                           reg_alpha=0, reg_lambda=1, ...),
                   n_iter=20, n_jobs=-1,
                   param_distributions={'max_depth': range(3, 10),
                                        'max_features': range(3, 10),
                                        'min_samples_leaf': range(20, 200, 50),
                                        'n_estimators': range(10, 51, 10)},
                   return_train_score=True, verbose=1)

And the score is

model_rcv_xgb.best_score_

0.9173245717058146

Displaying best values for hyperparameters

model_rcv_xgb.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=9, max_features=3, max_leaves=0,
              min_child_weight=1, min_samples_leaf=120, missing=nan,
              monotone_constraints='()', n_estimators=50, n_jobs=0,
              num_parallel_tree=1, predictor='auto', random_state=0, ...)

Let’s select the best model

xgb_best = model_rcv_xgb.best_estimator_

and predict the is_fraud variable for test data

y_pred_rcv_xgb = model_rcv_xgb.predict(X_test)

Performance Evaluation

Let’s plot various performance metrics using skplt

import scikitplot as skplt

Let’s check learning curves of our training examples

skplt.estimators.plot_learning_curve(XGBClassifier(), X_test, y_pred_rcv_xgb,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”XGBClassifier Learning Curve”);

XGBClassifier learning curve

skplt.estimators.plot_learning_curve(DecisionTreeClassifier(), X_test, y_pred_dtc_os,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”DecisionTreeClassifier Learning Curve”);

DecisionTreeClassifier learning curve

skplt.estimators.plot_learning_curve(RandomForestClassifier(), X_test, y_pred_rf_smt,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”RandomForestClassifier() Learning Curve”);

RandomForestClassifier learning curve

Let’s plot the ROC curve for our test data

Y_test_probs = model_rcv_xgb.predict_proba(X_test)

skplt.metrics.plot_roc_curve(y_pred_rcv_xgb, Y_test_probs,
title=”XGB ROC Curve”, figsize=(12,6));

XGB ROC curve

Y_test_probs = dtc_os.predict_proba(X_test)

skplt.metrics.plot_roc_curve(y_pred_dtc_os, Y_test_probs,
title=”DTC ROC Curve”, figsize=(12,6));

DTC ROC curve

Let’s compare calibration curves

lr_probas = LogisticRegression().fit(X_train, y_train).predict_proba(X_test)
rf_probas = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
gb_probas = DecisionTreeClassifier().fit(X_train, y_train).predict_proba(X_test)
et_scores = XGBClassifier().fit(X_train, y_train).predict_proba(X_test)

probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = [‘LogisticRegression’, ‘RandomForestClassifier’, ‘DecisionTreeClassifier’, ‘XGBClassifier’]

skplt.metrics.plot_calibration_curve(y_test,
probas_list,
clf_names, n_bins=15,
figsize=(12,6)
);

Calibration plots: LR, RFC, DTC, and XGB

Let’s look at KS Statistic plot of the best model – XGB Classifier

Y_test_probs = model_rcv_xgb.predict_proba(X_test)
skplt.metrics.plot_ks_statistic(y_test, Y_test_probs, figsize=(10,6));

KS Statistic plot of the best model - XGB Classifier

Let’s plot the Cumulative Gains Curve

Y_test_probs = model_rcv_xgb.predict_proba(X_test)
skplt.metrics.plot_cumulative_gain(y_test, Y_test_probs, figsize=(10,6));

Cumulative Gains Curve of XGB

Let’s look at the lift curve

skplt.metrics.plot_lift_curve(y_test, Y_test_probs, figsize=(10,6));

XGB lift curve

Let’s plot the cluster elbow curve

skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
X_test,
cluster_ranges=range(2, 20),
figsize=(8,6));

Kmeans elbow plot

Let’s check PCA

pca = PCA(random_state=1)
pca.fit(X_test)

skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

PCA component explained variance

Let’s look at the PCA 2-D projection

skplt.decomposition.plot_pca_2d_projection(pca, X_test, y_test,
figsize=(10,10),
cmap=”tab10″);

PCA 2-D projection

Let’s apply the silhouette analysis

kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_train, y_train)
cluster_labels = kmeans.predict(X_test)

skplt.metrics.plot_silhouette(X_test, cluster_labels,
figsize=(8,6));

Kmeans silhouette analysis

Let’s look at our evaluation metrics to check the accuracy and recall values

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rcv_xgb))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rcv_xgb))
print (‘Recall: ‘, recall_score(y_test, y_pred_rcv_xgb))
print (‘Precision: ‘, precision_score(y_test, y_pred_rcv_xgb))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rcv_xgb))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rcv_xgb))

Accuracy:  0.9270224699893291
F1 score:  0.11739102047922696
Recall:  0.9316062176165804
Precision:  0.06264226320434803

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.93      0.96    552824
           1       0.06      0.93      0.12      2895

    accuracy                           0.93    555719
   macro avg       0.53      0.93      0.54    555719
weighted avg       0.99      0.93      0.96    555719


 confussion matrix:
 [[512467  40357]
 [   198   2697]]

Let’s plot the XGB ROC curve

plot_roc_curve(xgb_best, X_resampled_ada, y_resampled_ada, drop_intermediate=False)
plt.show()

XGB ROC curve
AUC=0.98

So the accuracy score for the final model is 92.7% whereas the recall for the final model is 93.1%.

Let’s check the list of important features

importances = xgb_best.feature_importances_
weights = pd.Series(importances,
index=X_resampled_ada.columns.values)
weights.sort_values()[-10:].plot(kind = ‘barh’)

the list of important features with scores

Let’s apply HPO using Cross Validation for Random Forest ADASYN model
#from sklearn.model_selection import RandomizedSearchCV
hyper_params = {‘max_depth’: range(3, 10),
‘max_features’: range(3, 10),
‘min_samples_leaf’: range(20, 200, 50),
‘n_estimators’: range(10, 51, 10)}

Performing Randomizedsearch for Random Forest ADASYN model
model_rcv = RandomizedSearchCV(estimator=rf_ada,
param_distributions=hyper_params,
verbose=1,
cv=5,
return_train_score=True,
n_jobs=-1,
n_iter=20)
model_rcv.fit(X_resampled_ada, y_resampled_ada)

Fitting 5 folds for each of 20 candidates, totalling 100 fits

Out[248]:

RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,
                   n_jobs=-1,
                   param_distributions={'max_depth': range(3, 10),
                                        'max_features': range(3, 10),
                                        'min_samples_leaf': range(20, 200, 50),
                                        'n_estimators': range(10, 51, 10)},
                   return_train_score=True, verbose=1)

Let’s check the score

model_rcv.best_score_

0.8869038273254523

Displaying best values for hyperparameters

RandomForestClassifier(max_depth=8, max_features=8, min_samples_leaf=170,
                       n_estimators=10)

Select the best model rf_best

rf_best = model_rcv.best_estimator_

Let’s predict the is_fraud variable for test data

y_pred_rcv = rf_best.predict(X_test)

Let’s check our evaluation metrics to compare accuracy and recall values

print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rcv))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rcv))
print (‘Recall: ‘, recall_score(y_test, y_pred_rcv))
print (‘Precision: ‘, precision_score(y_test, y_pred_rcv))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rcv))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rcv))

Accuracy:  0.9199883394305395
F1 score:  0.10908070850364672
Recall:  0.9402417962003454
Precision:  0.05789887903345883

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.92      0.96    552824
           1       0.06      0.94      0.11      2895

    accuracy                           0.92    555719
   macro avg       0.53      0.93      0.53    555719
weighted avg       0.99      0.92      0.95    555719


 confussion matrix:
 [[508533  44291]
 [   173   2722]]

Let’s plot the ROC curve

#from sklearn.metrics import plot_roc_curve
plot_roc_curve(rf_best, X_resampled_ada, y_resampled_ada, drop_intermediate=False)
plt.show()

RFC ROC curve with AUC=0.87

The accuracy score for the final model is 92% whereas recall for the final model is 94%. Let’s plot the list of important features

importances = rf_best.feature_importances_
weights = pd.Series(importances,
index=X_resampled_ada.columns.values)
weights.sort_values()[-10:].plot(kind = ‘barh’)

the list of RFC important features with scores

Let’s apply HPO using Cross Validation for the Random Forest ADASYN model
#from sklearn.model_selection import RandomizedSearchCV
hyper_params = {‘max_depth’: range(3, 10),
‘max_features’: range(3, 10),
‘min_samples_leaf’: range(20, 200, 50),
‘n_estimators’: range(10, 51, 10)}

Let’s apply RandomizedSearchCV to Random Forest ADASYN model
model_rcv = RandomizedSearchCV(estimator=rf_ada,
param_distributions=hyper_params,
verbose=1,
cv=5,
return_train_score=True,
n_jobs=-1,
n_iter=20)

Let’s fit the resampled data
model_rcv.fit(X_resampled_ada, y_resampled_ada)

Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20,
                   n_jobs=-1,
                   param_distributions={'max_depth': range(3, 10),
                                        'max_features': range(3, 10),
                                        'min_samples_leaf': range(20, 200, 50),
                                        'n_estimators': range(10, 51, 10)},
                   return_train_score=True, verbose=1)

Let’s check the score

model_rcv.best_score_

0.8869038273254523

Displaying best values for hyperparameters

RandomForestClassifier(max_depth=8, max_features=8, min_samples_leaf=170,
                       n_estimators=10)

Naming the best model rf_best

rf_best = model_rcv.best_estimator_

Let’s predict the is_fraud variable for test data

y_pred_rcv = rf_best.predict(X_test)

Let’s print our evaluation Metrics to see the accuracy and recall values
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rcv))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rcv))
print (‘Recall: ‘, recall_score(y_test, y_pred_rcv))
print (‘Precision: ‘, precision_score(y_test, y_pred_rcv))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rcv))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rcv))

Accuracy:  0.9199883394305395
F1 score:  0.10908070850364672
Recall:  0.9402417962003454
Precision:  0.05789887903345883

 clasification report:
               precision    recall  f1-score   support

           0       1.00      0.92      0.96    552824
           1       0.06      0.94      0.11      2895

    accuracy                           0.92    555719
   macro avg       0.53      0.93      0.53    555719
weighted avg       0.99      0.92      0.95    555719


 confussion matrix:
 [[508533  44291]
 [   173   2722]]

Let’s plot the ROC curve

#from sklearn.metrics import plot_roc_curve
plot_roc_curve(rf_best, X_resampled_ada, y_resampled_ada, drop_intermediate=False)
plt.show()

RFC ROC curve after resampling

Accuracy Score for the Final Model is 92%

Recall for the Final Model is 94%

Let’s plot the list of important features

The plot of RFC important features

Cost Benefit Analysis

Let’s perform the Cost Benefit Analysis using the XGBoost ADASYN (xgb_best) model.

Forming a consolidated train+test dataset
data = pd.concat([data_train, data_test])

Let’s define the transaction date and time column

data[‘trans_date_trans_time’] = pd.to_datetime(data[‘trans_date_trans_time’])

Extract year from trans_date_trans_time column

data[‘year’] = pd.DatetimeIndex(data[‘trans_date_trans_time’]).year
data[‘month’] = pd.DatetimeIndex(data[‘trans_date_trans_time’]).month
data.head()

The Cost Benefit Analysis using the XGBoost ADASYN (xgb_best) model.

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1852394 entries, 0 to 555718
Data columns (total 25 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   Unnamed: 0             int64         
 1   trans_date_trans_time  datetime64[ns]
 2   cc_num                 int64         
 3   merchant               object        
 4   category               object        
 5   amt                    float64       
 6   first                  object        
 7   last                   object        
 8   gender                 object        
 9   street                 object        
 10  city                   object        
 11  state                  object        
 12  zip                    int64         
 13  lat                    float64       
 14  long                   float64       
 15  city_pop               int64         
 16  job                    object        
 17  dob                    object        
 18  trans_num              object        
 19  unix_time              int64         
 20  merch_lat              float64       
 21  merch_long             float64       
 22  is_fraud               int64         
 23  year                   int64         
 24  month                  int64         
dtypes: datetime64[ns](1), float64(5), int64(8), object(11)
memory usage: 367.4+ MB

Let’s check the monthly is_fraud count

avg = data.groupby([‘year’, ‘month’]).is_fraud.count()
avg

year  month
2019  1         52525
      2         49866
      3         70939
      4         68078
      5         72532
      6         86064
      7         86596
      8         87359
      9         70652
      10        68758
      11        70421
      12       141060
2020  1         52202
      2         47791
      3         72850
      4         66892
      5         74343
      6         87805
      7         85848
      8         88759
      9         69533
      10        69348
      11        72635
      12       139538
Name: is_fraud, dtype: int64

Let’s estimate the average number of transactions per month

Avg_tran_per_month = avg.sum()/24
Avg_tran_per_month

77183.08333333333

Let’s check the monthly is_fraud=1 count

fraud_trans = data[data[‘is_fraud’]==1]
avg_fraud = fraud_trans.groupby([‘year’, ‘month’]).is_fraud.count()
avg_fraud

year  month
2019  1        506
      2        517
      3        494
      4        376
      5        408
      6        354
      7        331
      8        382
      9        418
      10       454
      11       388
      12       592
2020  1        343
      2        336
      3        444
      4        302
      5        527
      6        467
      7        321
      8        415
      9        340
      10       384
      11       294
      12       258
Name: is_fraud, dtype: int64

Let’s estimate the average number of fraudulent transactions per month

Avg_fraud_tran_per_month = avg_fraud.sum()/24
Avg_fraud_tran_per_month

402.125

let’s get the average amount amt per fraudulent transaction
fraud_trans.amt.mean()

402.125

and the average amount amt per fraudulent transaction
fraud_trans.amt.mean()

530.6614122888789

Let TF be the average number of transactions per month detected as fraudulent by the model

TF = (41715+2706)/24 # (True Positives + False Positives) as per xgb_best confusion matrix
TF

1850.875

Let FN be the average number of transactions per month that are fraudulent but not detected by the model

FN = 189 # False Negatives as per xgb_best confusion matrix

Cost incurred per month before the model was deployed is
Cost_Before = 402.125*530.66
print(Cost_Before)

Cost of providing customer executive support per fraudulent transaction detected by the model is $1.5.
Total cost of providing customer support per month for fraudulent transactions detected by the model is TF*$1.5

Cost_cust_supp = TF*1.5
print(Cost_cust_supp)

Cost incurred due to fraudulent transactions left undetected by the model is

Cost_fraud = FN*530.66
print(Cost_fraud)

213391.6525
2776.3125
100294.73999999999

the monthly cost after the model is built and deployed is

Cost_after_model = Cost_cust_supp + Cost_fraud
Cost_after_model

103071.05249999999

Final savings = Cost incurred before – Cost incurred after

Final_Savings = Cost_Before – Cost_after_model
Final_Savings

110320.6

Conclusions

The XGBoost ADASYN model yields the best performance score: Accuracy is 92%

and Recall 94%. The deployment of this model results in saving 110320$. The most important features to monitor are the amount and late-night transaction hours.

References

[1] Tutort Academy

[2] Kaggle: Credit Card Transactions Fraud Detection Dataset

[3] Machine Learning Case Study: Credit Card Fraud Detection

[4] Credit Card Fraud Detection: Capstone Project (BA)

[5] Procedia Computer Science 165 (2019) 631–641.

Check github link

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: