
Contents:
- Introduction
- The Algorithm
- E2E Pipeline
- Prerequisites
- The Input Dataset
- Data Preparation
- Exploratory Data Analysis (EDA)
- Feature Engineering
- Data Resampling and Base Model Testing
- Random Under-Sampling;
- Random Over-Sampling;
- SMOTE – Synthetic Minority Oversampling Technique;
- ADASYN – Adaptive Synthetic Sampling Method;
- SMOTETomek – Over-sampling followed by under-sampling.
- Unddersampling leads to the loss of data so we will not use that.
- We will proceed with Random Over-Sampling, SMOTE – Synthetic Minority Oversampling Technique,
- ADASYN – Adaptive Synthetic Sampling Method and see which technique works better.
- Hyper-Parameter Optimization (HPO)
- Performance Evaluation
- Cost Benefit Analysis
- Conclusions
- References
Introduction
Business Case
Payment fraud is already a billion dollar business, and it’s growing. When you look at the stats behind global online payment fraud, it’s no surprise that almost three quarters of businesses say it’s a major concern. According to Juniper Research, online sellers will lose $130 billion to online payment fraud between 2018 and 2023.
As eCommerce sales rise, payment fraud continues to plague customers and merchants.
Since 40% of customers blame the merchant when they experience fraud, it’s time to stop fraud from cutting into your bottom line.
Global fraud average costs:
- Online payment fraud costs global businesses 1.8% of revenue.
- For every $1 of fraud from chargebacks, ecommerce businesses lose an extra $2.94
The extra costs of fraud for businesses include chargeback fees, merchandise distribution, fraud investigation, legal prosecution and software security.
It’s not only about the financial cost – fraud also impacts brand and customer loyalty. Because consumers aren’t aware of how fraud works, they often blame the online seller and are less likely to buy from their site again.
Old School Approach
Traditionally businesses relied on rules alone to block fraudulent payments. Today, rules are still an important part of the anti-fraud toolkit but in the past, using them on their own also caused some issues:
- False positives (FP) – using lots of rules tends to result in a high number of FP – meaning you’re likely to block a lot of genuine customers. For example, high-value orders and orders from high-risk locations are more likely to be fraudulent.
- Fixed outcomes – the thresholds for fraudy behaviour can change over time – if your prices change, the average order value can go up, meaning that orders over $500 become the norm, and so rules can become invalid.
- Inefficient and hard to scale – using a rules-only approach means that your library must keep expanding as fraud evolves. This makes the system slower and puts a heavy maintenance burden on your fraud analyst team, demanding increasing numbers of manual reviews.
Bottom Line: Rules, machine learning (ML) and Artificial Intelligence (AI) are complementary tools for fraud detection.
Machine Learning Approach
The ML/AI approach overcomes the above drawbacks of traditional rule-based fraud detection techniques as follows [1-5]:
- Speed – Machine learning is like having several teams of analysts running hundreds of thousands of queries and comparing the outcomes to find the best result – this is all done in real-time and only takes milliseconds.
- Scalability – Every online business wants to increase its transaction volume. With a rules only system, increasing amounts of payment and customer data puts more pressure on the rules library to expand. But with machine learning it’s the opposite – the more data the better.
- Efficiency & cost – Machine learning does all the dirty work of data analysis in a fraction of the time it would take for even 100 fraud analysts. Unlike humans, machines can perform repetitive, tedious tasks 24/7 and only need to escalate decisions to a human when specific insight is needed. Recall that the cost of machine learning is just the cost of the servers running.
- Accuracy – Machine learning models are able to learn from patterns of normal behavior. They are very fast to adapt to changes in that normal behaviour and can quickly identify patterns of rapidly varying fraud transactions. The model can identify suspicious customers even when there hasn’t been a chargeback yet.
Multiple Supervised and Semi-Supervised machine learning techniques are used for fraud detection [3-5], but we aim is to overcome three main challenges with card frauds related dataset i.e., strong class imbalance, the inclusion of labelled and unlabelled samples, and to increase the ability to process a large number of transactions.
The ultimate goal of ML solutions is to minimize fraud losses while keeping customers safe by maximizing their CLV.
The Algorithm
The supervised deep learning binary classification algorithm [3-4] consists of the following steps:
- Input Data Management
- Feature Engineering
- Model Training, Testing and Validation
- Model Performance Analysis
- Cost and Risk Score Estimates
- Final Risk Threshold Adjustment
At the point of the transaction, the ML model gives each customer a risk score. The higher the score, the higher the probability of fraud.
You can choose what level of risk is right for your business, and set thresholds for what proportion of transactions you want to allow, block and manually review or challenge.
E2E Pipeline
The proposed end-to-end Python sequence [3] detects fraudulent transactions based on the historical transactional data of customers with a pool of merchants:
- Setup the Jupyter Notebook Python open-source ML workspace within the Anaconda IDE, including all necessary libraries
- Read and examine the input csv dataset
- Data preparation, transformation, editing, train/test split and clean-up
- Exploratory Data Analysis (EDA) and Visualization
- Feature selection, ranking, scaling and correlations
- Handling data imbalance by increasing minority class (Random Over-Sampling, SMOTE and ADASYN)
- Model Training, Testing and Validation (4 base models + 3 sampling techniques x 4 ML models = 16 models)
- Hyper-Parameter Optimization using grid search and randomly selected hyperparameter combinations (GridSearchCV and RandomizedSearchCV)
- ML performance evaluation using precision-recall, f1-score, and confusion matrix classification reports, feature importance, ROC and other metrics
- Risk score estimates are based on user-defined thresholds for what proportion of transactions you want to allow, block and manually review or challenge; these form the important triggers for fraud identification and using these triggers, techniques can be built further to mitigate the frauds in real time
- Cost Benefit Analysis using the best ML model (cost incurred per month before/after the model is built and deployed).
Prerequisites
We install the Anaconda IDE to perform Python data science and machine learning on a single machine. In doing so, we start working with thousands of open-source packages and libraries. We import and install the following libraries:
import scikitplot as skplt
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
from imblearn.over_sampling import RandomOverSampler
from collections import Counter
import sys
import warnings
warnings.filterwarnings(“ignore”)
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import time
%matplotlib inline warnings.filterwarnings('ignore') sns.set_style("darkgrid")
print(“Scikit Plot Version : “, skplt.version)
print(“Scikit Learn Version : “, sklearn.version)
print(“Python Version : “, sys.version)
Scikit Plot Version : 0.3.7 Scikit Learn Version : 1.0.2 Python Version : 3.9.12 (main, Apr 4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]
Adjusting the display to fit rows and columns effectively
start_time = time.time()
pd.set_option(‘display.max_rows’, 500)
pd.set_option(‘display.max_columns’, 500)
pd.set_option(‘display.width’, 1000)
from datetime import datetime, date
import math
from math import radians, sin, cos, acos, atan2
from sklearn.model_selection import train_test_split,GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
!pip install mpu –user
!pip install imbalanced-learn
The Input Dataset
We use the Kaggle Credit Card Transactions Fraud Detection Dataset [2]. This is a simulated credit card transaction dataset containing legitimate and fraud transactions from the duration 1st Jan 2019 – 31st Dec 2020. It covers credit cards of 1000 customers doing transactions with a pool of 800 merchants. This was generated using Sparkov Data Generation | Github tool created by Brandon Harris. The files were combined and converted into a standard csv format. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Let’s read and understand the train fraud data
data_train = pd.read_csv(r’YourPath\fraudTrain.csv’)
data_train.head()

Let’s read and understand the test fraud data
data_test = pd.read_csv(r’YourPath\fraudTest.csv’)
data_test.head()

Checking the number of columns and rows in the dataset
print(data_train.shape)
print(data_test.shape)
(1296675, 23) (555719, 23)
Checking for duplicates and deleteing the duplicate records.
data_train.drop_duplicates()
data_test.drop_duplicates()
print(data_train.shape)
print(data_test.shape)
(1296675, 23) (555719, 23)
Checking the fields and their datatypes
data_train.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1296675 entries, 0 to 1296674 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 1296675 non-null int64 1 trans_date_trans_time 1296675 non-null object 2 cc_num 1296675 non-null int64 3 merchant 1296675 non-null object 4 category 1296675 non-null object 5 amt 1296675 non-null float64 6 first 1296675 non-null object 7 last 1296675 non-null object 8 gender 1296675 non-null object 9 street 1296675 non-null object 10 city 1296675 non-null object 11 state 1296675 non-null object 12 zip 1296675 non-null int64 13 lat 1296675 non-null float64 14 long 1296675 non-null float64 15 city_pop 1296675 non-null int64 16 job 1296675 non-null object 17 dob 1296675 non-null object 18 trans_num 1296675 non-null object 19 unix_time 1296675 non-null int64 20 merch_lat 1296675 non-null float64 21 merch_long 1296675 non-null float64 22 is_fraud 1296675 non-null int64 dtypes: float64(5), int64(6), object(12) memory usage: 227.5+ MB
data_test.info(verbose=True)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 555719 entries, 0 to 555718 Data columns (total 23 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 555719 non-null int64 1 trans_date_trans_time 555719 non-null object 2 cc_num 555719 non-null int64 3 merchant 555719 non-null object 4 category 555719 non-null object 5 amt 555719 non-null float64 6 first 555719 non-null object 7 last 555719 non-null object 8 gender 555719 non-null object 9 street 555719 non-null object 10 city 555719 non-null object 11 state 555719 non-null object 12 zip 555719 non-null int64 13 lat 555719 non-null float64 14 long 555719 non-null float64 15 city_pop 555719 non-null int64 16 job 555719 non-null object 17 dob 555719 non-null object 18 trans_num 555719 non-null object 19 unix_time 555719 non-null int64 20 merch_lat 555719 non-null float64 21 merch_long 555719 non-null float64 22 is_fraud 555719 non-null int64 dtypes: float64(5), int64(6), object(12) memory usage: 97.5+ MB
We see that there are no null values in both train and test data.
Check statistical information about numerical fields.
data_train.describe()

data_test.describe()

Data Preparation
The variable is_fraud represents 0 for non-fraudulent and 1 for fraudulent transactions. This is our TARGET variable. Let’s check the class imbalance of target variable is_fraud in train and test sets.
data_train[‘is_fraud’].value_counts(normalize=True)
0 0.994211 1 0.005789 Name: is_fraud, dtype: float64
data_test[‘is_fraud’].value_counts(normalize=True)
0 0.99614 1 0.00386 Name: is_fraud, dtype: float64
Forming a consolidated train+test dataset
data = pd.concat([data_train, data_test])
data.head()

Let’s check statistical information abut numerical fields.
data.describe()

Let’s confirm that the concatenation is done properly
data.shape
(1852394, 23)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 Unnamed: 0 int64 1 trans_date_trans_time object 2 cc_num int64 3 merchant object 4 category object 5 amt float64 6 first object 7 last object 8 gender object 9 street object 10 city object 11 state object 12 zip int64 13 lat float64 14 long float64 15 city_pop int64 16 job object 17 dob object 18 trans_num object 19 unix_time int64 20 merch_lat float64 21 merch_long float64 22 is_fraud int64 dtypes: float64(5), int64(6), object(12) memory usage: 339.2+ MB
Let’s check null or missing values and sort them in ascending order
data.isnull().sum().sort_values()
Unnamed: 0 0 merch_lat 0 unix_time 0 trans_num 0 dob 0 job 0 city_pop 0 long 0 lat 0 zip 0 merch_long 0 state 0 street 0 gender 0 last 0 first 0 amt 0 category 0 merchant 0 cc_num 0 trans_date_trans_time 0 city 0 is_fraud 0 dtype: int64
data.head()

data[‘Unnamed: 0’].value_counts()
0 2 370474 2 370488 2 370487 2 370486 2 .. 802705 1 802706 1 802707 1 802708 1 1296674 1 Name: Unnamed: 0, Length: 1296675, dtype: int64
Dropping unwanted columns
cols_to_delete = [‘Unnamed: 0’, ‘cc_num’, ‘street’, ‘zip’, ‘trans_num’, ‘unix_time’ ]
data.drop(cols_to_delete, axis = 1, inplace = True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 17 columns): # Column Dtype --- ------ ----- 0 trans_date_trans_time object 1 merchant object 2 category object 3 amt float64 4 first object 5 last object 6 gender object 7 city object 8 state object 9 lat float64 10 long float64 11 city_pop int64 12 job object 13 dob object 14 merch_lat float64 15 merch_long float64 16 is_fraud int64 dtypes: float64(5), int64(2), object(10) memory usage: 254.4+ MB
Let’s create a column customer name with columns first and last
data[‘Customer_name’] = data[‘first’]+” “+data[‘last’]
data.drop([‘first’,’last’], axis=1, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 16 columns): # Column Dtype --- ------ ----- 0 trans_date_trans_time object 1 merchant object 2 category object 3 amt float64 4 gender object 5 city object 6 state object 7 lat float64 8 long float64 9 city_pop int64 10 job object 11 dob object 12 merch_lat float64 13 merch_long float64 14 is_fraud int64 15 Customer_name object dtypes: float64(5), int64(2), object(9) memory usage: 240.3+ MB
Let’s look at city_pop variable
print(“Min population : “, data[‘city_pop’].min())
print(“Max population : “, data[‘city_pop’].max())
Min population : 23 Max population : 2906700 Let's Create a categorical column Population_group by binning the variable city_pop data["Population_group"] = pd.cut(data["city_pop"], bins=list(range(0,3000001,500000)), labels = ["<5lac","5-10lac","10-15lac","15-20","20-25lac","25-30lac"]) data["Population_group"].value_counts() <5lac 1758657 5-10lac 46877 10-15lac 21224 15-20 16105 25-30lac 8794 20-25lac 737 Name: Population_group, dtype: int64
Let’s create a column age from dob variable
data[‘dob’] = pd.to_datetime(data[‘dob’])
def calculate_age(born):
today = date.today()
return today.year – born.year – ((today.month, today.day) < (born.month, born.day))
by calling the function calculate_age
data[‘age’] = data[“dob”].apply(calculate_age)
Let’s create a column age_group from the column age
data[“age_group”] = pd.cut(data[“age”], bins=[0,25,40,60,80,9999], labels = [“<25″,”25-40″,”40-60″,”60-80″,”80+”])
Calculating distance between the customer and merchant location using customer location co-ordinates(lat and long) and merchant location co-ordinates(merch_lat and merch_long)
R = 6373.0 # radius of the Earth
data[‘lat’] = data[‘lat’].astype(‘float’)
data[‘long’] = data[‘long’].astype(‘float’)
data[‘merch_lat’] = data[‘merch_lat’].astype(‘float’)
data[‘merch_long’] = data[‘merch_long’].astype(‘float’)#coordinates
data[‘lat’] = np.radians(data[‘lat’])
data[‘long’] = np.radians(data[‘long’])
data[‘merch_lat’] = np.radians(data[‘merch_lat’])
data[‘merch_long’] = np.radians(data[‘merch_long’])
data[‘dlon’] = data[‘merch_long’] – data[‘long’] #change in coordinates
data[‘dlat’] = data[‘merch_lat’] – data[‘lat’]
a = np.sin(data[‘dlat’] / 2)**2 + np.cos(data[‘lat’]) * np.cos(data[‘merch_lat’]) * np.sin(data[‘dlon’] / 2)**2 #Haversine formula
c = 22np.arctan2(np.sqrt(a), np.sqrt(1 – a))
data[‘distance’] = R * c
data[‘distance’].head()
0 157.244484 1 60.443320 2 216.480102 3 191.406530 4 155.162181 Name: distance, dtype: float64
data[‘distance’].min()
0.0445230036617093
data[‘distance’].max()
304.3298522066305
data[‘distance’].describe()
count 1.852394e+06 mean 1.522712e+02 std 5.825222e+01 min 4.452300e-02 25% 1.106749e+02 50% 1.564819e+02 75% 1.970808e+02 max 3.043299e+02 Name: distance, dtype: float64
Let’s create a column dist_range_km from the column distance
data[“dist_range_km”] = pd.cut(data[“distance”], bins=[0,25,50,100,150,200,250,300,9999], labels = [“<25″,”25-50″,”50-100″,”100-150″,”150-200″,”200-250″,”250-300″,”300+”])
data.head()

data.drop([‘dlat’, ‘dlon’], axis=1, inplace=True)
data.drop([‘dob’,’city_pop’], axis=1, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 19 columns): # Column Dtype --- ------ ----- 0 trans_date_trans_time object 1 merchant object 2 category object 3 amt float64 4 gender object 5 city object 6 state object 7 lat float64 8 long float64 9 job object 10 merch_lat float64 11 merch_long float64 12 is_fraud int64 13 Customer_name object 14 Population_group category 15 age int64 16 age_group category 17 distance float64 18 dist_range_km category dtypes: category(3), float64(6), int64(2), object(8) memory usage: 245.6+ MB
Let’s create the transaction date and time column
data[‘trans_date_trans_time’] = pd.to_datetime(data[‘trans_date_trans_time’])
Extract year and month from trans_date_trans_time column
data[‘year’] = pd.DatetimeIndex(data[‘trans_date_trans_time’]).year
data[‘month’] = pd.DatetimeIndex(data[‘trans_date_trans_time’]).month
Extract day of the week and transaction hour from trans_date_trans_time column
data[‘day_of_week’] = data[‘trans_date_trans_time’].dt.day_name()
data[‘transaction_hour’] = data[‘trans_date_trans_time’].dt.hour
data.head()

data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 trans_date_trans_time datetime64[ns] 1 merchant object 2 category object 3 amt float64 4 gender object 5 city object 6 state object 7 lat float64 8 long float64 9 job object 10 merch_lat float64 11 merch_long float64 12 is_fraud int64 13 Customer_name object 14 Population_group category 15 age int64 16 age_group category 17 distance float64 18 dist_range_km category 19 year int64 20 month int64 21 day_of_week object 22 transaction_hour int64 dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8) memory usage: 302.1+ MB
Exploratory Data Analysis (EDA)
Univariate Analysis
Let’s begin with the target variable is_fraud = 0 (no fraud), 1 (fraud)
plt.figure(figsize= (10,6))
fig = data[“is_fraud”].value_counts(normalize = True).plot.pie(autopct=’%1.2f%%’)
plt.title(“Pie-chart showing imbalance in is_fraud variable”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
fig.legend(title=”is_fraud”,
loc=”center left”,
bbox_to_anchor=(1, 0, 0.5, 1))
plt.show()

As we can see, the input dataset is highly imbalanced with 0.52% of transactions being fraudulent and 99.48% transactions being non-fraudulent.
Let’s plot the bar chart of the category variable
plt.figure(figsize= (8,4))
data[“category”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing category variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

It appears that maximum transactions in the dataset belong to merchants falling in gas_transport category and least being travel category.
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 trans_date_trans_time datetime64[ns] 1 merchant object 2 category object 3 amt float64 4 gender object 5 city object 6 state object 7 lat float64 8 long float64 9 job object 10 merch_lat float64 11 merch_long float64 12 is_fraud int64 13 Customer_name object 14 Population_group category 15 age int64 16 age_group category 17 distance float64 18 dist_range_km category 19 year int64 20 month int64 21 day_of_week object 22 transaction_hour int64 dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8) memory usage: 302.1+ MB
Let’s look at the gender variable
plt.figure(figsize= (8,4))
data[“gender”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing gender variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

It is clear that maximum transactions in the dataset are performed by females.
Let’s look at the state variable
plt.figure(figsize= (12,8))
data[“state”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing state variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

We can see that maximum transactions in the dataset take place in Texas and least being in Delaware state.
Let’s plot the population_group variable
plt.figure(figsize= (10,6))
data[“Population_group”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing Population_group variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

It turns out that maximum transactions take place in areas where population is less than 5 lakh (aka lac).
Let’s look at the age_group variable
plt.figure(figsize= (10,6))
data[“age_group”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing age_group variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

This plot shows that maximum transactions are performed by people within the age group 40-60 years.
Let’s look at the year variable
plt.figure(figsize= (10,6))
data[“year”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing year variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

It is interesting that equal number of transactions are performed in the years 2019 and 2020.
Let’s look at the month variable
plt.figure(figsize= (10,6))
data[“month”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing month variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

We can see that maximum transactions are performed in December.
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 trans_date_trans_time datetime64[ns] 1 merchant object 2 category object 3 amt float64 4 gender object 5 city object 6 state object 7 lat float64 8 long float64 9 job object 10 merch_lat float64 11 merch_long float64 12 is_fraud int64 13 Customer_name object 14 Population_group category 15 age int64 16 age_group category 17 distance float64 18 dist_range_km category 19 year int64 20 month int64 21 day_of_week object 22 transaction_hour int64 dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8) memory usage: 302.1+ MB
Let’s plot the day_of_week variable
plt.figure(figsize= (10,6))
data[“day_of_week”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing day_of_week variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

We can see that maximum transactions take place on Monday and Sunday.
Let’s plot the transaction_hour variable
plt.figure(figsize= (10,6))
data[“transaction_hour”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing transaction_hour variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

It appears that most transactions take place after 12 noon.
Let’s portray the dist_range_km variable
plt.figure(figsize= (10,6))
data[“dist_range_km”].value_counts(normalize = True).plot.bar()
plt.title(“Bar chart analysing dist_range_km variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

it is insightful to see that maximum transactions are performed within the distance range 152-200 kms from the customer location.
Bivariate Analysis
Let’s begin with amt vs age
plt.figure(figsize= [10,6])
plt.scatter(data[“age”], data[“amt”], alpha = 0.5)
plt.title(“Scatter plot analysing amt vs age\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.xlabel(“age”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})
plt.ylabel(“amt”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’} )
plt.show()

We can say that people of all age groups usually perform transactions less than 5000 $.
Let’s examine spatial distributions of customer/merchant locations responsible for fraud transactions
fraud_data = data[data[‘is_fraud’]==1]
plt.figure(figsize=(15,6))
plt.subplot(1,2,1)
plt.scatter(fraud_data[“lat”], fraud_data[“long”], alpha = 0.5)
plt.title(“Plot analysing distribution of customer location for frauds\n”, fontdict={‘fontsize’: 15, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.ylabel(“Longitude”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})
plt.xlabel(“Latitude”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})
plt.subplot(1,2,2)
plt.scatter(fraud_data[“merch_lat”], fraud_data[“merch_long”], alpha = 0.5)
plt.title(“Plot analysing distribution of merchant location for frauds\n”, fontdict={‘fontsize’: 15, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.ylabel(“Longitude”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})
plt.xlabel(“Latitude”, fontdict={‘fontsize’: 12, ‘fontweight’ : 5, ‘color’ : ‘Black’})
plt.show()

We can see the complete overlap between customer and merchant locations responsible for fraud transactions.
Now let’s look at is_fraud vs amt plot
plt.figure(figsize= (8,4))
data.groupby(“is_fraud”)[“amt”].mean().plot.bar()
plt.title(“Plot analysing amt w.r.t. is_fraud variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

It is clear that the average amount of fraudulent transactions is slightly greater than 500 $.
Let’s plot is_fraud vs age
plt.figure(figsize= (8,4))
data.groupby(“is_fraud”)[“age”].mean().plot.bar()
plt.title(“Plot analysing age w.r.t. is_fraud variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

Generally, people within the 40-50 years age group have done more fraudulent transactions than genuine ones.
Lest’s look at is_fraud vs distance
plt.figure(figsize= (8,4))
data.groupby(“is_fraud”)[“distance”].mean().plot.bar()
plt.title(“Plot analysing distance w.r.t. is_fraud variable\n”, fontdict={‘fontsize’: 20, ‘fontweight’ : 5, ‘color’ : ‘Green’})
plt.show()

Generally speaking, transactions performed within the distance range 140-160 kms from the customer locations have equal chances of being fraudelent and non-fraudulent.
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 trans_date_trans_time datetime64[ns] 1 merchant object 2 category object 3 amt float64 4 gender object 5 city object 6 state object 7 lat float64 8 long float64 9 job object 10 merch_lat float64 11 merch_long float64 12 is_fraud int64 13 Customer_name object 14 Population_group category 15 age int64 16 age_group category 17 distance float64 18 dist_range_km category 19 year int64 20 month int64 21 day_of_week object 22 transaction_hour int64 dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8) memory usage: 302.1+ MB
Let’s switch to the categorical-categorical variable analysis.
Let’s begin with category vs is_fraud
plt.figure(figsize= (10,6))
data.groupby(“category”)[“is_fraud”].mean().plot.bar()
plt.show()

We can see that maximim fraudulent transactions are performed by merchants with category shopping_net.
Let’s plot gender vs is_fraud
plt.figure(figsize= (8,6))
data.groupby(“gender”)[“is_fraud”].mean().plot.bar()
plt.show()

Even though women have performed maximum transactions, men have performed more fraudulent transactions than women.
Let’s check state vs is_fraud
plt.figure(figsize= (15,6))
data.groupby(“state”)[“is_fraud”].mean().plot.bar()
plt.show()

Surprisingly, most of the transactions performed in Deleware (DE) are fraudulent, even though the number of actual transaction performed in Delaware(DE) are the least amongst all the states.
Let’s plot population_group vs is_fraud
plt.figure(figsize= (10,6))
data.groupby(“Population_group”)[“is_fraud”].mean().plot.bar()
plt.show()

We can see that maximum fraudulent transactions are performed in the areas with population range 20-25 lakhs.
Let’s lokk at age_group vs is_fraud
plt.figure(figsize= (10,6))
data.groupby(“age_group”)[“is_fraud”].mean().plot.bar()
plt.show()

Whatever least transactions performed by people with 80+ years of age are fraudulent. This can indicate identity theft.
Let’s plot year vs is_fraud
plt.figure(figsize= (8,4))
data.groupby(“year”)[“is_fraud”].mean().plot.bar()
plt.show()

It appears that more fraudulent transactions are performed in 2019 compared to 2020.
Let’s check month vs is_fraud
plt.figure(figsize= (15,6))
data.groupby(“month”)[“is_fraud”].mean().plot.bar()
plt.show()

It is clear that maximum fraudulent transactions are performed in February.
Let’s look at day_of_week vs is_fraud
plt.figure(figsize= (10,6))
data.groupby(“day_of_week”)[“is_fraud”].mean().plot.bar()
plt.show()

The above plot shows that maximum fraudulent transactions take place on Thursday and Friday.
Let’s check transaction_hour vs is_fraud
plt.figure(figsize= (10,6))
data.groupby(“transaction_hour”)[“is_fraud”].mean().plot.bar()
plt.show()

It is clear that maximum fraudulent transactions take place late at night.
Let’s look at dist_range_km vs is_fraud
plt.figure(figsize= (10,6))
data.groupby(“dist_range_km”)[“is_fraud”].mean().plot.bar()
plt.show()

It appears that maximum fraudulent transactions are performed within the distance range 100-150 kms from the customer location.
Multivariate Analysis
Let’s plot pivot tables is_fraud vs 2 attributes taken from the list below
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 23 columns): # Column Dtype --- ------ ----- 0 trans_date_trans_time datetime64[ns] 1 merchant object 2 category object 3 amt float64 4 gender object 5 city object 6 state object 7 lat float64 8 long float64 9 job object 10 merch_lat float64 11 merch_long float64 12 is_fraud int64 13 Customer_name object 14 Population_group category 15 age int64 16 age_group category 17 distance float64 18 dist_range_km category 19 year int64 20 month int64 21 day_of_week object 22 transaction_hour int64 dtypes: category(3), datetime64[ns](1), float64(6), int64(5), object(8) memory usage: 302.1+ MB
Let’s begin with category vs gender vs is_fraud
pivot1 = pd.pivot_table(data = data, index = “gender”, columns = “category”, values = “is_fraud”)
pivot1

Let’s plot it
plt.figure(figsize= (16,4))
sns.heatmap(pivot1, cmap = “Greens”, annot = True)
plt.show()

It turns out that most fraudulent transactions are performed by males within the category shopping_net.
Let’s look at state vs gender vs is_fraud
pivot2 = pd.pivot_table(data = data, index = “state”, columns = “gender”, values = “is_fraud”)
pivot2

The corresponding plot is given by
plt.figure(figsize= (10,15))
sns.heatmap(pivot2, cmap = “Greens”, annot = True)
plt.show()

It is clear that 100% fraudulent transactions in the states DE and NV are performed by females.
Let’s consider age_group vs gender vs is_fraud
pivot3 = pd.pivot_table(data = data, index = “age_group”, columns = “gender”, values = “is_fraud”)
pivot3

Let’s plot this table

Transactions carried out by 80+ years old males may be fraudulent.
Let’s look at Population_group vs dist_range_km vs is_fraud
pivot4 = pd.pivot_table(data = data, index = “dist_range_km”, columns = “Population_group”, values = “is_fraud”)
pivot4

Let’s plot it

We can see that fraudulent transactions are carried out within the distance range 250-300 kms from the customer locations in areas within the population range 20-25 lac.
Let’s check year vs month vs is_fraud
pivot5 = pd.pivot_table(data = data, index = “month”, columns = “year”, values = “is_fraud”)
pivot5

Let’s plot it
plt.figure(figsize= (10,6))
sns.heatmap(pivot5, cmap = “Greens”, annot = True)
plt.show()

We can see that most fraudulent transactions are conducted in Jan and Feb 2019.
Let’s look at age_group vs dist_range_km vs is_fraud
pivot6 = pd.pivot_table(data = data, index = “age_group”, columns = “dist_range_km”, values = “is_fraud”)
pivot6

and plot it
plt.figure(figsize= (10,6))
sns.heatmap(pivot6, cmap = “Greens”, annot = True)
plt.show()

Transactions performed by 80+ years old customers within the distance range 200-250 kms from the customer locations may be fraudulent.
Let’s look at transaction_hour vs day_of_week vs is_fraud
pivot7 = pd.pivot_table(data = data, index = “transaction_hour”, columns = “day_of_week”, values = “is_fraud”)
pivot7

Let’s plot it
plt.figure(figsize= (10,6))
sns.heatmap(pivot7, cmap = “Greens”, annot = True)
plt.show()

We can see that late-night midweek transactions may be fraudulent.
Let’s look at transaction_hour vs gender vs is_fraud
pivot8 = pd.pivot_table(data = data, index = “transaction_hour”, columns = “gender”, values = “is_fraud”)
pivot8

and plot it
plt.figure(figsize= (10,6))
sns.heatmap(pivot8, cmap = “Greens”, annot = True)
plt.show()

Late-night transactions performed by male customers may be fraudulent.
Let’s look at transaction_hour vs dist_range_km vs is_fraud
pivot9 = pd.pivot_table(data = data, index = “transaction_hour”, columns = “dist_range_km”, values = “is_fraud”)
pivot9

Let’s plot it
plt.figure(figsize= (10,6))
sns.heatmap(pivot9, cmap = “Greens”, annot = True)
plt.show()

Late-night transactions regardless of distance from the customer locations may be fraudulent.
Let’s check data skewness
data.describe()

Let’s plot the following histograms of interest
cols = [‘amt’, ‘age’, ‘distance’]
plt.figure(figsize=[20,7])
for ind, col in enumerate(cols):
plt.subplot(2,2,ind+1)
data[col].value_counts(normalize=True).plot.hist()
plt.title(col)
plt.show()
![Histograms of cols = ['amt', 'age', 'distance']](https://newdigitals603757545.files.wordpress.com/2022/05/fraudhistamtage.jpg?w=995)
and the density plot
sns.distplot(data.amt)
plt.show()

We can see that the amt variable is skewed.
In the scikit-learn world, we can apply a power transform featurewise to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
Let’s apply PowerTransformer to amt, age and plot the density curve
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
data[‘amt’] = pt.fit_transform(data[[‘amt’]])
sns.distplot(data.amt)
plt.show()

sns.distplot(data.age)
plt.show()

We can see the both attributes result in multi-mode skewed non-normal density distributions.
Feature Engineering
Let’s count fraud-related merchants
data.merchant.value_counts()
fraud_Kilback LLC 6262 fraud_Cormier LLC 5246 fraud_Schumm PLC 5195 fraud_Kuhn LLC 5031 fraud_Boyer PLC 4999 ... fraud_Douglas, DuBuque and McKenzie 1101 fraud_Treutel-King 1098 fraud_Satterfield-Lowe 1095 fraud_Hahn, Douglas and Schowalter 1091 fraud_Ritchie, Bradtke and Stiedemann 1090 Name: merchant, Length: 693, dtype: int64
Let’s count fraud-related jobs
data.job.value_counts()
Film/video editor 13898 Exhibition designer 13167 Surveyor, land/geomatics 12436 Naval architect 12434 Materials engineer 11711 Designer, ceramics/pottery 11688 Environmental consultant 10974 Financial adviser 10963 Systems developer 10962 IT trainer 10943 Copywriter, advertising 10241 Scientist, audiological 10234 Chartered public finance accountant 10211 Chief Executive Officer 10199 Podiatrist 9525 Comptroller 9515 Magazine features editor 9506 Agricultural consultant 9500 Paramedic 9494 Sub 9488 Audiological scientist 8801 Historic buildings inspector/conservation officer 8787 Building surveyor 8786 Librarian, public 8773 Musician 8772 Scientist, research (maths) 8768 Barrister 8767 etc.
len(data.job.value_counts())
497
Let’s count transaction_hour
23 95902 22 95370 16 94289 18 94052 21 93738 17 93514 13 93492 15 93439 19 93433 12 93294 14 93089 20 93081 1 61330 3 60968 2 60796 0 60655 8 60498 6 60406 10 60320 7 60301 9 60231 11 60170 5 60088 4 59938 Name: transaction_hour, dtype: int64
Let’s drop unwanted columns
data.drop([‘trans_date_trans_time’, “lat”, “long”, “merch_lat”, “merch_long”, “Customer_name”, “year”], axis=1, inplace=True)
data.drop([“merchant”, “city”, “job”], axis=1, inplace=True)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 13 columns): # Column Dtype --- ------ ----- 0 category object 1 amt float64 2 gender object 3 state object 4 is_fraud int64 5 Population_group category 6 age int64 7 age_group category 8 distance float64 9 dist_range_km category 10 month int64 11 day_of_week object 12 transaction_hour int64 dtypes: category(3), float64(2), int64(4), object(4) memory usage: 160.8+ MB
Let’s perform train-test data split
train,test = train_test_split(data,test_size=0.3,random_state=42, stratify=data.is_fraud)
print(f”train data shape:{train.shape}”)
print(f”Test data shape:{test.shape}”)
while checking the train/test data shape
train data shape:(1296675, 13) Test data shape:(555719, 13)
Let’ look at the normalized is_fraud value counts
train.is_fraud.value_counts(normalize=True)
0 0.99479 1 0.00521 Name: is_fraud, dtype: float64
test.is_fraud.value_counts(normalize=True)
0 0.994791 1 0.005209 Name: is_fraud, dtype: float64
Let’s proceed with the train/test data segregation
y_train = train.pop(“is_fraud”)
X_train = train
y_test = test.pop(“is_fraud”)
X_test = test
X_train.head()

Creating dummy variables
X_train[‘transaction_hour’]= X_train[‘transaction_hour’].astype(str)
X_train[‘month’]= X_train[‘month’].astype(str)
X_test[‘transaction_hour’]= X_test[‘transaction_hour’].astype(str)
X_test[‘month’]= X_test[‘month’].astype(str)
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 13 columns): # Column Dtype --- ------ ----- 0 category object 1 amt float64 2 gender object 3 state object 4 is_fraud int64 5 Population_group category 6 age int64 7 age_group category 8 distance float64 9 dist_range_km category 10 month int64 11 day_of_week object 12 transaction_hour int64 dtypes: category(3), float64(2), int64(4), object(4) memory usage: 160.8+ MB
cat_cols = [“category”, “state”, “month”, “day_of_week”, “transaction_hour”, ‘gender’, ‘Population_group’,’age_group’, ‘dist_range_km’]
dummy = pd.get_dummies(X_train[cat_cols], drop_first=True)
Adding the results to the master dataframe
X_train = pd.concat([X_train, dummy], axis=1)
X_train.drop(cat_cols, axis=1, inplace=True)
X_train.head()

and drop columns age and distance
X_train.drop([‘age’,’distance’], axis=1, inplace=True)
X_train.head()

Let’s scale the numerical variables of train data
scaler = MinMaxScaler()
scale_var = [“amt”]
X_train[scale_var] = scaler.fit_transform(X_train[scale_var]) # Scaling of train set
X_train.describe() # Check if scaling is proper

Dummy variables creation for X_test:
dummy1 = pd.get_dummies(X_test[cat_cols], drop_first=True)
Adding the results to the master dataframe
X_test = pd.concat([X_test, dummy1], axis=1)
while removing columns age and distance
X_test.drop(cat_cols, axis=1, inplace=True)
X_test.drop([‘age’,’distance’], axis=1, inplace=True)
X_test.head()

X_test[scale_var] = scaler.transform(X_test[scale_var]) #applying scaler transform
Let’s check train data heatmap for correlation
plt.figure(figsize=(20,20))
sns.heatmap(X_train.corr())
plt.show()


Let’s begin the feature selection process by running random forest
rf = RandomForestClassifier(n_estimators = 25).fit(X_train, y_train)
feats = X_train.columns
for feature in zip(feats, rf.feature_importances_):
print(feature)
('amt', 0.44884653633977767) ('category_food_dining', 0.002372802581305735) ('category_gas_transport', 0.024489745855491785) ('category_grocery_net', 0.002141319843316232) ('category_grocery_pos', 0.046934143613396484) ('category_health_fitness', 0.0019287492434254044) ('category_home', 0.0031504986728694293) ('category_kids_pets', 0.0024307865687283802) ('category_misc_net', 0.01036166861243191) ('category_misc_pos', 0.010701233251879962) ('category_personal_care', 0.0027006945386455128) ('category_shopping_net', 0.009838187929644734) ('category_shopping_pos', 0.008626217675562224) ('category_travel', 0.008089480453702001) ('state_AL', 0.0028451917910117964) ('state_AR', 0.0025863113237657065) ('state_AZ', 0.0007275615509002449) ('state_CA', 0.003985682357137098) ('state_CO', 0.001432823417517122) ('state_CT', 0.0009176589468720602) ('state_DC', 0.00036086393446330936) ('state_DE', 0.0004957902299982193) ('state_FL', 0.0034067591286467817) ('state_GA', 0.00191173441492063) ('state_HI', 0.00032649560874954046) ('state_IA', 0.002522684148010048) ('state_ID', 0.0005130031762308228) ('state_IL', 0.0031306355654785725) ('state_IN', 0.001718640762194997) ('state_KS', 0.0021523696300904493) ('state_KY', 0.0018749886942408573) ('state_LA', 0.0013180421160900225) ('state_MA', 0.0011769883432360799) ('state_MD', 0.0018891607272615305) ('state_ME', 0.0014708303814694452) ('state_MI', 0.002714156063496067) ('state_MN', 0.0029512062020218364) ('state_MO', 0.0029553796181859817) ('state_MS', 0.0017194170527638957) ('state_MT', 0.0010971013578513307) ('state_NC', 0.0021431590494776667) ('state_ND', 0.0012864115515251978) ('state_NE', 0.0025524290660134535) ('state_NH', 0.0009430691355806164) ('state_NJ', 0.0017198456851276757) ('state_NM', 0.0017526919926552512) ('state_NV', 0.0006523200851623981) ('state_NY', 0.0050968284355943924) ('state_OH', 0.003109290933850257) ('state_OK', 0.002587653106086186) ('state_OR', 0.0025336746622146985) ('state_PA', 0.004186466351887547) ('state_RI', 0.00020523285496891944) ('state_SC', 0.00275344758560034) ('state_SD', 0.001051836957874794) ('state_TN', 0.001925047648587341) ('state_TX', 0.00471032610213278) ('state_UT', 0.0011421486162993275) ('state_VA', 0.0022731719234685287) ('state_VT', 0.0013083357103399159) ('state_WA', 0.0016483046455105601) ('state_WI', 0.0021518745062484045) ('state_WV', 0.0019238414922494978) ('state_WY', 0.0018998829319713622) ('month_10', 0.005402371161650254) ('month_11', 0.004914005979080988) ('month_12', 0.005258891376983022) ('month_2', 0.004705495331029598) ('month_3', 0.005734176419160778) ('month_4', 0.004743138624643532) ('month_5', 0.0051636266445033935) ('month_6', 0.005446093194120895) ('month_7', 0.0046133826098524134) ('month_8', 0.006074684297321303) ('month_9', 0.0054900987255024755) ('day_of_week_Monday', 0.008555357268766122) ('day_of_week_Saturday', 0.008159321126532816) ('day_of_week_Sunday', 0.00815564259496755) ('day_of_week_Thursday', 0.007204152924990988) ('day_of_week_Tuesday', 0.00764217782147093) ('day_of_week_Wednesday', 0.0063330896189024) ('transaction_hour_1', 0.004890282089502399) ('transaction_hour_10', 0.0016521138538497192) ('transaction_hour_11', 0.0017717968946582864) ('transaction_hour_12', 0.001695621350767716) ('transaction_hour_13', 0.0016548576008602997) ('transaction_hour_14', 0.0021788577694004784) ('transaction_hour_15', 0.0015478959923511215) ('transaction_hour_16', 0.0022681461261766264) ('transaction_hour_17', 0.0016246963882984827) ('transaction_hour_18', 0.0017728139510987259) ('transaction_hour_19', 0.0016305237953596094) ('transaction_hour_2', 0.0034283266378819675) ('transaction_hour_20', 0.0016024313734906263) ('transaction_hour_21', 0.0016571170671093698) ('transaction_hour_22', 0.024470521644854196) ('transaction_hour_23', 0.023721758601067) ('transaction_hour_3', 0.0034593466446824765) ('transaction_hour_4', 0.0020262122602349194) ('transaction_hour_5', 0.0015626607483672133) ('transaction_hour_6', 0.0015595754359422134) ('transaction_hour_7', 0.0016255074240301667) ('transaction_hour_8', 0.0016137106185135032) ('transaction_hour_9', 0.0016202468711129857) ('gender_M', 0.01666026401486715) ('Population_group_5-10lac', 0.002367474653793451) ('Population_group_10-15lac', 0.0010827374314643264) ('Population_group_15-20', 0.0011892716291091788) ('Population_group_20-25lac', 0.0002028015406884771) ('Population_group_25-30lac', 0.0009669303189092789) ('age_group_25-40', 0.014185096524970091) ('age_group_40-60', 0.012718165264989536) ('age_group_60-80', 0.022788615963389207) ('age_group_80+', 0.008995927397751997) ('dist_range_km_25-50', 0.0031651607722741663) ('dist_range_km_50-100', 0.007565177041413791) ('dist_range_km_100-150', 0.009485275311664227) ('dist_range_km_150-200', 0.010142872717989861) ('dist_range_km_200-250', 0.008731533690317395) ('dist_range_km_250-300', 0.002751136114109341) ('dist_range_km_300+', 0.0)
imp_df = pd.DataFrame({
“Varname”: X_train.columns,
“Imp”: rf.feature_importances_
})
imp_df.sort_values(by=”Imp”, ascending=False)

Let’s group features of interest into a single list
cols_for_model = [‘amt’, ‘category_grocery_pos’, ‘transaction_hour_22’, ‘transaction_hour_23’, ‘category_gas_transport’,
‘age_group_60-80’, ‘gender_M’, ‘age_group_25-40’, ‘age_group_40-60’, ‘category_misc_net’, ‘dist_range_km_150-200’,
‘category_misc_pos’, ‘category_shopping_net’, ‘dist_range_km_100-150’, ‘day_of_week_Sunday’, ‘dist_range_km_200-250’,
‘category_shopping_pos’, ‘age_group_80+’, ‘day_of_week_Saturday’]
and create corresponding test and train subsets
X_train = X_train[cols_for_model]
X_test = X_test[cols_for_model]
X_train.columns
Index(['amt', 'category_grocery_pos', 'transaction_hour_22', 'transaction_hour_23', 'category_gas_transport', 'age_group_60-80', 'gender_M', 'age_group_25-40', 'age_group_40-60', 'category_misc_net', 'dist_range_km_150-200', 'category_misc_pos', 'category_shopping_net', 'dist_range_km_100-150', 'day_of_week_Sunday', 'dist_range_km_200-250', 'category_shopping_pos', 'age_group_80+', 'day_of_week_Saturday'], dtype='object')
Data Resampling and Base Model Testing
Let’s check the training and testing data shape
print(f”train data shape:{X_train.shape}”)
print(f”Test data shape:{X_test.shape}”)
train data shape:(1296675, 19) Test data shape:(555719, 19)
Let’s check normalized value counts for both train and test data:
print(y_train.value_counts())
y_train.value_counts(normalize = True).reset_index()
0 1289919 1 6756 Name: is_fraud, dtype: int64

print(y_test.value_counts())
y_test.value_counts(normalize = True).reset_index()
0 552824 1 2895 Name: is_fraud, dtype: int64

Let’s look at Logistic regression – Base model
lreg = LogisticRegression()
lreg.fit(X_train, y_train)
LogisticRegression()
y_pred = lreg.predict(X_test)
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred))
print (‘F1 score: ‘, f1_score(y_test, y_pred))
print (‘Recall: ‘, recall_score(y_test, y_pred))
print (‘Precision: ‘, precision_score(y_test, y_pred))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred))
Accuracy: 0.9959907795126673 F1 score: 0.42577319587628865 Recall: 0.2853195164075993 Precision: 0.8385786802030457 clasification report: precision recall f1-score support 0 1.00 1.00 1.00 552824 1 0.84 0.29 0.43 2895 accuracy 1.00 555719 macro avg 0.92 0.64 0.71 555719 weighted avg 1.00 1.00 1.00 555719 confussion matrix: [[552665 159] [ 2069 826]]
We can handle imbalanced classes by balancing the classes by increasing minority or decreasing majority. We can do that by using the following techniques:
Random Under-Sampling;
Random Over-Sampling;
SMOTE – Synthetic Minority Oversampling Technique;
ADASYN – Adaptive Synthetic Sampling Method;
SMOTETomek – Over-sampling followed by under-sampling.
Unddersampling leads to the loss of data so we will not use that.
We will proceed with Random Over-Sampling, SMOTE – Synthetic Minority Oversampling Technique,
ADASYN – Adaptive Synthetic Sampling Method and see which technique works better.
We begin with RandomOverSampler
over_sample = RandomOverSampler(sampling_strategy = 1)
X_resampled_os, y_resampled_os = over_sample.fit_resample(X_train, y_train)
len(X_resampled_os)
2579838
print(sorted(Counter(y_resampled_os).items()))
[(0, 1289919), (1, 1289919)]
Let’s apply LogisticRegression
lreg_os = LogisticRegression()
lreg_os.fit(X_resampled_os, y_resampled_os)
y_pred_os = lreg_os.predict(X_test)
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_os))
Accuracy: 0.7639598430141852 F1 score: 0.035443261368315784 Recall: 0.8324697754749568 Precision: 0.01810709482557834 clasification report: precision recall f1-score support 0 1.00 0.76 0.87 552824 1 0.02 0.83 0.04 2895 accuracy 0.76 555719 macro avg 0.51 0.80 0.45 555719 weighted avg 0.99 0.76 0.86 555719 confussion matrix: [[422137 130687] [ 485 2410]]
Let’s apply RandomOverSampler
from imblearn.over_sampling import RandomOverSampler
over_sample = RandomOverSampler(sampling_strategy = 1)
X_resampled_os, y_resampled_os = over_sample.fit_resample(X_train, y_train)
len(X_resampled_os)
2579838
from collections import Counter
print(sorted(Counter(y_resampled_os).items()))
[(0, 1289919), (1, 1289919)]
Let’s run LogisticRegression
lreg_os = LogisticRegression()
lreg_os.fit(X_resampled_os, y_resampled_os)
y_pred_os = lreg_os.predict(X_test)
and print the classification summary
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_os))
Accuracy: 0.7639598430141852 F1 score: 0.035443261368315784 Recall: 0.8324697754749568 Precision: 0.01810709482557834 clasification report: precision recall f1-score support 0 1.00 0.76 0.87 552824 1 0.02 0.83 0.04 2895 accuracy 0.76 555719 macro avg 0.51 0.80 0.45 555719 weighted avg 0.99 0.76 0.86 555719 confussion matrix: [[422137 130687] [ 485 2410]]
Let’s apply the SMOTE resampling
from imblearn.over_sampling import SMOTE
smt = SMOTE(random_state=45, k_neighbors=5)
X_resampled_smt, y_resampled_smt = smt.fit_resample(X_train, y_train)
len(X_resampled_smt)
2579838
print(sorted(Counter(y_resampled_smt).items()))
[(0, 1289919), (1, 1289919)]
Let’s apply LogisticRegression
lreg_smt = LogisticRegression()
lreg_smt.fit(X_resampled_smt, y_resampled_smt)
y_pred_smt = lreg_smt.predict(X_test)
and print the classification summary
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_smt))
print (‘F1 score: ‘, f1_score(y_test, y_pred_smt))
print (‘Recall: ‘, recall_score(y_test, y_pred_smt))
print (‘Precision: ‘, precision_score(y_test, y_pred_smt))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_smt))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_smt))
Accuracy: 0.7686168729159881 F1 score: 0.03591404621590415 Recall: 0.8272884283246977 Precision: 0.018355444171092666 clasification report: precision recall f1-score support 0 1.00 0.77 0.87 552824 1 0.02 0.83 0.04 2895 accuracy 0.77 555719 macro avg 0.51 0.80 0.45 555719 weighted avg 0.99 0.77 0.86 555719 confussion matrix: [[424740 128084] [ 500 2395]]
Let’s apply the ADASYN resampling
from imblearn.over_sampling import ADASYN
ada = ADASYN(random_state=45, n_neighbors=5)
X_resampled_ada, y_resampled_ada = ada.fit_resample(X_train, y_train)
len(X_resampled_ada)
2579654
print(sorted(Counter(y_resampled_ada).items()))
[(0, 1289919), (1, 1289735)]
Let’s apply LogisticRegression
lreg_ada = LogisticRegression()
lreg_ada.fit(X_resampled_ada, y_resampled_ada)
y_pred_ada = lreg_ada.predict(X_test)
and print the classification summary
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_ada))
print (‘F1 score: ‘, f1_score(y_test, y_pred_ada))
print (‘Recall: ‘, recall_score(y_test, y_pred_ada))
print (‘Precision: ‘, precision_score(y_test, y_pred_ada))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_ada))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_ada))
Accuracy: 0.843919318936369 F1 score: 0.044736175508540844 Recall: 0.7015544041450777 Precision: 0.02310475063705861 clasification report: precision recall f1-score support 0 1.00 0.84 0.92 552824 1 0.02 0.70 0.04 2895 accuracy 0.84 555719 macro avg 0.51 0.77 0.48 555719 weighted avg 0.99 0.84 0.91 555719 confussion matrix: [[466951 85873] [ 864 2031]]
Let’s apply DecisionTreeClassifier before resampling
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(random_state=0)
dtc.fit(X_train, y_train)
y_pred_dtc = dtc.predict(X_test)
and print the classification report
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_dtc))
print (‘F1 score: ‘, f1_score(y_test, y_pred_dtc))
print (‘Recall: ‘, recall_score(y_test, y_pred_dtc))
print (‘Precision: ‘, precision_score(y_test, y_pred_dtc))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_dtc))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_dtc))
Accuracy: 0.9965882037504566 F1 score: 0.6757865937072504 Recall: 0.6825561312607945 Precision: 0.6691500169319337 clasification report: precision recall f1-score support 0 1.00 1.00 1.00 552824 1 0.67 0.68 0.68 2895 accuracy 1.00 555719 macro avg 0.83 0.84 0.84 555719 weighted avg 1.00 1.00 1.00 555719 confussion matrix: [[551847 977] [ 919 1976]]
Let’s apply DecisionTreeClassifier after Random Over-Sampling
from sklearn.tree import DecisionTreeClassifier
dtc_os = DecisionTreeClassifier(random_state=0)
dtc_os.fit(X_resampled_os, y_resampled_os)
y_pred_dtc_os = dtc_os.predict(X_test)
and print the classification report
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_dtc_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_dtc_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_dtc_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_dtc_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_dtc_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_dtc_os))
Accuracy: 0.9964532434557754 F1 score: 0.6650807136788445 Recall: 0.6759930915371329 Precision: 0.6545150501672241 clasification report: precision recall f1-score support 0 1.00 1.00 1.00 552824 1 0.65 0.68 0.67 2895 accuracy 1.00 555719 macro avg 0.83 0.84 0.83 555719 weighted avg 1.00 1.00 1.00 555719 confussion matrix: [[551791 1033] [ 938 1957]]
Let’s apply DecisionTreeClassifier after SMOTE resampling
from sklearn.tree import DecisionTreeClassifier
dtc_smt = DecisionTreeClassifier(random_state=0)
dtc_smt.fit(X_resampled_smt, y_resampled_smt)
y_pred_dtc_smt = dtc_smt.predict(X_test)
and print the classification report
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_dtc_smt))
print (‘F1 score: ‘, f1_score(y_test, y_pred_dtc_smt))
print (‘Recall: ‘, recall_score(y_test, y_pred_dtc_smt))
print (‘Precision: ‘, precision_score(y_test, y_pred_dtc_smt))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_dtc_smt))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_dtc_smt))
Accuracy: 0.9538579749837598 F1 score: 0.16037982973149964 Recall: 0.8459412780656304 Precision: 0.08858744800144691 clasification report: precision recall f1-score support 0 1.00 0.95 0.98 552824 1 0.09 0.85 0.16 2895 accuracy 0.95 555719 macro avg 0.54 0.90 0.57 555719 weighted avg 0.99 0.95 0.97 555719 confussion matrix: [[527628 25196] [ 446 2449]]
Let’s apply DecisionTreeClassifier after ADASYN resampling
from sklearn.tree import DecisionTreeClassifier
dtc_ada = DecisionTreeClassifier(random_state=0)
dtc_ada.fit(X_resampled_ada, y_resampled_ada)
y_pred_dtc_ada = dtc_ada.predict(X_test)
and print the classification report
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_dtc_ada))
print (‘F1 score: ‘, f1_score(y_test, y_pred_dtc_ada))
print (‘Recall: ‘, recall_score(y_test, y_pred_dtc_ada))
print (‘Precision: ‘, precision_score(y_test, y_pred_dtc_ada))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_dtc_ada))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_dtc_ada))
Accuracy: 0.934169967195651 F1 score: 0.11964865840452413 Recall: 0.8587219343696028 Precision: 0.06430419037765132 clasification report: precision recall f1-score support 0 1.00 0.93 0.97 552824 1 0.06 0.86 0.12 2895 accuracy 0.93 555719 macro avg 0.53 0.90 0.54 555719 weighted avg 0.99 0.93 0.96 555719 confussion matrix: [[516650 36174] [ 409 2486]]
Let’s apply RandomForestClassifier before resampling (base model)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
and print the classification report
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rf))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rf))
print (‘Recall: ‘, recall_score(y_test, y_pred_rf))
print (‘Precision: ‘, precision_score(y_test, y_pred_rf))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rf))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rf))
Accuracy: 0.9971766306352671 F1 score: 0.7162235485621269 Recall: 0.6839378238341969 Precision: 0.7517084282460137 clasification report: precision recall f1-score support 0 1.00 1.00 1.00 552824 1 0.75 0.68 0.72 2895 accuracy 1.00 555719 macro avg 0.88 0.84 0.86 555719 weighted avg 1.00 1.00 1.00 555719 confussion matrix: [[552170 654] [ 915 1980]]
Let’s apply RandomForestClassifier after Random Over-Sampling
rf_os = RandomForestClassifier()
rf_os.fit(X_resampled_os, y_resampled_os)
y_pred_rf_os = rf_os.predict(X_test)
and print the classification report
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rf_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rf_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_rf_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_rf_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rf_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rf_os))
Accuracy: 0.9966781772802441 F1 score: 0.6860544217687076 Recall: 0.6967184801381693 Precision: 0.6757118927973199 clasification report: precision recall f1-score support 0 1.00 1.00 1.00 552824 1 0.68 0.70 0.69 2895 accuracy 1.00 555719 macro avg 0.84 0.85 0.84 555719 weighted avg 1.00 1.00 1.00 555719 confussion matrix: [[551856 968] [ 878 2017]]
Let’s apply RandomForestClassifier after SMOTE resampling
rf_smt = RandomForestClassifier()
rf_smt.fit(X_resampled_smt, y_resampled_smt)
y_pred_rf_smt = rf_smt.predict(X_test)
and print the classification summary
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rf_smt))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rf_smt))
print (‘Recall: ‘, recall_score(y_test, y_pred_rf_smt))
print (‘Precision: ‘, precision_score(y_test, y_pred_rf_smt))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rf_smt))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rf_smt))
Accuracy: 0.9540793098670372 F1 score: 0.15903773274015487 Recall: 0.8335060449050087 Precision: 0.08790528233151183 clasification report: precision recall f1-score support 0 1.00 0.95 0.98 552824 1 0.09 0.83 0.16 2895 accuracy 0.95 555719 macro avg 0.54 0.89 0.57 555719 weighted avg 0.99 0.95 0.97 555719 confussion matrix: [[527787 25037] [ 482 2413]]
Let’s apply RandomForestClassifier after ADASYN resampling
rf_ada = RandomForestClassifier()
rf_ada.fit(X_resampled_ada, y_resampled_ada)
y_pred_rf_ada = rf_ada.predict(X_test)
and print the classification summary
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rf_ada))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rf_ada))
print (‘Recall: ‘, recall_score(y_test, y_pred_rf_ada))
print (‘Precision: ‘, precision_score(y_test, y_pred_rf_ada))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rf_ada))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rf_ada))
Accuracy: 0.9337614873704156 F1 score: 0.11684261036468328 Recall: 0.8411053540587219 Precision: 0.06278200335181126 clasification report: precision recall f1-score support 0 1.00 0.93 0.97 552824 1 0.06 0.84 0.12 2895 accuracy 0.93 555719 macro avg 0.53 0.89 0.54 555719 weighted avg 0.99 0.93 0.96 555719 confussion matrix: [[516474 36350] [ 460 2435]]
Let’s plot the confusion matrix
cf_matrix=confusion_matrix(y_test, y_pred_rf_ada)
sns.heatmap(cf_matrix, annot=True)

sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True,
fmt=’.2%’, cmap=’Blues’)

group_names = [“True Neg”,”False Pos”,”False Neg”,”True Pos”]
group_counts = [“{0:0.0f}”.format(value) for value in
cf_matrix.flatten()]
group_percentages = [“{0:.2%}”.format(value) for value in
cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f”{v1}\n{v2}\n{v3}” for v1, v2, v3 in
zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt=”, cmap=’Blues’)

Let’s look at XGBoost
from xgboost import XGBClassifier
Let’s apply XGBClassifier after Random Over-Sampling
xgb_os = XGBClassifier()
xgb_os.fit(X_resampled_os, y_resampled_os)
y_pred_xgb_os = xgb_os.predict(X_test)
and print the classification report
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_xgb_os))
print (‘F1 score: ‘, f1_score(y_test, y_pred_xgb_os))
print (‘Recall: ‘, recall_score(y_test, y_pred_xgb_os))
print (‘Precision: ‘, precision_score(y_test, y_pred_xgb_os))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_xgb_os))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_xgb_os))
Accuracy: 0.9771449239633699 F1 score: 0.29301419426662956 Recall: 0.909153713298791 Precision: 0.17465162574651627 clasification report: precision recall f1-score support 0 1.00 0.98 0.99 552824 1 0.17 0.91 0.29 2895 accuracy 0.98 555719 macro avg 0.59 0.94 0.64 555719 weighted avg 1.00 0.98 0.98 555719 confussion matrix: [[540386 12438] [ 263 2632]]
Let’s apply XGBClassifier after SMOTE resampling
xgb_smt = XGBClassifier()
xgb_smt.fit(X_resampled_smt, y_resampled_smt)
y_pred_xgb_smt = xgb_smt.predict(X_test)
and print the classification report
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_xgb_smt))
print (‘F1 score: ‘, f1_score(y_test, y_pred_xgb_smt))
print (‘Recall: ‘, recall_score(y_test, y_pred_xgb_smt))
print (‘Precision: ‘, precision_score(y_test, y_pred_xgb_smt))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_xgb_smt))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_xgb_smt))
Accuracy: 0.9660457893287795 F1 score: 0.21954750382595029 Recall: 0.9167530224525043 Precision: 0.12470632459355324 clasification report: precision recall f1-score support 0 1.00 0.97 0.98 552824 1 0.12 0.92 0.22 2895 accuracy 0.97 555719 macro avg 0.56 0.94 0.60 555719 weighted avg 0.99 0.97 0.98 555719 confussion matrix: [[534196 18628] [ 241 2654]]
Let’s apply XGBClassifier after ADASYN resampling
xgb_ada = XGBClassifier()
xgb_ada.fit(X_resampled_ada, y_resampled_ada)
y_pred_xgb_ada = xgb_ada.predict(X_test)
and print the classification summary
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_xgb_ada))
print (‘F1 score: ‘, f1_score(y_test, y_pred_xgb_ada))
print (‘Recall: ‘, recall_score(y_test, y_pred_xgb_ada))
print (‘Precision: ‘, precision_score(y_test, y_pred_xgb_ada))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_xgb_ada))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_xgb_ada))
Accuracy: 0.9230924262082095 F1 score: 0.11335394062610211 Recall: 0.9436960276338515 Precision: 0.06029840204820341 clasification report: precision recall f1-score support 0 1.00 0.92 0.96 552824 1 0.06 0.94 0.11 2895 accuracy 0.92 555719 macro avg 0.53 0.93 0.54 555719 weighted avg 0.99 0.92 0.96 555719 confussion matrix: [[510248 42576] [ 163 2732]]
Hyper-Parameter Optimization (HPO)
The objective of HPO is to resolve a trade-off between ACCURACY and RECALL for the following models :
* Logistic Regression SMOTE model
* Decision Tree SMOTE model
* XGBoost ADASYN model
* Random Forest ADASYN model.
Let’s begin with Logistic Regression SMOTE by applying the Recursive Feature Elimination (RFE)
from sklearn.feature_selection import RFECV
from sklearn.feature_selection import RFE
rfe = RFE(estimator=lreg_smt, n_features_to_select=10) # Selecting 10 features that are important
rfe.fit(X_resampled_smt, y_resampled_smt)
RFE(estimator=LogisticRegression(), n_features_to_select=10)
rfe.ranking_
array([ 1, 1, 1, 1, 1, 10, 4, 5, 9, 1, 7, 1, 1, 6, 1, 8, 3, 1, 2])
X_resampled_ada.columns[rfe.support_]
Index(['amt', 'category_grocery_pos', 'transaction_hour_22', 'transaction_hour_23', 'category_gas_transport', 'category_misc_net', 'category_misc_pos', 'category_shopping_net', 'day_of_week_Sunday', 'age_group_80+'], dtype='object')
X_resampled_ada2 = X_resampled_ada.loc[:,rfe.support_]
X_resampled_ada2.shape
(2579654, 10)
Let’s look at top 10 important features
X_resampled_ada2.columns
Index(['amt', 'category_grocery_pos', 'transaction_hour_22', 'transaction_hour_23', 'category_gas_transport', 'category_misc_net', 'category_misc_pos', 'category_shopping_net', 'day_of_week_Sunday', 'age_group_80+'], dtype='object')
Let’s check the cross-validation score
from sklearn.model_selection import cross_val_score
cross_val_score(lreg_ada, X_resampled_ada2, y_resampled_ada, n_jobs=-1)
array([0.82018138, 0.81259897, 0.82518399, 0.80935435, 0.81977982])
Let’s look at cross validation for feature selection – Logistic Regression SMOTE
num_features = X_resampled_smt.shape
num_features[1]
19
Let’s run the estimator RFECV
rfecv = RFECV(estimator=lreg_smt, cv=5)
rfecv.fit(X_resampled_smt, y_resampled_smt)
RFECV(cv=5, estimator=LogisticRegression())
rfecv.grid_scores_
array([[0.80675546, 0.80701323, 0.80670119, 0.80656515, 0.80697603], [0.83262334, 0.83306523, 0.83312143, 0.83283815, 0.83287497], [0.85665972, 0.85629535, 0.85768691, 0.85693853, 0.85693853], [0.8118488 , 0.81180422, 0.81217827, 0.81202286, 0.81247056], [0.79551833, 0.79550282, 0.79591951, 0.79603928, 0.79595594], [0.79128551, 0.79193283, 0.79170801, 0.79163978, 0.79170955], [0.78841905, 0.78890358, 0.788859 , 0.78862408, 0.78878688], [0.78213378, 0.78211246, 0.7826997 , 0.78235817, 0.78264308], [0.79002574, 0.79047538, 0.79081455, 0.79066684, 0.79012417], [0.7892505 , 0.78962067, 0.78992302, 0.78961445, 0.78943808], [0.79136303, 0.79152583, 0.79215378, 0.79207391, 0.79155062], [0.7895509 , 0.79009745, 0.79040173, 0.79005053, 0.79006409], [0.79205687, 0.79224487, 0.79314996, 0.79285885, 0.79213593], [0.79193477, 0.79223905, 0.79279335, 0.79270961, 0.79194018], [0.7918398 , 0.79209176, 0.79280692, 0.79250223, 0.79215337], [0.79165762, 0.79209176, 0.79261698, 0.79230261, 0.79190723], [0.7918398 , 0.79234953, 0.79274878, 0.79250223, 0.79186847], [0.79195997, 0.79235534, 0.79298522, 0.79270574, 0.79211461], [0.79194834, 0.79235922, 0.79297747, 0.79269992, 0.79213399]])
Let’s plot the scores
plt.figure(figsize=[10, 5])
plt.plot(range(1, num_features[1]+1), rfecv.grid_scores_)
plt.show()

rfecv.n_features_
3
rfecv.support_
array([ True, False, True, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False])
Let’s apply HPO using Cross Validation for Decision Tree SMOTE
params = {
“max_depth”: [2,3,5,10,20],
“min_samples_leaf”: [5,10,20,50,100]
}
model_rcv_dt = RandomizedSearchCV(estimator=dtc_smt,
param_distributions=params,
verbose=1,
cv=5,
return_train_score=True,
n_jobs=-1,
n_iter=20)
model_rcv_dt.fit(X_resampled_smt, y_resampled_smt)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0), n_iter=20, n_jobs=-1, param_distributions={'max_depth': [2, 3, 5, 10, 20], 'min_samples_leaf': [5, 10, 20, 50, 100]}, return_train_score=True, verbose=1)
Let’s check the score
model_rcv_dt.best_score_
0.9548638326370502
and choose the best model
dt_best = model_rcv_dt.best_estimator_
dt_best
DecisionTreeClassifier(max_depth=20, min_samples_leaf=20, random_state=0)
Let’s plot the ROC curve
from sklearn.metrics import plot_roc_curve
plot_roc_curve(dt_best, X_resampled_smt, y_resampled_smt)
plt.show()

Let’s apply HPO using Cross Validation for the XGBOOST ADASYN model
from sklearn.model_selection import RandomizedSearchCV
hyper_params = {‘max_depth’: range(3, 10),
‘max_features’: range(3, 10),
‘min_samples_leaf’: range(20, 200, 50),
‘n_estimators’: range(10, 51, 10)}
while performing RandomizedSearchCV for this model
model_rcv_xgb = RandomizedSearchCV(estimator=xgb_ada,
param_distributions=hyper_params,
verbose=1,
cv=5,
return_train_score=True,
n_jobs=-1,
n_iter=20)
model_rcv_xgb.fit(X_resampled_ada, y_resampled_ada)
Fitting 5 folds for each of 20 candidates, totalling 100 fits [12:05:20] WARNING: C:/Users/Administrator/workspace/xgboost-win64_release_1.6.0/src/learner.cc:627: Parameters: { "max_features", "min_samples_leaf" } might not be used.
RandomizedSearchCV(cv=5, estimator=XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise', importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_bin=256,... max_leaves=0, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, ...), n_iter=20, n_jobs=-1, param_distributions={'max_depth': range(3, 10), 'max_features': range(3, 10), 'min_samples_leaf': range(20, 200, 50), 'n_estimators': range(10, 51, 10)}, return_train_score=True, verbose=1)
And the score is
model_rcv_xgb.best_score_
0.9173245717058146
Displaying best values for hyperparameters
model_rcv_xgb.best_estimator_
XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise', importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=9, max_features=3, max_leaves=0, min_child_weight=1, min_samples_leaf=120, missing=nan, monotone_constraints='()', n_estimators=50, n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, ...)
Let’s select the best model
xgb_best = model_rcv_xgb.best_estimator_
and predict the is_fraud variable for test data
y_pred_rcv_xgb = model_rcv_xgb.predict(X_test)
Performance Evaluation
Let’s plot various performance metrics using skplt
import scikitplot as skplt
Let’s check learning curves of our training examples
skplt.estimators.plot_learning_curve(XGBClassifier(), X_test, y_pred_rcv_xgb,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”XGBClassifier Learning Curve”);

skplt.estimators.plot_learning_curve(DecisionTreeClassifier(), X_test, y_pred_dtc_os,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”DecisionTreeClassifier Learning Curve”);

skplt.estimators.plot_learning_curve(RandomForestClassifier(), X_test, y_pred_rf_smt,
cv=7, shuffle=True, scoring=”accuracy”,
n_jobs=-1, figsize=(6,4), title_fontsize=”large”, text_fontsize=”large”,
title=”RandomForestClassifier() Learning Curve”);

Let’s plot the ROC curve for our test data
Y_test_probs = model_rcv_xgb.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_pred_rcv_xgb, Y_test_probs,
title=”XGB ROC Curve”, figsize=(12,6));

Y_test_probs = dtc_os.predict_proba(X_test)
skplt.metrics.plot_roc_curve(y_pred_dtc_os, Y_test_probs,
title=”DTC ROC Curve”, figsize=(12,6));

Let’s compare calibration curves
lr_probas = LogisticRegression().fit(X_train, y_train).predict_proba(X_test)
rf_probas = RandomForestClassifier().fit(X_train, y_train).predict_proba(X_test)
gb_probas = DecisionTreeClassifier().fit(X_train, y_train).predict_proba(X_test)
et_scores = XGBClassifier().fit(X_train, y_train).predict_proba(X_test)
probas_list = [lr_probas, rf_probas, gb_probas, et_scores]
clf_names = [‘LogisticRegression’, ‘RandomForestClassifier’, ‘DecisionTreeClassifier’, ‘XGBClassifier’]
skplt.metrics.plot_calibration_curve(y_test,
probas_list,
clf_names, n_bins=15,
figsize=(12,6)
);

Let’s look at KS Statistic plot of the best model – XGB Classifier
Y_test_probs = model_rcv_xgb.predict_proba(X_test)
skplt.metrics.plot_ks_statistic(y_test, Y_test_probs, figsize=(10,6));

Let’s plot the Cumulative Gains Curve
Y_test_probs = model_rcv_xgb.predict_proba(X_test)
skplt.metrics.plot_cumulative_gain(y_test, Y_test_probs, figsize=(10,6));

Let’s look at the lift curve
skplt.metrics.plot_lift_curve(y_test, Y_test_probs, figsize=(10,6));

Let’s plot the cluster elbow curve
skplt.cluster.plot_elbow_curve(KMeans(random_state=1),
X_test,
cluster_ranges=range(2, 20),
figsize=(8,6));

Let’s check PCA
pca = PCA(random_state=1)
pca.fit(X_test)
skplt.decomposition.plot_pca_component_variance(pca, figsize=(8,6));

Let’s look at the PCA 2-D projection
skplt.decomposition.plot_pca_2d_projection(pca, X_test, y_test,
figsize=(10,10),
cmap=”tab10″);

Let’s apply the silhouette analysis
kmeans = KMeans(n_clusters=10, random_state=1)
kmeans.fit(X_train, y_train)
cluster_labels = kmeans.predict(X_test)
skplt.metrics.plot_silhouette(X_test, cluster_labels,
figsize=(8,6));

Let’s look at our evaluation metrics to check the accuracy and recall values
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rcv_xgb))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rcv_xgb))
print (‘Recall: ‘, recall_score(y_test, y_pred_rcv_xgb))
print (‘Precision: ‘, precision_score(y_test, y_pred_rcv_xgb))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rcv_xgb))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rcv_xgb))
Accuracy: 0.9270224699893291 F1 score: 0.11739102047922696 Recall: 0.9316062176165804 Precision: 0.06264226320434803 clasification report: precision recall f1-score support 0 1.00 0.93 0.96 552824 1 0.06 0.93 0.12 2895 accuracy 0.93 555719 macro avg 0.53 0.93 0.54 555719 weighted avg 0.99 0.93 0.96 555719 confussion matrix: [[512467 40357] [ 198 2697]]
Let’s plot the XGB ROC curve
plot_roc_curve(xgb_best, X_resampled_ada, y_resampled_ada, drop_intermediate=False)
plt.show()

So the accuracy score for the final model is 92.7% whereas the recall for the final model is 93.1%.
Let’s check the list of important features
importances = xgb_best.feature_importances_
weights = pd.Series(importances,
index=X_resampled_ada.columns.values)
weights.sort_values()[-10:].plot(kind = ‘barh’)

Let’s apply HPO using Cross Validation for Random Forest ADASYN model
#from sklearn.model_selection import RandomizedSearchCV
hyper_params = {‘max_depth’: range(3, 10),
‘max_features’: range(3, 10),
‘min_samples_leaf’: range(20, 200, 50),
‘n_estimators’: range(10, 51, 10)}
Performing Randomizedsearch for Random Forest ADASYN model
model_rcv = RandomizedSearchCV(estimator=rf_ada,
param_distributions=hyper_params,
verbose=1,
cv=5,
return_train_score=True,
n_jobs=-1,
n_iter=20)
model_rcv.fit(X_resampled_ada, y_resampled_ada)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
Out[248]:
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20, n_jobs=-1, param_distributions={'max_depth': range(3, 10), 'max_features': range(3, 10), 'min_samples_leaf': range(20, 200, 50), 'n_estimators': range(10, 51, 10)}, return_train_score=True, verbose=1)
Let’s check the score
model_rcv.best_score_
0.8869038273254523
Displaying best values for hyperparameters
RandomForestClassifier(max_depth=8, max_features=8, min_samples_leaf=170, n_estimators=10)
Select the best model rf_best
rf_best = model_rcv.best_estimator_
Let’s predict the is_fraud variable for test data
y_pred_rcv = rf_best.predict(X_test)
Let’s check our evaluation metrics to compare accuracy and recall values
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rcv))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rcv))
print (‘Recall: ‘, recall_score(y_test, y_pred_rcv))
print (‘Precision: ‘, precision_score(y_test, y_pred_rcv))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rcv))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rcv))
Accuracy: 0.9199883394305395 F1 score: 0.10908070850364672 Recall: 0.9402417962003454 Precision: 0.05789887903345883 clasification report: precision recall f1-score support 0 1.00 0.92 0.96 552824 1 0.06 0.94 0.11 2895 accuracy 0.92 555719 macro avg 0.53 0.93 0.53 555719 weighted avg 0.99 0.92 0.95 555719 confussion matrix: [[508533 44291] [ 173 2722]]
Let’s plot the ROC curve
#from sklearn.metrics import plot_roc_curve
plot_roc_curve(rf_best, X_resampled_ada, y_resampled_ada, drop_intermediate=False)
plt.show()

The accuracy score for the final model is 92% whereas recall for the final model is 94%. Let’s plot the list of important features
importances = rf_best.feature_importances_
weights = pd.Series(importances,
index=X_resampled_ada.columns.values)
weights.sort_values()[-10:].plot(kind = ‘barh’)

Let’s apply HPO using Cross Validation for the Random Forest ADASYN model
#from sklearn.model_selection import RandomizedSearchCV
hyper_params = {‘max_depth’: range(3, 10),
‘max_features’: range(3, 10),
‘min_samples_leaf’: range(20, 200, 50),
‘n_estimators’: range(10, 51, 10)}
Let’s apply RandomizedSearchCV to Random Forest ADASYN model
model_rcv = RandomizedSearchCV(estimator=rf_ada,
param_distributions=hyper_params,
verbose=1,
cv=5,
return_train_score=True,
n_jobs=-1,
n_iter=20)
Let’s fit the resampled data
model_rcv.fit(X_resampled_ada, y_resampled_ada)
Fitting 5 folds for each of 20 candidates, totalling 100 fits
RandomizedSearchCV(cv=5, estimator=RandomForestClassifier(), n_iter=20, n_jobs=-1, param_distributions={'max_depth': range(3, 10), 'max_features': range(3, 10), 'min_samples_leaf': range(20, 200, 50), 'n_estimators': range(10, 51, 10)}, return_train_score=True, verbose=1)
Let’s check the score
model_rcv.best_score_
0.8869038273254523
Displaying best values for hyperparameters
RandomForestClassifier(max_depth=8, max_features=8, min_samples_leaf=170, n_estimators=10)
Naming the best model rf_best
rf_best = model_rcv.best_estimator_
Let’s predict the is_fraud variable for test data
y_pred_rcv = rf_best.predict(X_test)
Let’s print our evaluation Metrics to see the accuracy and recall values
print (‘Accuracy: ‘, accuracy_score(y_test, y_pred_rcv))
print (‘F1 score: ‘, f1_score(y_test, y_pred_rcv))
print (‘Recall: ‘, recall_score(y_test, y_pred_rcv))
print (‘Precision: ‘, precision_score(y_test, y_pred_rcv))
print (‘\n clasification report:\n’, classification_report(y_test,y_pred_rcv))
print (‘\n confussion matrix:\n’,confusion_matrix(y_test, y_pred_rcv))
Accuracy: 0.9199883394305395 F1 score: 0.10908070850364672 Recall: 0.9402417962003454 Precision: 0.05789887903345883 clasification report: precision recall f1-score support 0 1.00 0.92 0.96 552824 1 0.06 0.94 0.11 2895 accuracy 0.92 555719 macro avg 0.53 0.93 0.53 555719 weighted avg 0.99 0.92 0.95 555719 confussion matrix: [[508533 44291] [ 173 2722]]
Let’s plot the ROC curve
#from sklearn.metrics import plot_roc_curve
plot_roc_curve(rf_best, X_resampled_ada, y_resampled_ada, drop_intermediate=False)
plt.show()

Accuracy Score for the Final Model is 92%
Recall for the Final Model is 94%
Let’s plot the list of important features

Cost Benefit Analysis
Let’s perform the Cost Benefit Analysis using the XGBoost ADASYN (xgb_best) model.
Forming a consolidated train+test dataset
data = pd.concat([data_train, data_test])
Let’s define the transaction date and time column
data[‘trans_date_trans_time’] = pd.to_datetime(data[‘trans_date_trans_time’])
Extract year from trans_date_trans_time column
data[‘year’] = pd.DatetimeIndex(data[‘trans_date_trans_time’]).year
data[‘month’] = pd.DatetimeIndex(data[‘trans_date_trans_time’]).month
data.head()

data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 1852394 entries, 0 to 555718 Data columns (total 25 columns): # Column Dtype --- ------ ----- 0 Unnamed: 0 int64 1 trans_date_trans_time datetime64[ns] 2 cc_num int64 3 merchant object 4 category object 5 amt float64 6 first object 7 last object 8 gender object 9 street object 10 city object 11 state object 12 zip int64 13 lat float64 14 long float64 15 city_pop int64 16 job object 17 dob object 18 trans_num object 19 unix_time int64 20 merch_lat float64 21 merch_long float64 22 is_fraud int64 23 year int64 24 month int64 dtypes: datetime64[ns](1), float64(5), int64(8), object(11) memory usage: 367.4+ MB
Let’s check the monthly is_fraud count
avg = data.groupby([‘year’, ‘month’]).is_fraud.count()
avg
year month 2019 1 52525 2 49866 3 70939 4 68078 5 72532 6 86064 7 86596 8 87359 9 70652 10 68758 11 70421 12 141060 2020 1 52202 2 47791 3 72850 4 66892 5 74343 6 87805 7 85848 8 88759 9 69533 10 69348 11 72635 12 139538 Name: is_fraud, dtype: int64
Let’s estimate the average number of transactions per month
Avg_tran_per_month = avg.sum()/24
Avg_tran_per_month
77183.08333333333
Let’s check the monthly is_fraud=1 count
fraud_trans = data[data[‘is_fraud’]==1]
avg_fraud = fraud_trans.groupby([‘year’, ‘month’]).is_fraud.count()
avg_fraud
year month 2019 1 506 2 517 3 494 4 376 5 408 6 354 7 331 8 382 9 418 10 454 11 388 12 592 2020 1 343 2 336 3 444 4 302 5 527 6 467 7 321 8 415 9 340 10 384 11 294 12 258 Name: is_fraud, dtype: int64
Let’s estimate the average number of fraudulent transactions per month
Avg_fraud_tran_per_month = avg_fraud.sum()/24
Avg_fraud_tran_per_month
402.125
let’s get the average amount amt per fraudulent transaction
fraud_trans.amt.mean()
402.125
and the average amount amt per fraudulent transaction
fraud_trans.amt.mean()
530.6614122888789
Let TF be the average number of transactions per month detected as fraudulent by the model
TF = (41715+2706)/24 # (True Positives + False Positives) as per xgb_best confusion matrix
TF
1850.875
Let FN be the average number of transactions per month that are fraudulent but not detected by the model
FN = 189 # False Negatives as per xgb_best confusion matrix
Cost incurred per month before the model was deployed is
Cost_Before = 402.125*530.66
print(Cost_Before)
Cost of providing customer executive support per fraudulent transaction detected by the model is $1.5.
Total cost of providing customer support per month for fraudulent transactions detected by the model is TF*$1.5
Cost_cust_supp = TF*1.5
print(Cost_cust_supp)
Cost incurred due to fraudulent transactions left undetected by the model is
Cost_fraud = FN*530.66
print(Cost_fraud)
213391.6525 2776.3125 100294.73999999999
the monthly cost after the model is built and deployed is
Cost_after_model = Cost_cust_supp + Cost_fraud
Cost_after_model
103071.05249999999
Final savings = Cost incurred before – Cost incurred after
Final_Savings = Cost_Before – Cost_after_model
Final_Savings
110320.6
Conclusions
The XGBoost ADASYN model yields the best performance score: Accuracy is 92%
and Recall 94%. The deployment of this model results in saving 110320$. The most important features to monitor are the amount and late-night transaction hours.
References
[2] Kaggle: Credit Card Transactions Fraud Detection Dataset
[3] Machine Learning Case Study: Credit Card Fraud Detection
[4] Credit Card Fraud Detection: Capstone Project (BA)
[5] Procedia Computer Science 165 (2019) 631–641.
Check github link