Online
Fraud Detection
Fighting financial crime with machine learning .
Intoduction
Everyone is exposed to financial fraud if
you're selling or buying something
onine providing financial services , you face
fraud risks every day .
For business scams especially because you're
not only losing money but also
customers who may no longer trust you .
So detecting and preventing Fraud is
essential.
Online Fraud
E-commerce Social media
Digital Online
advertising banking
Steps
01 02 03 04 05
Selection
Data Feature Model Model
&
Exploration Engineering Evaluation Testing
Training
&
Of the
Preprocessing
Model
01
Data Exploration
& Preprocessing
We first explored our datasets to gain insights into its structure and characteristics.
Our dataset contains 200000 transaction The dataset of fraud transactions contains 8640
records with 55 features transactions with 8 features
Visualization
We used various visualizations like
histograms, pie charts , and box plots
to understand the distribution and
relationships between different
features.
Distribution of Payment Type
Paypal 9.47% 90.5% Credit Card
0.0005 0.001
Direct Debit Inicis Payment
% %
Distribution of Card Type
Visa is the most commonly
used card, followed by MasterCard
and American Express
Transaction Currency code
79.2% 5.3%
USD CAD
8.9% 6.6%
EUR GBP
Registreted accounts
The number of unregistreted user
account is higher than registreted user
account
-----> Transactions from unregistered users
might be considered higher risk
• Fraudsters often use
proxy IPs to obfuscate their
true location
-----> There are 1092
suspecious transactions
originating from Washington
where potentian fraud risk .
• The presence of a large number of transactions
from these 30 states may raise suspicions about the
legitimacy of those transactions
Transaction Hours distribution
The most of transactions
are happened between the 10
and 20 hour of the day which
corresponds to 10:00 AM –
08:00 PM
Tag the data : Labeling Our Data for Supervised Learning
In our fraud detection project, the target variable is the 'Label' , which indicates whether a
transaction is fraudulent or not
• 0 : Non fraudulent transaction 1: Fraudulent transaction
We observe that the tagged data is
imbalanced . There is fewer instances of
positive class compared to the negative
class
Handling missing values
o The dataset had missing values in certain features which can affect the quality and accuracy of
analysis
Numerical values Categorical values
Using Multiple Imputation by Using the mode of the column
Chained Equations MICE which is an
iterative imputation method Replace missing values with the most
frequent category in each categorical column
It uses observed values from other variables to
estimate missing values
Outliers Detection & Handling
o One widely used method for identifying and handling outliers is the "Winsorizing"
------> It's a data transformation technique that involves capping extreme values in a dataset at a
specified percentile
------> It preserves the distributional characteristics of the original data while reducing the effect
of outliers
After
Handling
Outliers
02
Feature
Engineering
Feature engineering is a critical step to enhance the model's predictive power
Encoding Categorical
Variables
Machine learning models typically expect
numerical inputs, so categorical variables need to
be encoded into a numeric representation before Label Encoding
feeding them to the model
Each category is mapped to an
integer value
Point-biserial correlation
The correlation
between the binary
target variable and the
continuous features
03
Selection & Training
of the model
Split the data
The training set is used to train the
model 80 %
The test set is used to evaluate the
model's performance on unseen data.the 80% 20%
model makes predictions on this set
20%
The target variable is 'Label' : the variable we want to predict
Train the model
Using LightGBM :
For this classification task , lightGBM is an efficient and
powerful open-source machine learning framework
specifically designed for gradient boosting .It combines
speed, memory efficiency, and accuracy
During the training process, lightGBM
calculates the importance
of each feature based on how much
it contributes to the model's
accuracy.
04
Model Evaluation
ROC-AUC: Receiver Operating Characteristic - Area Under the Curve
It's a performance metric used to evaluate the performance of binary classification
models
It measures the area under
the ROC curve, which is a
graphical representation of
the model's true positive rate
against the false positive rate
at different classification
thresholds
The ROC-AUC score is a useful metric for evaluating classifiers, especially in imbalanced
datasets where accuracy alone can be misleading
The score is 0.81 : suggests that it model is
performing well in distinguishing between fraud
and non-fraud transactions .
Thank yo for your attention !