0% found this document useful (0 votes)
68 views22 pages

Online Fraud Detection

The document summarizes the key steps in an online fraud detection workflow using machine learning: 1. Data is explored and preprocessed, including handling missing values, outliers, and imbalanced classes. 2. Feature engineering is applied to encode categorical variables and select important predictive features. 3. A LightGBM gradient boosting model is trained on labeled transaction data to classify fraudulent and non-fraudulent transactions. 4. The model is evaluated on a test set using the ROC-AUC metric, achieving a score of 0.81 indicating good performance in distinguishing fraud.

Uploaded by

farahzayani82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views22 pages

Online Fraud Detection

The document summarizes the key steps in an online fraud detection workflow using machine learning: 1. Data is explored and preprocessed, including handling missing values, outliers, and imbalanced classes. 2. Feature engineering is applied to encode categorical variables and select important predictive features. 3. A LightGBM gradient boosting model is trained on labeled transaction data to classify fraudulent and non-fraudulent transactions. 4. The model is evaluated on a test set using the ROC-AUC metric, achieving a score of 0.81 indicating good performance in distinguishing fraud.

Uploaded by

farahzayani82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Online

Fraud Detection
Fighting financial crime with machine learning .
Intoduction

Everyone is exposed to financial fraud if


you're selling or buying something
onine providing financial services , you face
fraud risks every day .

For business scams especially because you're


not only losing money but also
customers who may no longer trust you .

So detecting and preventing Fraud is


essential.
Online Fraud

E-commerce Social media

Digital Online
advertising banking
Steps

01 02 03 04 05

Selection
Data Feature Model Model
&
Exploration Engineering Evaluation Testing
Training
&
Of the
Preprocessing
Model
01
Data Exploration
& Preprocessing

We first explored our datasets to gain insights into its structure and characteristics.

Our dataset contains 200000 transaction The dataset of fraud transactions contains 8640
records with 55 features transactions with 8 features
Visualization

We used various visualizations like


histograms, pie charts , and box plots
to understand the distribution and
relationships between different
features.
Distribution of Payment Type

Paypal 9.47% 90.5% Credit Card

0.0005 0.001
Direct Debit Inicis Payment
% %
Distribution of Card Type

 Visa is the most commonly


used card, followed by MasterCard
and American Express
Transaction Currency code

79.2% 5.3%
USD CAD

8.9% 6.6%
EUR GBP
Registreted accounts

 The number of unregistreted user


account is higher than registreted user
account

-----> Transactions from unregistered users


might be considered higher risk
• Fraudsters often use
proxy IPs to obfuscate their
true location

-----> There are 1092


suspecious transactions
originating from Washington
where potentian fraud risk .

• The presence of a large number of transactions


from these 30 states may raise suspicions about the
legitimacy of those transactions
Transaction Hours distribution

 The most of transactions


are happened between the 10
and 20 hour of the day which
corresponds to 10:00 AM –
08:00 PM
Tag the data : Labeling Our Data for Supervised Learning

In our fraud detection project, the target variable is the 'Label' , which indicates whether a
transaction is fraudulent or not

• 0 : Non fraudulent transaction 1: Fraudulent transaction

 We observe that the tagged data is


imbalanced . There is fewer instances of
positive class compared to the negative
class
Handling missing values

o The dataset had missing values in certain features which can affect the quality and accuracy of
analysis

Numerical values Categorical values

 Using Multiple Imputation by  Using the mode of the column


Chained Equations MICE which is an
iterative imputation method Replace missing values with the most
frequent category in each categorical column
It uses observed values from other variables to
estimate missing values
Outliers Detection & Handling
o One widely used method for identifying and handling outliers is the "Winsorizing"

------> It's a data transformation technique that involves capping extreme values in a dataset at a
specified percentile
------> It preserves the distributional characteristics of the original data while reducing the effect
of outliers

After

Handling
Outliers
02
Feature
Engineering

Feature engineering is a critical step to enhance the model's predictive power

Encoding Categorical
Variables
Machine learning models typically expect
numerical inputs, so categorical variables need to
be encoded into a numeric representation before Label Encoding
feeding them to the model
Each category is mapped to an
integer value
Point-biserial correlation

 The correlation
between the binary
target variable and the
continuous features
03
Selection & Training
of the model

Split the data

 The training set is used to train the


model 80 %
 The test set is used to evaluate the
model's performance on unseen data.the 80% 20%
model makes predictions on this set
20%

The target variable is 'Label' : the variable we want to predict


Train the model

Using LightGBM :
For this classification task , lightGBM is an efficient and
powerful open-source machine learning framework
specifically designed for gradient boosting .It combines
speed, memory efficiency, and accuracy

During the training process, lightGBM


calculates the importance
of each feature based on how much
it contributes to the model's
accuracy.
04
Model Evaluation

ROC-AUC: Receiver Operating Characteristic - Area Under the Curve

It's a performance metric used to evaluate the performance of binary classification


models

 It measures the area under


the ROC curve, which is a
graphical representation of
the model's true positive rate
against the false positive rate
at different classification
thresholds
The ROC-AUC score is a useful metric for evaluating classifiers, especially in imbalanced
datasets where accuracy alone can be misleading

The score is 0.81 : suggests that it model is


performing well in distinguishing between fraud
and non-fraud transactions .
Thank yo for your attention !

You might also like