SlideShare a Scribd company logo
Real Time
Fraud Detection
Patterns and reference architectures
Ted Malaska // PSA Gwen Shapira // Software
Engineer
2
• Intro
• Review Problem
• Quick overview of key technology
• High level architecture
• Deep Dive into NRT Processing
• Completing the Puzzle – Micro-batch, Ingest and Batch
Overview
©2014 Cloudera, Inc. All rights reserved.
3©2014 Cloudera, Inc. All rights reserved.
• 15 years of moving data
• Formerly consultant
• Now Cloudera Engineer:
– Sqoop Committer
– Kafka
– Flume
• @gwenshap
Gwen Shapira
4
• Ted Malaska (PSA at Cloudera)
• Hadoop for ~5 years
• Contributed to
– HDFS, MapReduce, Yarn, HBase, Spark, Avro,
– Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo
– And working on a Sentry Patch
• Co-Author to O’Reilly Hadoop Application Architectures
• Worked with about 70 companies in 8 countries
• Marvel Fan Boy
• Runner
Hello
©2014 Cloudera, Inc. All rights reserved.
5
The Problem
©2014 Cloudera, Inc. All rights reserved.
6
Credit Card Transaction Fraud
©2014 Cloudera, Inc. All rights reserved.
7
Ikea Meat Balls
©2014 Cloudera, Inc. All rights reserved.
8
Coupon Fraud
©2014 Cloudera, Inc. All rights reserved.
9
Video Game Strategy
©2014 Cloudera, Inc. All rights reserved.
10
Health Insurance Fraud
©2014 Cloudera, Inc. All rights reserved.
11
• Typical Atomic Card Fraud Detection
• Ikea Meat Ball
• Multi Coupons Combinations
• OP or Negative Video Games Strategies
• Ad Serving
• Health Insurance Fraud
• Kid Coming Home From School
Review of the Problem
©2014 Cloudera, Inc. All rights reserved.
12
How do we React
• Human Brain at Tennis
– Muscle Memory
– Reaction Thought
– Reflective Meditation
©2014 Cloudera, Inc. All rights reserved.
13
Overview of
Key Technologies
©2014 Cloudera, Inc. All rights reserved.
14
Kafka
©2014 Cloudera, Inc. All Rights Reserved.
15©2014 Cloudera, Inc. All rights reserved.
• Messages are organized into topics
• Producers push messages
• Consumers pull messages
• Kafka runs in a cluster. Nodes are called
brokers
The Basics
16©2014 Cloudera, Inc. All rights reserved.
Topics, Partitions and Logs
17©2014 Cloudera, Inc. All rights reserved.
Each partition is a log
18©2014 Cloudera, Inc. All rights reserved.
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partition 2 Partion 2
19©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
20©2014 Cloudera, Inc. All rights reserved.
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partition 0
Partition 1
Partion 2
Client
21©2014 Cloudera, Inc. All rights reserved.
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C (File)
Consumer
Consumer
Consumer
Order retained with in
partition
Order retained with in
partition but not over
partitionsOffSetX
OffSetX
OffSetX
OffSetYOffSetYOffSetY
Off sets are kept per
consumer group
22
Flume
23
Sources Interceptors Selectors Channels Sinks
Flume Agent
Short Intro to Flume
Twitter, logs, JMS,
webserver, Kafka
Mask, re-format,
validate…
DR, critical
Memory, file,
Kafka
HDFS, HBase,
Solr
24
Flume and/or Kafka
©2014 Cloudera, Inc. All rights reserved.
Flume
UpStream
Flume Source
Interceptor
Flume Channel
Flume Sink
Down Stream
Selector
Can Be KafkaCan Be KafkaCan Be Kafka
25
Interceptors
• Mask fields
• Validate information
against external source
• Extract fields
• Modify data format
• Filter or split events
©2014 Cloudera, Inc. All rights reserved.
26
SparkStreaming
27
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val ssc = new StreamingContext(conf, Seconds(1))
3. val lines = ssc.socketTextStream("localhost", 9999)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
8. SSC.start()
28
Spark Streaming Example
©2014 Cloudera, Inc. All rights reserved.
1. val conf = new SparkConf().setMaster("local[2]”)
2. val sc = new SparkContext(conf)
3. val lines = sc.textFile(path, 2)
4. val words = lines.flatMap(_.split(" "))
5. val pairs = words.map(word => (word, 1))
6. val wordCounts = pairs.reduceByKey(_ + _)
7. wordCounts.print()
29
DStream
DStream
DStream
Spark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count Print
Pre-first
Batch
First
Batch
Second
Batch
30
DStream
DStream
DStreamSpark Streaming
Confidentiality Information Goes Here
Single Pass
Source Receiver RDD
Source Receiver RDD
RDD
Filter Count
Print
Source Receiver RDD
RDD
RDD
Single Pass
Filter Count
Pre-first
Batch
First
Batch
Second
Batch
Stateful RDD 1
Print
Stateful RDD 2
Stateful RDD 1
31
Spark Streaming and HBase
©2014 Cloudera, Inc. All rights reserved.
Driver
Walker Node
Configs
Executor
Static Space
Configs
HConnection
Tasks Tasks
Walker Node
Executor
Static Space
Configs
HConnection
Tasks Tasks
32
High Level
Architecture
©2014 Cloudera, Inc. All rights reserved.
33
Real-Time Event Processing Approach
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
34
NRT Processing
©2014 Cloudera, Inc. All rights reserved.
35
Focus on NRT First
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
NRT Event Processing with Context
36
Streaming Architecture – NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase
Client
Kafka
Answer Topic
HBase
KafkaConsumer
KafkaProducer
Able to respond with
in 10s of
milliseconds
37
Partitioned NRT Event Processing
©2014 Cloudera, Inc. All rights reserved.
Flume Source
Flume Source
Kafka
Initial Events Topic
Flume Source
Flume Interceptor
Event Processing Logic
Local
Memory
HBase
Client
Kafka
Answer Topic
HBase
KafkaConsumer
KafkaProducer
Topic
Partition A
Partition B
Partition C
Producer
Partitione
r
Producer
Partitione
r
Producer
Partitione
r
Custom Partitioner
Better use of local
memory
38
Completing the
Puzzle
©2014 Cloudera, Inc. All rights reserved.
39
Micro Batching
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Micro Batching
Micro Batching
Micro Batching
40
Complex Topologies
©2014 Cloudera, Inc. All rights reserved.
Kafka
Initial Events Topic
Spark Streaming
KafkaDirect
Connection
Dag Topologies
Kafka
Initial Events Topic
Spark Streaming
Kafka Receivers Dag Topologies
Kafka Receivers
Kafka Receivers
• Manages Offset
• Stores Offset is RDD
• No longer needs HDFS for initial RDD check
pointing
• Lets Kafka Manage Offsets
• Uses HDFS for initial RDD recovery
1.3
1.2
41
MicroBatch Bad-Input Handling
©2014 Cloudera, Inc. All rights reserved.
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – incoming events topic
Dag Topologies
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – bad events topic
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – resolved events topic
0 1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
Kafka – results topic
42
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Ingestion
Ingestion
43
Ingestion
©2014 Cloudera, Inc. All rights reserved.
Flume HDFS Sink
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR Sink
Sink
Sink
Sink
SolR
Flume Hbase Sink
Sink
Sink
Sink
HBase
44
Reflective Thoughts
©2014 Cloudera, Inc. All rights reserved.
Hadoop Cluster II
Storage Processing
SolR
Hadoop Cluster I
ClientClient
Flume Agents
Hbase /
Memory
Spark
Streaming
HDFS
Hive/Im
pala
Map/Re
duce
Spark
Search
Automated &
Manual
Analytical
Adjustments
and Pattern
detection
Fetching &
Updating Profiles
Adjusting NRT Stats
HDFSEventSink
SolR Sink
Batch Time Adjustments
Automated &
Manual
Review of
NRT Changes
and Counters
Local Cache
Kafka
Clients:
(Swipe
here!)
Web App
Research and Searching
©2014 Cloudera, Inc. All rights reserved.

More Related Content

What's hot (20)

PPTX
Ransomware by lokesh
Lokesh Bysani
 
PDF
SYMMETRIC CRYPTOGRAPHY
Santosh Naidu
 
PDF
IBM Qradar
Coenraad Smith
 
PPTX
Security Information and Event Management (SIEM)
k33a
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PPTX
Cybersecurity 1. intro to cybersecurity
sommerville-videos
 
PPTX
Threat Modeling In 2021
Adam Shostack
 
PPTX
Web technologies lesson 1
nhepner
 
PPTX
Cyber crime ppt
Gracy Joseph
 
PPT
Phishing attacks ppt
Aryan Ragu
 
PPTX
Presentation on Cloud computing
Vijay Bhanu Thodupunoori
 
PPTX
Basic cryptography
Perfect Training Center
 
PPTX
Computer Security 101
Progressive Integrations
 
PDF
Secure Design: Threat Modeling
Narudom Roongsiriwong, CISSP
 
PPTX
tahdid mustaqil ishi.pptx
SharofjonMuxtorov
 
PDF
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
PPTX
Map Reduce
Prashant Gupta
 
PPTX
Encryption
keith dias
 
PPTX
Splunk Overview
Splunk
 
PPTX
Cyber security presentation
sweetpeace1
 
Ransomware by lokesh
Lokesh Bysani
 
SYMMETRIC CRYPTOGRAPHY
Santosh Naidu
 
IBM Qradar
Coenraad Smith
 
Security Information and Event Management (SIEM)
k33a
 
Big data and Hadoop
Rahul Agarwal
 
Cybersecurity 1. intro to cybersecurity
sommerville-videos
 
Threat Modeling In 2021
Adam Shostack
 
Web technologies lesson 1
nhepner
 
Cyber crime ppt
Gracy Joseph
 
Phishing attacks ppt
Aryan Ragu
 
Presentation on Cloud computing
Vijay Bhanu Thodupunoori
 
Basic cryptography
Perfect Training Center
 
Computer Security 101
Progressive Integrations
 
Secure Design: Threat Modeling
Narudom Roongsiriwong, CISSP
 
tahdid mustaqil ishi.pptx
SharofjonMuxtorov
 
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Map Reduce
Prashant Gupta
 
Encryption
keith dias
 
Splunk Overview
Splunk
 
Cyber security presentation
sweetpeace1
 

Similar to Fraud Detection Architecture (20)

PDF
Fraud Detection using Hadoop
hadooparchbook
 
PPTX
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PPTX
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
PPTX
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
PPTX
End to End Streaming Architectures
Cloudera, Inc.
 
PDF
Streaming architecture patterns
hadooparchbook
 
PPTX
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PPTX
Spark+flume seattle
Hari Shreedharan
 
PPTX
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ryan Bosshart
 
PDF
Fraud Detection with Hadoop
markgrover
 
PDF
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
PPTX
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
PPTX
Real time analytics with Kafka and SparkStreaming
Ashish Singh
 
PPTX
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
PPTX
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
PPTX
Have your cake and eat it too
Gwen (Chen) Shapira
 
PPTX
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
PPTX
Intro to Apache Spark
Cloudera, Inc.
 
Fraud Detection using Hadoop
hadooparchbook
 
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Data Con LA
 
End to End Streaming Architectures
Cloudera, Inc.
 
Streaming architecture patterns
hadooparchbook
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Spark+flume seattle
Hari Shreedharan
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Ryan Bosshart
 
Fraud Detection with Hadoop
markgrover
 
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Event Detection Pipelines with Apache Kafka
DataWorks Summit
 
Real time analytics with Kafka and SparkStreaming
Ashish Singh
 
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Have your cake and eat it too
Gwen (Chen) Shapira
 
Have your Cake and Eat it Too - Architecture for Batch and Real-time processing
DataWorks Summit
 
Intro to Apache Spark
Cloudera, Inc.
 
Ad

More from Gwen (Chen) Shapira (20)

PPTX
Velocity 2019 - Kafka Operations Deep Dive
Gwen (Chen) Shapira
 
PPTX
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gwen (Chen) Shapira
 
PPTX
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
PPTX
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
PPTX
Papers we love realtime at facebook
Gwen (Chen) Shapira
 
PPTX
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
PPTX
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
PPTX
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PPTX
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
PPT
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
PPTX
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Gwen (Chen) Shapira
 
PPTX
Kafka for DBAs
Gwen (Chen) Shapira
 
PPTX
Kafka and Hadoop at LinkedIn Meetup
Gwen (Chen) Shapira
 
PPTX
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
PPTX
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
PPTX
R for hadoopers
Gwen (Chen) Shapira
 
PPTX
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
PPTX
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
PPTX
Incredible Impala
Gwen (Chen) Shapira
 
Velocity 2019 - Kafka Operations Deep Dive
Gwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Gwen (Chen) Shapira
 
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Gwen (Chen) Shapira
 
Kafka for DBAs
Gwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Gwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
R for hadoopers
Gwen (Chen) Shapira
 
Scaling ETL with Hadoop - Avoiding Failure
Gwen (Chen) Shapira
 
Intro to Spark - for Denver Big Data Meetup
Gwen (Chen) Shapira
 
Incredible Impala
Gwen (Chen) Shapira
 
Ad

Recently uploaded (20)

PPTX
Rational Functions, Equations, and Inequalities (1).pptx
mdregaspi24
 
PPTX
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPTX
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
PPT
deep dive data management sharepoint apps.ppt
novaprofk
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
Rational Functions, Equations, and Inequalities (1).pptx
mdregaspi24
 
Rocket-Launched-PowerPoint-Template.pptx
Arden31
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Data base management system Transactions.ppt
gandhamcharan2006
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
加拿大尼亚加拉学院毕业证书{Niagara在读证明信Niagara成绩单修改}复刻
Taqyea
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
deep dive data management sharepoint apps.ppt
novaprofk
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 

Fraud Detection Architecture

  • 1. Real Time Fraud Detection Patterns and reference architectures Ted Malaska // PSA Gwen Shapira // Software Engineer
  • 2. 2 • Intro • Review Problem • Quick overview of key technology • High level architecture • Deep Dive into NRT Processing • Completing the Puzzle – Micro-batch, Ingest and Batch Overview ©2014 Cloudera, Inc. All rights reserved.
  • 3. 3©2014 Cloudera, Inc. All rights reserved. • 15 years of moving data • Formerly consultant • Now Cloudera Engineer: – Sqoop Committer – Kafka – Flume • @gwenshap Gwen Shapira
  • 4. 4 • Ted Malaska (PSA at Cloudera) • Hadoop for ~5 years • Contributed to – HDFS, MapReduce, Yarn, HBase, Spark, Avro, – Kite, Pig, Navigator, Cloudera Manager, Flume, Kafke, Sqoop, Accumulo – And working on a Sentry Patch • Co-Author to O’Reilly Hadoop Application Architectures • Worked with about 70 companies in 8 countries • Marvel Fan Boy • Runner Hello ©2014 Cloudera, Inc. All rights reserved.
  • 5. 5 The Problem ©2014 Cloudera, Inc. All rights reserved.
  • 6. 6 Credit Card Transaction Fraud ©2014 Cloudera, Inc. All rights reserved.
  • 7. 7 Ikea Meat Balls ©2014 Cloudera, Inc. All rights reserved.
  • 8. 8 Coupon Fraud ©2014 Cloudera, Inc. All rights reserved.
  • 9. 9 Video Game Strategy ©2014 Cloudera, Inc. All rights reserved.
  • 10. 10 Health Insurance Fraud ©2014 Cloudera, Inc. All rights reserved.
  • 11. 11 • Typical Atomic Card Fraud Detection • Ikea Meat Ball • Multi Coupons Combinations • OP or Negative Video Games Strategies • Ad Serving • Health Insurance Fraud • Kid Coming Home From School Review of the Problem ©2014 Cloudera, Inc. All rights reserved.
  • 12. 12 How do we React • Human Brain at Tennis – Muscle Memory – Reaction Thought – Reflective Meditation ©2014 Cloudera, Inc. All rights reserved.
  • 13. 13 Overview of Key Technologies ©2014 Cloudera, Inc. All rights reserved.
  • 14. 14 Kafka ©2014 Cloudera, Inc. All Rights Reserved.
  • 15. 15©2014 Cloudera, Inc. All rights reserved. • Messages are organized into topics • Producers push messages • Consumers pull messages • Kafka runs in a cluster. Nodes are called brokers The Basics
  • 16. 16©2014 Cloudera, Inc. All rights reserved. Topics, Partitions and Logs
  • 17. 17©2014 Cloudera, Inc. All rights reserved. Each partition is a log
  • 18. 18©2014 Cloudera, Inc. All rights reserved. Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  • 19. 19©2014 Cloudera, Inc. All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 20. 20©2014 Cloudera, Inc. All rights reserved. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  • 21. 21©2014 Cloudera, Inc. All rights reserved. Consumers Consumer Group Y Consumer Group X Consumer Kafka Cluster Topic Partition A (File) Partition B (File) Partition C (File) Consumer Consumer Consumer Order retained with in partition Order retained with in partition but not over partitionsOffSetX OffSetX OffSetX OffSetYOffSetYOffSetY Off sets are kept per consumer group
  • 23. 23 Sources Interceptors Selectors Channels Sinks Flume Agent Short Intro to Flume Twitter, logs, JMS, webserver, Kafka Mask, re-format, validate… DR, critical Memory, file, Kafka HDFS, HBase, Solr
  • 24. 24 Flume and/or Kafka ©2014 Cloudera, Inc. All rights reserved. Flume UpStream Flume Source Interceptor Flume Channel Flume Sink Down Stream Selector Can Be KafkaCan Be KafkaCan Be Kafka
  • 25. 25 Interceptors • Mask fields • Validate information against external source • Extract fields • Modify data format • Filter or split events ©2014 Cloudera, Inc. All rights reserved.
  • 27. 27 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val ssc = new StreamingContext(conf, Seconds(1)) 3. val lines = ssc.socketTextStream("localhost", 9999) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print() 8. SSC.start()
  • 28. 28 Spark Streaming Example ©2014 Cloudera, Inc. All rights reserved. 1. val conf = new SparkConf().setMaster("local[2]”) 2. val sc = new SparkContext(conf) 3. val lines = sc.textFile(path, 2) 4. val words = lines.flatMap(_.split(" ")) 5. val pairs = words.map(word => (word, 1)) 6. val wordCounts = pairs.reduceByKey(_ + _) 7. wordCounts.print()
  • 29. 29 DStream DStream DStream Spark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Print Pre-first Batch First Batch Second Batch
  • 30. 30 DStream DStream DStreamSpark Streaming Confidentiality Information Goes Here Single Pass Source Receiver RDD Source Receiver RDD RDD Filter Count Print Source Receiver RDD RDD RDD Single Pass Filter Count Pre-first Batch First Batch Second Batch Stateful RDD 1 Print Stateful RDD 2 Stateful RDD 1
  • 31. 31 Spark Streaming and HBase ©2014 Cloudera, Inc. All rights reserved. Driver Walker Node Configs Executor Static Space Configs HConnection Tasks Tasks Walker Node Executor Static Space Configs HConnection Tasks Tasks
  • 32. 32 High Level Architecture ©2014 Cloudera, Inc. All rights reserved.
  • 33. 33 Real-Time Event Processing Approach ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App
  • 34. 34 NRT Processing ©2014 Cloudera, Inc. All rights reserved.
  • 35. 35 Focus on NRT First ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App NRT Event Processing with Context
  • 36. 36 Streaming Architecture – NRT Event Processing ©2014 Cloudera, Inc. All rights reserved. Flume Source Flume Source Kafka Initial Events Topic Flume Source Flume Interceptor Event Processing Logic Local Memory HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Able to respond with in 10s of milliseconds
  • 37. 37 Partitioned NRT Event Processing ©2014 Cloudera, Inc. All rights reserved. Flume Source Flume Source Kafka Initial Events Topic Flume Source Flume Interceptor Event Processing Logic Local Memory HBase Client Kafka Answer Topic HBase KafkaConsumer KafkaProducer Topic Partition A Partition B Partition C Producer Partitione r Producer Partitione r Producer Partitione r Custom Partitioner Better use of local memory
  • 38. 38 Completing the Puzzle ©2014 Cloudera, Inc. All rights reserved.
  • 39. 39 Micro Batching ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Micro Batching Micro Batching Micro Batching
  • 40. 40 Complex Topologies ©2014 Cloudera, Inc. All rights reserved. Kafka Initial Events Topic Spark Streaming KafkaDirect Connection Dag Topologies Kafka Initial Events Topic Spark Streaming Kafka Receivers Dag Topologies Kafka Receivers Kafka Receivers • Manages Offset • Stores Offset is RDD • No longer needs HDFS for initial RDD check pointing • Lets Kafka Manage Offsets • Uses HDFS for initial RDD recovery 1.3 1.2
  • 41. 41 MicroBatch Bad-Input Handling ©2014 Cloudera, Inc. All rights reserved. 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – incoming events topic Dag Topologies 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – bad events topic 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – resolved events topic 0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 Kafka – results topic
  • 42. 42 Ingestion ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Ingestion Ingestion
  • 43. 43 Ingestion ©2014 Cloudera, Inc. All rights reserved. Flume HDFS Sink Kafka Cluster Topic Partition A Partition B Partition C Sink Sink Sink HDFS Flume SolR Sink Sink Sink Sink SolR Flume Hbase Sink Sink Sink Sink HBase
  • 44. 44 Reflective Thoughts ©2014 Cloudera, Inc. All rights reserved. Hadoop Cluster II Storage Processing SolR Hadoop Cluster I ClientClient Flume Agents Hbase / Memory Spark Streaming HDFS Hive/Im pala Map/Re duce Spark Search Automated & Manual Analytical Adjustments and Pattern detection Fetching & Updating Profiles Adjusting NRT Stats HDFSEventSink SolR Sink Batch Time Adjustments Automated & Manual Review of NRT Changes and Counters Local Cache Kafka Clients: (Swipe here!) Web App Research and Searching
  • 45. ©2014 Cloudera, Inc. All rights reserved.

Editor's Notes

  • #4: This gives me a lot of perspective regarding the use of Hadoop
  • #17: Topics are partitioned, each partition ordered and immutable. Messages in a partition have an ID, called Offset. Offset uniquely identifies a message within a partition
  • #18: Kafka retains all messages for fixed amount of time. Not waiting for acks from consumers. The only metadata retained per consumer is the position in the log – the offset So adding many consumers is cheap On the other hand, consumers have more responsibility and are more challenging to implement correctly And “batching” consumers is not a problem
  • #19: 3 partitions, each replicated 3 times.
  • #20: The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  • #21: The choose how many replicas must ACK a message before its considered committed. This is the tradeoff between speed and reliability
  • #22: can read from one or more partition leader. You can’t have two consumers in same group reading the same partition. Leaders obviously do more work – but they are balanced between nodes We reviewed the basic components on the system, and it may seem complex. In the next section we’ll see how simple it actually is to get started with Kafka.
  • #24: Does not require programming.