SlideShare a Scribd company logo
Realtime
Components &
Architectures
Prepared for Big Data Madison
Ryan Bosshart // Systems Engineer
Agenda
• (Near) Real-Time Problems – Actual Cloudera Use-Cases
• Applicable Frameworks and Architectures
• DDOS Example & Code
Click to enter confidentiality information
3
Connected Medical Devices
• Batch:
– How does a patients disease progress
over time?
– How does physician training affect
disease state?
– How can we recommend better
therapies?
• Realtime:
– What is the patient’s disease state right
now?
– Alert on potential device malfunctions.
Click to enter confidentiality information
4
Connected Cars
• Batch:
– Manufacturer wants to know
optimal charge performance.
• Real-time:
– Consumer wants to know if teen is
driving car right now. How fast are
they accelerating / driving?
– Vehicle Service – e.g. grab an up-
to-date “diagnosis bundle” before
service.
5
Victim’s
Infrastructu
re
Security
• Batch analytics:
– What countries are most
common?
• Realtime:
– How do we detect and stop
attackers right now!
6
Netflow Data
Click to enter confidentiality information
Bytes Contents Description
0-3 srcaddr Source IP address
4-7 dstaddr Destination IP address
8-11 nexthop IP address of next hop router
12-13 input SNMP index of input interface
14-15 output SNMP index of output interface
16-19 dPkts Packets in the flow
20-23 dOctets Total number of Layer 3 bytes in the packets of the
flow
24-27 first SysUptime at start of flow
28-31 last SysUptime at the time the last packet of the flow
was received
32-33 srcport TCP/UDP source port number or equivalent
34-35 dstport TCP/UDP destination port number or equivalent
36 pad1 Unused (zero) bytes
37 tcp_flags Cumulative OR of TCP flags
38 prot IP protocol type (for example, TCP = 6; UDP = 17)
39 tos IP type of service (ToS)
40-41 src_as Autonomous system number of the source, either
origin or peer
42-43 dst_as Autonomous system number of the destination,
either origin or peer
44 src_mask Source address prefix mask bits
45 dst_mask Destination address prefix mask bits
46-47 pad2 Unused (zero) bytes
7
5
Ingesting and Processing Netflow Data
Click to enter confidentiality information
IP Traffic Annotate
Netflo
w
Pub-
Sub
Analyze Data
& Train Model
Classify
Events as
DDOS or
Legit
Analyze
Long
Term
https://github.com/Markus-Go/bonesi
IP Geolocation
Ingest
and
Annotat
e
1
Stage
Netflow
Events
2
Store &
process3
Store
Model4
Process
Realtim
e
Events
5
Alert
and
Analyze
6
8
Need to Process in Different
Ways
• Stream Ingestion – low latency
persistence to HDFS, Hbase, Solr,
etc.
• Near Real-Time Processing with
External Context – alerting , flagging
, transforms, filtering.
• Complex Near Real-Time
Processing - complex aggregations,
windowed computations, machine
learning, etc.
Need to Persist in Different
Ways
• Kafka – pub-sub messaging, fast,
scalable, durable
• Solr – natural language search, low-
latency, scalable
• Hbase – online, real-time gets, puts,
micro-scans
• HDFS – analytical SQL, scans.
9
Architecture
Patterns for Ingest
and Annotation
10
Ingest…
IP Traffic Annotate
Netflo
w
https://github.com/Markus-Go/bonesi
IP Geolocation
Ingest
and
Annotat
e
1
11
Logs
HDFSFlumeLogs
Logs
Sources
Sinks
Flume - Capture and Ingest Streaming Data
kafka
jms
log4j
directory
thrift
solr
elasticsearch
hbase
kafka
12
Flume – Interceptors
“netflow”
topic
Flume Source
NetFlow
Logs
Flume Interceptor
Memor
y
Hbase
Client
GeoDB
13
Ingest with StreamSets
Intelligent
Monitoring
Adaptable
Flows
Continuous
Platform
Streaming
Sanitization
GeoDB
14
Ingest…
IP Traffic
Netflo
w
IP Geolocation
Ingest
and
Annotat
e
1
15
Real-time Pub-Sub
Apache Kafka
16
5
Pub-Sub
Click to enter confidentiality information
Netflo
w
Pub-
Sub
Analyze Data
& Train Model
Classify
Events as
DDOS or
Legit
IP Geolocation
17
Why Kafka?
200
9
18
Why Kafka? Increasing complexity
200
9
201
4
19
Why Kafka? Decoupling
201
4
2015+
?
20
What is Kafka?
• Kafka is a distributed, topic-
oriented, partitioned, replicated
commit log.
• Kafka is also pub-sub
messaging system.
• Messages can be text (e.g.
syslog), but binary is best
(preferably Avro!).
21
Flume HDFS Sink
Possible Consumer Architectures
©2014 Cloudera, Inc. All rights reserved.
Kafka Cluster
Topic
Partition A
Partition B
Partition C
Sink
Sink
Sink
HDFS
Flume SolR Sink
Sink
Sink
Sink
SolR
Flume SolR Sink
Sink
Sink
Sink
HBase
Spark Streaming
DirectStream
Topology
22
Logs
HDFSFlumeLogs
Logs
Capture and Ingest Streaming Data – Now
with Kafka!
Kafka
Source
HDFS
Sink
Kafka
Channel
HDFSLogs
Kafka
23
Processing and
Consumption
24
Processing & Consumption
Kafka HDFS (Storage)
Train
Model
in
Spark
(Batch)
Analyze
Long
Term
Trends
All
Events
Realtime
Events
Apply Model on Dstreams
Spark Streaming
Read
Mode
l
Alerts
Classifie
d
IPs
Impala (SQL)
25
Unification of Batch & Streaming
Click to enter confidentiality information
// Create data set from Hadoop file
val dataset = sparkContext.hadoopFile(“file”)
// Join each batch in stream with the dataset
kafkaDStream.transform { batchRDD =>
batchRDD.join(dataset).filter(...)
}
Interoperability
// Learn model offline
val model = KMeans.train(dataset, ...)
// Apply model online on stream
val kafkaStream = KafkaUtils.createDStream(...)
kafkaStream.map { event => model.predict(featurize(event)) }
26
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
flatMap flatMap flatMap
save save save
batch @ t+1batch @ t batch @ t+2
tweets DStream
hashTags DStream
Stream composed of small (1-
10s) batch computations
“Micro-batch” Architecture
27
Streaming to
HDFS
28
Real-Time Analytics in Hadoop Today
RT Detection in the Real World = Storage Complexity
New Partition
Most Recent Partition
Historic Data
HBase
Parquet
File
• Wait for running operations to complete
• Define new Impala partition referencing
the newly written Parquet file
Incoming Data
(Messaging
System)
Impala, Spark, Hive
on HDFS /newdata/smallfil
e
/yesterday/largefi
le
Spark
Streamin
gLogs
Logs
Cron
Job
OR
29
Real-Time Analytics in Hadoop with Kudu
Simpler Architecture, Superior Performance over Hybrid Approaches
Impala, Spark on
Kudu
Incoming Data
(Messaging
System)
Reporting
Request
30
Demo
Using Netflow Data & Detecting a DDOS Attack
Questions?

More Related Content

What's hot (20)

PPTX
Why JSON API?
valuebound
 
PDF
04 olap
JoonyoungJayGwak
 
PDF
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 
ODP
Introduction to MongoDB
Dineesha Suraweera
 
PPTX
( G . i . s )
Himanshu Chakravarti
 
PPT
Clustering
M Rizwan Aqeel
 
PPTX
Data Mining: Outlier analysis
DataminingTools Inc
 
PDF
Decision tree
R A Akerkar
 
PPTX
Data Visualization Tools in Python
Roman Merkulov
 
PPTX
Python Seaborn Data Visualization
Sourabh Sahu
 
PDF
Capsule Networks
Charles Martin
 
PPTX
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
Maulik Borsaniya
 
PPTX
Hadoop And Their Ecosystem ppt
sunera pathan
 
PPT
Spatial data mining
MITS Gwalior
 
PPTX
Id3,c4.5 algorithim
Abdelfattah Al Zaqqa
 
PPTX
Data mining an introduction
Dr-Dipali Meher
 
PDF
Random forest using apache mahout
Gaurav Kasliwal
 
PPTX
Introduction to data analysis using python
Guido Luz Percú
 
PPTX
APPLICATIONS OF REMOTE SENSING AND GIS IN WATERSHED MANAGEMENT
Sriram Chakravarthy
 
PPTX
Hadoop Platform at Yahoo
DataWorks Summit/Hadoop Summit
 
Why JSON API?
valuebound
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 
Introduction to MongoDB
Dineesha Suraweera
 
( G . i . s )
Himanshu Chakravarti
 
Clustering
M Rizwan Aqeel
 
Data Mining: Outlier analysis
DataminingTools Inc
 
Decision tree
R A Akerkar
 
Data Visualization Tools in Python
Roman Merkulov
 
Python Seaborn Data Visualization
Sourabh Sahu
 
Capsule Networks
Charles Martin
 
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
Maulik Borsaniya
 
Hadoop And Their Ecosystem ppt
sunera pathan
 
Spatial data mining
MITS Gwalior
 
Id3,c4.5 algorithim
Abdelfattah Al Zaqqa
 
Data mining an introduction
Dr-Dipali Meher
 
Random forest using apache mahout
Gaurav Kasliwal
 
Introduction to data analysis using python
Guido Luz Percú
 
APPLICATIONS OF REMOTE SENSING AND GIS IN WATERSHED MANAGEMENT
Sriram Chakravarthy
 
Hadoop Platform at Yahoo
DataWorks Summit/Hadoop Summit
 

Viewers also liked (20)

PDF
Spark meetup TCHUG
Ryan Bosshart
 
PPTX
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
PDF
Under DDoS: Instant Access to Live Information
Imperva Incapsula
 
PDF
DDosMon A Global DDoS Monitoring Project
APNIC
 
PPT
Scientific Computing with Python Webinar --- May 22, 2009
Enthought, Inc.
 
PPT
Arrays
Sb Sharma
 
PDF
2nd section
Hadi Rahmat-Khah
 
PDF
A Gentle Introduction to Coding ... with Python
Tariq Rashid
 
PPT
Images and Vision in Python
streety
 
PDF
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
PDF
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Rekha Joshi
 
PPTX
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
PPTX
Enter The Matrix
Mike Anderson
 
PPT
Wasc Honeypot Update App Sec2007
rcbarnett
 
PDF
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
PDF
Hortonworks & Bilot Data Driven Transformations with Hadoop
Mats Johansson
 
PPTX
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
PPT
` Traffic Classification based on Machine Learning
butest
 
ODP
Lambda Architecture with Spark
Knoldus Inc.
 
PPTX
파이썬 Numpy 선형대수 이해하기
Yong Joon Moon
 
Spark meetup TCHUG
Ryan Bosshart
 
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
Under DDoS: Instant Access to Live Information
Imperva Incapsula
 
DDosMon A Global DDoS Monitoring Project
APNIC
 
Scientific Computing with Python Webinar --- May 22, 2009
Enthought, Inc.
 
Arrays
Sb Sharma
 
2nd section
Hadi Rahmat-Khah
 
A Gentle Introduction to Coding ... with Python
Tariq Rashid
 
Images and Vision in Python
streety
 
Kudu - Fast Analytics on Fast Data
Ryan Bosshart
 
Hadoop Summit 2016 - Evolution of Big Data Pipelines At Intuit
Rekha Joshi
 
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
Enter The Matrix
Mike Anderson
 
Wasc Honeypot Update App Sec2007
rcbarnett
 
Apache Spark Introduction - CloudxLab
Abhinav Singh
 
Hortonworks & Bilot Data Driven Transformations with Hadoop
Mats Johansson
 
Securing Spark Applications
DataWorks Summit/Hadoop Summit
 
` Traffic Classification based on Machine Learning
butest
 
Lambda Architecture with Spark
Knoldus Inc.
 
파이썬 Numpy 선형대수 이해하기
Yong Joon Moon
 
Ad

Similar to Realtime Detection of DDOS attacks using Apache Spark and MLLib (20)

PPTX
Fraud Detection Architecture
Gwen (Chen) Shapira
 
PPTX
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
PDF
Building end to end streaming application on Spark
datamantra
 
PPTX
Spark+flume seattle
Hari Shreedharan
 
PDF
Fraud Detection using Hadoop
hadooparchbook
 
PDF
Streaming architecture patterns
hadooparchbook
 
PDF
Unconference Round Table Notes
Timothy Spann
 
PPTX
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
PDF
Io t data streaming
ratthaslip ranokphanuwat
 
PDF
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
PPTX
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
PDF
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
PDF
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
PDF
BigDataFest Building Modern Data Streaming Apps
ssuser73434e
 
PPTX
End to End Streaming Architectures
Cloudera, Inc.
 
PDF
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
PDF
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
PDF
big data fest building modern data streaming apps
Timothy Spann
 
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Architecting a Fraud Detection Application with Hadoop
DataWorks Summit
 
Building end to end streaming application on Spark
datamantra
 
Spark+flume seattle
Hari Shreedharan
 
Fraud Detection using Hadoop
hadooparchbook
 
Streaming architecture patterns
hadooparchbook
 
Unconference Round Table Notes
Timothy Spann
 
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
Io t data streaming
ratthaslip ranokphanuwat
 
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Apache frameworks for Big and Fast Data
Naveen Korakoppa
 
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
Self-Service Data Ingestion Using NiFi, StreamSets & Kafka
Guido Schmutz
 
BigDataFest Building Modern Data Streaming Apps
ssuser73434e
 
End to End Streaming Architectures
Cloudera, Inc.
 
Hadoop Ecosystem and Low Latency Streaming Architecture
InSemble
 
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
big data fest building modern data streaming apps
Timothy Spann
 
Ad

Recently uploaded (20)

PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 

Realtime Detection of DDOS attacks using Apache Spark and MLLib

  • 1. Realtime Components & Architectures Prepared for Big Data Madison Ryan Bosshart // Systems Engineer
  • 2. Agenda • (Near) Real-Time Problems – Actual Cloudera Use-Cases • Applicable Frameworks and Architectures • DDOS Example & Code Click to enter confidentiality information
  • 3. 3 Connected Medical Devices • Batch: – How does a patients disease progress over time? – How does physician training affect disease state? – How can we recommend better therapies? • Realtime: – What is the patient’s disease state right now? – Alert on potential device malfunctions. Click to enter confidentiality information
  • 4. 4 Connected Cars • Batch: – Manufacturer wants to know optimal charge performance. • Real-time: – Consumer wants to know if teen is driving car right now. How fast are they accelerating / driving? – Vehicle Service – e.g. grab an up- to-date “diagnosis bundle” before service.
  • 5. 5 Victim’s Infrastructu re Security • Batch analytics: – What countries are most common? • Realtime: – How do we detect and stop attackers right now!
  • 6. 6 Netflow Data Click to enter confidentiality information Bytes Contents Description 0-3 srcaddr Source IP address 4-7 dstaddr Destination IP address 8-11 nexthop IP address of next hop router 12-13 input SNMP index of input interface 14-15 output SNMP index of output interface 16-19 dPkts Packets in the flow 20-23 dOctets Total number of Layer 3 bytes in the packets of the flow 24-27 first SysUptime at start of flow 28-31 last SysUptime at the time the last packet of the flow was received 32-33 srcport TCP/UDP source port number or equivalent 34-35 dstport TCP/UDP destination port number or equivalent 36 pad1 Unused (zero) bytes 37 tcp_flags Cumulative OR of TCP flags 38 prot IP protocol type (for example, TCP = 6; UDP = 17) 39 tos IP type of service (ToS) 40-41 src_as Autonomous system number of the source, either origin or peer 42-43 dst_as Autonomous system number of the destination, either origin or peer 44 src_mask Source address prefix mask bits 45 dst_mask Destination address prefix mask bits 46-47 pad2 Unused (zero) bytes
  • 7. 7 5 Ingesting and Processing Netflow Data Click to enter confidentiality information IP Traffic Annotate Netflo w Pub- Sub Analyze Data & Train Model Classify Events as DDOS or Legit Analyze Long Term https://github.com/Markus-Go/bonesi IP Geolocation Ingest and Annotat e 1 Stage Netflow Events 2 Store & process3 Store Model4 Process Realtim e Events 5 Alert and Analyze 6
  • 8. 8 Need to Process in Different Ways • Stream Ingestion – low latency persistence to HDFS, Hbase, Solr, etc. • Near Real-Time Processing with External Context – alerting , flagging , transforms, filtering. • Complex Near Real-Time Processing - complex aggregations, windowed computations, machine learning, etc. Need to Persist in Different Ways • Kafka – pub-sub messaging, fast, scalable, durable • Solr – natural language search, low- latency, scalable • Hbase – online, real-time gets, puts, micro-scans • HDFS – analytical SQL, scans.
  • 11. 11 Logs HDFSFlumeLogs Logs Sources Sinks Flume - Capture and Ingest Streaming Data kafka jms log4j directory thrift solr elasticsearch hbase kafka
  • 12. 12 Flume – Interceptors “netflow” topic Flume Source NetFlow Logs Flume Interceptor Memor y Hbase Client GeoDB
  • 16. 16 5 Pub-Sub Click to enter confidentiality information Netflo w Pub- Sub Analyze Data & Train Model Classify Events as DDOS or Legit IP Geolocation
  • 18. 18 Why Kafka? Increasing complexity 200 9 201 4
  • 20. 20 What is Kafka? • Kafka is a distributed, topic- oriented, partitioned, replicated commit log. • Kafka is also pub-sub messaging system. • Messages can be text (e.g. syslog), but binary is best (preferably Avro!).
  • 21. 21 Flume HDFS Sink Possible Consumer Architectures ©2014 Cloudera, Inc. All rights reserved. Kafka Cluster Topic Partition A Partition B Partition C Sink Sink Sink HDFS Flume SolR Sink Sink Sink Sink SolR Flume SolR Sink Sink Sink Sink HBase Spark Streaming DirectStream Topology
  • 22. 22 Logs HDFSFlumeLogs Logs Capture and Ingest Streaming Data – Now with Kafka! Kafka Source HDFS Sink Kafka Channel HDFSLogs Kafka
  • 24. 24 Processing & Consumption Kafka HDFS (Storage) Train Model in Spark (Batch) Analyze Long Term Trends All Events Realtime Events Apply Model on Dstreams Spark Streaming Read Mode l Alerts Classifie d IPs Impala (SQL)
  • 25. 25 Unification of Batch & Streaming Click to enter confidentiality information // Create data set from Hadoop file val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaDStream.transform { batchRDD => batchRDD.join(dataset).filter(...) } Interoperability // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream val kafkaStream = KafkaUtils.createDStream(...) kafkaStream.map { event => model.predict(featurize(event)) }
  • 26. 26 val tweets = ssc.twitterStream() val hashTags = tweets.flatMap (status => getTags(status)) hashTags.saveAsHadoopFiles("hdfs://...") flatMap flatMap flatMap save save save batch @ t+1batch @ t batch @ t+2 tweets DStream hashTags DStream Stream composed of small (1- 10s) batch computations “Micro-batch” Architecture
  • 28. 28 Real-Time Analytics in Hadoop Today RT Detection in the Real World = Storage Complexity New Partition Most Recent Partition Historic Data HBase Parquet File • Wait for running operations to complete • Define new Impala partition referencing the newly written Parquet file Incoming Data (Messaging System) Impala, Spark, Hive on HDFS /newdata/smallfil e /yesterday/largefi le Spark Streamin gLogs Logs Cron Job OR
  • 29. 29 Real-Time Analytics in Hadoop with Kudu Simpler Architecture, Superior Performance over Hybrid Approaches Impala, Spark on Kudu Incoming Data (Messaging System) Reporting Request
  • 30. 30 Demo Using Netflow Data & Detecting a DDOS Attack