SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011–2018. All rights reserved.
Solving Cybersecurity at Scale
Laurence Da Luz & Mo Kamel
2 © Hortonworks Inc. 2011–2018. All rights reserved.
What Are We Talking About?
Cybersecurity Challenges
Solving Cybersecurity At Scale
Anatomy of Apache Metron
Use Case Walkthrough
3 © Hortonworks Inc. 2011–2018. All rights reserved. 3
CYBERSECURITY IS
A BIG DATA PROBLEM
4 © Hortonworks Inc. 2011–2018. All rights reserved.
Big Traffic, Big Trouble
Complexity Problem
• Too many point solutions
• Too many dashboards
• Too hard to correlate data across
silos
• Cybersecurity staff overwhelmed
with too many alerts
5 © Hortonworks Inc. 2011–2018. All rights reserved.
Big Traffic, Big Trouble + Capability Problem
• Huge volumes & Limited Storage
• Inconsistent data from multiple
sources
• Real-time context is crucial
• Missing Adv. Analytics
• Alert Fatigue – “Many False
Positives”
6 © Hortonworks Inc. 2011–2018. All rights reserved.
Big Traffic, Big Trouble
+ People Problem
• Skill Shortage around the globe
• Staff inefficiency & high cost
• Low value work of data gathering
and cleansing
• Impractical solution scaling people
7 © Hortonworks Inc. 2011–2018. All rights reserved.
Big Traffic, Big Trouble
+ Security Problem
• Distracting and Adv. attacks
• Lake of security context
• Asset Classifications
• Prioritization and Scoring
• Full access to historical data
8 © Hortonworks Inc. 2011–2018. All rights reserved.
9 © Hortonworks Inc. 2011–2018. All rights reserved.
A Community Solution Open Source Solution
• Volume
• Variety
• Value
• Automation
• Realtime
• Threat Intel
10 © Hortonworks Inc. 2011–2018. All rights reserved.
Advanced Use Cases
Open Source Solution
• Users Behavior
• Entities Behavior
• Advanced Analytics
11 © Hortonworks Inc. 2011–2018. All rights reserved.
12 © Hortonworks Inc. 2011–2018. All rights reserved.
Solving Cybersecurity at Scale
An architecture for real-time cybersecurity analytics
REAL-TIME PROCESSING CYBER SECURITY ENGINE
Cyber Security Stream Processing Pipeline
Telemetry Data
Sources
Telemetry Data
Collectors
Telemetry
Parsers
Enrichment Threat
Intel
Profiler Alert
Triage
Indexers
and
Writers
SecurityEndPoint
Devices
(Fireye,PaloAlto,
BlueCoat,etc.)
Machine
GeneratedLogs
(AD,App/Web
Server,firewall,
VPN,etc.)
IDS
(Suricata,Snort,
etc.)
NetworkData
PCAP,Netflow,Bro,
etc.)
ThreatIntelligence
Feeds
(Soltra,OpenTaxi
third-partyfeeds)
Performance
NetworkIngest
Probes
Real-Time
Enrich/Threat
IntelStreams
/Other…
DataVault
Real-TimeSearch
EvidentiaryStore
ThreatIntelligence
Platform
ModelasaService
CommunityModels
DataScience
Workbench
PCAPForensics
Modules
Data Services
& Integration
Layer
Telemetry
Ingest Buffer
HORTONWORKS DATA PLATFORMHORTONWORKS DATA FLOW
13 © Hortonworks Inc. 2011–2018. All rights reserved.
Solving Cybersecurity at Scale
An architecture for real-time cybersecurity analytics
Cyber Security Stream Processing Pipeline
Telemetry Data
Sources
Telemetry Data
Collectors
Telemetry
Parsers
Enrichment Threat
Intel
Profiler Alert
Triage
Indexers
and
Writers
SecurityEndPoint
Devices
(Fireye,PaloAlto,
BlueCoat,etc.)
Machine
GeneratedLogs
(AD,App/Web
Server,firewall,
VPN,etc.)
IDS
(Suricata,Snort,
etc.)
NetworkData
PCAP,Netflow,Bro,
etc.)
ThreatIntelligence
Feeds
(Soltra,OpenTaxi
third-partyfeeds)
Performance
NetworkIngest
Probes
Real-Time
Enrich/Threat
IntelStreams
/Other…
DataVault
Real-TimeSearch
EvidentiaryStore
ThreatIntelligence
Platform
ModelasaService
CommunityModels
DataScience
Workbench
PCAPForensics
Modules
Data Services
& Integration
Layer
Telemetry
Ingest Buffer
HORTONWORKS DATA PLATFORMHORTONWORKS DATA FLOW
Collect security device and
machine generated logs
Extendable data model
Enrichment on Ingest for
extra context
Behavior profiling and
advanced windowing
Flexible deployment of Data
Science
Alerting and triage (exposed
to SOC)
Hortonworks Cybersecurity Platform runs as an application
on top of HDF and HDP
REAL-TIME PROCESSING CYBER SECURITY ENGINE
14 © Hortonworks Inc. 2011–2018. All rights reserved.
15 © Hortonworks Inc. 2011–2018. All rights reserved.
16 © Hortonworks Inc. 2011–2018. All rights reserved.
17 © Hortonworks Inc. 2011–2018. All rights reserved.
Context is everything
Enrichments
User, group data,
internal business sources
Geospatial data, worldwide
shared threat intelligence Model predictions, via
Model As A Service framework Time
18 © Hortonworks Inc. 2011–2018. All rights reserved.
Time is context | Time matters
The Profiler
• A generalized solution for extracting model features and aggregations over time from high throughput,
streaming data
• Generates a profile describing the behavior of an entity; a host, user, subnet or application..
• A foundational component for both security model building and alerting in HCP
t = 1 t = 2 t = 3 t = n
Profile behavior across
windows in time, and
across multiple devices
19 © Hortonworks Inc. 2011–2018. All rights reserved.
t = 1 t = 2 t = 3 t = n
… how do we perform
behavioral profiling at
real-time scale?
Time is context | Time matters
The Profiler
Variety of different types of data sketches, but
general characteristics include:
• Stream friendly - each item examined only once,
can quickly update a small sketch data structure
• Scalable – effective for queries that do not scale
well; count distinct, quantiles, most frequent
items
• Approximate – faster results within
mathematically proven error bounds
• provide fixed size compute and
predictable space usage
Combined Sketch
Period: 0<t<3
Combined Sketch
Period: 0<t<1 + 2<t<3 +….
Sketch
Period: 0<t<1
Sketch
Period: 1<t<2
Sketch
Period: 2<t<3
Sketch
Period: n-1<t<n
Data sketches provide fast,
approximate answers to queries about
the underlying data.
Data sketches are combinable.
Allows us to slice and dice the windows and re-
combine them during read.
Can pick and mix sketches (skip certain days,
hours, etc..)
20 © Hortonworks Inc. 2011–2018. All rights reserved.
Streaming Analytics at Scale
Algorithms out of the box
Profiles
• HyperLogLog (Cardinality) – How many servers
does this user talk to usually?
• Bloom Filters – Have we seen this domain
before?
• T-Digest (distribution) – Personalized Baselining
and statistics
• Counters and descriptive statistics – Quick
results and triggers for more intensive
calculations.
• Mixed period windows – accounting for
holidays, typical working periods and seasons
Approximation algorithms - specialized algorithms that can produce results
orders-of-magnitude faster within mathematically proven error bounds - ideal
for real-time analytics
Natural Language Processing
(finding likely non-human behavior with Machine
Learning)
• Typosquat (mis-spellings, homoglyphs)
• DGA (Domain Generation Algorithm)
Streaming similarity and anomaly detection
• Mean Absolute Deviation
• TLSH (Locality Sensitive Hashing) – Finding events
similar to known bad
• GeoHash similarity
• Robust PCA
21 © Hortonworks Inc. 2011–2018. All rights reserved. 21
UNDER THE HOOD: ANATOMY OF THE
DATA ENGINEERING PIPELINE
22 © Hortonworks Inc. 2011–2018. All rights reserved.
Architecture & Capabilities
23 © Hortonworks Inc. 2011–2018. All rights reserved.
Architecture & Capabilities
Pipelines are created and deployed via the Metron
framework - no custom storm code required
Extendable Domain Specific Language (DSL) used across Metron
for querying, transformation, and configuring rules
Core pipeline components: NiFi, Kafka, Storm, Spark, Solr.
Access and Visualization: Metron UI & Zoomdata (partner)
Generated alerts can be integrated
with external systems
24 © Hortonworks Inc. 2011–2018. All rights reserved. 19
ARCHITECTURE & CAPABILITIES
[#] #### :
#######
#####
[#] #### :
#######
#####
[#] #### :
#######
#####
Acquire
NiFi (& MiNiFi) acquire raw
data and handle routing
Devices generate
raw log messages
data formats from a variety of
disparate systems and sources
Architecture & Capabilities
25 © Hortonworks Inc. 2011–2018. All rights reserved. 20
ARCHITECTURE & CAPABILITIES
[#] #### :
#######
#####
{
a: ##
b: ####
c: ###
}
[#] #### :
#######
#####
[#] #### :
#######
#####
[#] #### :
#######
#####
Acquire Normalize
Out of the box device parsers
ASA, Bro, Fireeye, PaloAlto, …
Convert all data from raw source
logs into a common JSON format
simplifies downstream
analytics across devices
General purpose format parsers
Grok, Regex, CSV, JSON
Custom java based parsers
Architecture & Capabilities
26 © Hortonworks Inc. 2011–2018. All rights reserved. 21
ARCHITECTURE & CAPABILITIES
[#] #### :
#######
#####
{
a: ##
b: ####
c: ###
}
{
a: ##
b: ####
c: ###
d: ##
e: ##
}
{
a: ##
b: ####
c: ###
d: ##
e: ##
alert: true
sev: 1
}
[#] #### :
#######
#####
[#] #### :
#######
#####
[#] #### :
#######
#####
Acquire EnrichNormalize
Geo enrichment, hbase lookups
for custom enrichments, MaaS
additional additional information
to raw source during streaming
Assess against threat feeds,
and alert based on severity
Architecture & Capabilities
27 © Hortonworks Inc. 2011–2018. All rights reserved. 22
ARCHITECTURE & CAPABILITIES
[#] #### :
#######
#####
{
a: ##
b: ####
c: ###
}
{
a: ##
b: ####
c: ###
d: ##
e: ##
}
{
a: ##
b: ####
c: ###
d: ##
e: ##
alert: true
sev: 1
}
[#] #### :
#######
#####
[#] #### :
#######
#####
[#] #### :
#######
#####
t1 t2 [t3] --- !
Acquire Enrich ProfileNormalize
Profiler generates feature sets
that are stored within HBase
Profiler is a separate pipeline that
listens on all streaming events
Pipeline specialized to understand a series
of actions in time across multiple devices
windowed features can be looped
back for triage and alerting
Batch profiling is also supported, that can
“seed” a feature set from historical data
Architecture & Capabilities
28 © Hortonworks Inc. 2011–2018. All rights reserved. 23
ARCHITECTURE & CAPABILITIES
[#] #### :
#######
#####
{
a: ##
b: ####
c: ###
}
{
a: ##
b: ####
c: ###
d: ##
e: ##
}
{
a: ##
b: ####
c: ###
d: ##
e: ##
alert: true
sev: 1
}
[#] #### :
#######
#####
[#] #### :
#######
#####
[#] #### :
#######
#####
t1 t2 [t3] --- !
--- [alert] ---
a: ## b: #### c: ### d: ## e: ##
Acquire Enrich ProfileNormalize Security Data Lake
Data is indexed in Solr near-term
for random access and hot tiering
Data is stored in HDFS long-term
for historical access and analytics
SOC Analysts perform security
monitoring and threat hunting
Security data scientists train
against historical trends to
improve alerting models
Architecture & Capabilities
29 © Hortonworks Inc. 2011–2018. All rights reserved. 24
ARCHITECTURE & CAPABILITIES
[#] #### :
#######
#####
{
a: ##
b: ####
c: ###
}
{
a: ##
b: ####
c: ###
d: ##
e: ##
}
{
a: ##
b: ####
c: ###
d: ##
e: ##
alert: true
sev: 1
}
[#] #### :
#######
#####
[#] #### :
#######
#####
[#] #### :
#######
#####
t1 t2 [t3] --- !
--- [alert] ---
a: ## b: #### c: ### d: ## e: ##
Acquire Enrich ProfileNormalize Security Data Lake
End-to-end streaming data pipeline enables real-time
action against cyber threats in a repeatable patternArchitecture & Capabilities
30 © Hortonworks Inc. 2011–2018. All rights reserved. 30
USE CASE
WALK THROUGH
31 © Hortonworks Inc. 2011–2018. All rights reserved.
Deploying a Use Case in Apache Metron
What is Squid?
• Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response
times by caching and reusing frequently-requested web pages
What does a Squid access log look like?
• When you make an outbound http connection to http://www.cnn.com, the following entry is added to a file called
access.log:
Squid Logs - Use Case Walkthrough
Unix Epoch Time
IP of host where connection was
made
Domain name of the outbound
connection
32 © Hortonworks Inc. 2011–2018. All rights reserved.
Deploying a Use Case in Apache Metron
What is Squid?
• Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response
times by caching and reusing frequently-requested web pages
What does a Squid access log look like?
• When you make an outbound http connection to http://www.cnn.com, the following entry is added to a file called
access.log:
Squid Logs - Use Case Walkthrough
Unix Epoch Time
IP of host where connection was
made
Domain name of the outbound
connection
Convert from Unix Epoch to
Timestamp
Asset enrichment to enrich IP
(hostname, type of device)
WHOIS enrichment to look up
domain name information
Threat Intel to cross-reference address with
intel feed to see if there is a hit
Index the event into Solr and persist in
HDFS (Security Data Lake)
What Metron will do to the
Squid telemetry in real-time
33 © Hortonworks Inc. 2011–2018. All rights reserved.
Deploying a Use Case in Apache Metron
34 © Hortonworks Inc. 2011–2018. All rights reserved.
Deploying a Use Case in Apache Metron
Step 1 NiFi TailFile
Step 2 Define Parser
Step 3 Enrichment Config
Step 4 Configure Alerts
Step 5 Configure Profiler
35 © Hortonworks Inc. 2011–2018. All rights reserved.
Step 1 – Telemetry Ingest
Streaming from NiFi to Kafka
Data is tailed from the Squid access-log files:
36 © Hortonworks Inc. 2011–2018. All rights reserved.
Step 2 – Configuring the Squid Parser
Defining a Grok Filter for the Squid data
• Grok parser à config driven
• Regex-based abstraction
• Grok is suitable for structured or
semi-structured logs
• Contains pre-defined mappings
Pre-defined grok mappings for IP
37 © Hortonworks Inc. 2011–2018. All rights reserved.
Step 3 – Configuring Streaming Enrichment
Enriching events with GEO data
• Leverage the out of the box GEO enrichment.
• Custom enrichment sources also supported – stored
in HBase & configuration driven
Enriching against a DGA model
$METRON_HOME/bin/maas_deploy.sh -zq node1:2181 -lmp $HOME/mock_dga -hmp
/user/$USER/models -mo ADD -m 512 -n dga -v 1.0 -ni 1
Mock DGA python model to
detect malicious domains
Deploy to Metron Model as a
Service (MaaS)
Call model in stream within
parser or enrichment configs
38 © Hortonworks Inc. 2011–2018. All rights reserved.
Step 4 – Configuring Alerts
Defining severity ratings based on threat triage rules
Raise an alert if our dga model finds
a detection
Set our score rating to 100 on hit
Multiple alert rules supported.
Aggregator defined for when
multiple conditions are met
39 © Hortonworks Inc. 2011–2018. All rights reserved. 39
enrichment from GEOIP lookup
enrichment from DGA python model
(from Model as a Service)
40 © Hortonworks Inc. 2011–2018. All rights reserved.
Step 5 – Configuring Profiler
Finding geographic anomalies in user login behavior - an authentication log example
Profile 1: Track locations by user
• geohashes of the locations the user has logged in from
• multiset of geohashes per user (mapping occurrence counts)
{
"profile": "locations_by_user",
"foreach": "user",
"onlyif": "hash != null && LENGTH(hash) > 0"
"init": {
"s": "MULTISET_INIT()"
},
"update": {
"s": "MULTISET_ADD(s, hash)"
},
"result": "s”
}
{
"profile": "geo_distribution_from_centroid",
"foreach": "'global'",
"onlyif": "geo_distance != null"
"init": {
"s": "STATS_INIT()"
},
"update": {
"s": "STATS_ADD(s, geo_distance)"
},
"result": "s”
}
Profile 2: Track geo distribution from centroid
• Statistical distribution of the distance between login location and the
geographic centroid of the user’s previous logins from within the last
5 minutes
These profiles will help us track if a user is logging in via vastly differing
geographic locations in a short period of time
41 © Hortonworks Inc. 2011–2018. All rights reserved.
{
"threatIntel": {
"fieldMap": {
"stellar" : {
"config" : [
"geo_distance_distr:= STATS_MERGE( PROFILE_GET('geo_distribution_from_centroid', 'global',
PROFILE_FIXED( 4, ’HOURS')))",
"dist_median := STATS_PERCENTILE(geo_distance_distr, 50.0)",
"dist_sd := STATS_SD(geo_distance_distr)",
"geo_outlier := ABS(dist_median - geo_distance) >= 5*dist_sd",
"is_alert := exists(is_alert) && is_alert",
"is_alert := is_alert || (geo_outlier != null && geo_outlier == true)",
"geo_distance_distr := null"
]
}
}
Step 5 – Configuring Profiler
Compute the threat given global context and per-user context
Get the statistical distribution of the
‘geo_distance’ field for all users
Decide if the geo_distance is an outlier by
testing how many standard deviations it is
from the median
Update the ‘is_alert’ accordingly. If this is
true, then we need to triage the alert level
42 © Hortonworks Inc. 2011–2018. All rights reserved.
"triageConfig" : {
"riskLevelRules" : [
{
"name" : "Geographic Outlier",
"comment" : "Determine if the user's geographic distance from the centroid of the historic logins
is an outlier as compared to all users.",
"rule" : "geo_outlier != null && geo_outlier",
"score" : 10,
"reason" : "FORMAT('user %s has a distance (%d) from the centroid of their last login is 5 std
deviations (%f) from the median (%f)', user, geo_distance, dist_sd, dist_median)"
}
],
"aggregator" : "MAX"
}
Step 5 – Configuring Profiler
Triage the threat
Because this is only a circumstantial
indicator, we’ll only give this a threat
score of 10
In a normal system, there would be many
rules triaging the threat. In this case the
max score would be taken
We need to ensure the SOC Analyst has enough
context to make a decision here
43 © Hortonworks Inc. 2011–2018. All rights reserved.
Key Takeaways
• Cybersecurity is a big data problem
We need a community driven approach to solve it
• Modern cybersecurity challenges require
a modern data architecture to facilitate real-time response
• Apache Metron provides an extensible, repeatable, and
configuration driven framework for real-time cybersecurity at scale
© Hortonworks Inc. 2011–2018. All rights reserved.
44 © Hortonworks Inc. 2011–2018. All rights reserved.
Thank you.

More Related Content

What's hot (20)

PDF
Meet HBase 2.0 and Phoenix-5.0
DataWorks Summit
 
PDF
Deep learning 101
DataWorks Summit
 
PDF
What s new in spark 2.3 and spark 2.4
DataWorks Summit
 
PDF
Apache Hadoop YARN: State of the Union
DataWorks Summit
 
PDF
Keynote
DataWorks Summit
 
PDF
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
DataWorks Summit
 
PDF
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
PDF
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
PDF
Curing the Kafka Blindness – Streams Messaging Manager
DataWorks Summit
 
PDF
Apache Hadoop YARN: state of the union - Tokyo
DataWorks Summit
 
PDF
Containers and Big Data
DataWorks Summit
 
PDF
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
 
PDF
What is new in Apache Hive 3.0?
DataWorks Summit
 
PPTX
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
DataWorks Summit
 
PDF
Scalable OCR with NiFi and Tesseract
DataWorks Summit/Hadoop Summit
 
PPTX
Navigating Idiosyncrasies of IoT Development
DataWorks Summit
 
PDF
HDF: Hortonworks DataFlow: Technical Workshop
Hortonworks
 
PDF
Data in the Cloud Crash Course
DataWorks Summit
 
PDF
Containers and Big Data
DataWorks Summit
 
PPTX
Difference between apache spark and apache nifi
GaneshJoshi47
 
Meet HBase 2.0 and Phoenix-5.0
DataWorks Summit
 
Deep learning 101
DataWorks Summit
 
What s new in spark 2.3 and spark 2.4
DataWorks Summit
 
Apache Hadoop YARN: State of the Union
DataWorks Summit
 
The First Mile – Edge and IoT Data Collection with Apache NiFi and MiNiFi
DataWorks Summit
 
Hadoop Operations - Past, Present, and Future
DataWorks Summit
 
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
Curing the Kafka Blindness – Streams Messaging Manager
DataWorks Summit
 
Apache Hadoop YARN: state of the union - Tokyo
DataWorks Summit
 
Containers and Big Data
DataWorks Summit
 
What’s new in Apache Spark 2.3 and Spark 2.4
DataWorks Summit
 
What is new in Apache Hive 3.0?
DataWorks Summit
 
How to Ingest 16 Billion Records Per Day into your Hadoop Environment
DataWorks Summit
 
Scalable OCR with NiFi and Tesseract
DataWorks Summit/Hadoop Summit
 
Navigating Idiosyncrasies of IoT Development
DataWorks Summit
 
HDF: Hortonworks DataFlow: Technical Workshop
Hortonworks
 
Data in the Cloud Crash Course
DataWorks Summit
 
Containers and Big Data
DataWorks Summit
 
Difference between apache spark and apache nifi
GaneshJoshi47
 

Similar to Solving Cybersecurity at Scale (20)

PPTX
A streaming architecture for Cyber Security - Apache Metron
Simon Elliston Ball
 
PDF
Apache Metron in the Real World
Dave Russell
 
PDF
Hortonworks sqrrl webinar v5.pptx
Hortonworks
 
PDF
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
 
PPTX
Just the sketch: advanced streaming analytics in Apache Metron
DataWorks Summit
 
PPTX
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
DataWorks Summit
 
PDF
Storm Demo Talk - Denver Apr 2015
Mac Moore
 
PDF
Big Traffic, Big Trouble: Big Data - Tokyo
DataWorks Summit
 
PDF
Big Traffic, Big Trouble: Big Data Security Analytics
DataWorks Summit
 
PPTX
Enterprise data science at scale
Carolyn Duby
 
PPTX
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
PDF
Apache Metron in the Real World
DataWorks Summit
 
PDF
Joseph Witt
AFCEA International
 
PPTX
Unlocking insights in streaming data
Carolyn Duby
 
PPTX
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
PDF
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
PDF
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks
 
PDF
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
KEY
Agile analytics applications on hadoop
Hortonworks
 
KEY
Hortonworks: Agile Analytics Applications
russell_jurney
 
A streaming architecture for Cyber Security - Apache Metron
Simon Elliston Ball
 
Apache Metron in the Real World
Dave Russell
 
Hortonworks sqrrl webinar v5.pptx
Hortonworks
 
Data in Motion - Data at Rest - Hortonworks a Modern Architecture
Mats Johansson
 
Just the sketch: advanced streaming analytics in Apache Metron
DataWorks Summit
 
Hortonworks Open Connected Data Platforms for IoT and Predictive Big Data Ana...
DataWorks Summit
 
Storm Demo Talk - Denver Apr 2015
Mac Moore
 
Big Traffic, Big Trouble: Big Data - Tokyo
DataWorks Summit
 
Big Traffic, Big Trouble: Big Data Security Analytics
DataWorks Summit
 
Enterprise data science at scale
Carolyn Duby
 
Hortonworks Data In Motion Webinar Series Pt. 2
Hortonworks
 
Apache Metron in the Real World
DataWorks Summit
 
Joseph Witt
AFCEA International
 
Unlocking insights in streaming data
Carolyn Duby
 
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
Hadoop Summit Tokyo HDP Sandbox Workshop
DataWorks Summit/Hadoop Summit
 
Hortonworks and Red Hat Webinar_Sept.3rd_Part 1
Hortonworks
 
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
Agile analytics applications on hadoop
Hortonworks
 
Hortonworks: Agile Analytics Applications
russell_jurney
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PPTX
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
PDF
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
PDF
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
PDF
Learn Computer Forensics, Second Edition
AnuraShantha7
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PDF
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
✨Unleashing Collaboration: Salesforce Channels & Community Power in Patna!✨
SanjeetMishra29
 
Why Orbit Edge Tech is a Top Next JS Development Company in 2025
mahendraalaska08
 
Predicting the unpredictable: re-engineering recommendation algorithms for fr...
Speck&Tech
 
Learn Computer Forensics, Second Edition
AnuraShantha7
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Windsurf Meetup Ottawa 2025-07-12 - Planning Mode at Reliza.pdf
Pavel Shukhman
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 

Solving Cybersecurity at Scale

  • 1. 1 © Hortonworks Inc. 2011–2018. All rights reserved. Solving Cybersecurity at Scale Laurence Da Luz & Mo Kamel
  • 2. 2 © Hortonworks Inc. 2011–2018. All rights reserved. What Are We Talking About? Cybersecurity Challenges Solving Cybersecurity At Scale Anatomy of Apache Metron Use Case Walkthrough
  • 3. 3 © Hortonworks Inc. 2011–2018. All rights reserved. 3 CYBERSECURITY IS A BIG DATA PROBLEM
  • 4. 4 © Hortonworks Inc. 2011–2018. All rights reserved. Big Traffic, Big Trouble Complexity Problem • Too many point solutions • Too many dashboards • Too hard to correlate data across silos • Cybersecurity staff overwhelmed with too many alerts
  • 5. 5 © Hortonworks Inc. 2011–2018. All rights reserved. Big Traffic, Big Trouble + Capability Problem • Huge volumes & Limited Storage • Inconsistent data from multiple sources • Real-time context is crucial • Missing Adv. Analytics • Alert Fatigue – “Many False Positives”
  • 6. 6 © Hortonworks Inc. 2011–2018. All rights reserved. Big Traffic, Big Trouble + People Problem • Skill Shortage around the globe • Staff inefficiency & high cost • Low value work of data gathering and cleansing • Impractical solution scaling people
  • 7. 7 © Hortonworks Inc. 2011–2018. All rights reserved. Big Traffic, Big Trouble + Security Problem • Distracting and Adv. attacks • Lake of security context • Asset Classifications • Prioritization and Scoring • Full access to historical data
  • 8. 8 © Hortonworks Inc. 2011–2018. All rights reserved.
  • 9. 9 © Hortonworks Inc. 2011–2018. All rights reserved. A Community Solution Open Source Solution • Volume • Variety • Value • Automation • Realtime • Threat Intel
  • 10. 10 © Hortonworks Inc. 2011–2018. All rights reserved. Advanced Use Cases Open Source Solution • Users Behavior • Entities Behavior • Advanced Analytics
  • 11. 11 © Hortonworks Inc. 2011–2018. All rights reserved.
  • 12. 12 © Hortonworks Inc. 2011–2018. All rights reserved. Solving Cybersecurity at Scale An architecture for real-time cybersecurity analytics REAL-TIME PROCESSING CYBER SECURITY ENGINE Cyber Security Stream Processing Pipeline Telemetry Data Sources Telemetry Data Collectors Telemetry Parsers Enrichment Threat Intel Profiler Alert Triage Indexers and Writers SecurityEndPoint Devices (Fireye,PaloAlto, BlueCoat,etc.) Machine GeneratedLogs (AD,App/Web Server,firewall, VPN,etc.) IDS (Suricata,Snort, etc.) NetworkData PCAP,Netflow,Bro, etc.) ThreatIntelligence Feeds (Soltra,OpenTaxi third-partyfeeds) Performance NetworkIngest Probes Real-Time Enrich/Threat IntelStreams /Other… DataVault Real-TimeSearch EvidentiaryStore ThreatIntelligence Platform ModelasaService CommunityModels DataScience Workbench PCAPForensics Modules Data Services & Integration Layer Telemetry Ingest Buffer HORTONWORKS DATA PLATFORMHORTONWORKS DATA FLOW
  • 13. 13 © Hortonworks Inc. 2011–2018. All rights reserved. Solving Cybersecurity at Scale An architecture for real-time cybersecurity analytics Cyber Security Stream Processing Pipeline Telemetry Data Sources Telemetry Data Collectors Telemetry Parsers Enrichment Threat Intel Profiler Alert Triage Indexers and Writers SecurityEndPoint Devices (Fireye,PaloAlto, BlueCoat,etc.) Machine GeneratedLogs (AD,App/Web Server,firewall, VPN,etc.) IDS (Suricata,Snort, etc.) NetworkData PCAP,Netflow,Bro, etc.) ThreatIntelligence Feeds (Soltra,OpenTaxi third-partyfeeds) Performance NetworkIngest Probes Real-Time Enrich/Threat IntelStreams /Other… DataVault Real-TimeSearch EvidentiaryStore ThreatIntelligence Platform ModelasaService CommunityModels DataScience Workbench PCAPForensics Modules Data Services & Integration Layer Telemetry Ingest Buffer HORTONWORKS DATA PLATFORMHORTONWORKS DATA FLOW Collect security device and machine generated logs Extendable data model Enrichment on Ingest for extra context Behavior profiling and advanced windowing Flexible deployment of Data Science Alerting and triage (exposed to SOC) Hortonworks Cybersecurity Platform runs as an application on top of HDF and HDP REAL-TIME PROCESSING CYBER SECURITY ENGINE
  • 14. 14 © Hortonworks Inc. 2011–2018. All rights reserved.
  • 15. 15 © Hortonworks Inc. 2011–2018. All rights reserved.
  • 16. 16 © Hortonworks Inc. 2011–2018. All rights reserved.
  • 17. 17 © Hortonworks Inc. 2011–2018. All rights reserved. Context is everything Enrichments User, group data, internal business sources Geospatial data, worldwide shared threat intelligence Model predictions, via Model As A Service framework Time
  • 18. 18 © Hortonworks Inc. 2011–2018. All rights reserved. Time is context | Time matters The Profiler • A generalized solution for extracting model features and aggregations over time from high throughput, streaming data • Generates a profile describing the behavior of an entity; a host, user, subnet or application.. • A foundational component for both security model building and alerting in HCP t = 1 t = 2 t = 3 t = n Profile behavior across windows in time, and across multiple devices
  • 19. 19 © Hortonworks Inc. 2011–2018. All rights reserved. t = 1 t = 2 t = 3 t = n … how do we perform behavioral profiling at real-time scale? Time is context | Time matters The Profiler Variety of different types of data sketches, but general characteristics include: • Stream friendly - each item examined only once, can quickly update a small sketch data structure • Scalable – effective for queries that do not scale well; count distinct, quantiles, most frequent items • Approximate – faster results within mathematically proven error bounds • provide fixed size compute and predictable space usage Combined Sketch Period: 0<t<3 Combined Sketch Period: 0<t<1 + 2<t<3 +…. Sketch Period: 0<t<1 Sketch Period: 1<t<2 Sketch Period: 2<t<3 Sketch Period: n-1<t<n Data sketches provide fast, approximate answers to queries about the underlying data. Data sketches are combinable. Allows us to slice and dice the windows and re- combine them during read. Can pick and mix sketches (skip certain days, hours, etc..)
  • 20. 20 © Hortonworks Inc. 2011–2018. All rights reserved. Streaming Analytics at Scale Algorithms out of the box Profiles • HyperLogLog (Cardinality) – How many servers does this user talk to usually? • Bloom Filters – Have we seen this domain before? • T-Digest (distribution) – Personalized Baselining and statistics • Counters and descriptive statistics – Quick results and triggers for more intensive calculations. • Mixed period windows – accounting for holidays, typical working periods and seasons Approximation algorithms - specialized algorithms that can produce results orders-of-magnitude faster within mathematically proven error bounds - ideal for real-time analytics Natural Language Processing (finding likely non-human behavior with Machine Learning) • Typosquat (mis-spellings, homoglyphs) • DGA (Domain Generation Algorithm) Streaming similarity and anomaly detection • Mean Absolute Deviation • TLSH (Locality Sensitive Hashing) – Finding events similar to known bad • GeoHash similarity • Robust PCA
  • 21. 21 © Hortonworks Inc. 2011–2018. All rights reserved. 21 UNDER THE HOOD: ANATOMY OF THE DATA ENGINEERING PIPELINE
  • 22. 22 © Hortonworks Inc. 2011–2018. All rights reserved. Architecture & Capabilities
  • 23. 23 © Hortonworks Inc. 2011–2018. All rights reserved. Architecture & Capabilities Pipelines are created and deployed via the Metron framework - no custom storm code required Extendable Domain Specific Language (DSL) used across Metron for querying, transformation, and configuring rules Core pipeline components: NiFi, Kafka, Storm, Spark, Solr. Access and Visualization: Metron UI & Zoomdata (partner) Generated alerts can be integrated with external systems
  • 24. 24 © Hortonworks Inc. 2011–2018. All rights reserved. 19 ARCHITECTURE & CAPABILITIES [#] #### : ####### ##### [#] #### : ####### ##### [#] #### : ####### ##### Acquire NiFi (& MiNiFi) acquire raw data and handle routing Devices generate raw log messages data formats from a variety of disparate systems and sources Architecture & Capabilities
  • 25. 25 © Hortonworks Inc. 2011–2018. All rights reserved. 20 ARCHITECTURE & CAPABILITIES [#] #### : ####### ##### { a: ## b: #### c: ### } [#] #### : ####### ##### [#] #### : ####### ##### [#] #### : ####### ##### Acquire Normalize Out of the box device parsers ASA, Bro, Fireeye, PaloAlto, … Convert all data from raw source logs into a common JSON format simplifies downstream analytics across devices General purpose format parsers Grok, Regex, CSV, JSON Custom java based parsers Architecture & Capabilities
  • 26. 26 © Hortonworks Inc. 2011–2018. All rights reserved. 21 ARCHITECTURE & CAPABILITIES [#] #### : ####### ##### { a: ## b: #### c: ### } { a: ## b: #### c: ### d: ## e: ## } { a: ## b: #### c: ### d: ## e: ## alert: true sev: 1 } [#] #### : ####### ##### [#] #### : ####### ##### [#] #### : ####### ##### Acquire EnrichNormalize Geo enrichment, hbase lookups for custom enrichments, MaaS additional additional information to raw source during streaming Assess against threat feeds, and alert based on severity Architecture & Capabilities
  • 27. 27 © Hortonworks Inc. 2011–2018. All rights reserved. 22 ARCHITECTURE & CAPABILITIES [#] #### : ####### ##### { a: ## b: #### c: ### } { a: ## b: #### c: ### d: ## e: ## } { a: ## b: #### c: ### d: ## e: ## alert: true sev: 1 } [#] #### : ####### ##### [#] #### : ####### ##### [#] #### : ####### ##### t1 t2 [t3] --- ! Acquire Enrich ProfileNormalize Profiler generates feature sets that are stored within HBase Profiler is a separate pipeline that listens on all streaming events Pipeline specialized to understand a series of actions in time across multiple devices windowed features can be looped back for triage and alerting Batch profiling is also supported, that can “seed” a feature set from historical data Architecture & Capabilities
  • 28. 28 © Hortonworks Inc. 2011–2018. All rights reserved. 23 ARCHITECTURE & CAPABILITIES [#] #### : ####### ##### { a: ## b: #### c: ### } { a: ## b: #### c: ### d: ## e: ## } { a: ## b: #### c: ### d: ## e: ## alert: true sev: 1 } [#] #### : ####### ##### [#] #### : ####### ##### [#] #### : ####### ##### t1 t2 [t3] --- ! --- [alert] --- a: ## b: #### c: ### d: ## e: ## Acquire Enrich ProfileNormalize Security Data Lake Data is indexed in Solr near-term for random access and hot tiering Data is stored in HDFS long-term for historical access and analytics SOC Analysts perform security monitoring and threat hunting Security data scientists train against historical trends to improve alerting models Architecture & Capabilities
  • 29. 29 © Hortonworks Inc. 2011–2018. All rights reserved. 24 ARCHITECTURE & CAPABILITIES [#] #### : ####### ##### { a: ## b: #### c: ### } { a: ## b: #### c: ### d: ## e: ## } { a: ## b: #### c: ### d: ## e: ## alert: true sev: 1 } [#] #### : ####### ##### [#] #### : ####### ##### [#] #### : ####### ##### t1 t2 [t3] --- ! --- [alert] --- a: ## b: #### c: ### d: ## e: ## Acquire Enrich ProfileNormalize Security Data Lake End-to-end streaming data pipeline enables real-time action against cyber threats in a repeatable patternArchitecture & Capabilities
  • 30. 30 © Hortonworks Inc. 2011–2018. All rights reserved. 30 USE CASE WALK THROUGH
  • 31. 31 © Hortonworks Inc. 2011–2018. All rights reserved. Deploying a Use Case in Apache Metron What is Squid? • Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently-requested web pages What does a Squid access log look like? • When you make an outbound http connection to http://www.cnn.com, the following entry is added to a file called access.log: Squid Logs - Use Case Walkthrough Unix Epoch Time IP of host where connection was made Domain name of the outbound connection
  • 32. 32 © Hortonworks Inc. 2011–2018. All rights reserved. Deploying a Use Case in Apache Metron What is Squid? • Squid is a caching proxy for the Web supporting HTTP, HTTPS, FTP, and more. It reduces bandwidth and improves response times by caching and reusing frequently-requested web pages What does a Squid access log look like? • When you make an outbound http connection to http://www.cnn.com, the following entry is added to a file called access.log: Squid Logs - Use Case Walkthrough Unix Epoch Time IP of host where connection was made Domain name of the outbound connection Convert from Unix Epoch to Timestamp Asset enrichment to enrich IP (hostname, type of device) WHOIS enrichment to look up domain name information Threat Intel to cross-reference address with intel feed to see if there is a hit Index the event into Solr and persist in HDFS (Security Data Lake) What Metron will do to the Squid telemetry in real-time
  • 33. 33 © Hortonworks Inc. 2011–2018. All rights reserved. Deploying a Use Case in Apache Metron
  • 34. 34 © Hortonworks Inc. 2011–2018. All rights reserved. Deploying a Use Case in Apache Metron Step 1 NiFi TailFile Step 2 Define Parser Step 3 Enrichment Config Step 4 Configure Alerts Step 5 Configure Profiler
  • 35. 35 © Hortonworks Inc. 2011–2018. All rights reserved. Step 1 – Telemetry Ingest Streaming from NiFi to Kafka Data is tailed from the Squid access-log files:
  • 36. 36 © Hortonworks Inc. 2011–2018. All rights reserved. Step 2 – Configuring the Squid Parser Defining a Grok Filter for the Squid data • Grok parser à config driven • Regex-based abstraction • Grok is suitable for structured or semi-structured logs • Contains pre-defined mappings Pre-defined grok mappings for IP
  • 37. 37 © Hortonworks Inc. 2011–2018. All rights reserved. Step 3 – Configuring Streaming Enrichment Enriching events with GEO data • Leverage the out of the box GEO enrichment. • Custom enrichment sources also supported – stored in HBase & configuration driven Enriching against a DGA model $METRON_HOME/bin/maas_deploy.sh -zq node1:2181 -lmp $HOME/mock_dga -hmp /user/$USER/models -mo ADD -m 512 -n dga -v 1.0 -ni 1 Mock DGA python model to detect malicious domains Deploy to Metron Model as a Service (MaaS) Call model in stream within parser or enrichment configs
  • 38. 38 © Hortonworks Inc. 2011–2018. All rights reserved. Step 4 – Configuring Alerts Defining severity ratings based on threat triage rules Raise an alert if our dga model finds a detection Set our score rating to 100 on hit Multiple alert rules supported. Aggregator defined for when multiple conditions are met
  • 39. 39 © Hortonworks Inc. 2011–2018. All rights reserved. 39 enrichment from GEOIP lookup enrichment from DGA python model (from Model as a Service)
  • 40. 40 © Hortonworks Inc. 2011–2018. All rights reserved. Step 5 – Configuring Profiler Finding geographic anomalies in user login behavior - an authentication log example Profile 1: Track locations by user • geohashes of the locations the user has logged in from • multiset of geohashes per user (mapping occurrence counts) { "profile": "locations_by_user", "foreach": "user", "onlyif": "hash != null && LENGTH(hash) > 0" "init": { "s": "MULTISET_INIT()" }, "update": { "s": "MULTISET_ADD(s, hash)" }, "result": "s” } { "profile": "geo_distribution_from_centroid", "foreach": "'global'", "onlyif": "geo_distance != null" "init": { "s": "STATS_INIT()" }, "update": { "s": "STATS_ADD(s, geo_distance)" }, "result": "s” } Profile 2: Track geo distribution from centroid • Statistical distribution of the distance between login location and the geographic centroid of the user’s previous logins from within the last 5 minutes These profiles will help us track if a user is logging in via vastly differing geographic locations in a short period of time
  • 41. 41 © Hortonworks Inc. 2011–2018. All rights reserved. { "threatIntel": { "fieldMap": { "stellar" : { "config" : [ "geo_distance_distr:= STATS_MERGE( PROFILE_GET('geo_distribution_from_centroid', 'global', PROFILE_FIXED( 4, ’HOURS')))", "dist_median := STATS_PERCENTILE(geo_distance_distr, 50.0)", "dist_sd := STATS_SD(geo_distance_distr)", "geo_outlier := ABS(dist_median - geo_distance) >= 5*dist_sd", "is_alert := exists(is_alert) && is_alert", "is_alert := is_alert || (geo_outlier != null && geo_outlier == true)", "geo_distance_distr := null" ] } } Step 5 – Configuring Profiler Compute the threat given global context and per-user context Get the statistical distribution of the ‘geo_distance’ field for all users Decide if the geo_distance is an outlier by testing how many standard deviations it is from the median Update the ‘is_alert’ accordingly. If this is true, then we need to triage the alert level
  • 42. 42 © Hortonworks Inc. 2011–2018. All rights reserved. "triageConfig" : { "riskLevelRules" : [ { "name" : "Geographic Outlier", "comment" : "Determine if the user's geographic distance from the centroid of the historic logins is an outlier as compared to all users.", "rule" : "geo_outlier != null && geo_outlier", "score" : 10, "reason" : "FORMAT('user %s has a distance (%d) from the centroid of their last login is 5 std deviations (%f) from the median (%f)', user, geo_distance, dist_sd, dist_median)" } ], "aggregator" : "MAX" } Step 5 – Configuring Profiler Triage the threat Because this is only a circumstantial indicator, we’ll only give this a threat score of 10 In a normal system, there would be many rules triaging the threat. In this case the max score would be taken We need to ensure the SOC Analyst has enough context to make a decision here
  • 43. 43 © Hortonworks Inc. 2011–2018. All rights reserved. Key Takeaways • Cybersecurity is a big data problem We need a community driven approach to solve it • Modern cybersecurity challenges require a modern data architecture to facilitate real-time response • Apache Metron provides an extensible, repeatable, and configuration driven framework for real-time cybersecurity at scale © Hortonworks Inc. 2011–2018. All rights reserved.
  • 44. 44 © Hortonworks Inc. 2011–2018. All rights reserved. Thank you.