SlideShare a Scribd company logo
building a system for machine and
event-oriented data
e. sammer | @esammer
big data day la 2016
© 2015 Rocana, Inc. All Rights Reserved.
context: it’s important
© 2015 Rocana, Inc. All Rights Reserved.
what we do
3
• we build a system for the operation of modern data centers
• triage and diagnostics, exploration, trends, advanced analytics of complex
systems
• our data: logs, metrics, human activity, anything that occurs in the data center
• “enterprise software” (i.e. we build for others.)
• today: how we built what we built
© 2015 Rocana, Inc. All Rights Reserved.
our typical customer use cases
4
• millions of events / sec, sub-second end to end latency, full fidelity retention,
critical use cases
• quality of service - “are credit card transactions happening fast enough?”
• fraud detection - “detect, investigate, prosecute, and learn from fraud.”
• forensic diagnostics - “what really caused the outage last friday?”
• security - “who’s doing what, where, when, why, and how, and is that ok?”
• user behavior - ”capture and correlate user behavior with system performance,
then feed it to downstream systems in realtime.”
© 2015 Rocana, Inc. All Rights Reserved.
depth: 3 meters
© 2015 Rocana, Inc. All Rights Reserved.
high level architecture – data acquisition
6
© 2015 Rocana, Inc. All Rights Reserved.
high level architecture – processing, storage, query
7
© 2015 Rocana, Inc. All Rights Reserved.
guarantees
8
• no single point of failure exists
• all components scale horizontally[1]
• data retention and latency is a function of cost, not tech[1]
• every event is delivered provided no more than N - 1 failures occur (where N is
the kafka replication level)
• all operations, including upgrade, are online[2]
• every event is (or appears to be) delivered exactly once[3]
[1] we’re positive there’s a limit, but thus far it has been cost.
[2] from the user’s perspective, at a system level.
[3] when queried via our UI. lots of details here.
© 2015 Rocana, Inc. All Rights Reserved.
events
© 2015 Rocana, Inc. All Rights Reserved.
modeling our world
10
• everything is an event
• each event contains a timestamp, type, location, host, service, body, and type-
specific attributes (k/v pairs)
• build specialized aggregates as necessary - just optimized views of the data
© 2015 Rocana, Inc. All Rights Reserved.
event schema
11
{
id: string,
ts: long,
event_type_id: int,
location: string,
host: string,
service: string,
body: [ null, string ],
attributes: map<string>
}
© 2015 Rocana, Inc. All Rights Reserved.
event types
12
• some event types are standard
– syslog, http, log4j, generic text record, …
• users define custom event types
• producers populate event type
• transformations can turn an event of type A into B
• event type metadata tells downstream systems how to interpret body and
attributes
© 2015 Rocana, Inc. All Rights Reserved.
ex: generic syslog event
13
event_type_id: 100, // rfc3164, rfc5424 (syslog)
body: … // raw syslog message bytes
attributes: { // extracted fields from body
syslog_message: “DHCPACK from 10.10.0.1 (xid=0x45b63bdc)”,
syslog_severity: “6”, // info severity
syslog_facility: “3”, // daemon facility
syslog_process: “dhclient”,
syslog_pid: “668”,
…
}
© 2015 Rocana, Inc. All Rights Reserved.
ex: generic http event
14
event_type_id: 102, // generic http event
body: … // raw http log message bytes
attributes: {
http_req_method: “GET”,
http_req_vhost: “w2a-demo-02”,
http_req_path: “/api/v1/search?q=service%3Asshd&p=1&s=200”,
http_req_query: “q=service%3Asshd&p=1&s=200”,
http_resp_code: “200”,
…
}
© 2015 Rocana, Inc. All Rights Reserved.
stream processing
© 2015 Rocana, Inc. All Rights Reserved.
a reminder…
16
© 2015 Rocana, Inc. All Rights Reserved.
data processing
17
• each processing job gets a full stream of the fire hose, decides what it wants to
consider or operate on
• output of “non-terminal” jobs always just events
• result: all processing jobs are composable
• many jobs take user rules or configuration from our ui
© 2015 Rocana, Inc. All Rights Reserved.
the jobs
18
• transformation engine: configuration-based data transformation
• metric aggregation: olap cube construction of time series data (e.g. host 17
user cpu time)
• model build/eval: train/evaluate various kinds of models (e.g. anomaly
detection)
• trigger engine: detect complex patterns in the stream, emit events on match
(e.g. complex event processing, automated workflow, alerting)
• action engine: perform some action upon receiving a specific event type (e.g.
email notification, 3rd party api invocation)
• storage: write all the things to hdfs
© 2015 Rocana, Inc. All Rights Reserved.
transformation use cases
19
© 2015 Rocana, Inc. All Rights Reserved.
event feedback loops
20
© 2015 Rocana, Inc. All Rights Reserved.
metrics and time series
© 2015 Rocana, Inc. All Rights Reserved.
aggregation
22
• used for host/service metrics
• two halves: on write and on query
• data model: (dimensions) => (aggregates)
• on write
– reduce(a: A, b: A) => A over window
– store “base” aggregates, all associative and commutative
• on query
– perform same aggregate or derivative aggregates
– group by the same dimensions
– we use SQL (impala+parquet+hdfs)
© 2015 Rocana, Inc. All Rights Reserved.
aside: late arriving data (it’s a thing)
23
• never trust a (wall) clock
• producer determines event time, rest of the system uses this always
• data that shows up late always processed according to event time
• apache beam describes these issues perfectly
• this is real and you must deal with it
© 2015 Rocana, Inc. All Rights Reserved.
extension, pain, and advice
© 2015 Rocana, Inc. All Rights Reserved.
extending the system
25
• custom producers
• custom consumers
• event types
• parser / transformation plugins
• custom metric definition and aggregate functions
• custom processing jobs on landed data
© 2015 Rocana, Inc. All Rights Reserved.
pain (aka: the struggle is real)
26
• lots of tradeoffs when picking a stream processing solution
– samza: right features, but low level programming model, not supported by vendors.
missing security features.
– storm: too rigid, too slow. not supported by all Hadoop vendors.
– flink: relatively new, fledgling community. growing.
– spark streaming: tons of issues initially, but lots of community energy. improving.
• stack complexity, (relative im)maturity
• beam-style retractions required for correct, timely, efficient aggregates of
complex metrics (non-assoc/commutative)
© 2015 Rocana, Inc. All Rights Reserved.
if you’re going to try this…
27
• read all the literature on stream processing[1]
• treat it like the distributed systems problem it is
• understand, make, and make good on guarantees
• find the right abstractions
• never trust the hand waving or “hello worlds”
• fully evaluate the projects/products in this space
• understand it’s not just about search
[1] wait, like all of it? yea, like all of it.
© 2015 Rocana, Inc. All Rights Reserved.
things I didn’t talk about
28
• reprocessing data when bad code / transformations are detected
• dealing with data quality issues (“the struggle is real” part 2)
• the user interface and all the fancy analytics
– data visualization and exploration
– event search
– anomalous trend and event detection
– metric, source, and event correlation
– motif finding
– noise reduction and dithering
• event delivery semantics (e.g. at least/most/exactly once, etc.)
© 2015 Rocana, Inc. All Rights Reserved.
questions?
thank you.
@esammer | esammer@rocana.com

More Related Content

What's hot (20)

PPTX
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
PDF
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data Spain
 
PDF
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
PDF
Data Pipelines With Streamsets
Jowanza Joseph
 
PPTX
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Data Con LA
 
PDF
Power Your Delta Lake with Streaming Transactional Changes
Databricks
 
PDF
Building Custom Big Data Integrations
Pat Patterson
 
PPTX
Big Data Day LA 2016/ Data Science Track - The Evolving Data Science Landscap...
Data Con LA
 
PDF
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
PPTX
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
PDF
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
PDF
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
PDF
Data Driven Decisions at Scale
Databricks
 
PPTX
Real-Time Robot Predictive Maintenance in Action
DataWorks Summit
 
PDF
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PPTX
Druid Overview by Rachel Pedreschi
Brian Olsen
 
PDF
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow
 
PPTX
Big Data – A New Testing Challenge
TEST Huddle
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
Finding the needle in the haystack: how Nestle is leveraging big data to defe...
Big Data Spain
 
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Spark Summit
 
Data Pipelines With Streamsets
Jowanza Joseph
 
Big Data Day LA 2016/ Use Case Driven track - From Clusters to Clouds, Hardwa...
Data Con LA
 
Power Your Delta Lake with Streaming Transactional Changes
Databricks
 
Building Custom Big Data Integrations
Pat Patterson
 
Big Data Day LA 2016/ Data Science Track - The Evolving Data Science Landscap...
Data Con LA
 
Cloud Experience: Data-driven Applications Made Simple and Fast
Databricks
 
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Data Driven Decisions at Scale
Databricks
 
Real-Time Robot Predictive Maintenance in Action
DataWorks Summit
 
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Druid Overview by Rachel Pedreschi
Brian Olsen
 
Delta Lake: Open Source Reliability w/ Apache Spark
George Chow
 
Big Data – A New Testing Challenge
TEST Huddle
 

Viewers also liked (20)

PDF
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
 
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
PDF
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Data Con LA
 
PPTX
Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...
Data Con LA
 
PDF
Big Data Day LA 2016/ NoSQL track - Big Data and Real Estate, Jon Zifcak, CEO...
Data Con LA
 
PPTX
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Data Con LA
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Data Con LA
 
PPTX
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Data Con LA
 
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Data Con LA
 
PDF
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Data Con LA
 
PPTX
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Data Con LA
 
PPT
Dot pab forum september 2011
The Social Executive
 
PDF
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Data Con LA
 
PPT
101129 tokyopref bochibochi
redgang
 
PPTX
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Data Con LA
 
PDF
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Data Con LA
 
PPTX
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
Data Con LA
 
PPTX
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Data Con LA
 
PPTX
How to enhance customer engagement
PayURomania
 
Joining the Club: Using Spark to Accelerate Big Data at Dollar Shave Club
Data Con LA
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Big Data Day LA 2016/ Use Case Driven track - How to Use Design Thinking to J...
Data Con LA
 
Big Data Day LA 2016/ Data Science Track - Intuit's Payments Risk Platform, D...
Data Con LA
 
Big Data Day LA 2016/ Big Data Track - Puree through Trillion of Clicks in Se...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Big Data and Real Estate, Jon Zifcak, CEO...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - MongoDB 3.2 Goodness!!!, Mark Helmstetter...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Data Con LA
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Data Con LA
 
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Real-time Aggregations, Ap...
Data Con LA
 
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Data Con LA
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Data Con LA
 
Dot pab forum september 2011
The Social Executive
 
Big Data Day LA 2015 - Using data visualization to find patterns in multidime...
Data Con LA
 
101129 tokyopref bochibochi
redgang
 
Big Data Day LA 2015 - Transforming into a data driven enterprise using exist...
Data Con LA
 
Big Data Day LA 2015 - Big Data Day LA 2015 - Applying GeoSpatial Analytics u...
Data Con LA
 
An evening with Jay Kreps; author of Apache Kafka, Samza, Voldemort & Azkaban.
Data Con LA
 
Big Data Day LA 2015 - Lessons learned from scaling Big Data in the Cloud by...
Data Con LA
 
How to enhance customer engagement
PayURomania
 
Ad

Similar to Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented Data Platform, Eric Sammer, CTO, Rocana (20)

PPTX
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
PPTX
Building a system for machine and event-oriented data with Rocana
Treasure Data, Inc.
 
PPTX
Building a system for machine and event-oriented data - Data Day Seattle 2015
Eric Sammer
 
PPTX
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Eric Sammer
 
PDF
Building a system for machine and event-oriented data - SF HUG Nov 2015
Felicia Haggarty
 
PPTX
Streaming ETL for All
Joey Echeverria
 
PDF
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
cdmaxime
 
PPTX
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
PPTX
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
PPTX
Embeddable data transformation for real time streams
Joey Echeverria
 
PPTX
Building production spark streaming applications
Joey Echeverria
 
PPTX
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
PPTX
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
InfluxData
 
PPTX
Observability - the good, the bad, and the ugly
Aleksandr Tavgen
 
PPTX
Hadoop Summit - Sanoma self service on hadoop
Sander Kieft
 
PDF
Using Time Series for Full Observability of a SaaS Platform
DevOps.com
 
PPTX
from source to solution - building a system for event-oriented data
Eric Sammer
 
PPTX
Observability – the good, the bad, and the ugly
Timetrix
 
PPTX
Scaling self service on Hadoop
DataWorks Summit
 
PPTX
Spark in the Maritime Domain
Demi Ben-Ari
 
Building an Event-oriented Data Platform with Kafka, Eric Sammer
confluent
 
Building a system for machine and event-oriented data with Rocana
Treasure Data, Inc.
 
Building a system for machine and event-oriented data - Data Day Seattle 2015
Eric Sammer
 
Building a system for machine and event-oriented data - Velocity, Santa Clara...
Eric Sammer
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Felicia Haggarty
 
Streaming ETL for All
Joey Echeverria
 
Rocana Deep Dive OC Big Data Meetup #19 Sept 21st 2016
cdmaxime
 
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
DataEngConf SF16 - High cardinality time series search
Hakka Labs
 
Embeddable data transformation for real time streams
Joey Echeverria
 
Building production spark streaming applications
Joey Echeverria
 
High cardinality time series search: A new level of scale - Data Day Texas 2016
Eric Sammer
 
Using InfluxDB for Full Observability of a SaaS Platform by Aleksandr Tavgen,...
InfluxData
 
Observability - the good, the bad, and the ugly
Aleksandr Tavgen
 
Hadoop Summit - Sanoma self service on hadoop
Sander Kieft
 
Using Time Series for Full Observability of a SaaS Platform
DevOps.com
 
from source to solution - building a system for event-oriented data
Eric Sammer
 
Observability – the good, the bad, and the ugly
Timetrix
 
Scaling self service on Hadoop
DataWorks Summit
 
Spark in the Maritime Domain
Demi Ben-Ari
 
Ad

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PPTX
Data Con LA 2022 Keynotes
Data Con LA
 
PDF
Data Con LA 2022 Keynote
Data Con LA
 
PPTX
Data Con LA 2022 - Startup Showcase
Data Con LA
 
PPTX
Data Con LA 2022 Keynote
Data Con LA
 
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
PPTX
Data Con LA 2022 - AI Ethics
Data Con LA
 
PDF
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
PDF
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
PDF
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynotes
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Startup Showcase
Data Con LA
 
Data Con LA 2022 Keynote
Data Con LA
 
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA
 
Data Con LA 2022 - AI Ethics
Data Con LA
 
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA
 
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA
 
Data Con LA 2022 - Real world consumer segmentation
Data Con LA
 
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA
 
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA
 
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA
 
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA
 
Data Con LA 2022 - Intro to Data Science
Data Con LA
 
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA
 
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA
 
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA
 
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA
 
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA
 

Recently uploaded (20)

PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 

Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Building an Event-oriented Data Platform, Eric Sammer, CTO, Rocana

  • 1. building a system for machine and event-oriented data e. sammer | @esammer big data day la 2016
  • 2. © 2015 Rocana, Inc. All Rights Reserved. context: it’s important
  • 3. © 2015 Rocana, Inc. All Rights Reserved. what we do 3 • we build a system for the operation of modern data centers • triage and diagnostics, exploration, trends, advanced analytics of complex systems • our data: logs, metrics, human activity, anything that occurs in the data center • “enterprise software” (i.e. we build for others.) • today: how we built what we built
  • 4. © 2015 Rocana, Inc. All Rights Reserved. our typical customer use cases 4 • millions of events / sec, sub-second end to end latency, full fidelity retention, critical use cases • quality of service - “are credit card transactions happening fast enough?” • fraud detection - “detect, investigate, prosecute, and learn from fraud.” • forensic diagnostics - “what really caused the outage last friday?” • security - “who’s doing what, where, when, why, and how, and is that ok?” • user behavior - ”capture and correlate user behavior with system performance, then feed it to downstream systems in realtime.”
  • 5. © 2015 Rocana, Inc. All Rights Reserved. depth: 3 meters
  • 6. © 2015 Rocana, Inc. All Rights Reserved. high level architecture – data acquisition 6
  • 7. © 2015 Rocana, Inc. All Rights Reserved. high level architecture – processing, storage, query 7
  • 8. © 2015 Rocana, Inc. All Rights Reserved. guarantees 8 • no single point of failure exists • all components scale horizontally[1] • data retention and latency is a function of cost, not tech[1] • every event is delivered provided no more than N - 1 failures occur (where N is the kafka replication level) • all operations, including upgrade, are online[2] • every event is (or appears to be) delivered exactly once[3] [1] we’re positive there’s a limit, but thus far it has been cost. [2] from the user’s perspective, at a system level. [3] when queried via our UI. lots of details here.
  • 9. © 2015 Rocana, Inc. All Rights Reserved. events
  • 10. © 2015 Rocana, Inc. All Rights Reserved. modeling our world 10 • everything is an event • each event contains a timestamp, type, location, host, service, body, and type- specific attributes (k/v pairs) • build specialized aggregates as necessary - just optimized views of the data
  • 11. © 2015 Rocana, Inc. All Rights Reserved. event schema 11 { id: string, ts: long, event_type_id: int, location: string, host: string, service: string, body: [ null, string ], attributes: map<string> }
  • 12. © 2015 Rocana, Inc. All Rights Reserved. event types 12 • some event types are standard – syslog, http, log4j, generic text record, … • users define custom event types • producers populate event type • transformations can turn an event of type A into B • event type metadata tells downstream systems how to interpret body and attributes
  • 13. © 2015 Rocana, Inc. All Rights Reserved. ex: generic syslog event 13 event_type_id: 100, // rfc3164, rfc5424 (syslog) body: … // raw syslog message bytes attributes: { // extracted fields from body syslog_message: “DHCPACK from 10.10.0.1 (xid=0x45b63bdc)”, syslog_severity: “6”, // info severity syslog_facility: “3”, // daemon facility syslog_process: “dhclient”, syslog_pid: “668”, … }
  • 14. © 2015 Rocana, Inc. All Rights Reserved. ex: generic http event 14 event_type_id: 102, // generic http event body: … // raw http log message bytes attributes: { http_req_method: “GET”, http_req_vhost: “w2a-demo-02”, http_req_path: “/api/v1/search?q=service%3Asshd&p=1&s=200”, http_req_query: “q=service%3Asshd&p=1&s=200”, http_resp_code: “200”, … }
  • 15. © 2015 Rocana, Inc. All Rights Reserved. stream processing
  • 16. © 2015 Rocana, Inc. All Rights Reserved. a reminder… 16
  • 17. © 2015 Rocana, Inc. All Rights Reserved. data processing 17 • each processing job gets a full stream of the fire hose, decides what it wants to consider or operate on • output of “non-terminal” jobs always just events • result: all processing jobs are composable • many jobs take user rules or configuration from our ui
  • 18. © 2015 Rocana, Inc. All Rights Reserved. the jobs 18 • transformation engine: configuration-based data transformation • metric aggregation: olap cube construction of time series data (e.g. host 17 user cpu time) • model build/eval: train/evaluate various kinds of models (e.g. anomaly detection) • trigger engine: detect complex patterns in the stream, emit events on match (e.g. complex event processing, automated workflow, alerting) • action engine: perform some action upon receiving a specific event type (e.g. email notification, 3rd party api invocation) • storage: write all the things to hdfs
  • 19. © 2015 Rocana, Inc. All Rights Reserved. transformation use cases 19
  • 20. © 2015 Rocana, Inc. All Rights Reserved. event feedback loops 20
  • 21. © 2015 Rocana, Inc. All Rights Reserved. metrics and time series
  • 22. © 2015 Rocana, Inc. All Rights Reserved. aggregation 22 • used for host/service metrics • two halves: on write and on query • data model: (dimensions) => (aggregates) • on write – reduce(a: A, b: A) => A over window – store “base” aggregates, all associative and commutative • on query – perform same aggregate or derivative aggregates – group by the same dimensions – we use SQL (impala+parquet+hdfs)
  • 23. © 2015 Rocana, Inc. All Rights Reserved. aside: late arriving data (it’s a thing) 23 • never trust a (wall) clock • producer determines event time, rest of the system uses this always • data that shows up late always processed according to event time • apache beam describes these issues perfectly • this is real and you must deal with it
  • 24. © 2015 Rocana, Inc. All Rights Reserved. extension, pain, and advice
  • 25. © 2015 Rocana, Inc. All Rights Reserved. extending the system 25 • custom producers • custom consumers • event types • parser / transformation plugins • custom metric definition and aggregate functions • custom processing jobs on landed data
  • 26. © 2015 Rocana, Inc. All Rights Reserved. pain (aka: the struggle is real) 26 • lots of tradeoffs when picking a stream processing solution – samza: right features, but low level programming model, not supported by vendors. missing security features. – storm: too rigid, too slow. not supported by all Hadoop vendors. – flink: relatively new, fledgling community. growing. – spark streaming: tons of issues initially, but lots of community energy. improving. • stack complexity, (relative im)maturity • beam-style retractions required for correct, timely, efficient aggregates of complex metrics (non-assoc/commutative)
  • 27. © 2015 Rocana, Inc. All Rights Reserved. if you’re going to try this… 27 • read all the literature on stream processing[1] • treat it like the distributed systems problem it is • understand, make, and make good on guarantees • find the right abstractions • never trust the hand waving or “hello worlds” • fully evaluate the projects/products in this space • understand it’s not just about search [1] wait, like all of it? yea, like all of it.
  • 28. © 2015 Rocana, Inc. All Rights Reserved. things I didn’t talk about 28 • reprocessing data when bad code / transformations are detected • dealing with data quality issues (“the struggle is real” part 2) • the user interface and all the fancy analytics – data visualization and exploration – event search – anomalous trend and event detection – metric, source, and event correlation – motif finding – noise reduction and dithering • event delivery semantics (e.g. at least/most/exactly once, etc.)
  • 29. © 2015 Rocana, Inc. All Rights Reserved. questions? thank you. @esammer | [email protected]