SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Next Gen Tooling for Building
Stream Analytics App
Berlin Talk
George Vetticaden
VP of Emerging Products at Hortonworks
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Streaming Analytics Manager (SAM)
What is it?
 Helps users build and deploy complex
stream analytics apps without writing
code using graphical interface
 An Open Source ASF Licensed tool
– Project Name: Streamline,
https://github.com/hortonworks/streamline
Key Design Principles
 Build stream analytics apps w/o specialized
skillsets.
 Support multiple underlining streaming
engine (Storm, Spark Streaming, Flink)
 Extensibility – Provide SDK to plug in custom
sources/sinks/processors/udfs
 Schema is a first class citizen
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Uses SAM?
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
SAM is All about Doing Real-Time Analytics on the Stream
Real-Time
Prescriptive
Analytics
Real-Time Analytics
Real-Time
Predictive
Analytics
Real-Time
Descriptive
Analytics
What should we do
right now?
What could happen
now/soon?
What is happening
right now?
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Showcasing SAM with an Use Case
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Trucking company w/ large fleet of international trucks
A truck generates millions of events for a given route;
an event could be:
 'Normal' events: starting / stopping of the vehicle
 ‘Violation’ events: speeding, excessive acceleration and
breaking, unsafe tail distance
 ‘Speed’ Events: The speed of a driver that comes in every
minute.
Company uses an application that monitors truck
locations and violations from the truck/driver in real-
time
Route?
Truck?
Driver?
Analysts query a broad
history to understand if
today’s violations are
part of a larger problem
with specific routes,
trucks, or drivers
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Two Streaming Sources: Geo Event and Speed Stream
eventType
truckId
driverId
driverName longitude
eventTime routeId
eventSource
2018-01-22 15:00:58.493|1516654858493| truck_geo_event |40| 23 | G Vetticaden | 1090292248 | Peoria to Ceder Rapids Route 2 | Lane Departure| 40.7 | -89.52 |1|
route
latitude
correlationId
speed
truckId
driverId
driverName
eventTime routeIdeventSource
2018-01-22 15:00:58.673|1516654858673| truck_speed_event|40| 23 | G Vetticaden| 1090292248 | Peoria to Ceder Rapids Route 2 | 73 |
route
 Each Truck emits different event stream
– Truck Geo Event
– Truck Speed Event
Truck Geo Event:
Truck Speed Event:
eventTimeLong
eventTimeLong
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Common Streaming Analytics Requirements
Streaming Analytics
Requirement #
Requirement Description
Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the
enriched geo and speed streams to.
Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation
window.
Req. #3 Apply rules on the stream to filter on events of interest.
Req. #4 Calculate the average speed of driver over 3 minute window and create alert
for speeding driver
Req. #5 Enrich the stream with features required for a machine learning (ML) model.
The enrichment entails performing lookups for driver HR info, hours/miles
driven in the past week, weather info.
Req. #6 Normalize the events in the stream to feed into the PMML model.
Req. #7 Execute a predictive logistical regression model on the stream built with
Spark ML to predict if a driver is in danger going to commit a violation.
Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing these 8 Requirements in Storm – Complex and Lots of
Java Code
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing this with Spark Structured Streaming – Complex and
Lots of Scala Code
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Implementing this with SAM – Easy as Pie
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Implementing
Streaming Reqs 1 - 4 with SAM
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Common Streaming Analytics Requirements
Streaming Analytics
Requirement #
Requirement Description
Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the
enriched geo and speed streams to.
Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation
window.
Req. #3 Apply rules on the stream to filter on events of interest.
Req. #4 Calculate the average speed of driver over 3 minute window and create alert
for speeding driver
Req. #5 Enrich the stream with features required for a machine learning (ML) model.
The enrichment entails performing lookups for driver HR info, hours/miles
driven in the past week, weather info.
Req. #6 Normalize the events in the stream to feed into the PMML model.
Req. #7 Execute a predictive logistical regression model on the stream built with
Spark ML to predict if a driver is in danger going to commit a violation.
Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Real-Time Predictive Analytics
 Question: No violation events but what might happen that I need to be worried about?
 My data science team has a model that can predict that based on
– Weather
– Roads
– Driver HR info like driver certification status, wagePlan
– Driver timesheet info like hours, and miles logged over the last week
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Building the Predictive Model
Identify suitable ML algorithms to train a model – we will
use classification algorithms as we have labeled events
data
2
Transform enriched events data to a format that is
friendly to Spark MLlib – many ML libs expect
training data in a certain format
3
Train a logistic classification Spark model on YARN, with
above events as training input, and iterate to fine tune
generated model
4
Explore small subset of events to identify predictive
features and make a hypothesis. E.g. hypothesis: “foggy
weather causes driver violations”
1
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Logistical Regression Model
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scoring the Predictive Model using SAM
Use SAM’s enrich/custom processors to enrich the event
with the features required for the model6
Enrich with Features
Use SAM’s projection/custom processors to
transform/normalize the streaming event and the
features required for the model
7
Transform/Normalize
Use SAM’s PMML processor to score the model for each
stream event with its required features8
Score Model
Use SAM’s rule and notification processors to alert,
notify and take action using the results of the model9
Alert / Notify / Action
Export the Spark Mllib model and import into the HDF’s
Model Registry
5 Model
Registry
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Implementing
Streaming Reqs 5 - 8
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
New SAM “Test” Mode
Key Highlights
 Persona User: Developer
 Allows Developers to test SAM app
locally without deploying to cluster
 SAM Test Mode Lifecycle
1. Create Named Test
2. Mock out sources (e.g: Kafka) with test
data
3. Execute the Test Case
4. Visualize Validate the data as it traverses
through the SAM DAG Visualization
5. Download the Test Case Results
 Can do steps 1-5 via UI or REST
 Enables writing automated test
(Junit)
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Build Automated Junit Tests and CI & CD
Pipelines with SAM
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visualizing Output of Test Mode is Great…But want CI/CD
Key Highlights
 What Users Really Want?
– Automated Unit Tests – Write Junit tests
for streaming apps like their traditional
apps
– Continuous Integration - Incorporate
streaming apps into their CI Envs with
Jenkins
– Continuous Delivery – Deliver to business
new features in a continuous fashion.
 SAM Addresses these needs. How?
– SAM has first class support for REST.
– Everything in UI can be done with SAM
REST. This includes SAM Test Mode.
 SAM REST allows you to
– Create service pools, create envs, create
test case, setup test data, download test
results, validate, deploy app to diff envs

More Related Content

What's hot (20)

PDF
Building Audi’s enterprise big data platform
DataWorks Summit
 
PPTX
Containers and Big Data
DataWorks Summit
 
PDF
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
PPTX
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
PPTX
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
PPTX
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
DataWorks Summit
 
PPTX
Manage democratization of the data - Data Replication in Hadoop
DataWorks Summit
 
PPTX
Intro to Spark with Zeppelin
Hortonworks
 
PPTX
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
PDF
Fast SQL on Hadoop, really?
DataWorks Summit
 
PDF
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Hortonworks
 
PPTX
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
PDF
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks
 
PPTX
Spark Summit EMEA - Arun Murthy's Keynote
Hortonworks
 
PPTX
Apache deep learning 101
DataWorks Summit
 
PDF
What's New in Apache Hive 3.0?
DataWorks Summit
 
PDF
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
PPTX
Sharing metadata across the data lake and streams
DataWorks Summit
 
PPTX
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
PDF
Data in the Cloud Crash Course
DataWorks Summit
 
Building Audi’s enterprise big data platform
DataWorks Summit
 
Containers and Big Data
DataWorks Summit
 
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks
 
Analyzing the World's Largest Security Data Lake!
DataWorks Summit
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
Securing Enterprise Healthcare Big Data by the Combination of Knox/F5, Ranger...
DataWorks Summit
 
Manage democratization of the data - Data Replication in Hadoop
DataWorks Summit
 
Intro to Spark with Zeppelin
Hortonworks
 
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
 
Fast SQL on Hadoop, really?
DataWorks Summit
 
Powering Fast Data and the Hadoop Ecosystem with VoltDB and Hortonworks
Hortonworks
 
Apache Atlas: Why Big Data Management Requires Hierarchical Taxonomies
DataWorks Summit/Hadoop Summit
 
Hortonworks Protegrity Webinar: Leverage Security in Hadoop Without Sacrifici...
Hortonworks
 
Spark Summit EMEA - Arun Murthy's Keynote
Hortonworks
 
Apache deep learning 101
DataWorks Summit
 
What's New in Apache Hive 3.0?
DataWorks Summit
 
Splunk-hortonworks-risk-management-oct-2014
Hortonworks
 
Sharing metadata across the data lake and streams
DataWorks Summit
 
Dancing Elephants - Efficiently Working with Object Stories from Apache Spark...
DataWorks Summit/Hadoop Summit
 
Data in the Cloud Crash Course
DataWorks Summit
 

Similar to Next gen tooling for building streaming analytics apps: code-less development, unit and integration testing, continuous integration and delivery (20)

PPTX
Next Generation Tooling for building streaming analytics app
gvetticaden
 
PPTX
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
PPTX
SAM—streaming analytics made easy
DataWorks Summit
 
PPTX
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
PPTX
Unlocking insights in streaming data
Carolyn Duby
 
PPTX
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
DataWorks Summit
 
PPTX
SAM - Streaming Analytics Made Easy
DataWorks Summit
 
PPTX
Streaming analytics manager
Sriharsha Chintalapani
 
POTX
Schema Registry & Stream Analytics Manager
Sriharsha Chintalapani
 
DOCX
Vnv kumar performance testing
Vinay Kumar
 
PDF
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...
Flink Forward
 
PPTX
Trucking demo w Spark ML - Paul Hargis - Hortonworks
Kelly Kohlleffel
 
PPTX
Breaking the Monolith
VMware Tanzu
 
PPTX
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
PDF
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Jim Dowling
 
PDF
APIs and Services for Fleet Management - Talks given @ APIDays Berlin and Ba...
Toralf Richter
 
DOC
Saloni_Tyagi
Saloni Tyagi
 
PPTX
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
DOC
Prasanth
Prasanth K
 
DOC
Vijaybabu_Resume
vijay balakrishnan
 
Next Generation Tooling for building streaming analytics app
gvetticaden
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
Hortonworks
 
SAM—streaming analytics made easy
DataWorks Summit
 
Streamline - Stream Analytics for Everyone
DataWorks Summit/Hadoop Summit
 
Unlocking insights in streaming data
Carolyn Duby
 
Its Finally Here! Building Complex Streaming Analytics Apps in under 10 min w...
DataWorks Summit
 
SAM - Streaming Analytics Made Easy
DataWorks Summit
 
Streaming analytics manager
Sriharsha Chintalapani
 
Schema Registry & Stream Analytics Manager
Sriharsha Chintalapani
 
Vnv kumar performance testing
Vinay Kumar
 
Virtual Flink Forward 2020: Keynote: The Evolution of Data Infrastructure at ...
Flink Forward
 
Trucking demo w Spark ML - Paul Hargis - Hortonworks
Kelly Kohlleffel
 
Breaking the Monolith
VMware Tanzu
 
Hadoop Summit Tokyo Apache NiFi Crash Course
DataWorks Summit/Hadoop Summit
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Jim Dowling
 
APIs and Services for Fleet Management - Talks given @ APIDays Berlin and Ba...
Toralf Richter
 
Saloni_Tyagi
Saloni Tyagi
 
Make Streaming Analytics work for you: The Devil is in the Details
DataWorks Summit/Hadoop Summit
 
Prasanth
Prasanth K
 
Vijaybabu_Resume
vijay balakrishnan
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
July Patch Tuesday
Ivanti
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
July Patch Tuesday
Ivanti
 

Next gen tooling for building streaming analytics apps: code-less development, unit and integration testing, continuous integration and delivery

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Next Gen Tooling for Building Stream Analytics App Berlin Talk George Vetticaden VP of Emerging Products at Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Streaming Analytics Manager (SAM) What is it?  Helps users build and deploy complex stream analytics apps without writing code using graphical interface  An Open Source ASF Licensed tool – Project Name: Streamline, https://github.com/hortonworks/streamline Key Design Principles  Build stream analytics apps w/o specialized skillsets.  Support multiple underlining streaming engine (Storm, Spark Streaming, Flink)  Extensibility – Provide SDK to plug in custom sources/sinks/processors/udfs  Schema is a first class citizen
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who Uses SAM?
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved SAM is All about Doing Real-Time Analytics on the Stream Real-Time Prescriptive Analytics Real-Time Analytics Real-Time Predictive Analytics Real-Time Descriptive Analytics What should we do right now? What could happen now/soon? What is happening right now?
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Showcasing SAM with an Use Case
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Trucking company w/ large fleet of international trucks A truck generates millions of events for a given route; an event could be:  'Normal' events: starting / stopping of the vehicle  ‘Violation’ events: speeding, excessive acceleration and breaking, unsafe tail distance  ‘Speed’ Events: The speed of a driver that comes in every minute. Company uses an application that monitors truck locations and violations from the truck/driver in real- time Route? Truck? Driver? Analysts query a broad history to understand if today’s violations are part of a larger problem with specific routes, trucks, or drivers
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Two Streaming Sources: Geo Event and Speed Stream eventType truckId driverId driverName longitude eventTime routeId eventSource 2018-01-22 15:00:58.493|1516654858493| truck_geo_event |40| 23 | G Vetticaden | 1090292248 | Peoria to Ceder Rapids Route 2 | Lane Departure| 40.7 | -89.52 |1| route latitude correlationId speed truckId driverId driverName eventTime routeIdeventSource 2018-01-22 15:00:58.673|1516654858673| truck_speed_event|40| 23 | G Vetticaden| 1090292248 | Peoria to Ceder Rapids Route 2 | 73 | route  Each Truck emits different event stream – Truck Geo Event – Truck Speed Event Truck Geo Event: Truck Speed Event: eventTimeLong eventTimeLong
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Common Streaming Analytics Requirements Streaming Analytics Requirement # Requirement Description Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the enriched geo and speed streams to. Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation window. Req. #3 Apply rules on the stream to filter on events of interest. Req. #4 Calculate the average speed of driver over 3 minute window and create alert for speeding driver Req. #5 Enrich the stream with features required for a machine learning (ML) model. The enrichment entails performing lookups for driver HR info, hours/miles driven in the past week, weather info. Req. #6 Normalize the events in the stream to feed into the PMML model. Req. #7 Execute a predictive logistical regression model on the stream built with Spark ML to predict if a driver is in danger going to commit a violation. Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementing these 8 Requirements in Storm – Complex and Lots of Java Code
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementing this with Spark Structured Streaming – Complex and Lots of Scala Code
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Implementing this with SAM – Easy as Pie
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Implementing Streaming Reqs 1 - 4 with SAM
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Common Streaming Analytics Requirements Streaming Analytics Requirement # Requirement Description Req. #1 Create streams consuming from the two Kafka topics that NiFi delivered the enriched geo and speed streams to. Req. #2 Join the streams of the Geo and Speed sensors over a time based aggregation window. Req. #3 Apply rules on the stream to filter on events of interest. Req. #4 Calculate the average speed of driver over 3 minute window and create alert for speeding driver Req. #5 Enrich the stream with features required for a machine learning (ML) model. The enrichment entails performing lookups for driver HR info, hours/miles driven in the past week, weather info. Req. #6 Normalize the events in the stream to feed into the PMML model. Req. #7 Execute a predictive logistical regression model on the stream built with Spark ML to predict if a driver is in danger going to commit a violation. Req. #8 Alert and feed into real-time dashboard if model predicted a violation.
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Real-Time Predictive Analytics  Question: No violation events but what might happen that I need to be worried about?  My data science team has a model that can predict that based on – Weather – Roads – Driver HR info like driver certification status, wagePlan – Driver timesheet info like hours, and miles logged over the last week
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Building the Predictive Model Identify suitable ML algorithms to train a model – we will use classification algorithms as we have labeled events data 2 Transform enriched events data to a format that is friendly to Spark MLlib – many ML libs expect training data in a certain format 3 Train a logistic classification Spark model on YARN, with above events as training input, and iterate to fine tune generated model 4 Explore small subset of events to identify predictive features and make a hypothesis. E.g. hypothesis: “foggy weather causes driver violations” 1
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Logistical Regression Model
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Scoring the Predictive Model using SAM Use SAM’s enrich/custom processors to enrich the event with the features required for the model6 Enrich with Features Use SAM’s projection/custom processors to transform/normalize the streaming event and the features required for the model 7 Transform/Normalize Use SAM’s PMML processor to score the model for each stream event with its required features8 Score Model Use SAM’s rule and notification processors to alert, notify and take action using the results of the model9 Alert / Notify / Action Export the Spark Mllib model and import into the HDF’s Model Registry 5 Model Registry
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo Implementing Streaming Reqs 5 - 8
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved New SAM “Test” Mode Key Highlights  Persona User: Developer  Allows Developers to test SAM app locally without deploying to cluster  SAM Test Mode Lifecycle 1. Create Named Test 2. Mock out sources (e.g: Kafka) with test data 3. Execute the Test Case 4. Visualize Validate the data as it traverses through the SAM DAG Visualization 5. Download the Test Case Results  Can do steps 1-5 via UI or REST  Enables writing automated test (Junit)
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Build Automated Junit Tests and CI & CD Pipelines with SAM
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Visualizing Output of Test Mode is Great…But want CI/CD Key Highlights  What Users Really Want? – Automated Unit Tests – Write Junit tests for streaming apps like their traditional apps – Continuous Integration - Incorporate streaming apps into their CI Envs with Jenkins – Continuous Delivery – Deliver to business new features in a continuous fashion.  SAM Addresses these needs. How? – SAM has first class support for REST. – Everything in UI can be done with SAM REST. This includes SAM Test Mode.  SAM REST allows you to – Create service pools, create envs, create test case, setup test data, download test results, validate, deploy app to diff envs

Editor's Notes