SlideShare a Scribd company logo
Design Patterns for Large-Scale
Real-Time Learning
Sean Owen / Director of Data Science / Cloudera

1
What We Talk About When
We Talk About Data Science

2
www.quora.com/Data-Science/What-is-the-difference-between-a-data-scientist-and-a-statistician
3
4
tist
5
Data Science Is Exploratory Analytics?

www.tc.umn.edu/~zief0002/Comparing-Groups/blog.html
thenextweb.com/microsoft/2013/07/08/microsoft-brings-the-office-store-to-22-new-markets-adds-power-bi-an-intelligence-tool-to-office-365/

6
7
Example:
•
•
•
•
•
•

Search, ML over Patient Data
MapReduce for indexing, learning
HBase for storage and fast access
Also: Storm for
incremental update
And: relational DB for
most recent derived data
API façade for input;
API for querying learning
Engineering

8

Machine Learning

engineering.cerner.com/2013/02/near-real-time-processing-over-hadoop-and-hbase/
Adding Operational Analytics

9
2014: Lab to Factory

10
Data Science Will Be Operational Analytics

11
I Built A Model. Now What?

Collect Input

Repeat

12

Build Model

Query Model
I Built A Model On Hadoop. Now What?

?

Collect Input

?
Repeat

13

Build Model

?

Query Model
Example: Oryx

14
www.mwttl.com/wp-content/uploads/2013/11/IMG_5446_edited-2_mwttl.jpg
15
cloudera/ml

+

16
Gaps to fill, and Goals
•

Model Building
•
•
•
•

•

Model Serving
•
•

17

Large-scale
Continuous
Apache Hadoop™-based
Few, good algorithms
Real-time query
Real-time update

•

Algorithms
•
•
•

•

Parallelizable
Updateable
Works on diverse input

Interoperable
•
•
•

PMML model format
Simple REST API
Open source
Large-Scale or Real-Time?
Large-Scale
Offline
Batch

vs

Real-Time
Online
Streaming

Why Don’t We Have Both?

λ!
18
Lambda Architecture
Batch, Stream
Processing are different
• Tackle separately in
2+ Layers
• Batch Layer: offline,
asynchronous
• Serving / Speed Layer:
real-time, incremental,
approximate
•

… λ?

jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting
19
Batch

20

Serving/Speed
Two Layers
•

Computation Layer
•
•
•

•
•

Java-based server process
Client of Hadoop 2.x
Periodically builds
“generation” from recent
data and past model
Baby-sits MapReduce*
jobs (or, locally in-core)
Publishes models

•

Serving Layer
•
•
•
•
•
•

* Apache Spark later
21

Apache Tomcat™-based
server process
Consumes models from
HDFS (or local FS)
Serves queries from
model in memory
Updates from new input
Also writes input to HDFS
Replicas for scale
Collaborative Filtering : ALS
•
•
•
•
•
•

22

Alternating Least Squares
Latent-factor model
Accepts implicit or
explicit feedback
Real-time update
via fold-in of input
No cold-start
Parallelizable

YT

X
Clustering : k-means++
Well-known and
understood
• Parallelizable
• Clusters updateable
•

cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
23
Classification / Regression : RDF
•
•
•
•
•
•

24

Random Decision Forests
Ensemble method
Numeric, categorical
features and target
Very parallel
Nodes updateable
Works well on many
problems

age$ 30
>$

female?

income$ 20000
>$

Yes

Yes

Yes

No
PMML
Predictive Modeling
Markup Language
• XML-based format for
predictive models
• Standardized by Data
Mining Group
(www.dmg.org)
• Wide tool support
•

<PMML xmlns="http://www.dmg.org/PMML-4_1"
version="4.1">
<Header copyright="www.dmg.org"/>
<DataDictionary numberOfFields="5">
<DataField name="temperature"
optype="continuous"
dataType="double"/>
…
</DataDictionary>
<TreeModel modelName="golfing"
functionName="classification">
<MiningSchema>
<MiningField name="temperature"/>
…
</MiningSchema>
<Node score="will play">
<Node score="will play">
<SimplePredicate field="outlook"
operator="equal"
value="sunny"/>
…
</Node>
</Node>
</TreeModel>
</PMML>

www.dmg.org/v4-1/TreeModel.html
25
HTTP REST API
•
•
•
•
•

26

Convention for RPC-like
request / response
HTTP verbs, transport
GET : query
POST : add input
Easy from browser, CLI,
Java, Python, Scala, etc.

GET /recommend/jwills

HTTP/1.1 200 OK
Content-Type: text/plain
"Ray LaMontagne",0.951
"Fleet Foxes",0.7905
"The National",0.688
"Shearwater",0.3017
Wish List
•

Revamp workflow
•
•

•

De-emphasize model
building
•
•

•

Well-solved
Bring your own

Emphasize integration
•

27

Oozie?
Spark / Crunch-like API,
not raw M/R

PMML, etc.

More component-ized
• Less black-box service
• More “push” options
•

•

•

Flume?

“Pull” options
•
•

Kafka?
Hive / Impala ?
Open Source

github.com/cloudera/oryx
100% Apache License 2.0

28
Design Patterns for Large-Scale Real-Time Learning

More Related Content

What's hot (20)

PPTX
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
PDF
Rethinking Streaming Analytics For Scale
Helena Edelson
 
PPTX
Visual Mapping of Clickstream Data
DataWorks Summit
 
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
PDF
Conviva spark
Geetanjali G
 
PPTX
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
PPTX
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PPTX
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
PDF
Data Streaming Technology Overview
Dan Lynn
 
PPTX
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
PDF
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
PDF
Demystifying Data Engineering
nathanmarz
 
PDF
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
PPTX
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
PDF
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Storm – Streaming Data Analytics at Scale - StampedeCon 2014
StampedeCon
 
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Spark Summit
 
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Brian O'Neill
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Rethinking Streaming Analytics For Scale
Helena Edelson
 
Visual Mapping of Clickstream Data
DataWorks Summit
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Helena Edelson
 
Conviva spark
Geetanjali G
 
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Brian O'Neill
 
Data Streaming Technology Overview
Dan Lynn
 
Self-Service Analytics on Hadoop: Lessons Learned
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Zoltan Zvara
Spark Summit
 
Demystifying Data Engineering
nathanmarz
 
Spark Streaming and IoT by Mike Freedman
Spark Summit
 
Unifying Stream, SWL and CEP for Declarative Stream Processing with Apache Flink
DataWorks Summit/Hadoop Summit
 
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 

Similar to Design Patterns for Large-Scale Real-Time Learning (20)

PPTX
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera, Inc.
 
PPTX
Emerging technologies /frameworks in Big Data
Rahul Jain
 
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
PDF
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
confluent
 
PDF
Data Science with the Help of Metadata
Jim Dowling
 
PPTX
Getting It Right Exactly Once: Principles for Streaming Architectures
SingleStore
 
PDF
Dev Ops Training
Spark Summit
 
PDF
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
Big Data Value Association
 
PDF
Scaling up Machine Learning Development
Matei Zaharia
 
PDF
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
 
PPTX
Introduction to Designing and Building Big Data Applications
Cloudera, Inc.
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
PPTX
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS
 
PDF
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
PDF
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
PDF
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
PDF
Making BD Work~TIAS_20150622
Anthony Potappel
 
PDF
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxData
 
PPTX
Agile data warehousing
Sneha Challa
 
PDF
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Cloudera Federal Forum 2014: The Evolution of Machine Learning from Science t...
Cloudera, Inc.
 
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
confluent
 
Data Science with the Help of Metadata
Jim Dowling
 
Getting It Right Exactly Once: Principles for Streaming Architectures
SingleStore
 
Dev Ops Training
Spark Summit
 
ExtremeEarth: Hopsworks, a data-intensive AI platform for Deep Learning with ...
Big Data Value Association
 
Scaling up Machine Learning Development
Matei Zaharia
 
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...
Joachim Schlosser
 
Introduction to Designing and Building Big Data Applications
Cloudera, Inc.
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
RTTS
 
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Jason Dai
 
Slide 2 collecting, storing and analyzing big data
Trieu Nguyen
 
Making BD Work~TIAS_20150622
Anthony Potappel
 
InfluxEnterprise Architectural Patterns by Dean Sheehan, Senior Director, Pre...
InfluxData
 
Agile data warehousing
Sneha Challa
 
Low Latency Polyglot Model Scoring using Apache Apex
Apache Apex
 
Ad

More from Swiss Big Data User Group (20)

PDF
Making Hadoop based analytics simple for everyone to use
Swiss Big Data User Group
 
PDF
A real life project using Cassandra at a large Swiss Telco operator
Swiss Big Data User Group
 
PDF
Data Analytics – B2B vs. B2C
Swiss Big Data User Group
 
PDF
SQL on Hadoop
Swiss Big Data User Group
 
PDF
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
PDF
Closing The Loop for Evaluating Big Data Analysis
Swiss Big Data User Group
 
PDF
Big Data and Data Science for traditional Swiss companies
Swiss Big Data User Group
 
PDF
Educating Data Scientists of the Future
Swiss Big Data User Group
 
PDF
Unleash the power of Big Data in your existing Data Warehouse
Swiss Big Data User Group
 
PDF
Big data for Telco: opportunity or threat?
Swiss Big Data User Group
 
PDF
Project "Babelfish" - A data warehouse to attack complexity
Swiss Big Data User Group
 
PDF
Brainserve Datacenter: the High-Density Choice
Swiss Big Data User Group
 
PDF
Urturn on AWS: scaling infra, cost and time to maket
Swiss Big Data User Group
 
PDF
The World Wide Distributed Computing Architecture of the LHC Datagrid
Swiss Big Data User Group
 
PPTX
New opportunities for connected data : Neo4j the graph database
Swiss Big Data User Group
 
PDF
Technology Outlook - The new Era of computing
Swiss Big Data User Group
 
PDF
In-Store Analysis with Hadoop
Swiss Big Data User Group
 
PDF
Big Data Visualization With ParaView
Swiss Big Data User Group
 
PPTX
Introduction to Apache Drill
Swiss Big Data User Group
 
PPTX
Oracle's BigData solutions
Swiss Big Data User Group
 
Making Hadoop based analytics simple for everyone to use
Swiss Big Data User Group
 
A real life project using Cassandra at a large Swiss Telco operator
Swiss Big Data User Group
 
Data Analytics – B2B vs. B2C
Swiss Big Data User Group
 
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Closing The Loop for Evaluating Big Data Analysis
Swiss Big Data User Group
 
Big Data and Data Science for traditional Swiss companies
Swiss Big Data User Group
 
Educating Data Scientists of the Future
Swiss Big Data User Group
 
Unleash the power of Big Data in your existing Data Warehouse
Swiss Big Data User Group
 
Big data for Telco: opportunity or threat?
Swiss Big Data User Group
 
Project "Babelfish" - A data warehouse to attack complexity
Swiss Big Data User Group
 
Brainserve Datacenter: the High-Density Choice
Swiss Big Data User Group
 
Urturn on AWS: scaling infra, cost and time to maket
Swiss Big Data User Group
 
The World Wide Distributed Computing Architecture of the LHC Datagrid
Swiss Big Data User Group
 
New opportunities for connected data : Neo4j the graph database
Swiss Big Data User Group
 
Technology Outlook - The new Era of computing
Swiss Big Data User Group
 
In-Store Analysis with Hadoop
Swiss Big Data User Group
 
Big Data Visualization With ParaView
Swiss Big Data User Group
 
Introduction to Apache Drill
Swiss Big Data User Group
 
Oracle's BigData solutions
Swiss Big Data User Group
 
Ad

Recently uploaded (20)

PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PDF
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Digital Circuits, important subject in CS
contactparinay1
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
AI Agents in the Cloud: The Rise of Agentic Cloud Architecture
Lilly Gracia
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
What’s my job again? Slides from Mark Simos talk at 2025 Tampa BSides
Mark Simos
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
UiPath DevConnect 2025: Agentic Automation Community User Group Meeting
DianaGray10
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
UPDF - AI PDF Editor & Converter Key Features
DealFuel
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 

Design Patterns for Large-Scale Real-Time Learning

  • 1. Design Patterns for Large-Scale Real-Time Learning Sean Owen / Director of Data Science / Cloudera 1
  • 2. What We Talk About When We Talk About Data Science 2
  • 4. 4
  • 6. Data Science Is Exploratory Analytics? www.tc.umn.edu/~zief0002/Comparing-Groups/blog.html thenextweb.com/microsoft/2013/07/08/microsoft-brings-the-office-store-to-22-new-markets-adds-power-bi-an-intelligence-tool-to-office-365/ 6
  • 7. 7
  • 8. Example: • • • • • • Search, ML over Patient Data MapReduce for indexing, learning HBase for storage and fast access Also: Storm for incremental update And: relational DB for most recent derived data API façade for input; API for querying learning Engineering 8 Machine Learning engineering.cerner.com/2013/02/near-real-time-processing-over-hadoop-and-hbase/
  • 10. 2014: Lab to Factory 10
  • 11. Data Science Will Be Operational Analytics 11
  • 12. I Built A Model. Now What? Collect Input Repeat 12 Build Model Query Model
  • 13. I Built A Model On Hadoop. Now What? ? Collect Input ? Repeat 13 Build Model ? Query Model
  • 17. Gaps to fill, and Goals • Model Building • • • • • Model Serving • • 17 Large-scale Continuous Apache Hadoop™-based Few, good algorithms Real-time query Real-time update • Algorithms • • • • Parallelizable Updateable Works on diverse input Interoperable • • • PMML model format Simple REST API Open source
  • 19. Lambda Architecture Batch, Stream Processing are different • Tackle separately in 2+ Layers • Batch Layer: offline, asynchronous • Serving / Speed Layer: real-time, incremental, approximate • … λ? jameskinley.tumblr.com/post/37398560534/the-lambda-architecture-principles-for-architecting 19
  • 21. Two Layers • Computation Layer • • • • • Java-based server process Client of Hadoop 2.x Periodically builds “generation” from recent data and past model Baby-sits MapReduce* jobs (or, locally in-core) Publishes models • Serving Layer • • • • • • * Apache Spark later 21 Apache Tomcat™-based server process Consumes models from HDFS (or local FS) Serves queries from model in memory Updates from new input Also writes input to HDFS Replicas for scale
  • 22. Collaborative Filtering : ALS • • • • • • 22 Alternating Least Squares Latent-factor model Accepts implicit or explicit feedback Real-time update via fold-in of input No cold-start Parallelizable YT X
  • 23. Clustering : k-means++ Well-known and understood • Parallelizable • Clusters updateable • cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering 23
  • 24. Classification / Regression : RDF • • • • • • 24 Random Decision Forests Ensemble method Numeric, categorical features and target Very parallel Nodes updateable Works well on many problems age$ 30 >$ female? income$ 20000 >$ Yes Yes Yes No
  • 25. PMML Predictive Modeling Markup Language • XML-based format for predictive models • Standardized by Data Mining Group (www.dmg.org) • Wide tool support • <PMML xmlns="http://www.dmg.org/PMML-4_1" version="4.1"> <Header copyright="www.dmg.org"/> <DataDictionary numberOfFields="5"> <DataField name="temperature" optype="continuous" dataType="double"/> … </DataDictionary> <TreeModel modelName="golfing" functionName="classification"> <MiningSchema> <MiningField name="temperature"/> … </MiningSchema> <Node score="will play"> <Node score="will play"> <SimplePredicate field="outlook" operator="equal" value="sunny"/> … </Node> </Node> </TreeModel> </PMML> www.dmg.org/v4-1/TreeModel.html 25
  • 26. HTTP REST API • • • • • 26 Convention for RPC-like request / response HTTP verbs, transport GET : query POST : add input Easy from browser, CLI, Java, Python, Scala, etc. GET /recommend/jwills HTTP/1.1 200 OK Content-Type: text/plain "Ray LaMontagne",0.951 "Fleet Foxes",0.7905 "The National",0.688 "Shearwater",0.3017
  • 27. Wish List • Revamp workflow • • • De-emphasize model building • • • Well-solved Bring your own Emphasize integration • 27 Oozie? Spark / Crunch-like API, not raw M/R PMML, etc. More component-ized • Less black-box service • More “push” options • • • Flume? “Pull” options • • Kafka? Hive / Impala ?

Editor's Notes

  • #3: Raymond Carver anyone?
  • #19: http://knowyourmeme.com/memes/why-not-both-why-dont-we-have-both
  • #20: Why the name lambda? Don’t see a connection to lambda calculus.