SlideShare a Scribd company logo
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Me, Us
• Ted Dunning, MapR Chief Application Architect, Apache Member
– Committer PMC member Zookeeper, Drill, others
– Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin
– VP Incubator
– Bought the beer at the first HUG
• MapR
– Produces first converged platform for big and fast data
– Includes data platform (files, streams, tables) + open source
– Adds major technology for performance, HA, industry standard API’s
• Contact
@ted_dunning, ted.dunning@gmail.com, tdunning@mapr.com
© 2014 MapR Technologies 3
New book on Apache Flink
Download free pdf
courtesy of MapR Technologies
mapr.com/flink-book
© 2014 MapR Technologies 4
Agenda
• Why streaming first architecture
• What does fast mean?
• How do I make something fast?
• Minor pause for reality check
• First steps … heavy bottlenecks
• Real results
• Deeper insights
© 2014 MapR Technologies 5
Is this really a
revolutionary moment?
© 2014 MapR Technologies 6
Scenario:
Profile Database
© 2014 MapR Technologies 7
The task
?
POS 1
location, t, card #
yes/no?
POS 2
location, t, card #
yes/no?
© 2014 MapR Technologies 8
Traditional Solution
POS
1..n
Fraud
detector
Last card
use
© 2014 MapR Technologies 9
What Happens Next?
POS
1..n
Fraud
detector
Last card
use
POS
1..n
Fraud
detector
POS
1..n
Fraud
detector
© 2014 MapR Technologies 10
What Happens Next?
POS
1..n
Fraud
detector
Last card
use
POS
1..n
Fraud
detector
POS
1..n
Fraud
detector
© 2014 MapR Technologies 11
How to Get Service Isolation
POS
1..n
Fraud
detector
Last card
use
Updater
card activity
© 2014 MapR Technologies 12
New Uses of Data
POS
1..n
Fraud
detector
Last card
use
Updater
Card
location
history
Other
card activity
© 2014 MapR Technologies 13
Scaling Through Isolation
POS
1..n
Last card
use
Updater
POS
1..n
Last card
use
Updater
card activity
Fraud
detector
Fraud
detector
© 2014 MapR Technologies 14
For this to work (socially),
streaming has to be faster
than almost any requirement
© 2014 MapR Technologies 15
So how do we make something
go really fast?
© 2014 MapR Technologies 16
Make some
data
Process itmove it
© 2014 MapR Technologies 17
Make some
data
Process itmove it
World
domination
move it
© 2014 MapR Technologies 18
Well, perhaps not quite so
simple?
© 2014 MapR Technologies 19
Interactive recommendation
query
db
Off-line analysis
Real-time event
source
Recent
history
Item
linkage
Search Recommendations
Cooccurrence
analysis
Long-
term
history
queue
queue
Recommendations
© 2014 MapR Technologies 20
mySQL
mySQL
files
Web-site
Auth
service
Upload
service
Image
extractor
Transcoder
User
profiles
Search
User action
logging
Recommendation
analysis
mySQL
mySQL
mySQL
Oracle
Solr
Elastic
User Generated Content
© 2014 MapR Technologies 21
Yahoo Streaming Benchmark
Ad server Filter
Group by
campaign
impressions
Campaign
info
Count
impressions Results
Project
Augment
Window by event time
© 2014 MapR Technologies 22
Ad server Filter
Group by
campaign
impressions
Campaign
info
Count
impressions Results
Project
Augment
Window by event time
Client lock Partitions Threads/machineShuffle
© 2014 MapR Technologies 23
Ad server Filter
Group by
campaign
impressions
Campaign
info
Count
impressions Results
Project
Augment
Window by event time
Client lock Partitions Threads/machineShuffle
Threads/machineShuffle
© 2014 MapR Technologies 24
Ad server Filter
Group by
campaign
impressions
Campaign
info
Count
impressions Results
Project
Augment
Window by event time
Ad server impressions Filter Project
Augment
Group by
campaign
Count
impressions
Client lock Partitions Threads/machineShuffle
Threads/machineShuffle
© 2014 MapR Technologies 25
What we do at MapR
© 2014 MapR Technologies 26
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Over decades of progress,
Unix-based systems have set the
standard for compatibility and
functionality
© 2014 MapR Technologies 27
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
Hadoop achieves much higher
scalability by trading away
essentially all of this compatibility
Evolution of Data Storage
© 2014 MapR Technologies 28
Evolution of Data Storage
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
MapR enhanced Apache Hadoop by
restoring the compatibility while
increasing scalability and performance
Functionality
Compatibility
Scalability
POSIX
© 2014 MapR Technologies 29
Functionality
Compatibility
Scalability
Linux
POSIX
Hadoop
Evolution of Data Storage
Adding converged tables and streams
enhances the functionality of the base
file system
© 2014 MapR Technologies 30
http://bit.ly/fastest-big-data
© 2014 MapR Technologies 31
Key Ideas
• Convergence of files, tables, streams into single platform
– All forms of persistence share common implementation base
• Very high abstraction from hardware … no need to provision
clusters for tables and files
– Common disaster recovery, security, availability models for files,
directories, tables and streams
• Very high performance levels
© 2014 MapR Technologies 32
Key Issues
• MapR itself is heavily threaded internally (as many as 50k
threads/core)
• MapR client can have multiple internal threads
• Ordering boundaries require serialization, locks or memory
contention
– At client level and also within single stream/topic/partition
• Replication, splitting, data location completely automated by
default, explicit control available
• MapR Streams and Flink are in same cluster, but some shuffles
still required
© 2014 MapR Technologies 33
Initial Configuration
• 10 nodes in cluster
• 1 Flink task manager / node
• 72 partitions in impressions stream
• Each task manager spawns 72
generator threads
Ad server impressions
Ad server impressions
10x72 threads
72 partitions
• At full speed, partition insert points wander around cluster to
avoid hot-spotting
• MapR client connection shared by all threads in task
manager. Having more client connections could help
© 2014 MapR Technologies 34
Tuning #1
• Large number of threads and single client connection per node
caused massive contention at serialization point inside client
• Switched to 3 Flink task managers per node
• 2 task managers each run 1 producer thread
– More data pushed by 1 thread than previously sent by 72
© 2014 MapR Technologies 35
Tuning #2
• Effective cluster-wide parallelism limited by 72 partitions in
stream
• Increasing to 300 partitions substantially improved performance
© 2014 MapR Technologies 36
The consumer
• Initial tuning had 72 consumer threads per
node
• Final tuning used single consumer thread
per Flink task manager
Filter
Campaign
info
Project
Augment
Filter Project
Augment
© 2014 MapR Technologies 37
The Shuffle / Group-by
• Shuffles were also run by the
single consumer task
manager
• Even with shuffle, consumer
processes balanced
producer processes
Group by
campaign
Count
impressions Results
Window by event time
Group by
campaign
Count
impressions
© 2014 MapR Technologies 38
Tuning #3
• In separate experiments, number of campaigns was increased to
1e6 from original 100
• This caused bottle neck to shift massively to data export step
• Serving results directly from Flink memory avoids this step
© 2014 MapR Technologies 39
Final Comparisons
Flink on MapR
no tuning
Transactions / second
(millions)
0 5 10 15
Flink on MapR
tuned
Final result for tuning was
250% improvement
No serious optimization was
required, however
© 2014 MapR Technologies 40
The Moral
• Default of 10 partitions per topic is fine for large-scale multi-
tenancy, but special purpose applications may need tuning to
higher levels (we ended up with 30 partitions per node)
• Asynchronous client gives effective threading with small number
of producer threads, large number of producer threads was
counter-productive
• Net speedup of 250% with tuning, so far
• Gut feel is that there is ~4x more performance still to come
© 2014 MapR Technologies 41
Me, Us
• Ted Dunning, MapR Chief Application Architect, Apache Member
– Committer PMC member Zookeeper, Drill, others
– Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin
– VP Incubator
– Bought the beer at the first HUG
• MapR (www.mapr.com)
– Produces first converged platform for big and fast data
– Includes data platform (files, streams, tables) + open source
– Adds major technology for performance, HA, industry standard API’s
• Contact
@ted_dunning, ted.dunning@gmail.com, tdunning@mapr.com
© 2014 MapR Technologies 42
New book on Apache Flink
Download free pdf
courtesy of MapR Technologies
mapr.com/flink-book
© 2014 MapR Technologies 43
Streaming Architecture
by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly)
Free signed hard copies at
MapR booth at Flink
Forward
http://bit.ly/mapr-ebook-streams
© 2014 MapR Technologies 44
Short Books by Ted Dunning & Ellen Friedman
• Published by O’Reilly in 2014 - 2016
• For sale from Amazon or O’Reilly
• Free e-books currently available courtesy of MapR
Download pdfs: mapr.com/ebooks-pdf
© 2014 MapR Technologies 45
Thank You!
© 2014 MapR Technologies 46
Q&A
@mapr maprtech
tdunning@maprtech.com
Engage with us!
MapR
maprtech
mapr-technologies

More Related Content

PDF
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Flink Forward
 
PPTX
Ted Dunning - Keynote: How Can We Take Flink Forward?
Flink Forward
 
PDF
Maximilian Michels - Flink and Beam
Flink Forward
 
PDF
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Flink Forward
 
PPTX
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
 
PPTX
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Flink Forward
 
PDF
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Flink Forward
 
PDF
Jamie Grier - Robust Stream Processing with Apache Flink
Flink Forward
 
Zoltán Zvara - Advanced visualization of Flink and Spark jobs

Flink Forward
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Flink Forward
 
Maximilian Michels - Flink and Beam
Flink Forward
 
Malo Denielou - No shard left behind: Dynamic work rebalancing in Apache Beam
Flink Forward
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Jamie Grier
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Flink Forward
 
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Flink Forward
 
Jamie Grier - Robust Stream Processing with Apache Flink
Flink Forward
 

What's hot (20)

PPTX
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PPTX
Flink Streaming
Gyula Fóra
 
PDF
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
confluent
 
PDF
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
PDF
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
Bowen Li
 
PDF
Stream Processing with Apache Flink
C4Media
 
PDF
A Practical Guide to Selecting a Stream Processing Technology
confluent
 
PPTX
Stateful Stream Processing at In-Memory Speed
Jamie Grier
 
PDF
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
PDF
Time Series Analysis Using an Event Streaming Platform
Dr. Mirko Kämpf
 
PDF
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Flink Forward
 
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
PPTX
Flink vs. Spark
Slim Baltagi
 
PDF
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
PDF
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Guido Schmutz
 
PDF
Stream Processing Everywhere - What to use?
MapR Technologies
 
PDF
Introduction to Stream Processing
Guido Schmutz
 
PDF
Stream Processing Overview
Maycon Viana Bordin
 
PDF
Reliable and Scalable Data Ingestion at Airbnb
DataWorks Summit/Hadoop Summit
 
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Flink Streaming
Gyula Fóra
 
Five Fabulous Sinks for Your Kafka Data. #3 will surprise you! (Rachel Pedres...
confluent
 
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
Bowen Li
 
Stream Processing with Apache Flink
C4Media
 
A Practical Guide to Selecting a Stream Processing Technology
confluent
 
Stateful Stream Processing at In-Memory Speed
Jamie Grier
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 
Time Series Analysis Using an Event Streaming Platform
Dr. Mirko Kämpf
 
Virtual Flink Forward 2020: Everything is connected: How watermarking, scalin...
Flink Forward
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Slim Baltagi
 
Flink vs. Spark
Slim Baltagi
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
confluent
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Guido Schmutz
 
Stream Processing Everywhere - What to use?
MapR Technologies
 
Introduction to Stream Processing
Guido Schmutz
 
Stream Processing Overview
Maycon Viana Bordin
 
Reliable and Scalable Data Ingestion at Airbnb
DataWorks Summit/Hadoop Summit
 
Ad

Viewers also liked (20)

PDF
Julian Hyde - Streaming SQL
Flink Forward
 
PDF
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Flink Forward
 
PPTX
Aljoscha Krettek - The Future of Apache Flink
Flink Forward
 
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Flink Forward
 
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
PDF
Márton Balassi Streaming ML with Flink-
Flink Forward
 
PDF
Automatic Detection of Web Trackers by Vasia Kalavri
Flink Forward
 
PDF
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
Flink Forward
 
PDF
Alexander Kolb - Flinkspector – Taming the squirrel
Flink Forward
 
PDF
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
Flink Forward
 
PDF
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Flink Forward
 
PPTX
Eron Wright - Introducing Flink on Mesos
Flink Forward
 
PPTX
Eron Wright - Flink Security Enhancements
Flink Forward
 
PPTX
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
PPTX
Stephan Ewen - Running Flink Everywhere
Flink Forward
 
PPTX
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Flink Forward
 
PPTX
Stephan Ewen - Scaling to large State
Flink Forward
 
PPTX
Flink Case Study: OKKAM
Flink Forward
 
PDF
Flink Case Study: Amadeus
Flink Forward
 
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Julian Hyde - Streaming SQL
Flink Forward
 
Sanjar Akhmedov - Joining Infinity – Windowless Stream Processing with Flink
Flink Forward
 
Aljoscha Krettek - The Future of Apache Flink
Flink Forward
 
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Flink Forward
 
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Forward
 
Márton Balassi Streaming ML with Flink-
Flink Forward
 
Automatic Detection of Web Trackers by Vasia Kalavri
Flink Forward
 
Trevor Grant - Apache Zeppelin - A friendlier way to Flink
Flink Forward
 
Alexander Kolb - Flinkspector – Taming the squirrel
Flink Forward
 
Ana M Martinez - AMIDST Toolbox- Scalable probabilistic machine learning with...
Flink Forward
 
Maxim Fateev - Beyond the Watermark- On-Demand Backfilling in Flink
Flink Forward
 
Eron Wright - Introducing Flink on Mesos
Flink Forward
 
Eron Wright - Flink Security Enhancements
Flink Forward
 
S. Bartoli & F. Pompermaier – A Semantic Big Data Companion
Flink Forward
 
Stephan Ewen - Running Flink Everywhere
Flink Forward
 
Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...
Flink Forward
 
Stephan Ewen - Scaling to large State
Flink Forward
 
Flink Case Study: OKKAM
Flink Forward
 
Flink Case Study: Amadeus
Flink Forward
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Ad

Similar to Ted Dunning-Faster and Furiouser- Flink Drift (20)

PPTX
Hadoop Boosts Profits in Media and Telecom Industry
DataWorks Summit
 
PPTX
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
PPTX
Integrating Hadoop into your enterprise IT environment
MapR Technologies
 
PPTX
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
PDF
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Subbu Rama
 
PDF
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
PPTX
The power of hadoop in business
MapR Technologies
 
PPTX
Predictive Analytics San Diego
MapR Technologies
 
PPTX
Real time-hadoop
Ted Dunning
 
PDF
Univa Presentation at DAC 2020
Univa, an Altair Company
 
PPTX
Geo-Distributed Big Data and Analytics
MapR Technologies
 
PDF
Has Traditional MDM Finally Met its Match?
Inside Analysis
 
PPTX
Real-time Hadoop: The Ideal Messaging System for Hadoop
DataWorks Summit/Hadoop Summit
 
PDF
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
PPTX
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
PDF
Meetup oslo hortonworks HDP
Alexander Bakos Leirvåg
 
PDF
Hortonworks Hadoop @ Oslo Hadoop User Group
Mats Johansson
 
PPTX
Get Started Building YARN Applications
Hortonworks
 
PDF
Igniting Audience Measurement at Time Warner Cable
Tim Case
 
PPTX
YARN Ready: Integrating to YARN with Tez
Hortonworks
 
Hadoop Boosts Profits in Media and Telecom Industry
DataWorks Summit
 
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
Integrating Hadoop into your enterprise IT environment
MapR Technologies
 
Streaming Architecture including Rendezvous for Machine Learning
Ted Dunning
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Subbu Rama
 
Hadoop and Your Enterprise Data Warehouse
Edgar Alejandro Villegas
 
The power of hadoop in business
MapR Technologies
 
Predictive Analytics San Diego
MapR Technologies
 
Real time-hadoop
Ted Dunning
 
Univa Presentation at DAC 2020
Univa, an Altair Company
 
Geo-Distributed Big Data and Analytics
MapR Technologies
 
Has Traditional MDM Finally Met its Match?
Inside Analysis
 
Real-time Hadoop: The Ideal Messaging System for Hadoop
DataWorks Summit/Hadoop Summit
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
Meetup oslo hortonworks HDP
Alexander Bakos Leirvåg
 
Hortonworks Hadoop @ Oslo Hadoop User Group
Mats Johansson
 
Get Started Building YARN Applications
Hortonworks
 
Igniting Audience Measurement at Time Warner Cable
Tim Case
 
YARN Ready: Integrating to YARN with Tez
Hortonworks
 

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
PDF
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
PPTX
Autoscaling Flink with Reactive Mode
Flink Forward
 
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
PPTX
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
PDF
Flink powered stream processing platform at Pinterest
Flink Forward
 
PPTX
Apache Flink in the Cloud-Native Era
Flink Forward
 
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
PPTX
The Current State of Table API in 2022
Flink Forward
 
PDF
Flink SQL on Pulsar made easy
Flink Forward
 
PPTX
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PPTX
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
PDF
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Flink Forward
 
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Flink Forward
 
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Autoscaling Flink with Reactive Mode
Flink Forward
 
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
Flink Forward
 
One sink to rule them all: Introducing the new Async Sink
Flink Forward
 
Tuning Apache Kafka Connectors for Flink.pptx
Flink Forward
 
Flink powered stream processing platform at Pinterest
Flink Forward
 
Apache Flink in the Cloud-Native Era
Flink Forward
 
Where is my bottleneck? Performance troubleshooting in Flink
Flink Forward
 
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Flink Forward
 
The Current State of Table API in 2022
Flink Forward
 
Flink SQL on Pulsar made easy
Flink Forward
 
Dynamic Rule-based Real-time Market Data Alerts
Flink Forward
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Processing Semantically-Ordered Streams in Financial Services
Flink Forward
 
Tame the small files problem and optimize data layout for streaming ingestion...
Flink Forward
 
Batch Processing at Scale with Flink & Iceberg
Flink Forward
 

Recently uploaded (20)

PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
World-population.pptx fire bunberbpeople
umutunsalnsl4402
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Presentation on animal welfare a good topic
kidscream385
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 

Ted Dunning-Faster and Furiouser- Flink Drift

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Me, Us • Ted Dunning, MapR Chief Application Architect, Apache Member – Committer PMC member Zookeeper, Drill, others – Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin – VP Incubator – Bought the beer at the first HUG • MapR – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology for performance, HA, industry standard API’s • Contact @ted_dunning, [email protected], [email protected]
  • 3. © 2014 MapR Technologies 3 New book on Apache Flink Download free pdf courtesy of MapR Technologies mapr.com/flink-book
  • 4. © 2014 MapR Technologies 4 Agenda • Why streaming first architecture • What does fast mean? • How do I make something fast? • Minor pause for reality check • First steps … heavy bottlenecks • Real results • Deeper insights
  • 5. © 2014 MapR Technologies 5 Is this really a revolutionary moment?
  • 6. © 2014 MapR Technologies 6 Scenario: Profile Database
  • 7. © 2014 MapR Technologies 7 The task ? POS 1 location, t, card # yes/no? POS 2 location, t, card # yes/no?
  • 8. © 2014 MapR Technologies 8 Traditional Solution POS 1..n Fraud detector Last card use
  • 9. © 2014 MapR Technologies 9 What Happens Next? POS 1..n Fraud detector Last card use POS 1..n Fraud detector POS 1..n Fraud detector
  • 10. © 2014 MapR Technologies 10 What Happens Next? POS 1..n Fraud detector Last card use POS 1..n Fraud detector POS 1..n Fraud detector
  • 11. © 2014 MapR Technologies 11 How to Get Service Isolation POS 1..n Fraud detector Last card use Updater card activity
  • 12. © 2014 MapR Technologies 12 New Uses of Data POS 1..n Fraud detector Last card use Updater Card location history Other card activity
  • 13. © 2014 MapR Technologies 13 Scaling Through Isolation POS 1..n Last card use Updater POS 1..n Last card use Updater card activity Fraud detector Fraud detector
  • 14. © 2014 MapR Technologies 14 For this to work (socially), streaming has to be faster than almost any requirement
  • 15. © 2014 MapR Technologies 15 So how do we make something go really fast?
  • 16. © 2014 MapR Technologies 16 Make some data Process itmove it
  • 17. © 2014 MapR Technologies 17 Make some data Process itmove it World domination move it
  • 18. © 2014 MapR Technologies 18 Well, perhaps not quite so simple?
  • 19. © 2014 MapR Technologies 19 Interactive recommendation query db Off-line analysis Real-time event source Recent history Item linkage Search Recommendations Cooccurrence analysis Long- term history queue queue Recommendations
  • 20. © 2014 MapR Technologies 20 mySQL mySQL files Web-site Auth service Upload service Image extractor Transcoder User profiles Search User action logging Recommendation analysis mySQL mySQL mySQL Oracle Solr Elastic User Generated Content
  • 21. © 2014 MapR Technologies 21 Yahoo Streaming Benchmark Ad server Filter Group by campaign impressions Campaign info Count impressions Results Project Augment Window by event time
  • 22. © 2014 MapR Technologies 22 Ad server Filter Group by campaign impressions Campaign info Count impressions Results Project Augment Window by event time Client lock Partitions Threads/machineShuffle
  • 23. © 2014 MapR Technologies 23 Ad server Filter Group by campaign impressions Campaign info Count impressions Results Project Augment Window by event time Client lock Partitions Threads/machineShuffle Threads/machineShuffle
  • 24. © 2014 MapR Technologies 24 Ad server Filter Group by campaign impressions Campaign info Count impressions Results Project Augment Window by event time Ad server impressions Filter Project Augment Group by campaign Count impressions Client lock Partitions Threads/machineShuffle Threads/machineShuffle
  • 25. © 2014 MapR Technologies 25 What we do at MapR
  • 26. © 2014 MapR Technologies 26 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Over decades of progress, Unix-based systems have set the standard for compatibility and functionality
  • 27. © 2014 MapR Technologies 27 Functionality Compatibility Scalability Linux POSIX Hadoop Hadoop achieves much higher scalability by trading away essentially all of this compatibility Evolution of Data Storage
  • 28. © 2014 MapR Technologies 28 Evolution of Data Storage Functionality Compatibility Scalability Linux POSIX Hadoop MapR enhanced Apache Hadoop by restoring the compatibility while increasing scalability and performance Functionality Compatibility Scalability POSIX
  • 29. © 2014 MapR Technologies 29 Functionality Compatibility Scalability Linux POSIX Hadoop Evolution of Data Storage Adding converged tables and streams enhances the functionality of the base file system
  • 30. © 2014 MapR Technologies 30 http://bit.ly/fastest-big-data
  • 31. © 2014 MapR Technologies 31 Key Ideas • Convergence of files, tables, streams into single platform – All forms of persistence share common implementation base • Very high abstraction from hardware … no need to provision clusters for tables and files – Common disaster recovery, security, availability models for files, directories, tables and streams • Very high performance levels
  • 32. © 2014 MapR Technologies 32 Key Issues • MapR itself is heavily threaded internally (as many as 50k threads/core) • MapR client can have multiple internal threads • Ordering boundaries require serialization, locks or memory contention – At client level and also within single stream/topic/partition • Replication, splitting, data location completely automated by default, explicit control available • MapR Streams and Flink are in same cluster, but some shuffles still required
  • 33. © 2014 MapR Technologies 33 Initial Configuration • 10 nodes in cluster • 1 Flink task manager / node • 72 partitions in impressions stream • Each task manager spawns 72 generator threads Ad server impressions Ad server impressions 10x72 threads 72 partitions • At full speed, partition insert points wander around cluster to avoid hot-spotting • MapR client connection shared by all threads in task manager. Having more client connections could help
  • 34. © 2014 MapR Technologies 34 Tuning #1 • Large number of threads and single client connection per node caused massive contention at serialization point inside client • Switched to 3 Flink task managers per node • 2 task managers each run 1 producer thread – More data pushed by 1 thread than previously sent by 72
  • 35. © 2014 MapR Technologies 35 Tuning #2 • Effective cluster-wide parallelism limited by 72 partitions in stream • Increasing to 300 partitions substantially improved performance
  • 36. © 2014 MapR Technologies 36 The consumer • Initial tuning had 72 consumer threads per node • Final tuning used single consumer thread per Flink task manager Filter Campaign info Project Augment Filter Project Augment
  • 37. © 2014 MapR Technologies 37 The Shuffle / Group-by • Shuffles were also run by the single consumer task manager • Even with shuffle, consumer processes balanced producer processes Group by campaign Count impressions Results Window by event time Group by campaign Count impressions
  • 38. © 2014 MapR Technologies 38 Tuning #3 • In separate experiments, number of campaigns was increased to 1e6 from original 100 • This caused bottle neck to shift massively to data export step • Serving results directly from Flink memory avoids this step
  • 39. © 2014 MapR Technologies 39 Final Comparisons Flink on MapR no tuning Transactions / second (millions) 0 5 10 15 Flink on MapR tuned Final result for tuning was 250% improvement No serious optimization was required, however
  • 40. © 2014 MapR Technologies 40 The Moral • Default of 10 partitions per topic is fine for large-scale multi- tenancy, but special purpose applications may need tuning to higher levels (we ended up with 30 partitions per node) • Asynchronous client gives effective threading with small number of producer threads, large number of producer threads was counter-productive • Net speedup of 250% with tuning, so far • Gut feel is that there is ~4x more performance still to come
  • 41. © 2014 MapR Technologies 41 Me, Us • Ted Dunning, MapR Chief Application Architect, Apache Member – Committer PMC member Zookeeper, Drill, others – Mentor for Flink, Beam (nee Dataflow), Drill, Storm, Zeppelin – VP Incubator – Bought the beer at the first HUG • MapR (www.mapr.com) – Produces first converged platform for big and fast data – Includes data platform (files, streams, tables) + open source – Adds major technology for performance, HA, industry standard API’s • Contact @ted_dunning, [email protected], [email protected]
  • 42. © 2014 MapR Technologies 42 New book on Apache Flink Download free pdf courtesy of MapR Technologies mapr.com/flink-book
  • 43. © 2014 MapR Technologies 43 Streaming Architecture by Ted Dunning and Ellen Friedman © 2016 (published by O’Reilly) Free signed hard copies at MapR booth at Flink Forward http://bit.ly/mapr-ebook-streams
  • 44. © 2014 MapR Technologies 44 Short Books by Ted Dunning & Ellen Friedman • Published by O’Reilly in 2014 - 2016 • For sale from Amazon or O’Reilly • Free e-books currently available courtesy of MapR Download pdfs: mapr.com/ebooks-pdf
  • 45. © 2014 MapR Technologies 45 Thank You!
  • 46. © 2014 MapR Technologies 46 Q&A @mapr maprtech [email protected] Engage with us! MapR maprtech mapr-technologies