© 2017 MapR Technologies 1
Update on t-digest
© 2017 MapR Technologies 2
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning
© 2017 MapR Technologies 3
Who We Are
• MapR echnologies
– We make a kick-ass platform for big data computing
– Support many workloads including Hadoop / Spark / HPC / Other
– Extended to allow streams and tables in basic platform
– Free for academic research / training
• Apache Software Foundation
– Culture hub for building open source communities
– Shared values around openness for contribution as well as use
– Many major projects are part of Apache
– Even more minor ones!
© 2017 MapR Technologies 4
Basic Outline
• Why we should measure distributions
• Basic Ideas
• How t-digest works
• Recent results
• Applications
© 2017 MapR Technologies 5
Why Is This Practically Important
• The novice came to the master and says “something is broken”
© 2017 MapR Technologies 6
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
© 2017 MapR Technologies 7
Why Is This Practically Important
• The novice came to the master and says “something is broken”
• The master replied “What has changed?”
• And the student was enlightened
© 2017 MapR Technologies 8
Finding change is key
but what kind?
© 2017 MapR Technologies 9
Last Night’s Latencies
• These are ping latencies from my hotel
• Looks pretty good, right?
• But what about longer term?
208.302
198.571
185.099
191.258
201.392
214.738
197.389
187.749
201.693
186.762
185.296
186.390
183.960
188.060
190.763
> mean(y$t[i])
[1] 198.6047
> sd(y$t[i])
[1] 71.43965
© 2017 MapR Technologies 10
Not So Fast …
© 2017 MapR Technologies 11
This is long-tailed land
© 2017 MapR Technologies 12
This is long-tailed land
You have to know the distribution
of values
© 2017 MapR Technologies 13
© 2017 MapR Technologies 14
A single number
is simply not enough
© 2017 MapR Technologies 15
What We Really Need Here
• I want to be able to compute the distribution from any time
period
• From any subset of measurements
• With lots of keys and filters
• And not a lot of space
• Basically, any OLAP kind of query
select distribution(x) from … where … group by y,z
© 2017 MapR Technologies 16
Idea 0 – Pre-defined bins
• So let’s assume we have bins
– Upper, lower bound, constant width
• Get a measurement, pick a bin, increment count
• Works great if you know the data
– And you have limited dynamic range (too many bins)
– And the distribution is fixed
• Useful, but not general enough
© 2017 MapR Technologies 17
Idea 1 – Exponential Bins
• Suppose we want relative accuracy in measurement space
• Latencies are positive and only matter within a few percent
– 1.1 ms versus 1.0 ms
– 1100 ms versus 1000 ms
• We can cheat by using floating point representations
– Compute bin using magic
– Count
© 2017 MapR Technologies 18
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
© 2017 MapR Technologies 19
FloatHistogram
• Assume all measurements are in the range
• Divide this range into power of 2 sub-ranges
• Sub-divide each sub-range evenly with steps
– is typical
• Relative error is bounded in measurement space
• Bin index can be computed using FP representation!
© 2017 MapR Technologies 20
Fixed Size Bins
© 2017 MapR Technologies 21
Approximate Exponential Bins
© 2017 MapR Technologies 22
Non-linear bins are better
(sometimes)
Still not general enough
© 2017 MapR Technologies 23
Idea 2 – Fully Adaptive Bins
• First intuition – in general, we want accuracy in terms of
percentile
• Second intuition – we want better accuracy at extreme
quantiles
– 50%-ile versus 50.1%-ile?
– What does 0.1% error even mean for 99.99th percentile
• We need bins with small counts near the edges
© 2017 MapR Technologies 24
First 1% of data shown.
Left graph has 100 x 100 sample bins.
Right graph has ~130bins, variable size
© 2017 MapR Technologies 25
The Basic t-digest
• Take a bunch of data
• Sort it
• Group into bins
– But make the bins be smaller at the beginning and end
• Remember the centroid and count of each bin
• That’s a t-digest
© 2017 MapR Technologies 26
But Wait, You Need a Bit More
• Take a bunch of new data, old t-digest
• Sort the data and the old bins together
• Group into bins
– Note that existing bins have bigger weights
– So they might survive … or might clump
• Remember the centroid and count of each new bin
• That’s an updated t-digest
© 2017 MapR Technologies 27
Oh … and Merging
• Take a bunch of old t-digests
• Sort the bins
• Group into mega-bins
– Respect the size constraint
• Remember the centroid and count of each new bin
• That’s a merged t-digest
© 2017 MapR Technologies 28
Adaptive non-linear bins are good
and general
And can be grouped
and regrouped
© 2017 MapR Technologies 29
Results
© 2017 MapR Technologies 30
© 2017 MapR Technologies 31
Status
• Current release
– Small accuracy bugs in corner cases
– Best overall is still AVLTreeDigest
© 2017 MapR Technologies 32
Status
• Current release (3.x)
– Small accuracy bugs in corner cases
– Best overall is still AVLTreeDigest
• Upcoming release (4.0)
– Better accuracy in pathological cases
– Strictly bounded size
– No dynamic allocation (with MergingDigest)
– Good speed (100ns for MergingDigest, 5ns for FloatHistogram)
– Real Soon Now
© 2017 MapR Technologies 33
Example Application
• The data:
– ~ 1 million machines
– Even more services
– Each producing thousands of measurements per second
• Store t-digest for each 5 minute period for each measurement
• Want to query any combination of keys, produce t-digest result
“what was the distribution of launch times yesterday?”
“what about last month?”
“in Europe versus in North America versus in Asia?”
© 2017 MapR Technologies 34
Collect Data
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 35
And Transport to Global Analytics
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
events
Elaborate
events
(log-stash)
Aggregate
Signal
detection
© 2017 MapR Technologies 36
With Many Sources
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
events
Elaborate
events
(log-stash)
Aggregate
Signal
detection
© 2017 MapR Technologies 37
With Many Sources
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
events
Elaborate
events
(log-stash)
Aggregate
Signal
detection
log consolidator
web server
Web-
server
Log
web server
Web-
server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 38
With Many Sources
log consolidator
web server
web server
Web-
server
Log
Web-
server
Log
log_events
log-stash
log-stash
data center GHQ
log_events
events
Elaborate
events
(log-stash)
Aggregate
Signal
detection
log consolidator
web server
Web-
server
Log
web server
Web-
server
Log
log_events
log-stash
log-stash
data center
log consolidator
web server
Web-
server
Log
web server
Web-
server
Log
log_events
log-stash
log-stash
data center
© 2017 MapR Technologies 39
What about visualization?
© 2017 MapR Technologies 40
Can’t see small count bars
© 2017 MapR Technologies 41
Good Results
© 2017 MapR Technologies 42
Bad Results – 1% of measurements are 3x bigger
© 2017 MapR Technologies 43
Bad Results – 1% of measurements are 3x bigger
© 2017 MapR Technologies 44
With Better Vertical Scaling
© 2017 MapR Technologies 45
Uniform Bins
© 2017 MapR Technologies 46
FloatHistogram Bins
© 2017 MapR Technologies 47
With FloatHistogram
© 2017 MapR Technologies 48
Original Ping Latency Data
© 2017 MapR Technologies 49
Summary
• Single measurements insufficient, need distributions
• Uniform binned histograms not good
• FloatHistogram for some cases
• T-digest for general cases
• Upcoming release has super-
fast and accurate versions
• Good visualization also key
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k
© 2017 MapR Technologies 50
Q & A
© 2017 MapR Technologies 51
Contact Information
Ted Dunning, PhD
Chief Application Architect, MapR Technologies
Board member, Apache Software Foundation
O’Reilly author
Email tdunning@mapr.com tdunning@apache.org
Twitter @ted_dunning
© 2017 MapR Technologies 52
T-digest
• Or we can talk about small errors in q
• Accumulate samples, sort, merge
• Merge if k-size < 1
• Interpolate using centroids in x
• Very good near extremes, no dynamic allocation
0.0 0.2 0.4 0.6 0.8 1.0
q
0246810
k

More Related Content

PPTX
Finding Changes in Real Data
PPTX
Tensor Abuse - how to reuse machine learning frameworks
PPTX
Machine Learning logistics
PPTX
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
PPTX
Cheap learning-dunning-9-18-2015
PPTX
Where is Data Going? - RMDC Keynote
PPTX
Real time-hadoop
PPTX
Doing-the-impossible
Finding Changes in Real Data
Tensor Abuse - how to reuse machine learning frameworks
Machine Learning logistics
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-time
Cheap learning-dunning-9-18-2015
Where is Data Going? - RMDC Keynote
Real time-hadoop
Doing-the-impossible

What's hot (20)

PPTX
Sharing Sensitive Data Securely
PPTX
Streaming Architecture including Rendezvous for Machine Learning
PPTX
Anomaly Detection - New York Machine Learning
PPTX
Dunning time-series-2015
PPTX
What is the past future tense of data?
PPTX
Cognitive computing with big data, high tech and low tech approaches
PPTX
Dunning ml-conf-2014
PPTX
Which Algorithms Really Matter
PDF
Strata 2014 Anomaly Detection
PPTX
My talk about recommendation and search to the Hive
PPTX
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
PPTX
What's new in Apache Mahout
PPTX
Building multi-modal recommendation engines using search engines
PPTX
Recommendation Techn
PPTX
Using Mahout and a Search Engine for Recommendation
PPTX
Polyvalent recommendations
PDF
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
PPTX
Buzz words-dunning-real-time-learning
PDF
Mathematical bridges From Old to New
PPTX
Hadoop and R Go to the Movies
Sharing Sensitive Data Securely
Streaming Architecture including Rendezvous for Machine Learning
Anomaly Detection - New York Machine Learning
Dunning time-series-2015
What is the past future tense of data?
Cognitive computing with big data, high tech and low tech approaches
Dunning ml-conf-2014
Which Algorithms Really Matter
Strata 2014 Anomaly Detection
My talk about recommendation and search to the Hive
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
What's new in Apache Mahout
Building multi-modal recommendation engines using search engines
Recommendation Techn
Using Mahout and a Search Engine for Recommendation
Polyvalent recommendations
Streaming Goes Mainstream: New Architecture & Emerging Technologies for Strea...
Buzz words-dunning-real-time-learning
Mathematical bridges From Old to New
Hadoop and R Go to the Movies
Ad

Similar to T digest-update (20)

PPTX
ML Workshop 2: Machine Learning Model Comparison & Evaluation
PDF
Spark and MapR Streams: A Motivating Example
PPTX
Geo-Distributed Big Data and Analytics
PPTX
Deep Learning vs. Cheap Learning
PPTX
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
PDF
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
PPTX
ML Workshop 1: A New Architecture for Machine Learning Logistics
PDF
Map r chicago_advanalytics_oct_meetup
PDF
Predictive Maintenance Using Recurrent Neural Networks
PDF
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
PPTX
Machine Learning Success: The Key to Easier Model Management
PDF
Streaming patterns revolutionary architectures
PPTX
Ted Dunning, Chief Application Architect, MapR at MLconf SF
PDF
Big Data LDN 2017: Real World Impact of a Global Data Fabric
PPTX
How to Determine which Algorithms Really Matter
PPTX
How to tell which algorithms really matter
PPTX
Machine Learning Logistics
PDF
Fast Cars, Big Data How Streaming can help Formula 1
PPTX
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
PPTX
How the Internet of Things is Turning the Internet Upside Down
ML Workshop 2: Machine Learning Model Comparison & Evaluation
Spark and MapR Streams: A Motivating Example
Geo-Distributed Big Data and Analytics
Deep Learning vs. Cheap Learning
The Hive Think Tank: Rendezvous Architecture Makes Machine Learning Logistics...
Streaming Machine learning Distributed Pipeline for Real-Time Uber Data Using...
ML Workshop 1: A New Architecture for Machine Learning Logistics
Map r chicago_advanalytics_oct_meetup
Predictive Maintenance Using Recurrent Neural Networks
Big Data LDN 2017: Machine Learning: What Works And What They Won’t Tell You
Machine Learning Success: The Key to Easier Model Management
Streaming patterns revolutionary architectures
Ted Dunning, Chief Application Architect, MapR at MLconf SF
Big Data LDN 2017: Real World Impact of a Global Data Fabric
How to Determine which Algorithms Really Matter
How to tell which algorithms really matter
Machine Learning Logistics
Fast Cars, Big Data How Streaming can help Formula 1
State of the Art Robot Predictive Maintenance with Real-time Sensor Data
How the Internet of Things is Turning the Internet Upside Down
Ad

More from Ted Dunning (7)

PPTX
Dunning - SIGMOD - Data Economy.pptx
PPTX
How to Get Going with Kubernetes
PPTX
Progress for big data in Kubernetes
PPTX
Anomaly Detection: How to find what you didn’t know to look for
PPTX
Apache Kylin - OLAP Cubes for SQL on Hadoop
PPTX
Possible Visions for Mahout 1.0
PPTX
Inside MapR's M7
Dunning - SIGMOD - Data Economy.pptx
How to Get Going with Kubernetes
Progress for big data in Kubernetes
Anomaly Detection: How to find what you didn’t know to look for
Apache Kylin - OLAP Cubes for SQL on Hadoop
Possible Visions for Mahout 1.0
Inside MapR's M7

Recently uploaded (20)

PPTX
GPS sensor used agriculture land for automation
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
PPTX
Machine Learning and working of machine Learning
PPTX
indiraparyavaranbhavan-240418134200-31d840b3.pptx
PPTX
Stats annual compiled ipd opd ot br 2024
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPTX
AI_Agriculture_Presentation_Enhanced.pptx
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPTX
langchainpptforbeginners_easy_explanation.pptx
PDF
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
PPT
Classification methods in data analytics.ppt
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PDF
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
PPTX
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
PDF
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
PPTX
Introduction to Fundamentals of Data Security
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
transformers as a tool for understanding advance algorithms in deep learning
PPTX
DATA ANALYTICS COURSE IN PITAMPURA.pptx
PDF
Session 11 - Data Visualization Storytelling (2).pdf
GPS sensor used agriculture land for automation
PPT for Diseases.pptx, there are 3 types of diseases
Machine Learning and working of machine Learning
indiraparyavaranbhavan-240418134200-31d840b3.pptx
Stats annual compiled ipd opd ot br 2024
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
AI_Agriculture_Presentation_Enhanced.pptx
inbound6529290805104538764.pptxmmmmmmmmm
langchainpptforbeginners_easy_explanation.pptx
©️ 02_SKU Automatic SW Robotics for Microsoft PC.pdf
Classification methods in data analytics.ppt
Grey Minimalist Professional Project Presentation (1).pdf
©️ 01_Algorithm for Microsoft New Product Launch - handling web site - by Ale...
Sheep Seg. Marketing Plan_C2 2025 (1).pptx
2025-08 San Francisco FinOps Meetup: Tiering, Intelligently.
Introduction to Fundamentals of Data Security
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
transformers as a tool for understanding advance algorithms in deep learning
DATA ANALYTICS COURSE IN PITAMPURA.pptx
Session 11 - Data Visualization Storytelling (2).pdf

T digest-update

  • 1. © 2017 MapR Technologies 1 Update on t-digest
  • 2. © 2017 MapR Technologies 2 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email [email protected] [email protected] Twitter @ted_dunning
  • 3. © 2017 MapR Technologies 3 Who We Are • MapR echnologies – We make a kick-ass platform for big data computing – Support many workloads including Hadoop / Spark / HPC / Other – Extended to allow streams and tables in basic platform – Free for academic research / training • Apache Software Foundation – Culture hub for building open source communities – Shared values around openness for contribution as well as use – Many major projects are part of Apache – Even more minor ones!
  • 4. © 2017 MapR Technologies 4 Basic Outline • Why we should measure distributions • Basic Ideas • How t-digest works • Recent results • Applications
  • 5. © 2017 MapR Technologies 5 Why Is This Practically Important • The novice came to the master and says “something is broken”
  • 6. © 2017 MapR Technologies 6 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?”
  • 7. © 2017 MapR Technologies 7 Why Is This Practically Important • The novice came to the master and says “something is broken” • The master replied “What has changed?” • And the student was enlightened
  • 8. © 2017 MapR Technologies 8 Finding change is key but what kind?
  • 9. © 2017 MapR Technologies 9 Last Night’s Latencies • These are ping latencies from my hotel • Looks pretty good, right? • But what about longer term? 208.302 198.571 185.099 191.258 201.392 214.738 197.389 187.749 201.693 186.762 185.296 186.390 183.960 188.060 190.763 > mean(y$t[i]) [1] 198.6047 > sd(y$t[i]) [1] 71.43965
  • 10. © 2017 MapR Technologies 10 Not So Fast …
  • 11. © 2017 MapR Technologies 11 This is long-tailed land
  • 12. © 2017 MapR Technologies 12 This is long-tailed land You have to know the distribution of values
  • 13. © 2017 MapR Technologies 13
  • 14. © 2017 MapR Technologies 14 A single number is simply not enough
  • 15. © 2017 MapR Technologies 15 What We Really Need Here • I want to be able to compute the distribution from any time period • From any subset of measurements • With lots of keys and filters • And not a lot of space • Basically, any OLAP kind of query select distribution(x) from … where … group by y,z
  • 16. © 2017 MapR Technologies 16 Idea 0 – Pre-defined bins • So let’s assume we have bins – Upper, lower bound, constant width • Get a measurement, pick a bin, increment count • Works great if you know the data – And you have limited dynamic range (too many bins) – And the distribution is fixed • Useful, but not general enough
  • 17. © 2017 MapR Technologies 17 Idea 1 – Exponential Bins • Suppose we want relative accuracy in measurement space • Latencies are positive and only matter within a few percent – 1.1 ms versus 1.0 ms – 1100 ms versus 1000 ms • We can cheat by using floating point representations – Compute bin using magic – Count
  • 18. © 2017 MapR Technologies 18 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps – is typical • Relative error is bounded in measurement space
  • 19. © 2017 MapR Technologies 19 FloatHistogram • Assume all measurements are in the range • Divide this range into power of 2 sub-ranges • Sub-divide each sub-range evenly with steps – is typical • Relative error is bounded in measurement space • Bin index can be computed using FP representation!
  • 20. © 2017 MapR Technologies 20 Fixed Size Bins
  • 21. © 2017 MapR Technologies 21 Approximate Exponential Bins
  • 22. © 2017 MapR Technologies 22 Non-linear bins are better (sometimes) Still not general enough
  • 23. © 2017 MapR Technologies 23 Idea 2 – Fully Adaptive Bins • First intuition – in general, we want accuracy in terms of percentile • Second intuition – we want better accuracy at extreme quantiles – 50%-ile versus 50.1%-ile? – What does 0.1% error even mean for 99.99th percentile • We need bins with small counts near the edges
  • 24. © 2017 MapR Technologies 24 First 1% of data shown. Left graph has 100 x 100 sample bins. Right graph has ~130bins, variable size
  • 25. © 2017 MapR Technologies 25 The Basic t-digest • Take a bunch of data • Sort it • Group into bins – But make the bins be smaller at the beginning and end • Remember the centroid and count of each bin • That’s a t-digest
  • 26. © 2017 MapR Technologies 26 But Wait, You Need a Bit More • Take a bunch of new data, old t-digest • Sort the data and the old bins together • Group into bins – Note that existing bins have bigger weights – So they might survive … or might clump • Remember the centroid and count of each new bin • That’s an updated t-digest
  • 27. © 2017 MapR Technologies 27 Oh … and Merging • Take a bunch of old t-digests • Sort the bins • Group into mega-bins – Respect the size constraint • Remember the centroid and count of each new bin • That’s a merged t-digest
  • 28. © 2017 MapR Technologies 28 Adaptive non-linear bins are good and general And can be grouped and regrouped
  • 29. © 2017 MapR Technologies 29 Results
  • 30. © 2017 MapR Technologies 30
  • 31. © 2017 MapR Technologies 31 Status • Current release – Small accuracy bugs in corner cases – Best overall is still AVLTreeDigest
  • 32. © 2017 MapR Technologies 32 Status • Current release (3.x) – Small accuracy bugs in corner cases – Best overall is still AVLTreeDigest • Upcoming release (4.0) – Better accuracy in pathological cases – Strictly bounded size – No dynamic allocation (with MergingDigest) – Good speed (100ns for MergingDigest, 5ns for FloatHistogram) – Real Soon Now
  • 33. © 2017 MapR Technologies 33 Example Application • The data: – ~ 1 million machines – Even more services – Each producing thousands of measurements per second • Store t-digest for each 5 minute period for each measurement • Want to query any combination of keys, produce t-digest result “what was the distribution of launch times yesterday?” “what about last month?” “in Europe versus in North America versus in Asia?”
  • 34. © 2017 MapR Technologies 34 Collect Data log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center
  • 35. © 2017 MapR Technologies 35 And Transport to Global Analytics log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection
  • 36. © 2017 MapR Technologies 36 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection
  • 37. © 2017 MapR Technologies 37 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center
  • 38. © 2017 MapR Technologies 38 With Many Sources log consolidator web server web server Web- server Log Web- server Log log_events log-stash log-stash data center GHQ log_events events Elaborate events (log-stash) Aggregate Signal detection log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center log consolidator web server Web- server Log web server Web- server Log log_events log-stash log-stash data center
  • 39. © 2017 MapR Technologies 39 What about visualization?
  • 40. © 2017 MapR Technologies 40 Can’t see small count bars
  • 41. © 2017 MapR Technologies 41 Good Results
  • 42. © 2017 MapR Technologies 42 Bad Results – 1% of measurements are 3x bigger
  • 43. © 2017 MapR Technologies 43 Bad Results – 1% of measurements are 3x bigger
  • 44. © 2017 MapR Technologies 44 With Better Vertical Scaling
  • 45. © 2017 MapR Technologies 45 Uniform Bins
  • 46. © 2017 MapR Technologies 46 FloatHistogram Bins
  • 47. © 2017 MapR Technologies 47 With FloatHistogram
  • 48. © 2017 MapR Technologies 48 Original Ping Latency Data
  • 49. © 2017 MapR Technologies 49 Summary • Single measurements insufficient, need distributions • Uniform binned histograms not good • FloatHistogram for some cases • T-digest for general cases • Upcoming release has super- fast and accurate versions • Good visualization also key 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k
  • 50. © 2017 MapR Technologies 50 Q & A
  • 51. © 2017 MapR Technologies 51 Contact Information Ted Dunning, PhD Chief Application Architect, MapR Technologies Board member, Apache Software Foundation O’Reilly author Email [email protected] [email protected] Twitter @ted_dunning
  • 52. © 2017 MapR Technologies 52 T-digest • Or we can talk about small errors in q • Accumulate samples, sort, merge • Merge if k-size < 1 • Interpolate using centroids in x • Very good near extremes, no dynamic allocation 0.0 0.2 0.4 0.6 0.8 1.0 q 0246810 k