SlideShare a Scribd company logo
Yakir Buskilla + Itai Yaffe
Nielsen
USING DRUID
FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE
Introduction
Yakir Buskilla Itai Yaffe
● Software Architect
● Focusing on Big
Data and Machine
Learning problems
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years
Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 2 years ago
● A leader in the Ad Tech and Marketing Tech industry
● What do we do ?
○ Data as a Service (DaaS)
○ Software as a Service (SaaS)
NMC high-level architecture
The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time
The need
The need
● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.
Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the corresponding index
What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)
● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch
KMV intuition
Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error
“Very fast highly scalable columnar data-store”
DRUID
Roll-up
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
2
1
Druid architecture
How do we use Druid
Guidelines and pitfalls
● Setup is not easy
Guidelines and pitfalls
● Monitoring your system
Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...
Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters
Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10
Guidelines and pitfalls
● Community
Summary
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES
THANK YOU!

More Related Content

What's hot (20)

PPTX
Quoc Le at AI Frontiers : Automated Machine Learning
AI Frontiers
 
PDF
The Evolution of AutoML
Ning Jiang
 
PDF
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
Databricks
 
PDF
NLP Text Recommendation System Journey to Automated Training
Databricks
 
PPTX
Prediction of taxi rides ETA
Daniel Marcous
 
PDF
Growing Data Scientists by Amparo Alonso Betanzos
Big Data Spain
 
PDF
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
PDF
Lambda Architecture 2.0 for Reactive AB Testing
Trieu Nguyen
 
PDF
Automatic machine learning (AutoML) 101
QuantUniversity
 
PDF
Building A Feature Factory
Databricks
 
PPTX
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Sri Ambati
 
PPT
Data Science in the Real World: Making a Difference
Srinath Perera
 
PPTX
Machine Learning Projects Using MATLAB Research Help
Matlab Simulation
 
PDF
Sparklyr: Big Data enabler for R users
ICTeam S.p.A.
 
PDF
Consolidating MLOps at One of Europe’s Biggest Airports
Databricks
 
PDF
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
PDF
Graph Gurus Episode 9: How Visa Optimizes Network and IT Resources with a Nat...
TigerGraph
 
PPTX
Machine Learning In Production
Samir Bessalah
 
PDF
Machine Learning Powered by Graphs - Alessandro Negro
GraphAware
 
PPTX
WSO2 Big Data Platform and Applications
Srinath Perera
 
Quoc Le at AI Frontiers : Automated Machine Learning
AI Frontiers
 
The Evolution of AutoML
Ning Jiang
 
Retrieving Visually-Similar Products for Shopping Recommendations using Spark...
Databricks
 
NLP Text Recommendation System Journey to Automated Training
Databricks
 
Prediction of taxi rides ETA
Daniel Marcous
 
Growing Data Scientists by Amparo Alonso Betanzos
Big Data Spain
 
Machine Learning at Scale with MLflow and Apache Spark
Databricks
 
Lambda Architecture 2.0 for Reactive AB Testing
Trieu Nguyen
 
Automatic machine learning (AutoML) 101
QuantUniversity
 
Building A Feature Factory
Databricks
 
Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...
Sri Ambati
 
Data Science in the Real World: Making a Difference
Srinath Perera
 
Machine Learning Projects Using MATLAB Research Help
Matlab Simulation
 
Sparklyr: Big Data enabler for R users
ICTeam S.p.A.
 
Consolidating MLOps at One of Europe’s Biggest Airports
Databricks
 
Applied Machine Learning for Ranking Products in an Ecommerce Setting
Databricks
 
Graph Gurus Episode 9: How Visa Optimizes Network and IT Resources with a Nat...
TigerGraph
 
Machine Learning In Production
Samir Bessalah
 
Machine Learning Powered by Graphs - Alessandro Negro
GraphAware
 
WSO2 Big Data Platform and Applications
Srinath Perera
 

Viewers also liked (20)

PPTX
Blind spots in big data erez koren @ forter
Ido Shilon
 
PPTX
Deep learning at nmc devin jones
Ido Shilon
 
PDF
Why ml and ai are the future of gaming david sachs @ tomobox
Ido Shilon
 
PPTX
Accelerating scale from startups to enterprise by Peter bakas
Ido Shilon
 
PDF
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
Ido Shilon
 
PDF
Micro apps across 3 continents using React js
Ido Shilon
 
PDF
BDX 2016 - Arnon rotem gal-oz @ appsflyer
Ido Shilon
 
PDF
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
PPTX
Druid - DevconTLV X
Yakir Buskilla
 
PDF
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
NoSQLmatters
 
PPTX
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Senthil Pandurangan
 
PDF
Druid at SF Big Analytics 2015-12-01
gianmerlino
 
PPTX
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
PDF
Interactive analytics at scale with druid
Julien Lavigne du Cadet
 
PDF
Data Analytics with Druid
Yousun Jeong
 
PPTX
PayPal Real Time Analytics
Anil Madan
 
PPTX
Programmatic Bidding Data Streams & Druid
Charles Allen
 
PPTX
Druid realtime indexing
Seoeun Park
 
PDF
Real-time analytics with Druid at Appsflyer
Michael Spector
 
PDF
Lambda Architectures in Practice
C4Media
 
Blind spots in big data erez koren @ forter
Ido Shilon
 
Deep learning at nmc devin jones
Ido Shilon
 
Why ml and ai are the future of gaming david sachs @ tomobox
Ido Shilon
 
Accelerating scale from startups to enterprise by Peter bakas
Ido Shilon
 
BDX 2016 - Kevin lyons & yakir buskilla @ eXelate
Ido Shilon
 
Micro apps across 3 continents using React js
Ido Shilon
 
BDX 2016 - Arnon rotem gal-oz @ appsflyer
Ido Shilon
 
BDX 2016- Monal daxini @ Netflix
Ido Shilon
 
Druid - DevconTLV X
Yakir Buskilla
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
NoSQLmatters
 
Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, Hadoop
Senthil Pandurangan
 
Druid at SF Big Analytics 2015-12-01
gianmerlino
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
Interactive analytics at scale with druid
Julien Lavigne du Cadet
 
Data Analytics with Druid
Yousun Jeong
 
PayPal Real Time Analytics
Anil Madan
 
Programmatic Bidding Data Streams & Druid
Charles Allen
 
Druid realtime indexing
Seoeun Park
 
Real-time analytics with Druid at Appsflyer
Michael Spector
 
Lambda Architectures in Practice
C4Media
 
Ad

Similar to Using druid for interactive count distinct queries at scale @ nmc (20)

PPTX
Using druid for interactive count distinct queries at scale
Itai Yaffe
 
PPT
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
PPTX
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
PDF
Druid
Dori Waldman
 
PPTX
Understanding apache-druid
Suman Banerjee
 
PDF
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
PDF
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
PPTX
Apache Druid Design and Future prospect
c-bslim
 
PDF
Data Analytics with Druid
Dataya Nolja
 
PPTX
Scalable Real-time analytics using Druid
DataWorks Summit/Hadoop Summit
 
PDF
Zeotap: Data Modeling in Druid for Non temporal and Nested Data
Imply
 
PDF
Game Analytics at London Apache Druid Meetup
Jelena Zanko
 
PDF
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko
 
PPTX
How we evolved data pipeline at Celtra and what we learned along the way
Grega Kespret
 
PDF
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
PDF
Druid meetup @walkme
Dori Waldman
 
PPTX
Druid Scaling Realtime Analytics
Aaron Brooks
 
PDF
Funnel Analysis with Apache Spark and Druid
Databricks
 
PDF
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
Athens Big Data
 
PPTX
The of Operational Analytics Data Store
Rommel Garcia
 
Using druid for interactive count distinct queries at scale
Itai Yaffe
 
Counting Unique Users in Real-Time: Here's a Challenge for You!
DataWorks Summit
 
Our journey with druid - from initial research to full production scale
Itai Yaffe
 
Understanding apache-druid
Suman Banerjee
 
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
Apache Druid Design and Future prospect
c-bslim
 
Data Analytics with Druid
Dataya Nolja
 
Scalable Real-time analytics using Druid
DataWorks Summit/Hadoop Summit
 
Zeotap: Data Modeling in Druid for Non temporal and Nested Data
Imply
 
Game Analytics at London Apache Druid Meetup
Jelena Zanko
 
Imply at Apache Druid Meetup in London 1-15-20
Jelena Zanko
 
How we evolved data pipeline at Celtra and what we learned along the way
Grega Kespret
 
Web analytics at scale with Druid at naver.com
Jungsu Heo
 
Druid meetup @walkme
Dori Waldman
 
Druid Scaling Realtime Analytics
Aaron Brooks
 
Funnel Analysis with Apache Spark and Druid
Databricks
 
20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...
Athens Big Data
 
The of Operational Analytics Data Store
Rommel Garcia
 
Ad

Recently uploaded (20)

PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PPTX
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
July Patch Tuesday
Ivanti
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
DevBcn - Building 10x Organizations Using Modern Productivity Metrics
Justin Reock
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
MSP360 Backup Scheduling and Retention Best Practices.pptx
MSP360
 

Using druid for interactive count distinct queries at scale @ nmc

Editor's Notes

  • #3: Intro of us + NMC
  • #4: Daas = marketplace for device level data connecting buyers and sellers Saas - Nielsen Marketing cloud platform which help brands to connect with their customers by using our big data sets and our analytics tools
  • #5: Our serving layer(Front End) aggregates data from various online + offline sources We aggregate around 10B events per day
  • #6: Past… Mention “cardinality” and “real-time dashboard” Explain the need to union and intersect
  • #9: -Bit vector - Elastic search /Redis is an example of such system
  • #10: We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster This method was very expensive and was partially helpful Tuning for better performance also didn’t help too much
  • #11: Preprocessing - Too many combinations - The formula length is not bounded (show some numbers) HyperLogLog -Implementation in ElasticSearch was too slow (done on query time) - Set operations increase the error dramatically
  • #12: Unions and Intersections increase the error The problematic case is intersection of very small set with very big set
  • #14: The larger the K the smaller the Error However larger K means more memory & storage needed
  • #15: So we talked about statistical algorithms, which is nice, but we needed a practical solution… OOTB supports ThetaSketch algorithm
  • #16: Timeseries database - first thing you need to know about Druid Column types : Timestamp Dimensions Metrics Together they comprise a Datasource Agg is done on ingestion time (outcome is much smaller in size) In query time, it’s closer to a key-value search
  • #17: We have 3 types of processes - ingestion, querying, management All processes are decoupled and scalable Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time) Querying (brokers, historicals, query performance during ingestion) Lambda architecture
  • #18: Explain the tuple and what is happening during the aggregation
  • #19: Setup is not easy Separate config/servers/tuning Caused the deployment to take a few months Use the Druid recommendation for Production configuration
  • #20: Monitoring Your System Druid has built in support for Graphite ( exports many metrics )
  • #21: Data Modeling If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model). It didn’t solve all use-cases, but it gives you an idea of how you can approach the problem Different datasources - e.g lower accuracy for faster queries VS higher accuracy with a bit slower queries
  • #22: Combine multiple queries over the REST API There can be billions of rows, so filter the data as part of the query
  • #23: EMR tuning (spot instances (80% cost reduction), druid MR prod config) Use Parquet The picture here - maybe money??
  • #25: Ingestion doesn’t affect query + sub-second response for even 100s or 1000s of concurrent queries Cost is for the entire solution (Druid cluster, EMR, etc.) With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution)