Using druid for interactive count distinct queries at scale @ nmc

Download as PPTX, PDF

2 likes593 views

This document discusses Nielsen's use of Druid for interactive count-distinct queries at scale. It describes Nielsen's need to calculate unique devices encountered over date ranges and attributes in real time from large volumes of streaming data. Previous attempts using Elasticsearch were slow and inefficient. Druid uses sketch algorithms like ThetaSketch to approximate distinct counts quickly while balancing speed and accuracy. It has a columnar data store that allows fast roll-up queries of pre-aggregated sketches. Nielsen was able to reduce query times from hours to milliseconds and costs from $80k to $55k per month by ingesting data into Druid instead of Elasticsearch. The document provides guidelines for setting up, monitoring, modeling data, optimizing queries, and batch ingest

Technology

Yakir Buskilla + Itai Yaffe
Nielsen
USING DRUID
FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

Introduction
Yakir Buskilla Itai Yaffe
● Software Architect
● Focusing on Big
Data and Machine
Learning problems
● Big Data
Infrastructure
Developer
● Dealing with Big
Data challenges for
the last 5 years

Nielsen Marketing Cloud (NMC)
● eXelate was acquired by Nielsen 2 years ago
● A leader in the Ad Tech and Marketing Tech industry
● What do we do ?
○ Data as a Service (DaaS)
○ Software as a Service (SaaS)

The need
● Nielsen Marketing Cloud business question
○ How many unique devices we have encountered:
■ over a given date range
■ for a given set of attributes (segments, regions, etc.)
● Find the number of distinct elements in a data stream which
may contain repeated elements in real time

● Store everything
● Store only 1 bit per device
○ 10B Devices-1.25 GB/day
○ 10B Devices*80K attributes - 100 TB/day
● Approximate
Possible solutions
Naive
Bit VectorApprox.

Our journey
● Elasticsearch
○ Indexing data
■ 250 GB of daily data, 10 hours
■ Affect query time
○ Querying
■ Low concurrency
■ Scans on all the shards of the corresponding index

What we tried
● Preprocessing
● Statistical algorithms (e.g HyperLogLog)

● K Minimum Values (KMV)
● Estimate set cardinality
● Supports set-theoretic operations
X Y
● ThetaSketch mathematical framework - generalization of KMV
X Y
ThetaSketch

Number of Std Dev 1 2
Confidence Interval 68.27% 95.45%
16,384 0.78% 1.56%
32,768 0.55% 1.10%
65,536 0.39% 0.78%
ThetaSketch error

“Very fast highly scalable columnar data-store”
DRUID

Roll-up
ThetaSketchAggregator
2016-11-15
Timestamp Attribute Device ID
11111 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02
2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02
2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02
Timestamp Attribute Count Distinct
2016-11-15
2016-11-15
2016-11-15
11111
22222
33333
2
2
1

Guidelines and pitfalls
● Setup is not easy

Guidelines and pitfalls
● Monitoring your system

Guidelines and pitfalls
● Data modeling
○ Reduce the number of intersections
○ Different datasources for different use cases
2016-11-15
2016-11-15
2016-11-15
Timestamp Attribute
Count
Distinct
Timestamp Attribute Region
Count
Distinct
US XXXXXX US
Porsche
Intent
XXXXXX
Porsche
Intent
... ......
XXXXXX
...

Guidelines and pitfalls
● Query optimization
○ Combine multiple queries into single query
○ Use filters

Guidelines and pitfalls
● Batch Ingestion
○ EMR Tuning
■ 140-nodes cluster
● 85% spot instances => ~80% cost reduction
○ Druid input file format - Parquet vs CSV
■ Reduced indexing time by X4
■ Reduced used storage by X10

Summary
10TB/day
4 Hours/day
15GB/day
280ms-350ms
$55K/month
DRUID
250GB/day
10 Hours/day
2.5TB (total)
500ms-6000ms
$80K/month
ES

More Related Content

What's hot (20)

PPTX

Quoc Le at AI Frontiers : Automated Machine LearningAI Frontiers

PDF

The Evolution of AutoMLNing Jiang

PDF

Retrieving Visually-Similar Products for Shopping Recommendations using Spark...Databricks

PDF

NLP Text Recommendation System Journey to Automated TrainingDatabricks

PPTX

Prediction of taxi rides ETADaniel Marcous

PDF

Growing Data Scientists by Amparo Alonso BetanzosBig Data Spain

PDF

Machine Learning at Scale with MLflow and Apache SparkDatabricks

PDF

Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen

PDF

Automatic machine learning (AutoML) 101QuantUniversity

PDF

Building A Feature FactoryDatabricks

PPTX

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati

PPT

Data Science in the Real World: Making a Difference Srinath Perera

PPTX

Machine Learning Projects Using MATLAB Research HelpMatlab Simulation

PDF

Sparklyr: Big Data enabler for R usersICTeam S.p.A.

PDF

Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks

PDF

Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks

PDF

Graph Gurus Episode 9: How Visa Optimizes Network and IT Resources with a Nat...TigerGraph

PPTX

Machine Learning In ProductionSamir Bessalah

PDF

Machine Learning Powered by Graphs - Alessandro NegroGraphAware

PPTX

WSO2 Big Data Platform and ApplicationsSrinath Perera

Quoc Le at AI Frontiers : Automated Machine LearningAI Frontiers

The Evolution of AutoMLNing Jiang

Retrieving Visually-Similar Products for Shopping Recommendations using Spark...Databricks

NLP Text Recommendation System Journey to Automated TrainingDatabricks

Prediction of taxi rides ETADaniel Marcous

Growing Data Scientists by Amparo Alonso BetanzosBig Data Spain

Machine Learning at Scale with MLflow and Apache SparkDatabricks

Lambda Architecture 2.0 for Reactive AB TestingTrieu Nguyen

Automatic machine learning (AutoML) 101QuantUniversity

Building A Feature FactoryDatabricks

Design Patterns for Machine Learning in Production - Sergei Izrailev, Chief D...Sri Ambati

Data Science in the Real World: Making a Difference Srinath Perera

Machine Learning Projects Using MATLAB Research HelpMatlab Simulation

Sparklyr: Big Data enabler for R usersICTeam S.p.A.

Consolidating MLOps at One of Europe’s Biggest AirportsDatabricks

Applied Machine Learning for Ranking Products in an Ecommerce SettingDatabricks

Graph Gurus Episode 9: How Visa Optimizes Network and IT Resources with a Nat...TigerGraph

Machine Learning In ProductionSamir Bessalah

Machine Learning Powered by Graphs - Alessandro NegroGraphAware

WSO2 Big Data Platform and ApplicationsSrinath Perera

Viewers also liked (20)

PPTX

Blind spots in big data erez koren @ forterIdo Shilon

PPTX

Deep learning at nmc devin jones Ido Shilon

PDF

Why ml and ai are the future of gaming david sachs @ tomoboxIdo Shilon

PPTX

Accelerating scale from startups to enterprise by Peter bakasIdo Shilon

PDF

BDX 2016 - Kevin lyons & yakir buskilla @ eXelate Ido Shilon

PDF

Micro apps across 3 continents using React js Ido Shilon

PDF

BDX 2016 - Arnon rotem gal-oz @ appsflyerIdo Shilon

PDF

BDX 2016- Monal daxini @ NetflixIdo Shilon

PPTX

Druid - DevconTLV XYakir Buskilla

PDF

Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters

PPTX

Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, HadoopSenthil Pandurangan

PDF

Druid at SF Big Analytics 2015-12-01gianmerlino

PPTX

July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network

PDF

Interactive analytics at scale with druidJulien Lavigne du Cadet

PDF

Data Analytics with DruidYousun Jeong

PPTX

PayPal Real Time AnalyticsAnil Madan

PPTX

Programmatic Bidding Data Streams & DruidCharles Allen

PPTX

Druid realtime indexingSeoeun Park

PDF

Real-time analytics with Druid at AppsflyerMichael Spector

PDF

Lambda Architectures in PracticeC4Media

Blind spots in big data erez koren @ forterIdo Shilon

Deep learning at nmc devin jones Ido Shilon

Why ml and ai are the future of gaming david sachs @ tomoboxIdo Shilon

Accelerating scale from startups to enterprise by Peter bakasIdo Shilon

BDX 2016 - Kevin lyons & yakir buskilla @ eXelate Ido Shilon

Micro apps across 3 continents using React js Ido Shilon

BDX 2016 - Arnon rotem gal-oz @ appsflyerIdo Shilon

BDX 2016- Monal daxini @ NetflixIdo Shilon

Druid - DevconTLV XYakir Buskilla

Gregorry Letribot - Druid at Criteo - NoSQL matters 2015NoSQLmatters

Monitoring @ scale over diverse data sources @ PayPal - Druid, TSDB, HadoopSenthil Pandurangan

Druid at SF Big Analytics 2015-12-01gianmerlino

July 2014 HUG : Pushing the limits of Realtime Analytics using DruidYahoo Developer Network

Interactive analytics at scale with druidJulien Lavigne du Cadet

Data Analytics with DruidYousun Jeong

PayPal Real Time AnalyticsAnil Madan

Programmatic Bidding Data Streams & DruidCharles Allen

Druid realtime indexingSeoeun Park

Real-time analytics with Druid at AppsflyerMichael Spector

Lambda Architectures in PracticeC4Media

Similar to Using druid for interactive count distinct queries at scale @ nmc (20)

PPTX

Using druid for interactive count distinct queries at scaleItai Yaffe

PPT

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

PPTX

Our journey with druid - from initial research to full production scaleItai Yaffe

PDF

DruidDori Waldman

PPTX

Understanding apache-druidSuman Banerjee

PDF

Sherlock: an anomaly detection service on top of Druid DataWorks Summit

PDF

Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto

PPTX

Apache Druid Design and Future prospectc-bslim

PDF

Data Analytics with DruidDataya Nolja

PPTX

Scalable Real-time analytics using DruidDataWorks Summit/Hadoop Summit

PDF

Zeotap: Data Modeling in Druid for Non temporal and Nested DataImply

PDF

Game Analytics at London Apache Druid MeetupJelena Zanko

PDF

Imply at Apache Druid Meetup in London 1-15-20Jelena Zanko

PPTX

How we evolved data pipeline at Celtra and what we learned along the wayGrega Kespret

PDF

Web analytics at scale with Druid at naver.comJungsu Heo

PDF

Druid meetup @walkmeDori Waldman

PPTX

Druid Scaling Realtime AnalyticsAaron Brooks

PDF

Funnel Analysis with Apache Spark and DruidDatabricks

PDF

20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...Athens Big Data

PPTX

The of Operational Analytics Data StoreRommel Garcia

Using druid for interactive count distinct queries at scaleItai Yaffe

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

Our journey with druid - from initial research to full production scaleItai Yaffe

DruidDori Waldman

Understanding apache-druidSuman Banerjee

Sherlock: an anomaly detection service on top of Druid DataWorks Summit

Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto

Apache Druid Design and Future prospectc-bslim

Data Analytics with DruidDataya Nolja

Scalable Real-time analytics using DruidDataWorks Summit/Hadoop Summit

Zeotap: Data Modeling in Druid for Non temporal and Nested DataImply

Game Analytics at London Apache Druid MeetupJelena Zanko

Imply at Apache Druid Meetup in London 1-15-20Jelena Zanko

How we evolved data pipeline at Celtra and what we learned along the wayGrega Kespret

Web analytics at scale with Druid at naver.comJungsu Heo

Druid meetup @walkmeDori Waldman

Druid Scaling Realtime AnalyticsAaron Brooks

Funnel Analysis with Apache Spark and DruidDatabricks

20th Athens Big Data Meetup - 1st Talk - Druid: the open source, performant, ...Athens Big Data

The of Operational Analytics Data StoreRommel Garcia

Recently uploaded (20)

PDF

Building Real-Time Digital Twins with IBM Maximo & ArcGIS IndoorsSafe Software

PPTX

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

PDF

Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdfdarshakparmar

PDF

SFWelly Summer 25 Release Highlights July 2025Anna Loughnan Colquhoun

PPTX

Building Search Using OpenSearch: Limitations and WorkaroundsSease

PDF

Smart Trailers 2025 Update with History and OverviewPaul Menig

PDF

July Patch TuesdayIvanti

PDF

Bitcoin for Millennials podcast with Bram, Power Laws of BitcoinStephen Perrenod

PDF

Achieving Consistent and Reliable AI Code Generation - Medusa AImedusaaico

PPTX

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

PPTX

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

PDF

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

PDF

Exolore The Essential AI Tools in 2025.pdfSrinivasan M

PDF

Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025faizk77g

PPT

Interview paper part 3, It is based on Interview PrepSoumyadeepGhosh39

PPTX

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

PDF

CIFDAQ Token Spotlight for 9th July 2025CIFDAQ

PDF

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

PDF

"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...Fwdays

PPTX

MSP360 Backup Scheduling and Retention Best Practices.pptxMSP360

Building Real-Time Digital Twins with IBM Maximo & ArcGIS IndoorsSafe Software

AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptxsameeraaabegumm

Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdfdarshakparmar

SFWelly Summer 25 Release Highlights July 2025Anna Loughnan Colquhoun

Building Search Using OpenSearch: Limitations and WorkaroundsSease

Smart Trailers 2025 Update with History and OverviewPaul Menig

July Patch TuesdayIvanti

Bitcoin for Millennials podcast with Bram, Power Laws of BitcoinStephen Perrenod

Achieving Consistent and Reliable AI Code Generation - Medusa AImedusaaico

"Autonomy of LLM Agents: Current State and Future Prospects", Oles` PetrivFwdays

Q2 FY26 Tableau User Group Leader Quarterly Calllward7

NewMind AI - Journal 100 Insights After The 100th IssueNewMind AI

Exolore The Essential AI Tools in 2025.pdfSrinivasan M

Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025faizk77g

Interview paper part 3, It is based on Interview PrepSoumyadeepGhosh39

UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst ContentDianaGray10

CIFDAQ Token Spotlight for 9th July 2025CIFDAQ

DevBcn - Building 10x Organizations Using Modern Productivity MetricsJustin Reock

"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...Fwdays

MSP360 Backup Scheduling and Retention Best Practices.pptxMSP360

Using druid for interactive count distinct queries at scale @ nmc

1. Yakir Buskilla + Itai Yaffe Nielsen USING DRUID FOR INTERACTIVE COUNT-DISTINCT QUERIES AT SCALE

2. Introduction Yakir Buskilla Itai Yaffe ● Software Architect ● Focusing on Big Data and Machine Learning problems ● Big Data Infrastructure Developer ● Dealing with Big Data challenges for the last 5 years

3. Nielsen Marketing Cloud (NMC) ● eXelate was acquired by Nielsen 2 years ago ● A leader in the Ad Tech and Marketing Tech industry ● What do we do ? ○ Data as a Service (DaaS) ○ Software as a Service (SaaS)

4. NMC high-level architecture

5. The need ● Nielsen Marketing Cloud business question ○ How many unique devices we have encountered: ■ over a given date range ■ for a given set of attributes (segments, regions, etc.) ● Find the number of distinct elements in a data stream which may contain repeated elements in real time

6. The need

7. The need

8. ● Store everything ● Store only 1 bit per device ○ 10B Devices-1.25 GB/day ○ 10B Devices*80K attributes - 100 TB/day ● Approximate Possible solutions Naive Bit VectorApprox.

9. Our journey ● Elasticsearch ○ Indexing data ■ 250 GB of daily data, 10 hours ■ Affect query time ○ Querying ■ Low concurrency ■ Scans on all the shards of the corresponding index

10. What we tried ● Preprocessing ● Statistical algorithms (e.g HyperLogLog)

11. ● K Minimum Values (KMV) ● Estimate set cardinality ● Supports set-theoretic operations X Y ● ThetaSketch mathematical framework - generalization of KMV X Y ThetaSketch

12. KMV intuition

13. Number of Std Dev 1 2 Confidence Interval 68.27% 95.45% 16,384 0.78% 1.56% 32,768 0.55% 1.10% 65,536 0.39% 0.78% ThetaSketch error

14. “Very fast highly scalable columnar data-store” DRUID

15. Roll-up ThetaSketchAggregator 2016-11-15 Timestamp Attribute Device ID 11111 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 22222 3a4c1f2d84a5c179435c1fea86e6ae02 2016-11-15 11111 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 22222 5dd59f9bd068f802a7c6dd832bf60d02 2016-11-15 333333 5dd59f9bd068f802a7c6dd832bf60d02 Timestamp Attribute Count Distinct 2016-11-15 2016-11-15 2016-11-15 11111 22222 33333 2 2 1

16. Druid architecture

17. How do we use Druid

18. Guidelines and pitfalls ● Setup is not easy

19. Guidelines and pitfalls ● Monitoring your system

20. Guidelines and pitfalls ● Data modeling ○ Reduce the number of intersections ○ Different datasources for different use cases 2016-11-15 2016-11-15 2016-11-15 Timestamp Attribute Count Distinct Timestamp Attribute Region Count Distinct US XXXXXX US Porsche Intent XXXXXX Porsche Intent ... ...... XXXXXX ...

21. Guidelines and pitfalls ● Query optimization ○ Combine multiple queries into single query ○ Use filters

22. Guidelines and pitfalls ● Batch Ingestion ○ EMR Tuning ■ 140-nodes cluster ● 85% spot instances => ~80% cost reduction ○ Druid input file format - Parquet vs CSV ■ Reduced indexing time by X4 ■ Reduced used storage by X10

23. Guidelines and pitfalls ● Community

24. Summary 10TB/day 4 Hours/day 15GB/day 280ms-350ms $55K/month DRUID 250GB/day 10 Hours/day 2.5TB (total) 500ms-6000ms $80K/month ES

25. THANK YOU!

Editor's Notes

#3: Intro of us + NMC
#4: Daas = marketplace for device level data connecting buyers and sellers Saas - Nielsen Marketing cloud platform which help brands to connect with their customers by using our big data sets and our analytics tools
#5: Our serving layer(Front End) aggregates data from various online + offline sources We aggregate around 10B events per day
#6: Past… Mention “cardinality” and “real-time dashboard” Explain the need to union and intersect
#9: -Bit vector - Elastic search /Redis is an example of such system
#10: We tried to introduce new cluster dedicated for indexing only and then use backup and restore to the second cluster This method was very expensive and was partially helpful Tuning for better performance also didn’t help too much
#11: Preprocessing - Too many combinations - The formula length is not bounded (show some numbers) HyperLogLog -Implementation in ElasticSearch was too slow (done on query time) - Set operations increase the error dramatically
#12: Unions and Intersections increase the error The problematic case is intersection of very small set with very big set
#14: The larger the K the smaller the Error However larger K means more memory & storage needed
#15: So we talked about statistical algorithms, which is nice, but we needed a practical solution… OOTB supports ThetaSketch algorithm
#16: Timeseries database - first thing you need to know about Druid Column types : Timestamp Dimensions Metrics Together they comprise a Datasource Agg is done on ingestion time (outcome is much smaller in size) In query time, it’s closer to a key-value search
#17: We have 3 types of processes - ingestion, querying, managementAll processes are decoupled and scalable Ingestion (real time - e.g from Kafka, batch - talk about deep storage, how data is aggregated in ingestion time)Querying (brokers, historicals, query performance during ingestion) Lambda architecture
#18: Explain the tuple and what is happening during the aggregation
#19: Setup is not easy Separate config/servers/tuning Caused the deployment to take a few months Use the Druid recommendation for Production configuration
#20: Monitoring Your System Druid has built in support for Graphite ( exports many metrics )
#21: Data Modeling If using Theta sketch - reduce the number of intersections (show a slide of the old and new data model).It didn’t solve all use-cases, but it gives you an idea of how you can approach the problem Different datasources - e.g lower accuracy for faster queries VS higher accuracy with a bit slower queries
#22: Combine multiple queries over the REST API There can be billions of rows, so filter the data as part of the query
#23: EMR tuning (spot instances (80% cost reduction), druid MR prod config) Use Parquet The picture here - maybe money??
#25: Ingestion doesn’t affect query + sub-second response for even 100s or 1000s of concurrent queries Cost is for the entire solution (Druid cluster, EMR, etc.) With Druid and ThetaSketch, we’ve improved our ingestion volume and query performance and concurrency by an order of magnitude with a lesser cost, compared to our old solution (We’ve achieved a more performant, scalable, cost-effective solution)