SlideShare a Scribd company logo
Bullet: A Real Time Data Query Engine
A REAL TIME DATA QUERY ENGINE
Michael Natkovich, Akshai Sarma
3
ALLOW MYSELF TO INTRODUCE … MYSELF
• Akshai Sarma
• asarma@yahoo-inc.com
• Senior Engineer
• 4+ years of solving data problems at Yahoo
4
ALLOW MYSELF TO INTRODUCE … MYSELF
• Michael Natkovich
• mln@yahoo-inc.com
• Director Engineer
• 10+ years of causing data problems at Yahoo
5
THE WHY
6
INSTRUMENTATION
• Is code added to web pages or apps to track usage and behavior
• User Identity
• Engagement
• Location
• Drives all data applications
• Targeting
• Personalization
• Analytics
7
CYCLE OF SADNESS
• Instrumentation validation is unbearably slow
• Needs to be seconds not minutes or hours
• Needs to be easy to query
• Needs programmatic access
8
EXISTING OPTIONS
Type Latency Downside
Batch Hours Too slow
Mini-Batch 20 Minutes Faster, but still too slow
Streaming 2 Seconds Fast enough, but no way to query
• The various ways to obtain data were either:
• Not fast enough
• Impossible to query
9
THE WHAT
10
TYPICAL QUERYING
Data Flow
Persistence
Queries
11
ATYPICAL QUERYING
Future Queryable Data Old Un-Queryable DataCurrent Queryable Data
Query Engine Query Results
Data Flow
12
BULLET
• Retrieves data that arrives after query submission – Look Forward
• No persistence layer
• Light-weight, fast, and scalable
• UI for Ad-Hoc queries
• API for programmatic querying
• Pluggable interface to integrate with streaming data
13
QUERYING IN BULLET
• Support filtering, logical operators on typed data
• Supports aggregations
• Group By, Count Distincts, Top K, Distributions
• DataSketches based
• Queries have life spans
• All queries run for a specified time window
• Raw queries can terminate early if they have seen a minimum number of records
14
DATASKETCHES
• Sketches are a class of stochastic streaming algorithms
• Provides approximate results (if data is too large)
• Provable error bounds
• Fixed memory footprint
• Mergeable, allowing for parallel processing
Sketches logo from https://datasketches.github.io
15
DEMO
16
17
Request
Processor
Data
Processor
Combiner
Bullet
Data
StreamBullet
WS
Query
Results
Results
Query
& ID
Query
& ID
Data
Records
Matching
Events & ID
Query
FLOW Performance Stats
Sensor Data
User Activity
IoT Data
18
USE CASES: BEYOND INSTRUMENTATION VALIDATION
• See sample values of a field
• What’s a country code look like?
• Cardinality of fields for Druid ingestion
• 10s, 100s, 1000s of unique values?
• Check that a new experiment is running
• Is data coming in for all my test buckets?
19
THE HOW
20
ARCHITECTURE
21
STORM TERMINOLOGY
• Tuple: The basic unit of data in Storm
• Stream: An unbounded set of tuples
• Spout: Source of tuples Kafka, Flume etc.
• Bolt: Tuple processor
• Topology: A DAG of Spouts and Bolts
• DRPC: Distributed Remote Procedure Call
Storm logo from https://storm.apache.org
22
23
PLUGGABLE INTERFACE
1. Run Bullet on your data
• Write a Spout/Topology to read data from Kafka, Flume, HDFS etc.
• Convert data into a Bullet Record (AVRO)
2. Plug in a schema (if you need the UI)
• JSON based
• Provides field names, types, and descriptions
3. Plug in a default starting query (Optional for the UI)
• Example: A query based on a UI users’ cookie
24
PERFORMANCE
• Scaling for data
1. Scale pluggable data processing component
2. Scale Filter Bolts for handling data volume
• Scaling for simultaneous queries
1. Scale Filter Bolts
2. Scale DRPC components – DRPC servers primarily
3. Scale Join Bolts
25
TEST HARDWARE
• Storm 1.0 Cluster
• 2 x Intel E5-2680v3 (12 Core, 24 Threads) – 48 V. Cores
• 256 GB RAM
• 10 G Network Interface
• Multi-tenant
• Reading data from a Kafka 0.10 cluster
• In the same data center so network delays are minimal
26
DATA
• Average size: 4.33 KiB compressed (1.2 compression ratio)
• Data Volume: Records per second (R/s), Mebibytes per second (MiB/s)
• 92 top-level fields
• 62 Strings
• 4 Longs
• 23 Maps
• 3 Lists of Maps
27
FINDING A SINGLE GENERATED RECORD
• Data Volume : 67,400 R/s and 104 MiB/s
• Average of 100 Bullet queries to find a single generated record
Timestamp Delay (ms)
Query received in Bullet 0
Record generated 31.3
Record submitted to Kafka 357.9
Record received in Bullet 1008.2
Record found in Bullet 1015.4
Query finished in Bullet 1018.3
• Bullet latency is 1018.3 – 1008.2 = 10.1 ms
28
SCALING FOR DATA: GOALS
• Read the data
• Catch up on data backlog at > 5 : 1 ratio (5s of backlog in 1s)
• Support 400 Raw Bullet queries concurrently
• Max record finding latency < 200 ms at 400 queries
29
SCALING FOR DATA: CPU
30
SCALING FOR DATA: MEMORY
31
SCALING FOR DATA: SUMMARY
• For 400 Raw queries and data reading goals
• CPU to Memory ratio
• 1 core : 1.2 GiB
• CPU to Data ratio
• 1 core : 856 R/s
• 1 core : 3.4 MiB/s
32
SCALING FOR QUERIES: GOALS
• Fixed Data Volume : 68,400 R/s and 105 MiB/s
• Latency to find a record after the record is first seen by Bullet
• As number of Filter Bolts (1 V. Core, 1024 GiB RAM) varies
• As number of simultaneous Raw queries varies
• Each query runs for 30s and looks for 10 generated records
• Want max latency < 200 ms
33
SCALING FOR QUERIES: CPU
34
SCALING FOR QUERIES: LATENCY
35
SCALING FOR QUERIES: DRPC
• 3 DRPC servers on our test Storm cluster
• 2 x Intel E5620 (4 cores, 8 Threads) - 16 V. Cores
• 24 GB RAM
• 10 G Network
• About 700 simultaneous Bullet queries
• Horizontally scalable
• Blocking threads at the moment
• Async implementation in Storm 2.0
36
ANNOUNCING OPEN SOURCE
• We are on GitHub!
• Documentation: https://yahoo.github.io/bullet-docs
• Contributions, ideas, feedback welcome!
Component Repo
Storm https://github.com/yahoo/bullet-storm
WS https://github.com/yahoo/bullet-service
UI https://github.com/yahoo/bullet-ui
Record https://github.com/yahoo/bullet-record
37
SUMMARY
• Wanted to validate instrumentation but ended up with generic querying
• Query any data that can be plugged into Storm
• Queries first, then data  Look-forward querying
• Persists no data  Light-weight and cheap!
• Fetch Raw data
• Aggregate: Group By, Top K, Distributions, Count Distinct
38
FUTURE WORK
• Considering Pub/Sub queue to receive and send queries and results
• Allows Bullet implementations on other Stream processors
• Incremental updates
• WebSockets or SSE to push results
• Streaming results
• Additive results
• Security
• SQL interface
39
THANKS
• Nathan Speidel
• Cat Utah
• Marcus Svedman
• Satish Vanimisetti
40
LINKS
• Contact Us
• Developers : bullet-dev@googlegroups.com
• Users : bullet-users@googlegroups.com
• Documentation : https://yahoo.github.io/bullet-docs
• DataSketches: https://datasketches.github.io
41
APPENDIX
42
COUNT DISTINCT: NAIVE
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts
Overwhelm Single Combiner
43
COUNT DISTINCT: TYPICAL
1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distincts
6. Send Count
7. Combined Counts
Vulnerable to Data Skew
44
COUNT DISTINCT: SKETCHES
1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches

More Related Content

What's hot (20)

PDF
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Ankur Bansal
 
PDF
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
PDF
Presto updates to 0.178
Kai Sasaki
 
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
PPTX
How to ensure Presto scalability 
in multi use case
Kai Sasaki
 
PPTX
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Matt Fuller
 
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
PDF
Presto @ Treasure Data - Presto Meetup Boston 2015
Taro L. Saito
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
PDF
Presto - Analytical Database. Overview and use cases.
Wojciech Biela
 
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
PDF
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
confluent
 
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
PDF
Presto @ Facebook: Past, Present and Future
DataWorks Summit
 
PDF
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
PDF
Presto Strata Hadoop SJ 2016 short talk
kbajda
 
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
PDF
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
PPTX
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 
Hadoop summit - Scaling Uber’s Real-Time Infra for Trillion Events per Day
Ankur Bansal
 
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Presto updates to 0.178
Kai Sasaki
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
How to ensure Presto scalability 
in multi use case
Kai Sasaki
 
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Matt Fuller
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
HostedbyConfluent
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Taro L. Saito
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
Presto - Analytical Database. Overview and use cases.
Wojciech Biela
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
Kafka Summit SF 2017 - Real-Time Document Rankings with Kafka Streams
confluent
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Kriangkrai Chaonithi
 
Presto @ Facebook: Past, Present and Future
DataWorks Summit
 
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
Presto Strata Hadoop SJ 2016 short talk
kbajda
 
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
HostedbyConfluent
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Databricks
 
Presto@Netflix Presto Meetup 03-19-15
Zhenxiao Luo
 

Similar to Bullet: A Real Time Data Query Engine (20)

PDF
Using Riak for Events storage and analysis at Booking.com
Damien Krotkine
 
PDF
Internals of Presto Service
Treasure Data, Inc.
 
PDF
Chirp 2010: Scaling Twitter
John Adams
 
PDF
John adams talk cloudy
John Adams
 
PDF
Apache con 2020 use cases and optimizations of iotdb
ZhangZhengming
 
PPTX
Asynchronous design with Spring and RTI: 1M events per second
Stuart (Pid) Williams
 
PPTX
The server side story: Parallel and Asynchronous programming in .NET - ITPro...
Panagiotis Kanavos
 
PDF
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
PDF
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
PDF
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
PPTX
The Background Noise of the Internet
Andrew Morris
 
PDF
Presto At Treasure Data
Taro L. Saito
 
PDF
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
PDF
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
PPTX
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
PDF
Fixing twitter
Roger Xia
 
PDF
Fixing_Twitter
liujianrong
 
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
PDF
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
NETWAYS
 
Using Riak for Events storage and analysis at Booking.com
Damien Krotkine
 
Internals of Presto Service
Treasure Data, Inc.
 
Chirp 2010: Scaling Twitter
John Adams
 
John adams talk cloudy
John Adams
 
Apache con 2020 use cases and optimizations of iotdb
ZhangZhengming
 
Asynchronous design with Spring and RTI: 1M events per second
Stuart (Pid) Williams
 
The server side story: Parallel and Asynchronous programming in .NET - ITPro...
Panagiotis Kanavos
 
Build a Time Series Application with Apache Spark and Apache HBase
Carol McDonald
 
OSMC 2016 | Monitor your Infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
OSMC 2016 - Monitor your infrastructure with Elastic Beats by Monica Sarbu
NETWAYS
 
The Background Noise of the Internet
Andrew Morris
 
Presto At Treasure Data
Taro L. Saito
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs
 
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
Fixing twitter
Roger Xia
 
Fixing_Twitter
liujianrong
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
NETWAYS
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Ad

Recently uploaded (20)

PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Transcript: Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
“Computer Vision at Sea: Automated Fish Tracking for Sustainable Fishing,” a ...
Edge AI and Vision Alliance
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 

Bullet: A Real Time Data Query Engine

  • 2. A REAL TIME DATA QUERY ENGINE Michael Natkovich, Akshai Sarma
  • 3. 3 ALLOW MYSELF TO INTRODUCE … MYSELF • Akshai Sarma • [email protected] • Senior Engineer • 4+ years of solving data problems at Yahoo
  • 4. 4 ALLOW MYSELF TO INTRODUCE … MYSELF • Michael Natkovich • [email protected] • Director Engineer • 10+ years of causing data problems at Yahoo
  • 6. 6 INSTRUMENTATION • Is code added to web pages or apps to track usage and behavior • User Identity • Engagement • Location • Drives all data applications • Targeting • Personalization • Analytics
  • 7. 7 CYCLE OF SADNESS • Instrumentation validation is unbearably slow • Needs to be seconds not minutes or hours • Needs to be easy to query • Needs programmatic access
  • 8. 8 EXISTING OPTIONS Type Latency Downside Batch Hours Too slow Mini-Batch 20 Minutes Faster, but still too slow Streaming 2 Seconds Fast enough, but no way to query • The various ways to obtain data were either: • Not fast enough • Impossible to query
  • 11. 11 ATYPICAL QUERYING Future Queryable Data Old Un-Queryable DataCurrent Queryable Data Query Engine Query Results Data Flow
  • 12. 12 BULLET • Retrieves data that arrives after query submission – Look Forward • No persistence layer • Light-weight, fast, and scalable • UI for Ad-Hoc queries • API for programmatic querying • Pluggable interface to integrate with streaming data
  • 13. 13 QUERYING IN BULLET • Support filtering, logical operators on typed data • Supports aggregations • Group By, Count Distincts, Top K, Distributions • DataSketches based • Queries have life spans • All queries run for a specified time window • Raw queries can terminate early if they have seen a minimum number of records
  • 14. 14 DATASKETCHES • Sketches are a class of stochastic streaming algorithms • Provides approximate results (if data is too large) • Provable error bounds • Fixed memory footprint • Mergeable, allowing for parallel processing Sketches logo from https://datasketches.github.io
  • 16. 16
  • 18. 18 USE CASES: BEYOND INSTRUMENTATION VALIDATION • See sample values of a field • What’s a country code look like? • Cardinality of fields for Druid ingestion • 10s, 100s, 1000s of unique values? • Check that a new experiment is running • Is data coming in for all my test buckets?
  • 21. 21 STORM TERMINOLOGY • Tuple: The basic unit of data in Storm • Stream: An unbounded set of tuples • Spout: Source of tuples Kafka, Flume etc. • Bolt: Tuple processor • Topology: A DAG of Spouts and Bolts • DRPC: Distributed Remote Procedure Call Storm logo from https://storm.apache.org
  • 22. 22
  • 23. 23 PLUGGABLE INTERFACE 1. Run Bullet on your data • Write a Spout/Topology to read data from Kafka, Flume, HDFS etc. • Convert data into a Bullet Record (AVRO) 2. Plug in a schema (if you need the UI) • JSON based • Provides field names, types, and descriptions 3. Plug in a default starting query (Optional for the UI) • Example: A query based on a UI users’ cookie
  • 24. 24 PERFORMANCE • Scaling for data 1. Scale pluggable data processing component 2. Scale Filter Bolts for handling data volume • Scaling for simultaneous queries 1. Scale Filter Bolts 2. Scale DRPC components – DRPC servers primarily 3. Scale Join Bolts
  • 25. 25 TEST HARDWARE • Storm 1.0 Cluster • 2 x Intel E5-2680v3 (12 Core, 24 Threads) – 48 V. Cores • 256 GB RAM • 10 G Network Interface • Multi-tenant • Reading data from a Kafka 0.10 cluster • In the same data center so network delays are minimal
  • 26. 26 DATA • Average size: 4.33 KiB compressed (1.2 compression ratio) • Data Volume: Records per second (R/s), Mebibytes per second (MiB/s) • 92 top-level fields • 62 Strings • 4 Longs • 23 Maps • 3 Lists of Maps
  • 27. 27 FINDING A SINGLE GENERATED RECORD • Data Volume : 67,400 R/s and 104 MiB/s • Average of 100 Bullet queries to find a single generated record Timestamp Delay (ms) Query received in Bullet 0 Record generated 31.3 Record submitted to Kafka 357.9 Record received in Bullet 1008.2 Record found in Bullet 1015.4 Query finished in Bullet 1018.3 • Bullet latency is 1018.3 – 1008.2 = 10.1 ms
  • 28. 28 SCALING FOR DATA: GOALS • Read the data • Catch up on data backlog at > 5 : 1 ratio (5s of backlog in 1s) • Support 400 Raw Bullet queries concurrently • Max record finding latency < 200 ms at 400 queries
  • 31. 31 SCALING FOR DATA: SUMMARY • For 400 Raw queries and data reading goals • CPU to Memory ratio • 1 core : 1.2 GiB • CPU to Data ratio • 1 core : 856 R/s • 1 core : 3.4 MiB/s
  • 32. 32 SCALING FOR QUERIES: GOALS • Fixed Data Volume : 68,400 R/s and 105 MiB/s • Latency to find a record after the record is first seen by Bullet • As number of Filter Bolts (1 V. Core, 1024 GiB RAM) varies • As number of simultaneous Raw queries varies • Each query runs for 30s and looks for 10 generated records • Want max latency < 200 ms
  • 35. 35 SCALING FOR QUERIES: DRPC • 3 DRPC servers on our test Storm cluster • 2 x Intel E5620 (4 cores, 8 Threads) - 16 V. Cores • 24 GB RAM • 10 G Network • About 700 simultaneous Bullet queries • Horizontally scalable • Blocking threads at the moment • Async implementation in Storm 2.0
  • 36. 36 ANNOUNCING OPEN SOURCE • We are on GitHub! • Documentation: https://yahoo.github.io/bullet-docs • Contributions, ideas, feedback welcome! Component Repo Storm https://github.com/yahoo/bullet-storm WS https://github.com/yahoo/bullet-service UI https://github.com/yahoo/bullet-ui Record https://github.com/yahoo/bullet-record
  • 37. 37 SUMMARY • Wanted to validate instrumentation but ended up with generic querying • Query any data that can be plugged into Storm • Queries first, then data  Look-forward querying • Persists no data  Light-weight and cheap! • Fetch Raw data • Aggregate: Group By, Top K, Distributions, Count Distinct
  • 38. 38 FUTURE WORK • Considering Pub/Sub queue to receive and send queries and results • Allows Bullet implementations on other Stream processors • Incremental updates • WebSockets or SSE to push results • Streaming results • Additive results • Security • SQL interface
  • 39. 39 THANKS • Nathan Speidel • Cat Utah • Marcus Svedman • Satish Vanimisetti
  • 40. 40 LINKS • Contact Us • Developers : [email protected] • Users : [email protected] • Documentation : https://yahoo.github.io/bullet-docs • DataSketches: https://datasketches.github.io
  • 42. 42 COUNT DISTINCT: NAIVE 1. Read Input 2. Round Robin 3. Extract Field 4. Send to Combiner 5. Count Distincts Overwhelm Single Combiner
  • 43. 43 COUNT DISTINCT: TYPICAL 1. Read Input 2. Round Robin 3. Extract Field 4. Hash Partition 5. Count Distincts 6. Send Count 7. Combined Counts Vulnerable to Data Skew
  • 44. 44 COUNT DISTINCT: SKETCHES 1. Read Input 2. Round Robin 3. Build Sketch 4. Send to Combiner 5. Merge Sketches