Bullet: A Real Time Data Query Engine

A REAL TIME DATA QUERY ENGINE
Michael Natkovich, Akshai Sarma

3
ALLOW MYSELF TO INTRODUCE … MYSELF
• Akshai Sarma
• asarma@yahoo-inc.com
• Senior Engineer
• 4+ years of solving data problems at Yahoo

4
ALLOW MYSELF TO INTRODUCE … MYSELF
• Michael Natkovich
• mln@yahoo-inc.com
• Director Engineer
• 10+ years of causing data problems at Yahoo

6
INSTRUMENTATION
• Is code added to web pages or apps to track usage and behavior
• User Identity
• Engagement
• Location
• Drives all data applications
• Targeting
• Personalization
• Analytics

7
CYCLE OF SADNESS
• Instrumentation validation is unbearably slow
• Needs to be seconds not minutes or hours
• Needs to be easy to query
• Needs programmatic access

8
EXISTING OPTIONS
Type Latency Downside
Batch Hours Too slow
Mini-Batch 20 Minutes Faster, but still too slow
Streaming 2 Seconds Fast enough, but no way to query
• The various ways to obtain data were either:
• Not fast enough
• Impossible to query

10
TYPICAL QUERYING
Data Flow
Persistence
Queries

11
ATYPICAL QUERYING
Future Queryable Data Old Un-Queryable DataCurrent Queryable Data
Query Engine Query Results
Data Flow

12
BULLET
• Retrieves data that arrives after query submission – Look Forward
• No persistence layer
• Light-weight, fast, and scalable
• UI for Ad-Hoc queries
• API for programmatic querying
• Pluggable interface to integrate with streaming data

13
QUERYING IN BULLET
• Support filtering, logical operators on typed data
• Supports aggregations
• Group By, Count Distincts, Top K, Distributions
• DataSketches based
• Queries have life spans
• All queries run for a specified time window
• Raw queries can terminate early if they have seen a minimum number of records

14
DATASKETCHES
• Sketches are a class of stochastic streaming algorithms
• Provides approximate results (if data is too large)
• Provable error bounds
• Fixed memory footprint
• Mergeable, allowing for parallel processing
Sketches logo from https://datasketches.github.io

17
Request
Processor
Data
Processor
Combiner
Bullet
Data
StreamBullet
WS
Query
Results
Results
Query
& ID
Query
& ID
Data
Records
Matching
Events & ID
Query
FLOW Performance Stats
Sensor Data
User Activity
IoT Data

18
USE CASES: BEYOND INSTRUMENTATION VALIDATION
• See sample values of a field
• What’s a country code look like?
• Cardinality of fields for Druid ingestion
• 10s, 100s, 1000s of unique values?
• Check that a new experiment is running
• Is data coming in for all my test buckets?

21
STORM TERMINOLOGY
• Tuple: The basic unit of data in Storm
• Stream: An unbounded set of tuples
• Spout: Source of tuples Kafka, Flume etc.
• Bolt: Tuple processor
• Topology: A DAG of Spouts and Bolts
• DRPC: Distributed Remote Procedure Call
Storm logo from https://storm.apache.org

23
PLUGGABLE INTERFACE
1. Run Bullet on your data
• Write a Spout/Topology to read data from Kafka, Flume, HDFS etc.
• Convert data into a Bullet Record (AVRO)
2. Plug in a schema (if you need the UI)
• JSON based
• Provides field names, types, and descriptions
3. Plug in a default starting query (Optional for the UI)
• Example: A query based on a UI users’ cookie

24
PERFORMANCE
• Scaling for data
1. Scale pluggable data processing component
2. Scale Filter Bolts for handling data volume
• Scaling for simultaneous queries
1. Scale Filter Bolts
2. Scale DRPC components – DRPC servers primarily
3. Scale Join Bolts

25
TEST HARDWARE
• Storm 1.0 Cluster
• 2 x Intel E5-2680v3 (12 Core, 24 Threads) – 48 V. Cores
• 256 GB RAM
• 10 G Network Interface
• Multi-tenant
• Reading data from a Kafka 0.10 cluster
• In the same data center so network delays are minimal

26
DATA
• Average size: 4.33 KiB compressed (1.2 compression ratio)
• Data Volume: Records per second (R/s), Mebibytes per second (MiB/s)
• 92 top-level fields
• 62 Strings
• 4 Longs
• 23 Maps
• 3 Lists of Maps

27
FINDING A SINGLE GENERATED RECORD
• Data Volume : 67,400 R/s and 104 MiB/s
• Average of 100 Bullet queries to find a single generated record
Timestamp Delay (ms)
Query received in Bullet 0
Record generated 31.3
Record submitted to Kafka 357.9
Record received in Bullet 1008.2
Record found in Bullet 1015.4
Query finished in Bullet 1018.3
• Bullet latency is 1018.3 – 1008.2 = 10.1 ms

28
SCALING FOR DATA: GOALS
• Read the data
• Catch up on data backlog at > 5 : 1 ratio (5s of backlog in 1s)
• Support 400 Raw Bullet queries concurrently
• Max record finding latency < 200 ms at 400 queries

31
SCALING FOR DATA: SUMMARY
• For 400 Raw queries and data reading goals
• CPU to Memory ratio
• 1 core : 1.2 GiB
• CPU to Data ratio
• 1 core : 856 R/s
• 1 core : 3.4 MiB/s

32
SCALING FOR QUERIES: GOALS
• Fixed Data Volume : 68,400 R/s and 105 MiB/s
• Latency to find a record after the record is first seen by Bullet
• As number of Filter Bolts (1 V. Core, 1024 GiB RAM) varies
• As number of simultaneous Raw queries varies
• Each query runs for 30s and looks for 10 generated records
• Want max latency < 200 ms

34
SCALING FOR QUERIES: LATENCY

35
SCALING FOR QUERIES: DRPC
• 3 DRPC servers on our test Storm cluster
• 2 x Intel E5620 (4 cores, 8 Threads) - 16 V. Cores
• 24 GB RAM
• 10 G Network
• About 700 simultaneous Bullet queries
• Horizontally scalable
• Blocking threads at the moment
• Async implementation in Storm 2.0

36
ANNOUNCING OPEN SOURCE
• We are on GitHub!
• Documentation: https://yahoo.github.io/bullet-docs
• Contributions, ideas, feedback welcome!
Component Repo
Storm https://github.com/yahoo/bullet-storm
WS https://github.com/yahoo/bullet-service
UI https://github.com/yahoo/bullet-ui
Record https://github.com/yahoo/bullet-record

37
SUMMARY
• Wanted to validate instrumentation but ended up with generic querying
• Query any data that can be plugged into Storm
• Queries first, then data  Look-forward querying
• Persists no data  Light-weight and cheap!
• Fetch Raw data
• Aggregate: Group By, Top K, Distributions, Count Distinct

38
FUTURE WORK
• Considering Pub/Sub queue to receive and send queries and results
• Allows Bullet implementations on other Stream processors
• Incremental updates
• WebSockets or SSE to push results
• Streaming results
• Additive results
• Security
• SQL interface

39
THANKS
• Nathan Speidel
• Cat Utah
• Marcus Svedman
• Satish Vanimisetti

40
LINKS
• Contact Us
• Developers : bullet-dev@googlegroups.com
• Users : bullet-users@googlegroups.com
• Documentation : https://yahoo.github.io/bullet-docs
• DataSketches: https://datasketches.github.io

42
COUNT DISTINCT: NAIVE
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts
Overwhelm Single Combiner

43
COUNT DISTINCT: TYPICAL
1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distincts
6. Send Count
7. Combined Counts
Vulnerable to Data Skew

44
COUNT DISTINCT: SKETCHES
1. Read Input
2. Round Robin
3. Build Sketch
4. Send to Combiner
5. Merge Sketches

Bullet: A Real Time Data Query Engine

More Related Content

What's hot (20)

Similar to Bullet: A Real Time Data Query Engine (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Bullet: A Real Time Data Query Engine