Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes

Druid Ingestion: From 3 hr to 5 min
Shivji kumar Jha, Staff Engineer, Nutanix
Sachidananda Maharana, MTS 4, Nutanix
Challenges, Mitigations & Learnings

About Us
Shivji Kumar Jha
Staff Engineer,
CPaaS Data Platform,
Nutanix
Sachidananda Maharana
Sr Engineer / OLAP Ninja
CPaaS Team, Nutanix
 Software Engineer & Regular Speaker / Meetups
 Excited about:
 Distributed Databases & Streaming
 Open-Source Software & Communities
 MySQL, Postgres, Pulsar/NATS, Druid/Clickhouse
 Regular Platform Engineer
 Excited about:
 Distributed OLAP Databases
 Open-Source Enthusiast

Contents
Druid 101
How we use Druid
Re-architecture : What & Why
Impact On Druid components
How we fixed the issues
State of Bugs we filed / fixed

Druid 101
• Open-source, Apache 2.0 License and under Apache Foundation
• Columnar data store designed for high-performance.
• Supports Real-time and Batch ingestion.
• Segment Oriented Storage
• Distributed and modular architecture, horizontally scalable for most
parts
• Supports Data tiering – Keep cold data in cheaper storage!

What we love about Druid!
Modularity - Separation of Concerns
Modularity – Simplicity* : Ease to deploy , Upgrade, Migrate, Manage
Modularity – Flexibility - Scale only what you need, Retain based on retention rules on tiers
Modularity - Built for Cloud
Durability – Object Store (S3 or Nutanix Objects for instance) for Deep Storage
Durability - SQL database for metadata
Admin Dashboard – easier debugging and monitoring

Ingestion & Query Patterns
● IPFix log files are collected from clouds.
○ IPFIX : IP Flow Information Export
○ Summarizes network data packets to track IP actions
● We enrich data and store in an s3 bucket.
● S3 data is ingested into druid.
● Serves Analytics dashboards in slice and dice manner.
● Used for ML engine as well.

Druid Nos : 3+ years in Prod
Last 24 hrs
Cluster Size

Data Model for our Apps
● Analytics Apps as part of Nutanix Dashboard
● Customers can slice and dice data given some filters
● Multi-tenant Use Case
● Druid Data source per customer per use case
● Enable features for some data sources
○ Phased rollout for new Druid features
○ Druid Version Upgrades
○ App redesign requiring Change in Druid ingestion or query.
● Workflow engine (Temporal) for pipeline.
● Java based Workers backed by Postgres storage for state.

Change in Requirements
● Change in Requirement: Batch (3 hours) to 5 minutes
● Earlier:
○ Agent collects data, dumps to S3.
○ Cron runs every 3 hour, ingests from S3 to Druid
○ SLA : 3 hours
● New Design:
○ SLA : 15 minutes
○ Agent collects data, dumps to S3 every 5 minutes.
○ Ingestion Pipeline ingests to Druid depending on what Druid likes.
○ Ingestion Pipeline gobbles backpressure.
● Release Plan
○ Data sources uploaded to cluster in a phased manner

Before: old batch system
Cron : 3 hrs

Change: Batch to near-real-time system
nudge State
Machine,
absorb
backpressure
Cron : 5 mins

Batch to near-real-time system
Cron : 5 mins
Druid
Ingestion
Tasks

Druid Database
Datasource 3
Datasource 2
Datasource 1

Druid Database
Datasource N
Datasource 3
Datasource 2
Datasource 1

Summary: When Druid was struggling (Overlord on
fire)
● Ingested smaller, but more tasks.
● onboarding a few large datasources, fine for a day
● More confidence 
● Onboarded all datasource at once
○ Task queue kept increasing (till 25K). Overlord overwhelmed after 5K
○ Soon, overlord machine CPU usage at 100%
● All the tasks were stuck in pending state
● Task count was 12x more than previous but smaller.
● Middle managers were sitting idle, no incoming tasks.
● Task state were not updating properly as overlord was overwhelmed.
Druid Overlord

Druid Database
Overlord Process

Druid Database
Overlord Process
Bigger VM

Druid Database
Overlord Process
Bigger DB Instance
Bigger VM

Handling the Overlord…
● Vertically scale overlord. Didn’t help! No support for horizontal
scaling.
● Changed configs:

scaling.
● Changed configs: No
ZK for
assignment
Druid.indexer.runner.type : httpRemote

scaling.
Throttle,
Don’t give up
Druid.indexer.queue.maxSize : 5000

scaling.
● Set max pending tasks per datasource for an interval to 1
Throttle,
Don’t give up
Druid.indexer.queue.maxSize : 5000
GET /druid/indexer/v1/pendingTasks?datasource=ds1

Filed Github issues so you don’t hit these…

Making DB functional…
● Queries from overlord to Postgres for
task metadata were taking long time.
● Add more CPU to DB server
● Improvements:
○ Overlord CPU utilization is less
○ Number of pending tasks are less
○ Task slot utilization graph looks stable

Druid Database
Peon Processes
Middle managers

Druid Database
Middle managers
Peon Processes
More VMs

Druid Database
Middle managers
Peon Processes
More VMs
Bigger Compact tasks
Tiering Middle Managers

Druid Database
More slots per Middle Managers
More VMs
More slots

Druid Database
Right size Middle Managers
Less VMs
More slots

Summary : Scaling Middle Manager
● Increased number of middle manager as so
that more task slots are available for
overlord to assign tasks.
● Then we increased number of slots per
middle manager as new tasks were small
i.e. having less number of files to ingest.
● We created a separate tier for compaction
as these tasks took more resource then the
current index tasks.
● Then we right sized the middle manager
count in each tier by reducing it.
12 MMs * 5 slots => 24MMs * 5 slots
24 MMs * 5 slots => 12MMs * 10 slots
12 MMs * 10 slots =>
10 MMs * 10 slots + 2 MMs *5 slots
Tiering

Summary of Coordinator crisis…
● Happy Overlord.
● But issues in Coordinator now:
○ Huge number of small segments.
○ Unavailable segments count increasing.
○ Coordinator CPU usage increasing
○ Coordinator cycle is taking too long to complete

Druid Database
Handling the Coordinator
Coordinator Process

Druid Database
Coordinator Process
Bigger VM

Druid Database
Coordinator Process
Bigger VM
Same big DB

Handling the Coordinator…
● Increased Coordinator instance type as it is not scalable
horizontally
● Tried the following coordinator dynamic configs:

horizontally
● Tried few coordinator dynamic configs:
maxSegmentsToMove: 1000
percentOfSegmentsToConsiderPerMove: 25
reducing the
number of
segments per
coordinator
cycle

horizontally
● Tried few coordinator dynamic configs:
maxSegmentsToMove: 1000
percentOfSegmentsToConsiderPerMove: 25
Assign segments
In round-robin
fashion first.
Lazily reassign with
chosen balancer
strategy later
useRoundRobinSegmentAssignment: true

● We saw this error in coordinator logs during
auto compaction for many datasources.
“is larger than inputSegmentSize[2147483648]”
● Removing this setting from auto compaction
config resolved the issue.
● This is no longer an issue Druid 25 onwards.
inputSegmentSizeBytes: 100TB

Handling the Historicals
● Until auto compaction done:
○ More no of segments for queries
○ More processing power for historicals
● Cold data has HIGHER segment
granularity
○ Compaction Done!
● Hot data has LOWER segment
granularity
○ Compaction NOT done YET!
Query for
recent data
Query for
recent data
Older Historicals
Current Historicals
Larger segments
Smaller segments
Datasource 2
Datasource 1
Datasource 1
Datasource 2

Summary
● Once we stabilized Druid Ingestion and query both pipelines we
onboarded all customers in a phased manner.
● Set the optimal queue size.
● To absorb the initial burst of tasks we increased MM count.
● Right size Overlord and coordinator once the onboarding was
complete
● Do know overlord and coordinator settings well.

Thank You
Questions?
Shivji Kumar Jha
linkedin.com/in/shivjijha/
slideshare.net/shiv4289/presentations/
youtube.com/@shivjikumarjha
Sachidananda Maharana
https://www.linkedin.com/in/sachidanandamaharana/

Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes

More Related Content

Similar to Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes (20)

More from Shivji Kumar Jha (20)

Recently uploaded (20)

Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes