Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe

Analyzing Petabyte Scale Financial
Data with Apache Pinot and Apache
Kafka
Xiaoman Dong @Stripe
Joey Pereira @Stripe

Agenda
● Tracking funds at Stripe
● Quick intro on Pinot
● Challenges: scale and latency
● Optimizations for a large table

Stripe is complicated
Tracking funds at Stripe

Ledger, the financial source of truth
● Unified data format for financial activity
● Exhaustively covers all activity
● Centralized observability

Modelling as state machines
Successful payment

● What action caused the transition.
● Why it transitioned.
● When it transitioned.
● Looking at transitions across multiple systems and teams.
Observability
Transaction-level investigation

Incomplete states are balances

Observability
Aggregating state balances

Observability
Detection
Date of state’s ﬁrst transition
Amount ($$)

● Look up one state transition
○ by ID or other properties
● Look up one state, inspect it
○ listing transitions with sorting, paging, and summaries
● Aggregate many states
Query patterns

● Look up one state transition
○ by ID or other properties
● Look up one state, inspect it
○ listing transitions with sorting, paging, and summaries
● Aggregate many states
This is easy... until we have:
● Hundreds of billions of rows
● States with hundreds of millions of transitions
● Need for fresh, real-time data
● Queries with sub-second latency, serving interactive UI
Query patterns

World before Pinot
Two complicated systems

World with Pinot
● One system for serving all cases
● Simple and elegant
● No more multiple copies of data

Pinot Distributed Architecture
* (courtesy of blog https://www.conﬂuent.io/blog/real-time-analytics-with-kafka-and-pinot/ )

Our challenges
Query Latency
Data Freshness
Data Scale

Challenge #1:
Data Scale: the Largest Single Table in Pinot

One cluster to serve all major queries
Huge tables
● Each with more than hundreds of billions rows
● 700TB storage on disk, after 2x replication
Pinot numbers
● Oﬄine segments: ~60k segments per table
● Real time table: 64 partitions
Hosted by AWS EC2 Instances
● ~1000 small hosts (4000 vCPU) with attached SSD
● Instance conﬁg selected based on performance and cost

One cluster to serve all major queries
Huge tables
● Each with more than hundreds of billions rows
● 700TB storage on disk, after 2x replication
Pinot numbers
● Oﬄine segments: ~60k segments per table
● Real time table: 64 partitions
Hosted by AWS EC2 Instances
● ~1000 small hosts (4000 vCPU) with attached SSD
● Instance conﬁg selected based on performance and cost
Largest Pinot table in
the world !

Challenge #2:
Data freshness: Kafka Ingestion

What Pinot + Kafka Brings
Pinot broker provides merged view of oﬄine and real time data
● Real-time Kafka ingestion comes with second level data freshness
● Merged view allows us query whole data set like one single table

Financial Data in Real Time (1/2)
Avoid duplication is critical for financial systems
● A Flink deduplication job as upstream
● Exactly-once Kafka sink used in Flink
Exactly-once from Flink to Pinot
● Kafka transactional consumer enabled in Pinot
● Atomic update of Kafka offset and Pinot segment
● Result: 1:1 mapping from Flink output to Pinot
● No extra effort needed for us

● Alternative Solution: deduplication within Pinot directly
○ Pinot’s real time upsert feature is a nice option to explore
○ Sustained 200k+ QPS into Pinot oﬄine table in our experiments

Challenge #3:
Drive Down the Query Latency

Optimizations Applied (1/4)
● Partitioning - Hashing data across Pinot servers
○ The most powerful optimization tool in Pinot
○ Map partitions to servers: Pinot becomes a key-value store

● Partitioning - Hashing data across Pinot servers
○ The most powerful optimization tool in Pinot
○ Map partitions to servers: Pinot becomes a key-value store
Depending on query type,
partitioning can improve
query latency by 2x ~ 10x

● Sorting - Organize data between segments
○ Sorting is powerful when done in Spark ETL job; we can arrange
how the rows are divided into segments
○ Column min/max values can help avoid scanning segments
○ Grouping the same value into the the same segment can reduce
storage cost and speed up pre-aggregations

● Sorting - Organize data between segments
○ Sorting is powerful when done in Spark ETL job; we can arrange
how the rows are divided into segments
○ Column min/max values can help avoid scanning segments
○ Grouping the same value into the the same segment can reduce
storage cost and speed up pre-aggregations
In our production data set, sorting roughly
improves aggregation query latency by 2x

Optimization Applied (3/4)
● Bloom filter - Quickly prune out a Pinot segment
○ Best friend of key-value style lookup query
○ Works best when there are very few hit in filter
○ Configurable in Pinot: control false positive rate or total size

● Pre-aggregation by star tree index
○ Pinot supports a specialized pre-aggregation called “star-tree index”
○ Pre-aggregates several columns to avoid computation during query
○ Star tree index balances between disk space and query time for
aggregations with multiple dimensions

● Pre-aggregation by star tree index
○ Pinot supports a specialized pre-aggregation called “star-tree index”
○ Pre-aggregates several columns to avoid computation during query
○ Star tree index balances between disk space and query time for
aggregations with multiple dimensions
Query latency improvement
(accounts with billion-level transactions):
~30 seconds vs. 300 milliseconds

The Combined Power of Four Optimizations
● They can reduce query latency to sub second for any large table
○ Works well for our hundreds of billions of rows
○ Most of the time, tables are small and we only need some of them
● We chose the optimizations to speed up all 5 production queries
○ Some queries need only bloom ﬁlter
○ Partitioning and sorting are applied for critical queries

Real time ingestion needs extra care

Optimizing real time ingestion (1/2)
With 3-day real time data in Pinot, we saw 2~3 sec added latencies
● Pinot real time segments are often very small
● Real time server numbers are limited by Kafka partition count
(max 64 servers in our case)
● Each real time server ends up with many small segments
● Real time server has high I/O and high CPU during query

Optimizing real time ingestion (2/2)
Latency back to sub-seconds after adopting Tiered Storage
● Tiered storage enables diﬀerent storage hosts for segments based on time
● Moves real time segments into dedicated servers ASAP
● Utilizes more servers to process query for real time segments
● Avoids query slow down in Kafka consumers with back pressure

Production Query Latency Chart
Hundreds of billions of rows,
~700 TB data,
all are sub-second latency.

Financial Precision
● Precise numbers are critical for ﬁnancial data processing
● Java BigDecimal is the answer for Pinot
● Pinot supports BigDecimal by BINARY columns (currently)
○ Computation (e.g., sum) done by UDF-style scalar functions
○ Star Tree index can be applied to BigDecimal columns
○ Works for all our use cases
○ No signiﬁcant performance penalty observed

With Pinot and Kafka working together, we have created the largest Pinot
table in the world, to represent ﬁnancial funds ﬂow graphs.
● With hundreds of billions of edges
● Seconds of data freshness
● Financial precise number support
● Exactly-once Kafka semantics
● Sub-second query latency
Conclusion

Future Plans
● Reduce hardware cost by applying tiered storage in oﬄine table
○ Use HDD-based hosts for data months old
● Multi-region Pinot cluster
● Try out many of Pinot’s exciting new features

Thanks and Questions (We are hiring!)

● Ledger models ﬁnancial activity as state machines
● Transitions are immutable append-only logs in Kafka
● Everything is transaction-level
● Incomplete states are represented by balances.
● Two core use-cases: transaction-level queries, and aggregation analytics
● Current system is unscalable and complex
Summarizing

Pinot and Kafka works in synergy

Detect problems in hundreds of billions rows (cont’d)
How to detect issues in a graph of half trillion nodes?
1) Sum all money in/out nodes, focus only on non-zero nodes
Now we have 20 million nodes with non-zero sum, how to analyze it?
2) Group by
a) Day of first transaction seen -- Time Series
b) Sign of sum (negative/positive flow)
c) Some node properties like type
We have a time series, and fields we can slice/dice. OLAP Cube

Transitions State balances

Balances of incomplete payment

Balances of successful payment

● Data volume, handling hundreds of billions of records
● Data freshness, getting real-time processing
● Query latency, making analytics usable for interactive internal UIs
● Achieving all three at once: diﬃcult!
Why this is challenging?

Dozens and dozens of states

Double-Entry Bookkeeping
● Internal funds ﬂow represented by a directed graph
● Record the graph edge as Double-Entry Bookkeeping
● Nodes in the graph are modeled as accounts
● Accounts should eventually have zero balances

Detect problems in hundreds of billions of rows
Money in/out graph nodes should sum to zero (“cleared”).
Stuck funds over time = Revenue Loss
● One card swipe could create 10+ nodes
● Hundreds of billions unique nodes and increasing

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe

Lessons Learned
● Metadata becomes heavy for huge tables
○ O(n2
) algorithm is not good when processing 60k segments
○ Avoid sending 1k+ segment names across 100+ servers
○ Metadata is important when aiming for sub-second latency
● Tailing eﬀect of p99/p95 latencies when we have 1000 servers
○ Occasional hiccups in server becomes high probability events
and drags down p99/p95 query latency
○ Limit servers queried to be as small as possible
(partitioning, server grouping, etc)

Clearing Time Series (Exploring)

● We have an upstream Flink deduplication job in place
● No duplication allowed
○ Pinot’s real time primary key is a nice option to explore
○ Sustained 200k+ QPS into Pinot oﬄine tables in our
deduplication experiments (after optimization)
○ An upstream Flink deduplication job may be the best choice
● Exactly-once consumption from Kafka to Pinot
○ Kafka transactional consumer enabled in Pinot
○ 1:1 mapping of Kafka message to table rows
○ Critical for ﬁnancial data processing

Table Design Optimization Iterations
● It takes 2~3 days for Spark ETL job to
process full data set
● Scale up only after optimized design
○ Shadow production query
○ Rebuild whole data set when needed
● General rule of thumb:
the fewer segments scanned, the better

Kafka Ingestion Optimization (2/2)
● Partition/Sharding in Real time tables (Experimented)
○ Needs a streaming job to shuﬄer Kafka topic by key
○ Helps query performance for real time table
○ Worth adopting
● Merging small segments into large segments
○ Needs cron style job to do the work
○ Helps pruning and scanning
○ Not a bottleneck for us

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe

More Related Content

What's hot (20)

Similar to Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe