SlideShare a Scribd company logo
Analyzing Petabyte Scale Financial
Data with Apache Pinot and Apache
Kafka
Xiaoman Dong @Stripe
Joey Pereira @Stripe
Agenda
● Tracking funds at Stripe
● Quick intro on Pinot
● Challenges: scale and latency
● Optimizations for a large table
Tracking funds at Stripe
Stripe is complicated
Tracking funds at Stripe
Stripe is complicated
Tracking funds at Stripe
Ledger, the financial source of truth
● Unified data format for financial activity
● Exhaustively covers all activity
● Centralized observability
Tracking funds at Stripe
Ledger, the financial source of truth
● Unified data format for financial activity
● Exhaustively covers all activity
● Centralized observability
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
● What action caused the transition.
● Why it transitioned.
● When it transitioned.
● Looking at transitions across multiple systems and teams.
Observability
Transaction-level investigation
Tracking funds at Stripe
Modelling as state machines
Incomplete states are balances
Tracking funds at Stripe
Modelling as state machines
Incomplete states are balances
Tracking funds at Stripe
Observability
Aggregating state balances
Tracking funds at Stripe
Observability
Tracking funds at Stripe
Detection
Date of state’s first transition
Amount ($$)
● Look up one state transition
○ by ID or other properties
● Look up one state, inspect it
○ listing transitions with sorting, paging, and summaries
● Aggregate many states
Query patterns
Tracking funds at Stripe
● Look up one state transition
○ by ID or other properties
● Look up one state, inspect it
○ listing transitions with sorting, paging, and summaries
● Aggregate many states
This is easy... until we have:
● Hundreds of billions of rows
● States with hundreds of millions of transitions
● Need for fresh, real-time data
● Queries with sub-second latency, serving interactive UI
Query patterns
Tracking funds at Stripe
World before Pinot
Tracking funds at Stripe
Two complicated systems
World before Pinot
Tracking funds at Stripe
Two complicated systems
World with Pinot
Tracking funds at Stripe
● One system for serving all cases
● Simple and elegant
● No more multiple copies of data
Quick intro on Pinot
Pinot Distributed Architecture
* (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
Pinot Distributed Architecture
* (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
Pinot Distributed Architecture
* (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
Pinot Distributed Architecture
* (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
(Transition joey => xd)
●
Our challenges
Query Latency
Data Freshness
Data Scale
Challenge #1:
Data Scale: the Largest Single Table in Pinot
One cluster to serve all major queries
Huge tables
● Each with more than hundreds of billions rows
● 700TB storage on disk, after 2x replication
Pinot numbers
● Offline segments: ~60k segments per table
● Real time table: 64 partitions
Hosted by AWS EC2 Instances
● ~1000 small hosts (4000 vCPU) with attached SSD
● Instance config selected based on performance and cost
One cluster to serve all major queries
Huge tables
● Each with more than hundreds of billions rows
● 700TB storage on disk, after 2x replication
Pinot numbers
● Offline segments: ~60k segments per table
● Real time table: 64 partitions
Hosted by AWS EC2 Instances
● ~1000 small hosts (4000 vCPU) with attached SSD
● Instance config selected based on performance and cost
Largest Pinot table in
the world !
Challenge #2:
Data freshness: Kafka Ingestion
What Pinot + Kafka Brings
Pinot broker provides merged view of offline and real time data
● Real-time Kafka ingestion comes with second level data freshness
● Merged view allows us query whole data set like one single table
Financial Data in Real Time (1/2)
Avoid duplication is critical for financial systems
● A Flink deduplication job as upstream
● Exactly-once Kafka sink used in Flink
Exactly-once from Flink to Pinot
● Kafka transactional consumer enabled in Pinot
● Atomic update of Kafka offset and Pinot segment
● Result: 1:1 mapping from Flink output to Pinot
● No extra effort needed for us
Financial Data in Real Time (2/2)
● Alternative Solution: deduplication within Pinot directly
○ Pinot’s real time upsert feature is a nice option to explore
○ Sustained 200k+ QPS into Pinot offline table in our experiments
Challenge #3:
Drive Down the Query Latency
Optimizations Applied (1/4)
● Partitioning - Hashing data across Pinot servers
○ The most powerful optimization tool in Pinot
○ Map partitions to servers: Pinot becomes a key-value store
Optimizations Applied (1/4)
● Partitioning - Hashing data across Pinot servers
○ The most powerful optimization tool in Pinot
○ Map partitions to servers: Pinot becomes a key-value store
Depending on query type,
partitioning can improve
query latency by 2x ~ 10x
Optimizations Applied (2/4)
● Sorting - Organize data between segments
○ Sorting is powerful when done in Spark ETL job; we can arrange
how the rows are divided into segments
○ Column min/max values can help avoid scanning segments
○ Grouping the same value into the the same segment can reduce
storage cost and speed up pre-aggregations
Optimizations Applied (2/4)
● Sorting - Organize data between segments
○ Sorting is powerful when done in Spark ETL job; we can arrange
how the rows are divided into segments
○ Column min/max values can help avoid scanning segments
○ Grouping the same value into the the same segment can reduce
storage cost and speed up pre-aggregations
In our production data set, sorting roughly
improves aggregation query latency by 2x
Optimization Applied (3/4)
● Bloom filter - Quickly prune out a Pinot segment
○ Best friend of key-value style lookup query
○ Works best when there are very few hit in filter
○ Configurable in Pinot: control false positive rate or total size
Optimization Applied (4/4)
● Pre-aggregation by star tree index
○ Pinot supports a specialized pre-aggregation called “star-tree index”
○ Pre-aggregates several columns to avoid computation during query
○ Star tree index balances between disk space and query time for
aggregations with multiple dimensions
Optimization Applied (4/4)
● Pre-aggregation by star tree index
○ Pinot supports a specialized pre-aggregation called “star-tree index”
○ Pre-aggregates several columns to avoid computation during query
○ Star tree index balances between disk space and query time for
aggregations with multiple dimensions
Query latency improvement
(accounts with billion-level transactions):
~30 seconds vs. 300 milliseconds
The Combined Power of Four Optimizations
● They can reduce query latency to sub second for any large table
○ Works well for our hundreds of billions of rows
○ Most of the time, tables are small and we only need some of them
● We chose the optimizations to speed up all 5 production queries
○ Some queries need only bloom filter
○ Partitioning and sorting are applied for critical queries
Real time ingestion needs extra care
Optimizing real time ingestion (1/2)
With 3-day real time data in Pinot, we saw 2~3 sec added latencies
● Pinot real time segments are often very small
● Real time server numbers are limited by Kafka partition count
(max 64 servers in our case)
● Each real time server ends up with many small segments
● Real time server has high I/O and high CPU during query
Optimizing real time ingestion (2/2)
Latency back to sub-seconds after adopting Tiered Storage
● Tiered storage enables different storage hosts for segments based on time
● Moves real time segments into dedicated servers ASAP
● Utilizes more servers to process query for real time segments
● Avoids query slow down in Kafka consumers with back pressure
Production Query Latency Chart
Hundreds of billions of rows,
~700 TB data,
all are sub-second latency.
Financial Precision
● Precise numbers are critical for financial data processing
● Java BigDecimal is the answer for Pinot
● Pinot supports BigDecimal by BINARY columns (currently)
○ Computation (e.g., sum) done by UDF-style scalar functions
○ Star Tree index can be applied to BigDecimal columns
○ Works for all our use cases
○ No significant performance penalty observed
With Pinot and Kafka working together, we have created the largest Pinot
table in the world, to represent financial funds flow graphs.
● With hundreds of billions of edges
● Seconds of data freshness
● Financial precise number support
● Exactly-once Kafka semantics
● Sub-second query latency
Conclusion
Future Plans
● Reduce hardware cost by applying tiered storage in offline table
○ Use HDD-based hosts for data months old
● Multi-region Pinot cluster
● Try out many of Pinot’s exciting new features
Thanks and Questions (We are hiring!)
(Backup Slides)
● Ledger models financial activity as state machines
● Transitions are immutable append-only logs in Kafka
● Everything is transaction-level
● Incomplete states are represented by balances.
● Two core use-cases: transaction-level queries, and aggregation analytics
● Current system is unscalable and complex
Summarizing
Tracking funds at Stripe
Pinot and Kafka works in synergy
Detect problems in hundreds of billions rows (cont’d)
How to detect issues in a graph of half trillion nodes?
1) Sum all money in/out nodes, focus only on non-zero nodes
Now we have 20 million nodes with non-zero sum, how to analyze it?
2) Group by
a) Day of first transaction seen -- Time Series
b) Sign of sum (negative/positive flow)
c) Some node properties like type
We have a time series, and fields we can slice/dice. OLAP Cube
Modelling as state machines
Tracking funds at Stripe
Transitions State balances
Modelling as state machines
Tracking funds at Stripe
Transitions State balances
Modelling as state machines
Tracking funds at Stripe
Transitions State balances
Modelling as state machines
Balances of incomplete payment
Tracking funds at Stripe
Modelling as state machines
Balances of successful payment
Tracking funds at Stripe
Observability
Aggregating state balances
Tracking funds at Stripe
● Data volume, handling hundreds of billions of records
● Data freshness, getting real-time processing
● Query latency, making analytics usable for interactive internal UIs
● Achieving all three at once: difficult!
Why this is challenging?
Tracking funds at Stripe
Modelling as state machines
Dozens and dozens of states
Tracking funds at Stripe
Observability
Aggregating state balances
Tracking funds at Stripe
Observability
Aggregating state balances
Tracking funds at Stripe
Double-Entry Bookkeeping
● Internal funds flow represented by a directed graph
● Record the graph edge as Double-Entry Bookkeeping
● Nodes in the graph are modeled as accounts
● Accounts should eventually have zero balances
Detect problems in hundreds of billions of rows
Money in/out graph nodes should sum to zero (“cleared”).
Stuck funds over time = Revenue Loss
● One card swipe could create 10+ nodes
● Hundreds of billions unique nodes and increasing
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe
Lessons Learned
● Metadata becomes heavy for huge tables
○ O(n2
) algorithm is not good when processing 60k segments
○ Avoid sending 1k+ segment names across 100+ servers
○ Metadata is important when aiming for sub-second latency
● Tailing effect of p99/p95 latencies when we have 1000 servers
○ Occasional hiccups in server becomes high probability events
and drags down p99/p95 query latency
○ Limit servers queried to be as small as possible
(partitioning, server grouping, etc)
Clearing Time Series (Exploring)
Pinot Segment File Storage
Financial Data in Real Time (1/2)
● We have an upstream Flink deduplication job in place
● No duplication allowed
○ Pinot’s real time primary key is a nice option to explore
○ Sustained 200k+ QPS into Pinot offline tables in our
deduplication experiments (after optimization)
○ An upstream Flink deduplication job may be the best choice
● Exactly-once consumption from Kafka to Pinot
○ Kafka transactional consumer enabled in Pinot
○ 1:1 mapping of Kafka message to table rows
○ Critical for financial data processing
Table Design Optimization Iterations
● It takes 2~3 days for Spark ETL job to
process full data set
● Scale up only after optimized design
○ Shadow production query
○ Rebuild whole data set when needed
● General rule of thumb:
the fewer segments scanned, the better
Kafka Ingestion Optimization (2/2)
● Partition/Sharding in Real time tables (Experimented)
○ Needs a streaming job to shuffler Kafka topic by key
○ Helps query performance for real time table
○ Worth adopting
● Merging small segments into large segments
○ Needs cron style job to do the work
○ Helps pruning and scanning
○ Not a bottleneck for us

More Related Content

What's hot (20)

PDF
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
PDF
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Kai Wähner
 
PDF
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
PDF
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Seunghyun Lee
 
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
PPTX
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
PDF
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
ODP
Stream processing using Kafka
Knoldus Inc.
 
PDF
Building an open data platform with apache iceberg
Alluxio, Inc.
 
PDF
Performance Tuning RocksDB for Kafka Streams’ State Stores
confluent
 
PPTX
NiFi Best Practices for the Enterprise
Gregory Keys
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Introduction to Kafka Streams
Guozhang Wang
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PDF
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
PDF
Apache flink
pranay kumar
 
PDF
Data Pipline Observability meetup
Omid Vahdaty
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Adam Doyle
 
Pinot: Near Realtime Analytics @ Uber
Xiang Fu
 
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Kai Wähner
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Seunghyun Lee
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Stream processing using Kafka
Knoldus Inc.
 
Building an open data platform with apache iceberg
Alluxio, Inc.
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
confluent
 
NiFi Best Practices for the Enterprise
Gregory Keys
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Introduction to Kafka Streams
Guozhang Wang
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
Producer Performance Tuning for Apache Kafka
Jiangjie Qin
 
Apache flink
pranay kumar
 
Data Pipline Observability meetup
Omid Vahdaty
 

Similar to Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe (20)

PDF
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
PDF
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Seunghyun Lee
 
PDF
Intro to Pinot (2016-01-04)
Jean-François Im
 
PDF
New Features in Apache Pinot
Siddharth Teotia
 
PDF
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
HostedbyConfluent
 
PDF
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
HostedbyConfluent
 
PDF
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
HostedbyConfluent
 
PDF
History of Apache Pinot
Kishore Gopalakrishna
 
PDF
Real-Time Analytics: Going Beyond Stream Processing With Apache Pinot
Alluxio, Inc.
 
PDF
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
HostedbyConfluent
 
PDF
Analytics: The Final Data Frontier (or, Why Users Need Your Data and How Pino...
HostedbyConfluent
 
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
PDF
How LinkedIn Democratizes Big Data Visualization
Chi-Yi Kuan
 
PDF
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
StreamNative
 
PDF
Pinot: Realtime Distributed OLAP datastore
Kishore Gopalakrishna
 
PDF
Pinotcoursera 151103183418-lva1-app6892 (1)
Nayeli Bonilla
 
PDF
Fineo Technical Overview - NextSQL for IoT
Jesse Yates
 
PDF
Feast Feature Store - An In-depth Overview Experimentation and Application in...
Hong Ong
 
PDF
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
thumbtacktech
 
PDF
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Anton Yazovskiy
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Seunghyun Lee
 
Intro to Pinot (2016-01-04)
Jean-François Im
 
New Features in Apache Pinot
Siddharth Teotia
 
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
HostedbyConfluent
 
Enabling product personalisation using Apache Kafka, Apache Pinot and Trino w...
HostedbyConfluent
 
Building a Streaming Pipeline on Kubernetes Using Kafka Connect, KSQLDB & Apa...
HostedbyConfluent
 
History of Apache Pinot
Kishore Gopalakrishna
 
Real-Time Analytics: Going Beyond Stream Processing With Apache Pinot
Alluxio, Inc.
 
Look how easy it is to go from events to blazing-fast analytics! | Neha Pawar...
HostedbyConfluent
 
Analytics: The Final Data Frontier (or, Why Users Need Your Data and How Pino...
HostedbyConfluent
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
How LinkedIn Democratizes Big Data Visualization
Chi-Yi Kuan
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
StreamNative
 
Pinot: Realtime Distributed OLAP datastore
Kishore Gopalakrishna
 
Pinotcoursera 151103183418-lva1-app6892 (1)
Nayeli Bonilla
 
Fineo Technical Overview - NextSQL for IoT
Jesse Yates
 
Feast Feature Store - An In-depth Overview Experimentation and Application in...
Hong Ong
 
Polyglot Persistence in the Real World: Cassandra + S3 + MapReduce
thumbtacktech
 
Tues 115pm cassandra + s3 + hadoop = quick auditing and analytics_yazovskiy
Anton Yazovskiy
 
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PPTX
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
Agentforce World Tour Toronto '25 - Supercharge MuleSoft Development with Mod...
Alexandra N. Martinez
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Designing_the_Future_AI_Driven_Product_Experiences_Across_Devices.pptx
presentifyai
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Go Concurrency Real-World Patterns, Pitfalls, and Playground Battles.pdf
Emily Achieng
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe

  • 1. Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka Xiaoman Dong @Stripe Joey Pereira @Stripe
  • 2. Agenda ● Tracking funds at Stripe ● Quick intro on Pinot ● Challenges: scale and latency ● Optimizations for a large table
  • 6. Ledger, the financial source of truth ● Unified data format for financial activity ● Exhaustively covers all activity ● Centralized observability Tracking funds at Stripe
  • 7. Ledger, the financial source of truth ● Unified data format for financial activity ● Exhaustively covers all activity ● Centralized observability Tracking funds at Stripe
  • 8. Modelling as state machines Successful payment Tracking funds at Stripe
  • 9. Modelling as state machines Successful payment Tracking funds at Stripe
  • 10. Modelling as state machines Successful payment Tracking funds at Stripe
  • 11. Modelling as state machines Successful payment Tracking funds at Stripe
  • 12. Modelling as state machines Successful payment Tracking funds at Stripe
  • 13. ● What action caused the transition. ● Why it transitioned. ● When it transitioned. ● Looking at transitions across multiple systems and teams. Observability Transaction-level investigation Tracking funds at Stripe
  • 14. Modelling as state machines Incomplete states are balances Tracking funds at Stripe
  • 15. Modelling as state machines Incomplete states are balances Tracking funds at Stripe
  • 17. Observability Tracking funds at Stripe Detection Date of state’s first transition Amount ($$)
  • 18. ● Look up one state transition ○ by ID or other properties ● Look up one state, inspect it ○ listing transitions with sorting, paging, and summaries ● Aggregate many states Query patterns Tracking funds at Stripe
  • 19. ● Look up one state transition ○ by ID or other properties ● Look up one state, inspect it ○ listing transitions with sorting, paging, and summaries ● Aggregate many states This is easy... until we have: ● Hundreds of billions of rows ● States with hundreds of millions of transitions ● Need for fresh, real-time data ● Queries with sub-second latency, serving interactive UI Query patterns Tracking funds at Stripe
  • 20. World before Pinot Tracking funds at Stripe Two complicated systems
  • 21. World before Pinot Tracking funds at Stripe Two complicated systems
  • 22. World with Pinot Tracking funds at Stripe ● One system for serving all cases ● Simple and elegant ● No more multiple copies of data
  • 23. Quick intro on Pinot
  • 24. Pinot Distributed Architecture * (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
  • 25. Pinot Distributed Architecture * (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
  • 26. Pinot Distributed Architecture * (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
  • 27. Pinot Distributed Architecture * (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
  • 29. Our challenges Query Latency Data Freshness Data Scale
  • 30. Challenge #1: Data Scale: the Largest Single Table in Pinot
  • 31. One cluster to serve all major queries Huge tables ● Each with more than hundreds of billions rows ● 700TB storage on disk, after 2x replication Pinot numbers ● Offline segments: ~60k segments per table ● Real time table: 64 partitions Hosted by AWS EC2 Instances ● ~1000 small hosts (4000 vCPU) with attached SSD ● Instance config selected based on performance and cost
  • 32. One cluster to serve all major queries Huge tables ● Each with more than hundreds of billions rows ● 700TB storage on disk, after 2x replication Pinot numbers ● Offline segments: ~60k segments per table ● Real time table: 64 partitions Hosted by AWS EC2 Instances ● ~1000 small hosts (4000 vCPU) with attached SSD ● Instance config selected based on performance and cost Largest Pinot table in the world !
  • 33. Challenge #2: Data freshness: Kafka Ingestion
  • 34. What Pinot + Kafka Brings Pinot broker provides merged view of offline and real time data ● Real-time Kafka ingestion comes with second level data freshness ● Merged view allows us query whole data set like one single table
  • 35. Financial Data in Real Time (1/2) Avoid duplication is critical for financial systems ● A Flink deduplication job as upstream ● Exactly-once Kafka sink used in Flink Exactly-once from Flink to Pinot ● Kafka transactional consumer enabled in Pinot ● Atomic update of Kafka offset and Pinot segment ● Result: 1:1 mapping from Flink output to Pinot ● No extra effort needed for us
  • 36. Financial Data in Real Time (2/2) ● Alternative Solution: deduplication within Pinot directly ○ Pinot’s real time upsert feature is a nice option to explore ○ Sustained 200k+ QPS into Pinot offline table in our experiments
  • 37. Challenge #3: Drive Down the Query Latency
  • 38. Optimizations Applied (1/4) ● Partitioning - Hashing data across Pinot servers ○ The most powerful optimization tool in Pinot ○ Map partitions to servers: Pinot becomes a key-value store
  • 39. Optimizations Applied (1/4) ● Partitioning - Hashing data across Pinot servers ○ The most powerful optimization tool in Pinot ○ Map partitions to servers: Pinot becomes a key-value store Depending on query type, partitioning can improve query latency by 2x ~ 10x
  • 40. Optimizations Applied (2/4) ● Sorting - Organize data between segments ○ Sorting is powerful when done in Spark ETL job; we can arrange how the rows are divided into segments ○ Column min/max values can help avoid scanning segments ○ Grouping the same value into the the same segment can reduce storage cost and speed up pre-aggregations
  • 41. Optimizations Applied (2/4) ● Sorting - Organize data between segments ○ Sorting is powerful when done in Spark ETL job; we can arrange how the rows are divided into segments ○ Column min/max values can help avoid scanning segments ○ Grouping the same value into the the same segment can reduce storage cost and speed up pre-aggregations In our production data set, sorting roughly improves aggregation query latency by 2x
  • 42. Optimization Applied (3/4) ● Bloom filter - Quickly prune out a Pinot segment ○ Best friend of key-value style lookup query ○ Works best when there are very few hit in filter ○ Configurable in Pinot: control false positive rate or total size
  • 43. Optimization Applied (4/4) ● Pre-aggregation by star tree index ○ Pinot supports a specialized pre-aggregation called “star-tree index” ○ Pre-aggregates several columns to avoid computation during query ○ Star tree index balances between disk space and query time for aggregations with multiple dimensions
  • 44. Optimization Applied (4/4) ● Pre-aggregation by star tree index ○ Pinot supports a specialized pre-aggregation called “star-tree index” ○ Pre-aggregates several columns to avoid computation during query ○ Star tree index balances between disk space and query time for aggregations with multiple dimensions Query latency improvement (accounts with billion-level transactions): ~30 seconds vs. 300 milliseconds
  • 45. The Combined Power of Four Optimizations ● They can reduce query latency to sub second for any large table ○ Works well for our hundreds of billions of rows ○ Most of the time, tables are small and we only need some of them ● We chose the optimizations to speed up all 5 production queries ○ Some queries need only bloom filter ○ Partitioning and sorting are applied for critical queries
  • 46. Real time ingestion needs extra care
  • 47. Optimizing real time ingestion (1/2) With 3-day real time data in Pinot, we saw 2~3 sec added latencies ● Pinot real time segments are often very small ● Real time server numbers are limited by Kafka partition count (max 64 servers in our case) ● Each real time server ends up with many small segments ● Real time server has high I/O and high CPU during query
  • 48. Optimizing real time ingestion (2/2) Latency back to sub-seconds after adopting Tiered Storage ● Tiered storage enables different storage hosts for segments based on time ● Moves real time segments into dedicated servers ASAP ● Utilizes more servers to process query for real time segments ● Avoids query slow down in Kafka consumers with back pressure
  • 49. Production Query Latency Chart Hundreds of billions of rows, ~700 TB data, all are sub-second latency.
  • 50. Financial Precision ● Precise numbers are critical for financial data processing ● Java BigDecimal is the answer for Pinot ● Pinot supports BigDecimal by BINARY columns (currently) ○ Computation (e.g., sum) done by UDF-style scalar functions ○ Star Tree index can be applied to BigDecimal columns ○ Works for all our use cases ○ No significant performance penalty observed
  • 51. With Pinot and Kafka working together, we have created the largest Pinot table in the world, to represent financial funds flow graphs. ● With hundreds of billions of edges ● Seconds of data freshness ● Financial precise number support ● Exactly-once Kafka semantics ● Sub-second query latency Conclusion
  • 52. Future Plans ● Reduce hardware cost by applying tiered storage in offline table ○ Use HDD-based hosts for data months old ● Multi-region Pinot cluster ● Try out many of Pinot’s exciting new features
  • 53. Thanks and Questions (We are hiring!)
  • 55. ● Ledger models financial activity as state machines ● Transitions are immutable append-only logs in Kafka ● Everything is transaction-level ● Incomplete states are represented by balances. ● Two core use-cases: transaction-level queries, and aggregation analytics ● Current system is unscalable and complex Summarizing Tracking funds at Stripe
  • 56. Pinot and Kafka works in synergy
  • 57. Detect problems in hundreds of billions rows (cont’d) How to detect issues in a graph of half trillion nodes? 1) Sum all money in/out nodes, focus only on non-zero nodes Now we have 20 million nodes with non-zero sum, how to analyze it? 2) Group by a) Day of first transaction seen -- Time Series b) Sign of sum (negative/positive flow) c) Some node properties like type We have a time series, and fields we can slice/dice. OLAP Cube
  • 58. Modelling as state machines Tracking funds at Stripe Transitions State balances
  • 59. Modelling as state machines Tracking funds at Stripe Transitions State balances
  • 60. Modelling as state machines Tracking funds at Stripe Transitions State balances
  • 61. Modelling as state machines Balances of incomplete payment Tracking funds at Stripe
  • 62. Modelling as state machines Balances of successful payment Tracking funds at Stripe
  • 64. ● Data volume, handling hundreds of billions of records ● Data freshness, getting real-time processing ● Query latency, making analytics usable for interactive internal UIs ● Achieving all three at once: difficult! Why this is challenging? Tracking funds at Stripe
  • 65. Modelling as state machines Dozens and dozens of states Tracking funds at Stripe
  • 68. Double-Entry Bookkeeping ● Internal funds flow represented by a directed graph ● Record the graph edge as Double-Entry Bookkeeping ● Nodes in the graph are modeled as accounts ● Accounts should eventually have zero balances
  • 69. Detect problems in hundreds of billions of rows Money in/out graph nodes should sum to zero (“cleared”). Stuck funds over time = Revenue Loss ● One card swipe could create 10+ nodes ● Hundreds of billions unique nodes and increasing
  • 71. Lessons Learned ● Metadata becomes heavy for huge tables ○ O(n2 ) algorithm is not good when processing 60k segments ○ Avoid sending 1k+ segment names across 100+ servers ○ Metadata is important when aiming for sub-second latency ● Tailing effect of p99/p95 latencies when we have 1000 servers ○ Occasional hiccups in server becomes high probability events and drags down p99/p95 query latency ○ Limit servers queried to be as small as possible (partitioning, server grouping, etc)
  • 72. Clearing Time Series (Exploring)
  • 74. Financial Data in Real Time (1/2) ● We have an upstream Flink deduplication job in place ● No duplication allowed ○ Pinot’s real time primary key is a nice option to explore ○ Sustained 200k+ QPS into Pinot offline tables in our deduplication experiments (after optimization) ○ An upstream Flink deduplication job may be the best choice ● Exactly-once consumption from Kafka to Pinot ○ Kafka transactional consumer enabled in Pinot ○ 1:1 mapping of Kafka message to table rows ○ Critical for financial data processing
  • 75. Table Design Optimization Iterations ● It takes 2~3 days for Spark ETL job to process full data set ● Scale up only after optimized design ○ Shadow production query ○ Rebuild whole data set when needed ● General rule of thumb: the fewer segments scanned, the better
  • 76. Kafka Ingestion Optimization (2/2) ● Partition/Sharding in Real time tables (Experimented) ○ Needs a streaming job to shuffler Kafka topic by key ○ Helps query performance for real time table ○ Worth adopting ● Merging small segments into large segments ○ Needs cron style job to do the work ○ Helps pruning and scanning ○ Not a bottleneck for us