SlideShare a Scribd company logo
Build User-Facing
Analytics Application
That Scales
Using StarRocks, a Linux Foundation Project
Albert Wong, albert.wong@celerdata.com
Agenda
● Trends and challenges in today's
user-facing analytics
● How StarRocks solves the challenges of
user-facing analytics
● Case studies of how StarRocks is being
used for user-facing analytics
Community User Quote
“Real-time analytics is like trying to drink
from a fire hose. The data is constantly
flowing, and it can be difficult to keep up.
But if you can do it, the insights you gain can
be invaluable.”
Profiting off of near real time knowledge - Battle of Waterloo, Nathan Rothschild and London Stock Exchange.
Trends in user-facing analytics
1
Improved decision
making
2 Increased user
engagement
3 Reduced reliance on IT
User-facing analytics (UFA) is a rapidly growing field that is transforming the way businesses
deliver insights to their users. UFA empowers users to explore and analyze data for themselves,
without the need for technical expertise. This can lead to a number of benefits, such as:
Key trends in user-facing analytics:
Self-service Analytics Embedded Analytics Real-Time Analytics Augmented Analytics Conversational Analytics
Trends in OLAP databases
1 Cloud Native 2
A
Sub-second vs. Second/Minute
Query Response Time
3
Data Warehouse vs. Data Lake vs.
Data Lakehouse
Online analytical processing (OLAP) databases are evolving rapidly to meet the demands of
modern data analytics. Here are some of the key trends in OLAP databases:
2
B Streaming vs. Batch Data
2
C Mutable vs. Immutable Data
2
D
Remote (Object) Storage vs. Local
(SSD) Storage
2
E
Open Table Format vs. Product
Native Storage Format
Proprietary / Hybrid Open
Open Storage
Trends in OLAP databases
Compute
Table Format
Storage Format
Open Lakehouse vs Proprietary / Hybrid Lakehouse
Challenges in user-facing analytics
1
Ingestion speed and data updates:
User-facing analytics requires high-quality
data that is available in real time. The logic
and mathematics required to handle
unbounded data sets is different than static.
2
Query Response Time: User-facing
analytics needs to be able to deliver insights
with sub-second response times versus
seconds or minutes long query times.
3
Scalability: User-facing analytics solutions
need to be scalable to handle large volumes
of data and high concurrency. Performance
and cost are always put to the test as a data
set grows over time.
4
Cost: User-facing analytics solutions can be
expensive to implement and maintain.
User-facing analytics (UFA) is a powerful tool that can help businesses deliver insights to their
users, but it is not without its challenges. Here are some of the key challenges with UFA:
How StarRocks solves the
challenges of user-facing
analytics
StarRocks is an open-source query engine that
delivers data warehouse performance on the
data lake.
StarRocks enables:
● Directly query data on data lake and get data warehouse performance.
● Sub-second joins and aggregations on billions of rows.
● Serving hundreds of thousands of concurrent end-user requests.
● No need for external data transformation tool for denormalization or pre-aggregation
● Compatibility with standard sql protocol and Trino dialect.
● Perform significantly more queries per second at lower latency on less hardware.
● Increase the flexibility and capability of your data lake or analytics system.
● Reduced cost and complexity by eliminating the need for costly data transformation pipelines.
Low Latency Queries Solution
Sub-second query with thousands of QPS and beyond
High Concurrency
Sub-second query with thousands of QPS and beyond
IO Bound Workload
● Less complex queries such as point
queries.
● Very High QPS.
● Typically requires Latency in the low 10s
of ms on massive amount of data
(billions of rows).
CPU Bound Workload
● Complex OLAP-style queries with JOINs
and AGGs.
● Interactive analytics.
● Reports, dashboards, etc.
High Concurrency Solution
IO Bound Workload
Scan the least amount of data possible
1. Bucketing: Fine-grained control over how the
data is distributed in your cluster.
● Avoid data skew.
● Choose a high cardinality column that
often appears in your `WHERE` clauses.
2. Indexing: Use prefix index (order key), Bitmap,
Bloom Filter index to minimize disk access.
3. IO-optimized Hardware:
● Configure (multiple) high-performance
SSD drives.
● Good network in terms of latency and
throughput.
CPU Bound Workload
Pre-computation or reusing previous results to take load off
your CPUs
1. Intelligent Caching:
● Final Result Cache: For the folks who keep
hitting the same button just for fun.
● Query Cache (Intermediate Result Cache):
Reuse partial results even if the queries are not
identical.
1. Pre-computation:
● Denormalization through Column Partial
Update.
● Pre-aggregation through MVs and Aggregate
Table.
● Generated Column (3.1): Materialize a column.
Fresh Mutable Data Mutable data is important!!
Existing solutions relies on merge on read – Huge
compromise on query performance.
Fresh Mutable Data Solution
Delete and insert mechanism with Primary Key index.
Support Real-Time mutable data with no compromise on query
performance.
Approach Pros Cons
Delete and insert
Simple, efficient for
read-heavy workloads
Can be inefficient for
write-heavy workloads
Merge on read
Can be efficient for
write-heavy workloads
More complex than delete
and insert
Resource Isolation
Why Resource Isolation is Needed:
● Maintaining Business-Critical Operations: Uninterrupted and undisturbed performance for key tasks.
● Quality of Service Assurance: Resource assurance for varying user groups in a multi-tenant environment.
Current Approach: Physical Isolation:
● Pros:
○ Effective isolation of resources.
● Cons:
○ Low Hardware Utilization: Results in underutilized resources due to lack of sharing or
overprovisioning.
○ High Costs: Need to meet peak demand of each user group leads to idle resources and escalated
costs, especially in large-scale operations.
○ Gets worse when you grow to tens of thousands of tenants.
Resource Isolation Solution
Resource Group: Support over-provisioning CPU resources.
Three types of workloads defined:
● Short query: Time sensitive queries with dedicated resource (no over-provisioning).
● Query queue: Queries that are not time sensitive but are critical (cannot be killed).
● The rest: supports customized rules to kill big queries.
StarRocks is an open-source
query engine that delivers data
warehouse performance on the
data lake.
Reduce the time and cost of
developing data analytics projects.
● No ingestion and data copying.
● Stop paying to denormalize
data that may never be
queried.
● Keep the tools you’ve been
using.
Sub-second query while serve
millions of users.
● Multi-dimensional interactive
analytics through on-the-fly
computations.
● Single source of truth on open
data lake analytics, with no
external system for caching or
pre-computation.
Make real-time decisions in all
business scenarios, especially when
updated data, like in logistics, is
needed.
● Real-time update without
sacrificing query performance.
● Simpler real-time data pipeline
● Ditch stateful stream
processing jobs
(denormalization &
preaggregation) with efficient
on-the-fly computation.
Data warehouse query performance on the
data lakehouse with no data copying: Natively
integrates with open data lake including Hive,
Hudi, Iceberg, and Delta Lake.
Ditch Denormalization: Perform joins on
multiple tables with millions of rows in seconds.
No need for external data transformation tool:
Perform on-demand pre-computation (like
denormalization) within StarRocks, eliminating
another processing tool in your data pipeline.
Intelligent Query Planning
● Cost-based optimizer generates
optimized query plan.
● Global runtime filter.
Efficient query execution
● In-memory data shuffling enables fast
and scalable JOIN operation.
● C++ SIMD-optimized columnar storage
and vectorized query executions deliver
the industry's fastest query
performance.
High concurrency
● Built-in Materialized View.
● An Intelligent caching system.
● Secondary Indexes.
Ditch stream processing tools for
denormalization: StarRocks’ multi-table
performance allows you to ditch the rigid and
expensive stateful stream processing jobs for
denormalization and pre-aggregation.
Real-time updates: Through primary key table,
support real-time data updates while having no
impact on query performance.
Real-time Query Performance at scale: The
synchronized materialized view (MV) further
accelerates aggregated queries at scale for
real-time analytics.
Benchmark StarRocks Offers 2.2x Performance over ClickHouse and 8.9x
Performance over Apache Druid® in Wide-table Scenarios Out
of the Box using product native table format.
Benchmark StarRocks Delivers 5.54x Query Performance over Trino in
Multi-table Scenarios using Apache Iceberg table format with
Parquet files.
Use Case: Tableau
Dashboard at Airbnb
The Airbnb Tableau Dashboard project is designed to serve both
internal and external users by providing interactive dashboards. It
requires a quick response to user queries. However, the query
latency of previous solutions is over 10 mins, which is not
acceptable. This project was just suspended until StarRocks is
adopted.
StarRocks Solution:
● StarRocks can directly connect and works very well with
Tableau.
● 3 tables (0.5B rows, 6B rows, 100M rows) + 4 joins + 3
distinct count + JSON functions and regex at same time,
response time just 3.6s.
● Reduce the query response time from mins level to
sub-seconds level.
Use Case: Game and
User Behavior Analytics
at Tencent IEG
● 400+ game data analysis and user behavior analysis
● Operation reports need to be real-time.
● Using ClickHouse for real-time analysis and Trino for
Ad-hoc before, but they want to integrate them all.
● Using Iceberg + COS store, need better performance.
● Need elastic in ad-hoc query to deduce cost.
StarRocks Solution:
● Using StarRocks Primary key to solve update problem.
● Using compute node on k8s to auto-scaling.
● Get much more performance in ad-hoc query.
Use Case: Trust
Analytics at Airbnb
To enhance security, Airbnb needs a real-time fraud detection
system (Trust Analytics) to identify various attacks and take
actions ASAP. This system must support Ad-Hoc query and
real-time update.
StarRocks Solution:
● StarRocks hosts real-time updated datasets via Primary
Key.
● Dataset import from Kafka has a sub-minute delay.
● StarRocks provides second-level query latency for
complex joins.
● Alerting can be achieved by just running a SQL query
regularly.
History of StarRocks and CelerData
StarRocks was designed to address the challenges of real-time analytics, including the need to support
high concurrency, low latency, and a wide range of analytical workloads. StarRocks also offers a number
of features that are not available in other real-time analytics databases, such as the ability to query data
directly from data lakes.
2020
Birth of StarRocks
StarRocks is created as a commercialized fork of the
Apache Doris database. Over time, 90% of the
original codebase has been re-written.
2022
CelerData is founded
CelerData is founded as a company to develop and
commercialize StarRocks.
2023
StarRocks moves to Linux Foundation
CelerData contributes StarRocks to the Linux
Foundation and moves to Apache 2.0 license.
2023
CelerData Cloud Launched
CelerData launches its managed cloud service for
StarRocks.
2023
Benchmarks outperform competition
Latest TPC-DS and SSB benchmarks shows 2x-9x
speed performance over Trino, Clickhouse and
Apache Druid.
Thank you.
● Website starrocks.io
● Managed Service cloud.celerdata.com
StarRocks Project

More Related Content

What's hot (20)

PPTX
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
DevGAMM Conference
 
PPTX
Azure data platform overview
James Serra
 
PPTX
Mining high speed data streams: Hoeffding and VFDT
Davide Gallitelli
 
PPTX
Real time analytics
Leandro Totino Pereira
 
PDF
Building Data Science Teams
EMC
 
PDF
LanGCHAIN Framework
Keymate.AI
 
PDF
Big data real time architectures
Daniel Marcous
 
PDF
Data Versioning and Reproducible ML with DVC and MLflow
Databricks
 
PPTX
ChatGPT, Foundation Models and Web3.pptx
Jesus Rodriguez
 
PDF
Grafana overview deck - Tech - 2023 May v1.pdf
BillySin5
 
PPTX
Creating an Enterprise AI Strategy
AtScale
 
PDF
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
taozen
 
PDF
Unlocking the Power of Generative AI An Executive's Guide.pdf
PremNaraindas1
 
PDF
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
Kai Wähner
 
PDF
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
PPTX
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Sri Ambati
 
PDF
Build an LLM-powered application using LangChain.pdf
StephenAmell4
 
PDF
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
 
PPTX
Generative AI, WiDS 2023.pptx
Colleen Farrelly
 
PDF
James Feldman - AII Powered Business Tools.pdf
SOLTUIONSpeople, THINKubators, THINKathons
 
Modern Data Stack for Game Analytics / Dmitry Anoshin (Microsoft Gaming, The ...
DevGAMM Conference
 
Azure data platform overview
James Serra
 
Mining high speed data streams: Hoeffding and VFDT
Davide Gallitelli
 
Real time analytics
Leandro Totino Pereira
 
Building Data Science Teams
EMC
 
LanGCHAIN Framework
Keymate.AI
 
Big data real time architectures
Daniel Marcous
 
Data Versioning and Reproducible ML with DVC and MLflow
Databricks
 
ChatGPT, Foundation Models and Web3.pptx
Jesus Rodriguez
 
Grafana overview deck - Tech - 2023 May v1.pdf
BillySin5
 
Creating an Enterprise AI Strategy
AtScale
 
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
taozen
 
Unlocking the Power of Generative AI An Executive's Guide.pdf
PremNaraindas1
 
Service Mesh with Apache Kafka, Kubernetes, Envoy, Istio and Linkerd
Kai Wähner
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
ssuser4edc93
 
Open Source h2oGPT with Retrieval Augmented Generation (RAG), Web Search, and...
Sri Ambati
 
Build an LLM-powered application using LangChain.pdf
StephenAmell4
 
Cloud Data Warehousing presentation by Rogier Werschkull, including tips, bes...
Patrick Van Renterghem
 
Generative AI, WiDS 2023.pptx
Colleen Farrelly
 
James Feldman - AII Powered Business Tools.pdf
SOLTUIONSpeople, THINKubators, THINKathons
 

Similar to Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf (20)

PDF
4K Video Downloader Crack (2025) + License Key Free
boyjake527
 
PDF
Capcut Pro Crack For PC Latest 2025 Full
mushtaqcheema932
 
PDF
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
mushtaqcheema932
 
PDF
minitool partition wizard crack 12.8 latest
qaha7432
 
PDF
FastStone Capture 10.4 Crack + Serial Key [Latest]
hyby22543
 
PDF
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
hyby22543
 
PDF
Avast Premium Security 24.12.9725 + License Key Till 2050
asfadnew
 
PDF
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
drewgye
 
PDF
EASEUS Partition Master 18.8 Crack + License Code [2025]
drewgye
 
PPTX
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA
 
PDF
No sql now2011_review_of_adhoc_architectures
Nicholas Goodman
 
PPTX
Self Service Reporting & Analytics For an Enterprise
Sreejith Madhavan
 
PPTX
Online analytical processing
Samraiz Tejani
 
PDF
Why PostgreSQL for Analytics Infrastructure (DW)?
Huy Nguyen
 
PPT
assassasassaassasasasasasasasasasdw2.ppt
tarakesh7199
 
PDF
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
nadine39280
 
PPTX
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
PDF
Operational-Analytics
Niloy Mukherjee
 
PPTX
Data Warehousing
SHIKHA GAUTAM
 
PPTX
Bigdata
Abhishek Pamecha
 
4K Video Downloader Crack (2025) + License Key Free
boyjake527
 
Capcut Pro Crack For PC Latest 2025 Full
mushtaqcheema932
 
Adobe Photoshop CC 26.3 Crack + Serial Key [Latest 2025]
mushtaqcheema932
 
minitool partition wizard crack 12.8 latest
qaha7432
 
FastStone Capture 10.4 Crack + Serial Key [Latest]
hyby22543
 
Serif Affinity Photo Crack 2.3.1.2217 + Serial Key [Latest]
hyby22543
 
Avast Premium Security 24.12.9725 + License Key Till 2050
asfadnew
 
MiniTool Partition Wizard Crack 12.8 + Serial Key Download
drewgye
 
EASEUS Partition Master 18.8 Crack + License Code [2025]
drewgye
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA
 
No sql now2011_review_of_adhoc_architectures
Nicholas Goodman
 
Self Service Reporting & Analytics For an Enterprise
Sreejith Madhavan
 
Online analytical processing
Samraiz Tejani
 
Why PostgreSQL for Analytics Infrastructure (DW)?
Huy Nguyen
 
assassasassaassasasasasasasasasasdw2.ppt
tarakesh7199
 
Unlock user behavior with 87 Million events using Hudi, StarRocks & MinIO
nadine39280
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
Operational-Analytics
Niloy Mukherjee
 
Data Warehousing
SHIKHA GAUTAM
 
Ad

Recently uploaded (20)

PDF
Is Framer the Future of AI Powered No-Code Development?
Isla Pandora
 
PDF
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
PPTX
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
PDF
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
PDF
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
PPTX
UI5con_2025_Accessibility_Ever_Evolving_
gerganakremenska1
 
PDF
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
PDF
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
PDF
Dipole Tech Innovations – Global IT Solutions for Business Growth
dipoletechi3
 
PDF
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
PDF
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
PDF
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
PPTX
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
PDF
Why is partnering with a SaaS development company crucial for enterprise succ...
Nextbrain Technologies
 
PPTX
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
Is Framer the Future of AI Powered No-Code Development?
Isla Pandora
 
Wondershare PDFelement Pro Crack for MacOS New Version Latest 2025
bashirkhan333g
 
Comprehensive Risk Assessment Module for Smarter Risk Management
EHA Soft Solutions
 
How to Hire AI Developers_ Step-by-Step Guide in 2025.pdf
DianApps Technologies
 
Salesforce Experience Cloud Consultant.pdf
VALiNTRY360
 
UI5con_2025_Accessibility_Ever_Evolving_
gerganakremenska1
 
IObit Driver Booster Pro 12.4.0.585 Crack Free Download
henryc1122g
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
IDM Crack with Internet Download Manager 6.42 Build 43 with Patch Latest 2025
bashirkhan333g
 
Ready Layer One: Intro to the Model Context Protocol
mmckenna1
 
Dipole Tech Innovations – Global IT Solutions for Business Growth
dipoletechi3
 
AI + DevOps = Smart Automation with devseccops.ai.pdf
Devseccops.ai
 
ERP Consulting Services and Solutions by Contetra Pvt Ltd
jayjani123
 
Meet in the Middle: Solving the Low-Latency Challenge for Agentic AI
Alluxio, Inc.
 
Get Started with Maestro: Agent, Robot, and Human in Action – Session 5 of 5
klpathrudu
 
Why is partnering with a SaaS development company crucial for enterprise succ...
Nextbrain Technologies
 
Transforming Insights: How Generative AI is Revolutionizing Data Analytics
LetsAI Solutions
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
Build a Custom Agent for Agentic Testing.pptx
klpathrudu
 
Ad

Build User-Facing Analytics Application That Scales Using StarRocks (DLH).pdf

  • 1. Build User-Facing Analytics Application That Scales Using StarRocks, a Linux Foundation Project Albert Wong, [email protected]
  • 2. Agenda ● Trends and challenges in today's user-facing analytics ● How StarRocks solves the challenges of user-facing analytics ● Case studies of how StarRocks is being used for user-facing analytics
  • 3. Community User Quote “Real-time analytics is like trying to drink from a fire hose. The data is constantly flowing, and it can be difficult to keep up. But if you can do it, the insights you gain can be invaluable.” Profiting off of near real time knowledge - Battle of Waterloo, Nathan Rothschild and London Stock Exchange.
  • 4. Trends in user-facing analytics 1 Improved decision making 2 Increased user engagement 3 Reduced reliance on IT User-facing analytics (UFA) is a rapidly growing field that is transforming the way businesses deliver insights to their users. UFA empowers users to explore and analyze data for themselves, without the need for technical expertise. This can lead to a number of benefits, such as: Key trends in user-facing analytics: Self-service Analytics Embedded Analytics Real-Time Analytics Augmented Analytics Conversational Analytics
  • 5. Trends in OLAP databases 1 Cloud Native 2 A Sub-second vs. Second/Minute Query Response Time 3 Data Warehouse vs. Data Lake vs. Data Lakehouse Online analytical processing (OLAP) databases are evolving rapidly to meet the demands of modern data analytics. Here are some of the key trends in OLAP databases: 2 B Streaming vs. Batch Data 2 C Mutable vs. Immutable Data 2 D Remote (Object) Storage vs. Local (SSD) Storage 2 E Open Table Format vs. Product Native Storage Format
  • 6. Proprietary / Hybrid Open Open Storage Trends in OLAP databases Compute Table Format Storage Format Open Lakehouse vs Proprietary / Hybrid Lakehouse
  • 7. Challenges in user-facing analytics 1 Ingestion speed and data updates: User-facing analytics requires high-quality data that is available in real time. The logic and mathematics required to handle unbounded data sets is different than static. 2 Query Response Time: User-facing analytics needs to be able to deliver insights with sub-second response times versus seconds or minutes long query times. 3 Scalability: User-facing analytics solutions need to be scalable to handle large volumes of data and high concurrency. Performance and cost are always put to the test as a data set grows over time. 4 Cost: User-facing analytics solutions can be expensive to implement and maintain. User-facing analytics (UFA) is a powerful tool that can help businesses deliver insights to their users, but it is not without its challenges. Here are some of the key challenges with UFA:
  • 8. How StarRocks solves the challenges of user-facing analytics
  • 9. StarRocks is an open-source query engine that delivers data warehouse performance on the data lake. StarRocks enables: ● Directly query data on data lake and get data warehouse performance. ● Sub-second joins and aggregations on billions of rows. ● Serving hundreds of thousands of concurrent end-user requests. ● No need for external data transformation tool for denormalization or pre-aggregation ● Compatibility with standard sql protocol and Trino dialect. ● Perform significantly more queries per second at lower latency on less hardware. ● Increase the flexibility and capability of your data lake or analytics system. ● Reduced cost and complexity by eliminating the need for costly data transformation pipelines.
  • 10. Low Latency Queries Solution Sub-second query with thousands of QPS and beyond
  • 11. High Concurrency Sub-second query with thousands of QPS and beyond IO Bound Workload ● Less complex queries such as point queries. ● Very High QPS. ● Typically requires Latency in the low 10s of ms on massive amount of data (billions of rows). CPU Bound Workload ● Complex OLAP-style queries with JOINs and AGGs. ● Interactive analytics. ● Reports, dashboards, etc.
  • 12. High Concurrency Solution IO Bound Workload Scan the least amount of data possible 1. Bucketing: Fine-grained control over how the data is distributed in your cluster. ● Avoid data skew. ● Choose a high cardinality column that often appears in your `WHERE` clauses. 2. Indexing: Use prefix index (order key), Bitmap, Bloom Filter index to minimize disk access. 3. IO-optimized Hardware: ● Configure (multiple) high-performance SSD drives. ● Good network in terms of latency and throughput. CPU Bound Workload Pre-computation or reusing previous results to take load off your CPUs 1. Intelligent Caching: ● Final Result Cache: For the folks who keep hitting the same button just for fun. ● Query Cache (Intermediate Result Cache): Reuse partial results even if the queries are not identical. 1. Pre-computation: ● Denormalization through Column Partial Update. ● Pre-aggregation through MVs and Aggregate Table. ● Generated Column (3.1): Materialize a column.
  • 13. Fresh Mutable Data Mutable data is important!! Existing solutions relies on merge on read – Huge compromise on query performance.
  • 14. Fresh Mutable Data Solution Delete and insert mechanism with Primary Key index. Support Real-Time mutable data with no compromise on query performance. Approach Pros Cons Delete and insert Simple, efficient for read-heavy workloads Can be inefficient for write-heavy workloads Merge on read Can be efficient for write-heavy workloads More complex than delete and insert
  • 15. Resource Isolation Why Resource Isolation is Needed: ● Maintaining Business-Critical Operations: Uninterrupted and undisturbed performance for key tasks. ● Quality of Service Assurance: Resource assurance for varying user groups in a multi-tenant environment. Current Approach: Physical Isolation: ● Pros: ○ Effective isolation of resources. ● Cons: ○ Low Hardware Utilization: Results in underutilized resources due to lack of sharing or overprovisioning. ○ High Costs: Need to meet peak demand of each user group leads to idle resources and escalated costs, especially in large-scale operations. ○ Gets worse when you grow to tens of thousands of tenants.
  • 16. Resource Isolation Solution Resource Group: Support over-provisioning CPU resources. Three types of workloads defined: ● Short query: Time sensitive queries with dedicated resource (no over-provisioning). ● Query queue: Queries that are not time sensitive but are critical (cannot be killed). ● The rest: supports customized rules to kill big queries.
  • 17. StarRocks is an open-source query engine that delivers data warehouse performance on the data lake.
  • 18. Reduce the time and cost of developing data analytics projects. ● No ingestion and data copying. ● Stop paying to denormalize data that may never be queried. ● Keep the tools you’ve been using. Sub-second query while serve millions of users. ● Multi-dimensional interactive analytics through on-the-fly computations. ● Single source of truth on open data lake analytics, with no external system for caching or pre-computation. Make real-time decisions in all business scenarios, especially when updated data, like in logistics, is needed. ● Real-time update without sacrificing query performance. ● Simpler real-time data pipeline ● Ditch stateful stream processing jobs (denormalization & preaggregation) with efficient on-the-fly computation.
  • 19. Data warehouse query performance on the data lakehouse with no data copying: Natively integrates with open data lake including Hive, Hudi, Iceberg, and Delta Lake. Ditch Denormalization: Perform joins on multiple tables with millions of rows in seconds. No need for external data transformation tool: Perform on-demand pre-computation (like denormalization) within StarRocks, eliminating another processing tool in your data pipeline. Intelligent Query Planning ● Cost-based optimizer generates optimized query plan. ● Global runtime filter. Efficient query execution ● In-memory data shuffling enables fast and scalable JOIN operation. ● C++ SIMD-optimized columnar storage and vectorized query executions deliver the industry's fastest query performance. High concurrency ● Built-in Materialized View. ● An Intelligent caching system. ● Secondary Indexes. Ditch stream processing tools for denormalization: StarRocks’ multi-table performance allows you to ditch the rigid and expensive stateful stream processing jobs for denormalization and pre-aggregation. Real-time updates: Through primary key table, support real-time data updates while having no impact on query performance. Real-time Query Performance at scale: The synchronized materialized view (MV) further accelerates aggregated queries at scale for real-time analytics.
  • 20. Benchmark StarRocks Offers 2.2x Performance over ClickHouse and 8.9x Performance over Apache Druid® in Wide-table Scenarios Out of the Box using product native table format.
  • 21. Benchmark StarRocks Delivers 5.54x Query Performance over Trino in Multi-table Scenarios using Apache Iceberg table format with Parquet files.
  • 22. Use Case: Tableau Dashboard at Airbnb The Airbnb Tableau Dashboard project is designed to serve both internal and external users by providing interactive dashboards. It requires a quick response to user queries. However, the query latency of previous solutions is over 10 mins, which is not acceptable. This project was just suspended until StarRocks is adopted. StarRocks Solution: ● StarRocks can directly connect and works very well with Tableau. ● 3 tables (0.5B rows, 6B rows, 100M rows) + 4 joins + 3 distinct count + JSON functions and regex at same time, response time just 3.6s. ● Reduce the query response time from mins level to sub-seconds level.
  • 23. Use Case: Game and User Behavior Analytics at Tencent IEG ● 400+ game data analysis and user behavior analysis ● Operation reports need to be real-time. ● Using ClickHouse for real-time analysis and Trino for Ad-hoc before, but they want to integrate them all. ● Using Iceberg + COS store, need better performance. ● Need elastic in ad-hoc query to deduce cost. StarRocks Solution: ● Using StarRocks Primary key to solve update problem. ● Using compute node on k8s to auto-scaling. ● Get much more performance in ad-hoc query.
  • 24. Use Case: Trust Analytics at Airbnb To enhance security, Airbnb needs a real-time fraud detection system (Trust Analytics) to identify various attacks and take actions ASAP. This system must support Ad-Hoc query and real-time update. StarRocks Solution: ● StarRocks hosts real-time updated datasets via Primary Key. ● Dataset import from Kafka has a sub-minute delay. ● StarRocks provides second-level query latency for complex joins. ● Alerting can be achieved by just running a SQL query regularly.
  • 25. History of StarRocks and CelerData StarRocks was designed to address the challenges of real-time analytics, including the need to support high concurrency, low latency, and a wide range of analytical workloads. StarRocks also offers a number of features that are not available in other real-time analytics databases, such as the ability to query data directly from data lakes. 2020 Birth of StarRocks StarRocks is created as a commercialized fork of the Apache Doris database. Over time, 90% of the original codebase has been re-written. 2022 CelerData is founded CelerData is founded as a company to develop and commercialize StarRocks. 2023 StarRocks moves to Linux Foundation CelerData contributes StarRocks to the Linux Foundation and moves to Apache 2.0 license. 2023 CelerData Cloud Launched CelerData launches its managed cloud service for StarRocks. 2023 Benchmarks outperform competition Latest TPC-DS and SSB benchmarks shows 2x-9x speed performance over Trino, Clickhouse and Apache Druid.
  • 26. Thank you. ● Website starrocks.io ● Managed Service cloud.celerdata.com StarRocks Project