SlideShare a Scribd company logo
How we evolved data pipeline at Celtra and what we learned along the way
How we evolved data pipeline at Celtra
and what we learned along the way
whoami
Grega Kespret
Director of Engineering,
Analytics @
San Francisco
@gregakespret
github.com/gregakespret
slideshare.net/gregak
Data Pipeline
Big Data Problems vs. Big Data Problems
This is not going to be a talk about Data Science
Creative Management Platform
700,000 Ads Built
70,000 Campaigns
5000 Brands
6G Analytics Events / Day
1TB Compressed Data / Day
Celtra in Numbers
300 Metrics
200 Dimensions
Growth
The analytics platform at Celtra has experienced tremendous growth over
the past few years in terms of size, complexity, number of users, and
variety of use cases.
Business Data Volume Data ComplexityImpressions
How we evolved data pipeline at Celtra and what we learned along the way
• INSERT INTO impressions in track.php => 1 row per impression
• SELECT creativeId, placementId, sdk, platform, COUNT(*) sessions
FROM impressions
WHERE campaignId = ...
GROUP BY creativeId, placementId, sdk, platform
• It was obvious this wouldn't scale, pretty much an anti-pattern
MySQL for everything
Trackers Client-facing
dashboards
REST APIRead replicaRaw impressions
GET /api/analytics?
metrics=sessions,creativeLoads,sessionsWithInteraction
&dimensions=campaignId,campaignName
&filters.accountId=d4950f2c
&filters.utcDate.gte=2018-01-01
&filters.utcDate.lt=2018-04-01
&sort=-sessions&limit=25&format=json
REST API
Trackers Client-facing
dashboards
REST APIRead replicaRaw impressions
BigQuery (private beta)
Trackers Raw impressions Client-facing
dashboards
REST API
Hive + Events + Cube aggregates
Trackers Events
S3
Hive Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
ETL
• Separated write and read paths
• Write events to S3, read and process with Hive, store to MySQL,
asynchronously
• Tried & tested, conservative solution
• Conceptually almost 1:1 to BigQuery (almost the same SQL)
Events
• Immutable, append-only set of raw data
• Store on S3
s3://celtra-mab/events/2017-09-22/13-15/3fdc5604/part0000.events.gz
• Point in time facts about what happened
• Bread and butter of our analytics data
• JSON, one event per line
- server & client timestamp
- crc32 to detect invalid content
- index to detect missing events
- instantiation to deduplicate events
- accountId to partition the data
- name, implicitly defines the schema (very sparse)
Event data
• Replaced Hive with Spark (the new kid on the block)
• Spark 0.5 in production
• sessionization
Spark + Events + Cube aggregates
Trackers Raw Events
S3
Events
Analyzer Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis ReportsAdhoc queries
== Combining discrete events into sessions
• Complex relationships between events
• Patterns more interesting than sums and counts of events
• Easier to troubleshoot/debug with context
• Able to check for/enforce causality (if X happened, Y must also have happened)
• De-duplication possible (no skewed rates because of outliers)
• Later events reveal information about earlier arriving events (e.g. session duration, attribution, etc.)
Sessionization
Sessionization
Events Sessions
• Expressive computation layer: Provides nice functional abstractions over distributed collections
• Get full expressive power of Scala for ETL
• Complex ETL: de-duplicate, sessionize, clean, validate, emit facts
• Shuffle needed for sessionization
• Seamless integration with S3
• Speed of innovation
Spark
(we don't really use groupByKey any more)
Simplified Data Model
Session Unit view Page view
Creative
Campaign
Interaction
sdk time supplier ... unit
name
... name time on page ...
Account
Placement
1. Denormalize
✓ Speed
(aggregations without joins)
✗ Expensive storage
2. Normalize
✗ Speed (joins)
✓ Cheap storage
Denormalize vs. normalize
• Idea: store sessions (unaggregated data after sessionization) instead of cubes
• Columnar MPP database
• Normalized logical schema, denormalized physical schema (projections)
• Use pre-join projections to move the join step into the load
Vertica
Trackers Raw Events
S3
Analyzer Client-facing
dashboards
REST APIRaw Sessions
8 VPC nodes
Hash Joins vs. Merge Joins
• Always use MERGE JOINS (as opposed to HASH JOINS)
• Great performance and fast POC => start with implementation
• Pre-production load testing => a lot of problems:
- Inserts do no scale (VER-25968)
- Merge-join is single-threaded (VER-27781)
- Non-transitivity of pre-join projections
- Remove column problem
- Cannot rebuild data
- ...
• Decision to abandon the project
Vertica
Trackers Raw Events
S3
Analyzer Client-facing
dashboards
REST APIRaw Sessions
• Complex setup and configuration required
• Analyses not reproducible and repeatable
• No collaboration
• Moving data between different stages in troubleshooting/analysis lifecycle (e.g. Scala for aggregations, R for viz)
• Heterogeneity of the various components (Spark in production, something else for exploratory data analysis)
• Analytics team (3 people) bottleneck
Difficult to Analyze the Data Collected
Trackers Raw Events
S3
Events
Analyzer Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis ReportsAdhoc queries
• Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.)
• Needed answers to questions that existing data model did not support
• Visualizations
• For example:
• Analyzing effects of placement position on engagement rates
• Troubleshooting 95th percentile of ad loading time performance
Difficult to Analyze the Data Collected
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis ReportsAdhoc queries
Analyzer
• Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.)
• Needed answers to questions that existing data model did not support
• Visualizations
• For example:
• Analyzing effects of placement position on engagement rates
• Troubleshooting 95th percentile of ad loading time performance
Difficult to Analyze the Data Collected
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis
(Databricks)
ReportsAdhoc queries
Analyzer
✗ Complex ETL repeated in adhoc queries (slow, error-prone)
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
✗ Keep cubes small (no geo, hourly, external dimensions)
✗ Recompute cube from events to add new breakdowns to existing metric
(slow, not exactly deterministic)
Some of the problems
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis
(Databricks)
ReportsAdhoc queries
Analyzer
Idea: Split ETL, Materialize Sessions
deduplication, sessionization, cleaning, validation, external dependencies
aggregating across different dimensions
Part 1: Complex Part 2: Simple
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis
(Databricks)
ReportsAdhoc queries
Raw Sessions
ETL ETL
Analyzer
Requirements
• Fully managed service
• Columnar storage format
• Support for complex nested structures
• Schema evolution possible
• Data rewrites possible
• Scale compute resources separately
from storage
Data Warehouse to store sessions
Nice-to-Haves
• Transactions
• Partitioning
• Skipping
• Access control
• Appropriate for OLAP use case
Final
Contenders
for New
Data
Warehouse
• Too many small files problem => file stitching
• No consistency guarantees over set of files on
S3 => secondary index | convention
• Liked one layer vs. separate Query layer (Spark), Metadata layer (HCatalog),
Storage format layer (Parquet), Data layer (S3)
Spark + HCatalog + Parquet + S3
We really wanted a database-like abstraction with
transactions, not a file format!
Operational tasks with Vertica:
• Replace failed node
• Refresh projection
• Restart database with one node down
• Remove dead node from DNS
• Ensure enough (at least 2x) disk space
available for rewrites
• Backup data
• Archive data
Why We Wanted a Managed Service
We did not want to
deal with these tasks
1. Denormalize
✓ Speed
(aggregations without joins)
✗ Expensive storage
2. Normalize
✗ Speed (joins)
✓ Cheap storage
3. Normalized logical schema, denormalized physical schema (Vertica use case)
✓ Speed (move the join step into the load)
✓ Cheap storage
4. Nested objects: pre-group the data on each grain
✓ Speed (a "join" between parent and child is essentially free)
✓ Cheap storage
Support for complex nested structures
{
"id": "s1523120401x263215af605420x35562961",
"creativeId": "f21d6f4f",
"actualDeviceType": "Phone",
"loaded": true,
"rendered": true,
"platform": "IOS",
...
"unitShows": [
{
"unit": { ... },
"unitVariantShows": [
{
"unitVariant": { ... },
"screenShows": [
{
"screen": { ... },
},
{
"screen": {... },
}
],
"hasInteraction": false
}
],
"screenDepth": "2",
"hasInteraction": false
}
]
}
Pre-group the data on each grain
Flat + Normalized Nested
Find top 10 pages on creative units with most interactions on average
Flat vs. Nested Queries
SELECT
creativeId,
us.name unitName,
ss.name pageName,
AVG(COUNT(*)) avgInteractions
FROM
sessions s
JOIN unitShows us ON us.id = s.id
JOIN screenShows ss ON ss.usid = us.id
JOIN interactions i ON i.ssid = ss.id
GROUP BY 1, 2, 3
ORDER BY avgInteractions DESC LIMIT 10
Distributed join turned into local joinJoins
Requires unique ID at every grain
SELECT
creativeId,
unitShows.value:name unitName,
screenShows.value:name pageName,
AVG(ARRAY_SIZE(screenShows.value:interactions)) avgInteractions
FROM
sessions,
LATERAL FLATTEN(json:unitShows) unitShows,
LATERAL FLATTEN(unitShows.value:screenShows) screenShows
GROUP BY 1, 2, 3
ORDER BY avgInteractions DESC LIMIT 10
Sessions: Path to production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
ReportsAdhoc queries
Raw Sessions
ETL
S3
1. Backfilling
- Materialized historical sessions of last 2 years and inserted into
Snowflake
- Verify counts for each day versus cube aggregates, investigate
discrepancies & fix
Analyzer
2. "Soft deploy", period of mirrored writes
- Add SnowflakeSink to Analyzer (./bin/analyzer --sinks
mysql,snowflake)
- Sink to Cubes and Snowflake in one run (sequentially)
- Investigate potential problems of snowflake sink in production,
compare the data, etc.
Exploratory data analysis
(Databricks)
Sessions in production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
ReportsAdhoc queries
Raw Sessions
ETL
Cube filler
S3
3. Split Analyzer into Event analyzer and Cube filler
- Add Cubefiller
- Switch pipeline to Analyzer -> Sessions, Cubefiller -> Cubes
- Refactor and clean up analyzer
Analyzer
Exploratory data analysis
(Databricks)
Sessions in production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
ReportsAdhoc queries
Raw Sessions
ETL
Cube filler
S3
Analyzer
Exploratory data analysis
(Databricks)
 Data processed once & consumed many times (from sessions)
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
✗ Keep cubes small (no geo, hourly, external dimensions)
✗ Recompute cube from events to add new breakdowns to existing metric (slow,
not exactly deterministic)
Move Cubes to Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
1. Backfilling & Testing
- Export Cube aggregates data from MySQL and import into
Snowflake
- Test performance of single queries, parallel queries for different
use cases: Dashboard, Reporting BI through Report generator, ...
Analyzer
Exploratory data analysis
(Databricks)
2. "Soft deploy", period of mirrored writes
- Add SnowflakeSink to Cubefiller (./bin/cube-filler --sinks
mysql,snowflake)
- Sink to MySQL Cubes and Snowflake Cubes in one run
- Investigate potential problems of snowflake sink in production,
compare the data, etc.
Cube
Aggregates
Move Cubes to Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
4. Validating SnowflakeAnalyzer results on 1% of requests.
/**
* Proxies requests to a main analyzer and a percent of requests to the
secondary analyzer and compares responses. When responses do not match,
it logs the context needed for troubleshooting.
*/
class ExperimentAnalyzer extends Analyzer
Analyzer
Exploratory data analysis
(Databricks)
5. Gradually switch REST API to get data from Snowflake instead of MySQL
- Start with 10% of requests, increase to 100%
- This allows us to see how Snowflake performs under load
Cube
Aggregates
3. REST API can read from Snowflake Cube aggregates (SnowflakeAnalyzer)
Check matches and response times
Cubes aggregates in Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
Exploratory data analysis
(Databricks)
 Data processed once & consumed many times (from sessions)
 Fast schema changes in cubes (e.g. adding / removing metric)
 Can support geographical, hourly, external dimensions
✗ Recompute cube from sessions to add new breakdowns to existing metric
(slow, but deterministic)
Compute cube aggregates with SQL
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
Exploratory data analysis
(Databricks)
aggregating across different dimensions
Part 2: Simple• Move computation to data, not data to
computation
• Instead of exporting sessions to Spark
cluster only to compute aggregates, do it in
database with SELECT … FROM sessions
GROUP BY ...
Compute cube aggregates with SQL
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
Exploratory data analysis
(Databricks)
1. Period of double writes
./bin/cube-filler --logic sql,cubefiller --sink snowflakecubes
2. Compare differences, fix problems
Cube filler
Compute cube aggregates with SQL
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST API
Services
Reports
Raw Sessions
Cube Aggregates
ETL
Analyzer
Exploratory data analysis
(Databricks)
3. Remove Spark completely from cubefiller
• REST API can read directly from Raw sessions OR from Cube Aggregates,
generating the same response
• Session schema
Known, well defined (by Session Scala model) and enforced
• Latest Session model
Authoritative source for sessions schema
• Historical sessions conform to the latest Session model
Can de-serialize any historical session
• Readers should ignore fields not in Session model
We do not guarantee to preserve this data
• Computing metrics, dimensions from Session model is time-invariant
Computed 1 year ago or today, numbers must be the same
How we handle schema evolution
Schema Evolution
Change in Session model Top level / scalar column Nested / VARIANT column
Rename field
ALTER TABLE tbl RENAME COLUMN
col1 TO col2; data rewrite (!)
Remove field ALTER TABLE tbl DROP COLUMN col; batch together in next rewrite
Add field, no historical values
ALTER TABLE tbl ADD COLUMN col
type; no change necessary
Also considered views for VARIANT schema evolution
For complex scenarios have to use Javascript UDF => lose benefits of columnar access
Not good for practical use
• They are sometimes necessary
• We have the ability to do data rewrites
• Rewrite must maintain sort order fast access (UPDATE breaks it!)
• Did rewrite of 150TB of (compressed) data in December
• Complex and time consuming, so we fully automate them
• Costly, so we batch multiple changes together
• Javascript UDFs are our default approach for rewrites of data in VARIANT
Data Rewrites
• Expressive power of Javascript (vs. SQL)
• Run on the whole VARIANT record
• (Almost) constant performance
• More readable and understandable
• For changing a single field,
OBJECT_INSERT/OBJECT_DELETE
are preferred
Inline Rewrites with Javascript UDFs
CREATE OR REPLACE FUNCTION transform("json" variant)
RETURNS VARIANT
LANGUAGE JAVASCRIPT
AS '
// modify json
return json;
';
SELECT transform(json) FROM sessions;
• There are many ways to develop a data pipeline, none of them are perfect
• Make sure that load/scale testing is a required part of your go-to-production
plan
• Rollups make things cheaper, but at a great expense later
• Good abstractions survive a long time (e.g. Analytics API)
• Evolve pipeline modularly
• Maintain data consistency before/after
Stuff you have seen today...
Thank You.
P.S. We're hiring

More Related Content

What's hot (20)

PPTX
Maltego Information Gathering
Sreekanth Narendran
 
PDF
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
PDF
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
PPTX
Architecting a datalake
Laurent Leturgez
 
PPTX
How to Implement Snowflake Security Best Practices with Panther
Panther Labs
 
PDF
Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
Mitsunori Komatsu
 
PDF
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
PDF
Data Mesh 101
ChrisFord803185
 
PDF
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
PDF
Neo4j Training Series - Spring Data Neo4j
Neo4j
 
PPTX
Looker Studio Data Contracts - Data.Monks.pptx
Doug Hall
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Data Mesh
Piethein Strengholt
 
PPTX
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
PPTX
Apache sqoop with an use case
Davin Abraham
 
PPTX
Catalyst optimizer
Ayub Mohammad
 
PDF
Snowflake for Data Engineering
Harald Erb
 
PPTX
What is AWS Glue
jeetendra mandal
 
PDF
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
PDF
Fully Utilizing Spark for Data Validation
Databricks
 
Maltego Information Gathering
Sreekanth Narendran
 
Learn to Use Databricks for the Full ML Lifecycle
Databricks
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Architecting a datalake
Laurent Leturgez
 
How to Implement Snowflake Security Best Practices with Panther
Panther Labs
 
Presto in Treasure Data (presented at db tech showcase Sapporo 2015)
Mitsunori Komatsu
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
Databricks
 
Data Mesh 101
ChrisFord803185
 
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
Neo4j Training Series - Spring Data Neo4j
Neo4j
 
Looker Studio Data Contracts - Data.Monks.pptx
Doug Hall
 
Data Lakehouse Symposium | Day 4
Databricks
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Apache sqoop with an use case
Davin Abraham
 
Catalyst optimizer
Ayub Mohammad
 
Snowflake for Data Engineering
Harald Erb
 
What is AWS Glue
jeetendra mandal
 
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Fully Utilizing Spark for Data Validation
Databricks
 

Similar to How we evolved data pipeline at Celtra and what we learned along the way (20)

PDF
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
PPTX
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
Cisco DevNet
 
PDF
Serverless SQL
Torsten Steinbach
 
PDF
Levelling up your data infrastructure
Simon Belak
 
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
PPTX
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
PDF
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
PDF
Making sense of your data jug
Gerald Muecke
 
PDF
The art of the event streaming application: streams, stream processors and sc...
confluent
 
PDF
Kafka summit SF 2019 - the art of the event-streaming app
Neil Avery
 
PDF
Estimating the Total Costs of Your Cloud Analytics Platform
DATAVERSITY
 
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
PDF
ADV Slides: Comparing the Enterprise Analytic Solutions
DATAVERSITY
 
PDF
Azure data analytics platform - A reference architecture
Rajesh Kumar
 
PPTX
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit
 
PDF
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
PDF
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
Big Data Spain
 
PDF
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
PPTX
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Grega Kespret
 
DEVNET-1140 InterCloud Mapreduce and Spark Workload Migration and Sharing: Fi...
Cisco DevNet
 
Serverless SQL
Torsten Steinbach
 
Levelling up your data infrastructure
Simon Belak
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Day 1 - Technical Bootcamp azure synapse analytics
Armand272
 
IBM Cloud Native Day April 2021: Serverless Data Lake
Torsten Steinbach
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Making sense of your data jug
Gerald Muecke
 
The art of the event streaming application: streams, stream processors and sc...
confluent
 
Kafka summit SF 2019 - the art of the event-streaming app
Neil Avery
 
Estimating the Total Costs of Your Cloud Analytics Platform
DATAVERSITY
 
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
Databricks
 
ADV Slides: Comparing the Enterprise Analytic Solutions
DATAVERSITY
 
Azure data analytics platform - A reference architecture
Rajesh Kumar
 
Accumulo Summit 2015: Building Aggregation Systems on Accumulo [Leveraging Ac...
Accumulo Summit
 
Best Practices for Building Robust Data Platform with Apache Spark and Delta
Databricks
 
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...
Big Data Spain
 
Taming the shrew Power BI
Kellyn Pot'Vin-Gorman
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Ad

How we evolved data pipeline at Celtra and what we learned along the way

  • 2. How we evolved data pipeline at Celtra and what we learned along the way
  • 3. whoami Grega Kespret Director of Engineering, Analytics @ San Francisco @gregakespret github.com/gregakespret slideshare.net/gregak
  • 5. Big Data Problems vs. Big Data Problems This is not going to be a talk about Data Science
  • 7. 700,000 Ads Built 70,000 Campaigns 5000 Brands 6G Analytics Events / Day 1TB Compressed Data / Day Celtra in Numbers 300 Metrics 200 Dimensions
  • 8. Growth The analytics platform at Celtra has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. Business Data Volume Data ComplexityImpressions
  • 10. • INSERT INTO impressions in track.php => 1 row per impression • SELECT creativeId, placementId, sdk, platform, COUNT(*) sessions FROM impressions WHERE campaignId = ... GROUP BY creativeId, placementId, sdk, platform • It was obvious this wouldn't scale, pretty much an anti-pattern MySQL for everything Trackers Client-facing dashboards REST APIRead replicaRaw impressions
  • 12. BigQuery (private beta) Trackers Raw impressions Client-facing dashboards REST API
  • 13. Hive + Events + Cube aggregates Trackers Events S3 Hive Client-facing dashboards REST APICubes (read replica) Cube Aggregates ETL • Separated write and read paths • Write events to S3, read and process with Hive, store to MySQL, asynchronously • Tried & tested, conservative solution • Conceptually almost 1:1 to BigQuery (almost the same SQL) Events
  • 14. • Immutable, append-only set of raw data • Store on S3 s3://celtra-mab/events/2017-09-22/13-15/3fdc5604/part0000.events.gz • Point in time facts about what happened • Bread and butter of our analytics data • JSON, one event per line - server & client timestamp - crc32 to detect invalid content - index to detect missing events - instantiation to deduplicate events - accountId to partition the data - name, implicitly defines the schema (very sparse) Event data
  • 15. • Replaced Hive with Spark (the new kid on the block) • Spark 0.5 in production • sessionization Spark + Events + Cube aggregates Trackers Raw Events S3 Events Analyzer Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis ReportsAdhoc queries
  • 16. == Combining discrete events into sessions • Complex relationships between events • Patterns more interesting than sums and counts of events • Easier to troubleshoot/debug with context • Able to check for/enforce causality (if X happened, Y must also have happened) • De-duplication possible (no skewed rates because of outliers) • Later events reveal information about earlier arriving events (e.g. session duration, attribution, etc.) Sessionization Sessionization Events Sessions
  • 17. • Expressive computation layer: Provides nice functional abstractions over distributed collections • Get full expressive power of Scala for ETL • Complex ETL: de-duplicate, sessionize, clean, validate, emit facts • Shuffle needed for sessionization • Seamless integration with S3 • Speed of innovation Spark (we don't really use groupByKey any more)
  • 18. Simplified Data Model Session Unit view Page view Creative Campaign Interaction sdk time supplier ... unit name ... name time on page ... Account Placement
  • 19. 1. Denormalize ✓ Speed (aggregations without joins) ✗ Expensive storage 2. Normalize ✗ Speed (joins) ✓ Cheap storage Denormalize vs. normalize
  • 20. • Idea: store sessions (unaggregated data after sessionization) instead of cubes • Columnar MPP database • Normalized logical schema, denormalized physical schema (projections) • Use pre-join projections to move the join step into the load Vertica Trackers Raw Events S3 Analyzer Client-facing dashboards REST APIRaw Sessions 8 VPC nodes
  • 21. Hash Joins vs. Merge Joins
  • 22. • Always use MERGE JOINS (as opposed to HASH JOINS) • Great performance and fast POC => start with implementation • Pre-production load testing => a lot of problems: - Inserts do no scale (VER-25968) - Merge-join is single-threaded (VER-27781) - Non-transitivity of pre-join projections - Remove column problem - Cannot rebuild data - ... • Decision to abandon the project Vertica Trackers Raw Events S3 Analyzer Client-facing dashboards REST APIRaw Sessions
  • 23. • Complex setup and configuration required • Analyses not reproducible and repeatable • No collaboration • Moving data between different stages in troubleshooting/analysis lifecycle (e.g. Scala for aggregations, R for viz) • Heterogeneity of the various components (Spark in production, something else for exploratory data analysis) • Analytics team (3 people) bottleneck Difficult to Analyze the Data Collected Trackers Raw Events S3 Events Analyzer Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis ReportsAdhoc queries
  • 24. • Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.) • Needed answers to questions that existing data model did not support • Visualizations • For example: • Analyzing effects of placement position on engagement rates • Troubleshooting 95th percentile of ad loading time performance Difficult to Analyze the Data Collected Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis ReportsAdhoc queries Analyzer
  • 25. • Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.) • Needed answers to questions that existing data model did not support • Visualizations • For example: • Analyzing effects of placement position on engagement rates • Troubleshooting 95th percentile of ad loading time performance Difficult to Analyze the Data Collected Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis (Databricks) ReportsAdhoc queries Analyzer
  • 26. ✗ Complex ETL repeated in adhoc queries (slow, error-prone) ✗ Slow to make schema changes in cubes (e.g. adding / removing metric) ✗ Keep cubes small (no geo, hourly, external dimensions) ✗ Recompute cube from events to add new breakdowns to existing metric (slow, not exactly deterministic) Some of the problems Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis (Databricks) ReportsAdhoc queries Analyzer
  • 27. Idea: Split ETL, Materialize Sessions deduplication, sessionization, cleaning, validation, external dependencies aggregating across different dimensions Part 1: Complex Part 2: Simple Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Exploratory data analysis (Databricks) ReportsAdhoc queries Raw Sessions ETL ETL Analyzer
  • 28. Requirements • Fully managed service • Columnar storage format • Support for complex nested structures • Schema evolution possible • Data rewrites possible • Scale compute resources separately from storage Data Warehouse to store sessions Nice-to-Haves • Transactions • Partitioning • Skipping • Access control • Appropriate for OLAP use case
  • 30. • Too many small files problem => file stitching • No consistency guarantees over set of files on S3 => secondary index | convention • Liked one layer vs. separate Query layer (Spark), Metadata layer (HCatalog), Storage format layer (Parquet), Data layer (S3) Spark + HCatalog + Parquet + S3 We really wanted a database-like abstraction with transactions, not a file format!
  • 31. Operational tasks with Vertica: • Replace failed node • Refresh projection • Restart database with one node down • Remove dead node from DNS • Ensure enough (at least 2x) disk space available for rewrites • Backup data • Archive data Why We Wanted a Managed Service We did not want to deal with these tasks
  • 32. 1. Denormalize ✓ Speed (aggregations without joins) ✗ Expensive storage 2. Normalize ✗ Speed (joins) ✓ Cheap storage 3. Normalized logical schema, denormalized physical schema (Vertica use case) ✓ Speed (move the join step into the load) ✓ Cheap storage 4. Nested objects: pre-group the data on each grain ✓ Speed (a "join" between parent and child is essentially free) ✓ Cheap storage Support for complex nested structures
  • 33. { "id": "s1523120401x263215af605420x35562961", "creativeId": "f21d6f4f", "actualDeviceType": "Phone", "loaded": true, "rendered": true, "platform": "IOS", ... "unitShows": [ { "unit": { ... }, "unitVariantShows": [ { "unitVariant": { ... }, "screenShows": [ { "screen": { ... }, }, { "screen": {... }, } ], "hasInteraction": false } ], "screenDepth": "2", "hasInteraction": false } ] } Pre-group the data on each grain
  • 34. Flat + Normalized Nested Find top 10 pages on creative units with most interactions on average Flat vs. Nested Queries SELECT creativeId, us.name unitName, ss.name pageName, AVG(COUNT(*)) avgInteractions FROM sessions s JOIN unitShows us ON us.id = s.id JOIN screenShows ss ON ss.usid = us.id JOIN interactions i ON i.ssid = ss.id GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10 Distributed join turned into local joinJoins Requires unique ID at every grain SELECT creativeId, unitShows.value:name unitName, screenShows.value:name pageName, AVG(ARRAY_SIZE(screenShows.value:interactions)) avgInteractions FROM sessions, LATERAL FLATTEN(json:unitShows) unitShows, LATERAL FLATTEN(unitShows.value:screenShows) screenShows GROUP BY 1, 2, 3 ORDER BY avgInteractions DESC LIMIT 10
  • 35. Sessions: Path to production Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services ReportsAdhoc queries Raw Sessions ETL S3 1. Backfilling - Materialized historical sessions of last 2 years and inserted into Snowflake - Verify counts for each day versus cube aggregates, investigate discrepancies & fix Analyzer 2. "Soft deploy", period of mirrored writes - Add SnowflakeSink to Analyzer (./bin/analyzer --sinks mysql,snowflake) - Sink to Cubes and Snowflake in one run (sequentially) - Investigate potential problems of snowflake sink in production, compare the data, etc. Exploratory data analysis (Databricks)
  • 36. Sessions in production Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services ReportsAdhoc queries Raw Sessions ETL Cube filler S3 3. Split Analyzer into Event analyzer and Cube filler - Add Cubefiller - Switch pipeline to Analyzer -> Sessions, Cubefiller -> Cubes - Refactor and clean up analyzer Analyzer Exploratory data analysis (Databricks)
  • 37. Sessions in production Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services ReportsAdhoc queries Raw Sessions ETL Cube filler S3 Analyzer Exploratory data analysis (Databricks)  Data processed once & consumed many times (from sessions) ✗ Slow to make schema changes in cubes (e.g. adding / removing metric) ✗ Keep cubes small (no geo, hourly, external dimensions) ✗ Recompute cube from events to add new breakdowns to existing metric (slow, not exactly deterministic)
  • 38. Move Cubes to Snowflake Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Reports Raw Sessions ETL Cube filler S3 1. Backfilling & Testing - Export Cube aggregates data from MySQL and import into Snowflake - Test performance of single queries, parallel queries for different use cases: Dashboard, Reporting BI through Report generator, ... Analyzer Exploratory data analysis (Databricks) 2. "Soft deploy", period of mirrored writes - Add SnowflakeSink to Cubefiller (./bin/cube-filler --sinks mysql,snowflake) - Sink to MySQL Cubes and Snowflake Cubes in one run - Investigate potential problems of snowflake sink in production, compare the data, etc. Cube Aggregates
  • 39. Move Cubes to Snowflake Trackers Raw Events S3 Events Client-facing dashboards REST APICubes (read replica) Cube Aggregates Services Reports Raw Sessions ETL Cube filler S3 4. Validating SnowflakeAnalyzer results on 1% of requests. /** * Proxies requests to a main analyzer and a percent of requests to the secondary analyzer and compares responses. When responses do not match, it logs the context needed for troubleshooting. */ class ExperimentAnalyzer extends Analyzer Analyzer Exploratory data analysis (Databricks) 5. Gradually switch REST API to get data from Snowflake instead of MySQL - Start with 10% of requests, increase to 100% - This allows us to see how Snowflake performs under load Cube Aggregates 3. REST API can read from Snowflake Cube aggregates (SnowflakeAnalyzer)
  • 40. Check matches and response times
  • 41. Cubes aggregates in Snowflake Trackers Raw Events S3 Events Client-facing dashboards REST APICube Aggregates Services Reports Raw Sessions ETL Cube filler S3 Analyzer Exploratory data analysis (Databricks)  Data processed once & consumed many times (from sessions)  Fast schema changes in cubes (e.g. adding / removing metric)  Can support geographical, hourly, external dimensions ✗ Recompute cube from sessions to add new breakdowns to existing metric (slow, but deterministic)
  • 42. Compute cube aggregates with SQL Trackers Raw Events S3 Events Client-facing dashboards REST APICube Aggregates Services Reports Raw Sessions ETL Cube filler S3 Analyzer Exploratory data analysis (Databricks) aggregating across different dimensions Part 2: Simple• Move computation to data, not data to computation • Instead of exporting sessions to Spark cluster only to compute aggregates, do it in database with SELECT … FROM sessions GROUP BY ...
  • 43. Compute cube aggregates with SQL Trackers Raw Events S3 Events Client-facing dashboards REST APICube Aggregates Services Reports Raw Sessions ETL Cube filler S3 Analyzer Exploratory data analysis (Databricks) 1. Period of double writes ./bin/cube-filler --logic sql,cubefiller --sink snowflakecubes 2. Compare differences, fix problems Cube filler
  • 44. Compute cube aggregates with SQL Trackers Raw Events S3 Events Client-facing dashboards REST API Services Reports Raw Sessions Cube Aggregates ETL Analyzer Exploratory data analysis (Databricks) 3. Remove Spark completely from cubefiller • REST API can read directly from Raw sessions OR from Cube Aggregates, generating the same response
  • 45. • Session schema Known, well defined (by Session Scala model) and enforced • Latest Session model Authoritative source for sessions schema • Historical sessions conform to the latest Session model Can de-serialize any historical session • Readers should ignore fields not in Session model We do not guarantee to preserve this data • Computing metrics, dimensions from Session model is time-invariant Computed 1 year ago or today, numbers must be the same How we handle schema evolution
  • 46. Schema Evolution Change in Session model Top level / scalar column Nested / VARIANT column Rename field ALTER TABLE tbl RENAME COLUMN col1 TO col2; data rewrite (!) Remove field ALTER TABLE tbl DROP COLUMN col; batch together in next rewrite Add field, no historical values ALTER TABLE tbl ADD COLUMN col type; no change necessary Also considered views for VARIANT schema evolution For complex scenarios have to use Javascript UDF => lose benefits of columnar access Not good for practical use
  • 47. • They are sometimes necessary • We have the ability to do data rewrites • Rewrite must maintain sort order fast access (UPDATE breaks it!) • Did rewrite of 150TB of (compressed) data in December • Complex and time consuming, so we fully automate them • Costly, so we batch multiple changes together • Javascript UDFs are our default approach for rewrites of data in VARIANT Data Rewrites
  • 48. • Expressive power of Javascript (vs. SQL) • Run on the whole VARIANT record • (Almost) constant performance • More readable and understandable • For changing a single field, OBJECT_INSERT/OBJECT_DELETE are preferred Inline Rewrites with Javascript UDFs CREATE OR REPLACE FUNCTION transform("json" variant) RETURNS VARIANT LANGUAGE JAVASCRIPT AS ' // modify json return json; '; SELECT transform(json) FROM sessions;
  • 49. • There are many ways to develop a data pipeline, none of them are perfect • Make sure that load/scale testing is a required part of your go-to-production plan • Rollups make things cheaper, but at a great expense later • Good abstractions survive a long time (e.g. Analytics API) • Evolve pipeline modularly • Maintain data consistency before/after Stuff you have seen today...

Editor's Notes

  • #13: got private beta through connections almost no documentation, something about SQL migration was a couple of days (big factor) amazing response times, also for bigger campaigns in time, response times got slower and slower; nobody knew why; talked to Google, nobody mentioned they are doing full-table scans at some point all campaign queries were 30s, a little later >60s => browser timeout and total breakage combination of popularity and retries (because of >60s) caused us to go over query quotas on some days, which was even bigger disaster
  • #14: Separated write and read paths Write events to S3, read and process with Hive, store to MySQL, asynchronously Tried & tested, conservative solution Conceptually almost 1:1 to BigQuery (almost the same SQL)
  • #29: Managed service Schema evolution How fast are schema changes? (adding column, removing column, adding column with default value, computing a new column from other data, changing field type)? Do we need to do data rewrite to change schema? If so, how long does it take to migrate all data to new schema? Estimated cost of data rewrite? Storage and nested objects Support for complex nested data structures Columnar storage (nested objects "columnarized"?) Data storage format? Is the format open or closed? Workload management Scaling compute resources separately from storage