How we evolved data pipeline at Celtra and what we learned along the way

How we evolved data pipeline at Celtra
and what we learned along the way

whoami
Grega Kespret
Director of Engineering,
Analytics @
San Francisco
@gregakespret
github.com/gregakespret
slideshare.net/gregak

Big Data Problems vs. Big Data Problems
This is not going to be a talk about Data Science

700,000 Ads Built
70,000 Campaigns
5000 Brands
6G Analytics Events / Day
1TB Compressed Data / Day
Celtra in Numbers
300 Metrics
200 Dimensions

Growth
The analytics platform at Celtra has experienced tremendous growth over
the past few years in terms of size, complexity, number of users, and
variety of use cases.
Business Data Volume Data ComplexityImpressions

• INSERT INTO impressions in track.php => 1 row per impression
• SELECT creativeId, placementId, sdk, platform, COUNT(*) sessions
FROM impressions
WHERE campaignId = ...
GROUP BY creativeId, placementId, sdk, platform
• It was obvious this wouldn't scale, pretty much an anti-pattern
MySQL for everything
Trackers Client-facing
dashboards
REST APIRead replicaRaw impressions

GET /api/analytics?
metrics=sessions,creativeLoads,sessionsWithInteraction
&dimensions=campaignId,campaignName
&filters.accountId=d4950f2c
&filters.utcDate.gte=2018-01-01
&filters.utcDate.lt=2018-04-01
&sort=-sessions&limit=25&format=json
REST API
Trackers Client-facing
dashboards
REST APIRead replicaRaw impressions

BigQuery (private beta)
Trackers Raw impressions Client-facing
dashboards
REST API

Hive + Events + Cube aggregates
Trackers Events
S3
Hive Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
ETL
• Separated write and read paths
• Write events to S3, read and process with Hive, store to MySQL,
asynchronously
• Tried & tested, conservative solution
• Conceptually almost 1:1 to BigQuery (almost the same SQL)
Events

• Immutable, append-only set of raw data
• Store on S3
s3://celtra-mab/events/2017-09-22/13-15/3fdc5604/part0000.events.gz
• Point in time facts about what happened
• Bread and butter of our analytics data
• JSON, one event per line
- server & client timestamp
- crc32 to detect invalid content
- index to detect missing events
- instantiation to deduplicate events
- accountId to partition the data
- name, implicitly defines the schema (very sparse)
Event data

• Replaced Hive with Spark (the new kid on the block)
• Spark 0.5 in production
• sessionization
Spark + Events + Cube aggregates
Trackers Raw Events
S3
Events
Analyzer Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis ReportsAdhoc queries

== Combining discrete events into sessions
• Complex relationships between events
• Patterns more interesting than sums and counts of events
• Easier to troubleshoot/debug with context
• Able to check for/enforce causality (if X happened, Y must also have happened)
• De-duplication possible (no skewed rates because of outliers)
• Later events reveal information about earlier arriving events (e.g. session duration, attribution, etc.)
Sessionization
Sessionization
Events Sessions

• Expressive computation layer: Provides nice functional abstractions over distributed collections
• Get full expressive power of Scala for ETL
• Complex ETL: de-duplicate, sessionize, clean, validate, emit facts
• Shuffle needed for sessionization
• Seamless integration with S3
• Speed of innovation
Spark
(we don't really use groupByKey any more)

Simplified Data Model
Session Unit view Page view
Creative
Campaign
Interaction
sdk time supplier ... unit
name
... name time on page ...
Account
Placement

1. Denormalize
✓ Speed
(aggregations without joins)
✗ Expensive storage
2. Normalize
✗ Speed (joins)
✓ Cheap storage
Denormalize vs. normalize

• Idea: store sessions (unaggregated data after sessionization) instead of cubes
• Columnar MPP database
• Normalized logical schema, denormalized physical schema (projections)
• Use pre-join projections to move the join step into the load
Vertica
Trackers Raw Events
S3
dashboards
REST APIRaw Sessions
8 VPC nodes

• Always use MERGE JOINS (as opposed to HASH JOINS)
• Great performance and fast POC => start with implementation
• Pre-production load testing => a lot of problems:
- Inserts do no scale (VER-25968)
- Merge-join is single-threaded (VER-27781)
- Non-transitivity of pre-join projections
- Remove column problem
- Cannot rebuild data
- ...
• Decision to abandon the project
Vertica
Trackers Raw Events
S3
dashboards
REST APIRaw Sessions

• Complex setup and configuration required
• Analyses not reproducible and repeatable
• No collaboration
• Moving data between different stages in troubleshooting/analysis lifecycle (e.g. Scala for aggregations, R for viz)
• Heterogeneity of the various components (Spark in production, something else for exploratory data analysis)
• Analytics team (3 people) bottleneck
Difficult to Analyze the Data Collected
Trackers Raw Events
S3
Events
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services

• Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.)
• Needed answers to questions that existing data model did not support
• Visualizations
• For example:
• Analyzing effects of placement position on engagement rates
• Troubleshooting 95th percentile of ad loading time performance
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Analyzer

• Needed flexibility, not provided by precomputed aggregates (unique counting, order statistics, outliers, etc.)
• Needed answers to questions that existing data model did not support
• Visualizations
• For example:
• Analyzing effects of placement position on engagement rates
• Troubleshooting 95th percentile of ad loading time performance
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Exploratory data analysis
(Databricks)
ReportsAdhoc queries
Analyzer

✗ Complex ETL repeated in adhoc queries (slow, error-prone)
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
✗ Keep cubes small (no geo, hourly, external dimensions)
✗ Recompute cube from events to add new breakdowns to existing metric
(slow, not exactly deterministic)
Some of the problems
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
(Databricks)
Analyzer

Idea: Split ETL, Materialize Sessions
deduplication, sessionization, cleaning, validation, external dependencies
aggregating across different dimensions
Part 1: Complex Part 2: Simple
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
(Databricks)
Raw Sessions
ETL ETL
Analyzer

Requirements
• Fully managed service
• Columnar storage format
• Support for complex nested structures
• Schema evolution possible
• Data rewrites possible
• Scale compute resources separately
from storage
Data Warehouse to store sessions
Nice-to-Haves
• Transactions
• Partitioning
• Skipping
• Access control
• Appropriate for OLAP use case

Final
Contenders
for New
Data
Warehouse

• Too many small files problem => file stitching
• No consistency guarantees over set of files on
S3 => secondary index | convention
• Liked one layer vs. separate Query layer (Spark), Metadata layer (HCatalog),
Storage format layer (Parquet), Data layer (S3)
Spark + HCatalog + Parquet + S3
We really wanted a database-like abstraction with
transactions, not a file format!

Operational tasks with Vertica:
• Replace failed node
• Refresh projection
• Restart database with one node down
• Remove dead node from DNS
• Ensure enough (at least 2x) disk space
available for rewrites
• Backup data
• Archive data
Why We Wanted a Managed Service
We did not want to
deal with these tasks

1. Denormalize
✓ Speed
(aggregations without joins)
✗ Expensive storage
2. Normalize
✗ Speed (joins)
✓ Cheap storage
3. Normalized logical schema, denormalized physical schema (Vertica use case)
✓ Speed (move the join step into the load)
✓ Cheap storage
4. Nested objects: pre-group the data on each grain
✓ Speed (a "join" between parent and child is essentially free)
✓ Cheap storage
Support for complex nested structures

{
"id": "s1523120401x263215af605420x35562961",
"creativeId": "f21d6f4f",
"actualDeviceType": "Phone",
"loaded": true,
"rendered": true,
"platform": "IOS",
...
"unitShows": [
{
"unit": { ... },
"unitVariantShows": [
{
"unitVariant": { ... },
"screenShows": [
{
"screen": { ... },
},
{
"screen": {... },
}
],
"hasInteraction": false
}
],
"screenDepth": "2",
"hasInteraction": false
}
]
}
Pre-group the data on each grain

Flat + Normalized Nested
Find top 10 pages on creative units with most interactions on average
Flat vs. Nested Queries
SELECT
creativeId,
us.name unitName,
ss.name pageName,
AVG(COUNT(*)) avgInteractions
FROM
sessions s
JOIN unitShows us ON us.id = s.id
JOIN screenShows ss ON ss.usid = us.id
JOIN interactions i ON i.ssid = ss.id
GROUP BY 1, 2, 3
ORDER BY avgInteractions DESC LIMIT 10
Distributed join turned into local joinJoins
Requires unique ID at every grain
SELECT
creativeId,
unitShows.value:name unitName,
screenShows.value:name pageName,
AVG(ARRAY_SIZE(screenShows.value:interactions)) avgInteractions
FROM
sessions,
LATERAL FLATTEN(json:unitShows) unitShows,
LATERAL FLATTEN(unitShows.value:screenShows) screenShows
GROUP BY 1, 2, 3
ORDER BY avgInteractions DESC LIMIT 10

Sessions: Path to production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Raw Sessions
ETL
S3
1. Backfilling
- Materialized historical sessions of last 2 years and inserted into
Snowflake
- Verify counts for each day versus cube aggregates, investigate
discrepancies & fix
Analyzer
2. "Soft deploy", period of mirrored writes
- Add SnowflakeSink to Analyzer (./bin/analyzer --sinks
mysql,snowflake)
- Sink to Cubes and Snowflake in one run (sequentially)
- Investigate potential problems of snowflake sink in production,
compare the data, etc.
(Databricks)

Sessions in production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Raw Sessions
ETL
Cube filler
S3
3. Split Analyzer into Event analyzer and Cube filler
- Add Cubefiller
- Switch pipeline to Analyzer -> Sessions, Cubefiller -> Cubes
- Refactor and clean up analyzer
Analyzer
(Databricks)

Sessions in production
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Raw Sessions
ETL
Cube filler
S3
Analyzer
(Databricks)
 Data processed once & consumed many times (from sessions)
✗ Slow to make schema changes in cubes (e.g. adding / removing metric)
✗ Keep cubes small (no geo, hourly, external dimensions)
✗ Recompute cube from events to add new breakdowns to existing metric (slow,
not exactly deterministic)

Move Cubes to Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
1. Backfilling & Testing
- Export Cube aggregates data from MySQL and import into
Snowflake
- Test performance of single queries, parallel queries for different
use cases: Dashboard, Reporting BI through Report generator, ...
Analyzer
(Databricks)
2. "Soft deploy", period of mirrored writes
- Add SnowflakeSink to Cubefiller (./bin/cube-filler --sinks
mysql,snowflake)
- Sink to MySQL Cubes and Snowflake Cubes in one run
- Investigate potential problems of snowflake sink in production,
compare the data, etc.
Cube
Aggregates

Move Cubes to Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICubes
(read replica)
Cube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
4. Validating SnowflakeAnalyzer results on 1% of requests.
/**
* Proxies requests to a main analyzer and a percent of requests to the
secondary analyzer and compares responses. When responses do not match,
it logs the context needed for troubleshooting.
*/
class ExperimentAnalyzer extends Analyzer
Analyzer
(Databricks)
5. Gradually switch REST API to get data from Snowflake instead of MySQL
- Start with 10% of requests, increase to 100%
- This allows us to see how Snowflake performs under load
Cube
Aggregates
3. REST API can read from Snowflake Cube aggregates (SnowflakeAnalyzer)

Check matches and response times

Cubes aggregates in Snowflake
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
(Databricks)
 Data processed once & consumed many times (from sessions)
 Fast schema changes in cubes (e.g. adding / removing metric)
 Can support geographical, hourly, external dimensions
✗ Recompute cube from sessions to add new breakdowns to existing metric
(slow, but deterministic)

Compute cube aggregates with SQL
Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
(Databricks)
aggregating across different dimensions
Part 2: Simple• Move computation to data, not data to
computation
• Instead of exporting sessions to Spark
cluster only to compute aggregates, do it in
database with SELECT … FROM sessions
GROUP BY ...

Trackers Raw Events
S3
Events
Client-facing
dashboards
REST APICube
Aggregates
Services
Reports
Raw Sessions
ETL
Cube filler
S3
Analyzer
(Databricks)
1. Period of double writes
./bin/cube-filler --logic sql,cubefiller --sink snowflakecubes
2. Compare differences, fix problems
Cube filler

Trackers Raw Events
S3
Events
Client-facing
dashboards
REST API
Services
Reports
Raw Sessions
Cube Aggregates
ETL
Analyzer
(Databricks)
3. Remove Spark completely from cubefiller
• REST API can read directly from Raw sessions OR from Cube Aggregates,
generating the same response

• Session schema
Known, well defined (by Session Scala model) and enforced
• Latest Session model
Authoritative source for sessions schema
• Historical sessions conform to the latest Session model
Can de-serialize any historical session
• Readers should ignore fields not in Session model
We do not guarantee to preserve this data
• Computing metrics, dimensions from Session model is time-invariant
Computed 1 year ago or today, numbers must be the same
How we handle schema evolution

Schema Evolution
Change in Session model Top level / scalar column Nested / VARIANT column
Rename field
ALTER TABLE tbl RENAME COLUMN
col1 TO col2; data rewrite (!)
Remove field ALTER TABLE tbl DROP COLUMN col; batch together in next rewrite
Add field, no historical values
ALTER TABLE tbl ADD COLUMN col
type; no change necessary
Also considered views for VARIANT schema evolution
For complex scenarios have to use Javascript UDF => lose benefits of columnar access
Not good for practical use

• They are sometimes necessary
• We have the ability to do data rewrites
• Rewrite must maintain sort order fast access (UPDATE breaks it!)
• Did rewrite of 150TB of (compressed) data in December
• Complex and time consuming, so we fully automate them
• Costly, so we batch multiple changes together
• Javascript UDFs are our default approach for rewrites of data in VARIANT
Data Rewrites

• Expressive power of Javascript (vs. SQL)
• Run on the whole VARIANT record
• (Almost) constant performance
• More readable and understandable
• For changing a single field,
OBJECT_INSERT/OBJECT_DELETE
are preferred
Inline Rewrites with Javascript UDFs
CREATE OR REPLACE FUNCTION transform("json" variant)
RETURNS VARIANT
LANGUAGE JAVASCRIPT
AS '
// modify json
return json;
';
SELECT transform(json) FROM sessions;

• There are many ways to develop a data pipeline, none of them are perfect
• Make sure that load/scale testing is a required part of your go-to-production
plan
• Rollups make things cheaper, but at a great expense later
• Good abstractions survive a long time (e.g. Analytics API)
• Evolve pipeline modularly
• Maintain data consistency before/after
Stuff you have seen today...

How we evolved data pipeline at Celtra and what we learned along the way

More Related Content

What's hot (20)

Similar to How we evolved data pipeline at Celtra and what we learned along the way (20)

Recently uploaded (20)

How we evolved data pipeline at Celtra and what we learned along the way

Editor's Notes