SlideShare a Scribd company logo
#EntSAIS18
Exploring a Billion Activity Dataset from Athletes
with Apache Spark
Drew Robb, Infrastructure/Data at Strava
Strava Labs:
#EntSAIS18
1
#EntSAIS18 2
#EntSAIS18
Outline
1. Activity Stream Data “Lake”
2. Grade Adjusted Pace
3. Global Heatmap
a. Rasterize
b. Normalize
c. Recursion/Zoom out
d. Store+Serve
4. Clusterer
5. Final Thoughts
3
#EntSAIS18
- Activity Stream: Array of
measurement data over
time from a single
Run/Ride/etc
- Typically at 1/second rate
- Stream Types: Time,
Position (lat, lng),
elevation, heart rate,...
Data “Lake”
4
#EntSAIS18
- Billions of streams, each with thousands of points
- Canonical stream storage 1 file per stream in s3!
- Keyed by autoincrement id!!
- Over 10 billion total s3 files!
- Can’t actually bulk read-- too error prone, too expensive
- Need replica of data in a different format suitable for bulk
read with spark
Data “Lake”
5
#EntSAIS18
- Delta Encoding + Compression
- Parquet+S3 multi partition hive table
- ~100mb per s3 file (streams for ~30,000 activities)
- New data batch appended as new partitions
- Handle updates: Entire table periodically rewritten in
new s3 path, then update metastore
- Original data store: Cost: ~$5,000, time: 3 days
- Optimized for bulk read: Cost ~$5, time: 5 min
Data “Lake”
6
#EntSAIS18
- A synthetic running pace that reflects how fast
you would have run in the absence of hills
- Helps athletes compare performance of flat vs
hilly runs
- Requires precise understanding of how running
difficulty is affected by elevation gradient
Grade Adjusted Pace
7
#EntSAIS18
- Previous model: Derived from metabolic
measurements of a few athletes running on a
treadmill
- New model: Fit from actual running data drawn
from millions of runs in real world
- Find smooth data sample windows
- Only draw data from near maximum effort for
each athlete.
Grade Adjusted Pace
8
#EntSAIS18 9
#EntSAIS18 10
#EntSAIS18
Global Heatmap
- 700 million activities
- 1.4 trillion GPS points
- 10 billion miles
- 100,000 years of exercise
- 10TB input
- 500GB output
- 5 Layers by Activity Type
- https://strava.com/heatmap
11
#EntSAIS18 12
#EntSAIS18
Rasterize
13
#EntSAIS18
Web Mercator Tile Math
- Pixel coordinate system for Earth Surface.
- Each zoom level: 4x as many tiles at 2x resolution
- Each tile is 256x256 pixels
def fromPoint(lat, lng, zoom) = TilePixel(
zoom = zoom,
tileY = floor((lat+90 ) / (180 * 2^zoom)),
tileX = floor((lng+180) / (360 * 2^zoom)),
pixelY = floor((256 * (lat+90 ) / (180 * 2^zoom)) % 256),
pixelX = floor((256 * (lng+180) / (360 * 2^zoom)) % 256))
14
#EntSAIS18
Rasterize
- Convert vectorized lat/lng line data into
discrete pixel lines (Bresenham’s line
algorithm)
- Aggregate count of total passes of each
pixel
- Total pixel counts ~ 7 trillion!
- Total tile counts ~ 30 million
- Total memory load of tiles is 30TB, avoid
simultaneous allocation by having many
more tasks than cores
15
#EntSAIS18
Rasterize
Boundary case: adjacent points in different tiles
16
#EntSAIS18
Rasterize
Solution: Add virtual points on tile boundary
17
#EntSAIS18
Rasterize
- Add noise to
“uncorrect” GPS
road corrections
- Other filtering of
erroneous data
18
#EntSAIS18
Normalize
19
#EntSAIS18
Normalize
- Raw count of activity per pixel has domain [0, inf)
- To visualize with a colormap we need [0, 1]
- Tiles have dramatically different distributions of raw count values
- General and parameter free solution:
- Normalize using the Cumulative Distribution Function (CDF) of raw
count values within each tile
- Can approximate CDF with sampling (Sample 2^8 out of 2^16 values)
- Evaluate CDF with binary search in sample array
20
#EntSAIS18
Problem:
Normalization function is
different for every tile!
=> Artifacts along tile
boundaries
21
#EntSAIS18
Normalize
Solution:
- Averaging: Each tile’s CDF is calculated from data for
that tile + some radius of neighbor tiles
22
#EntSAIS18
Normalize
Solution:
- Averaging: Each tile’s CDF is calculated from data for
that tile + some radius of neighbor tiles
- Bilinear smoothing:
- Every pixel value is a weighted average of CDFs
from 4 nearest tiles
- Continuously varying pixel perfect normalization!
23
#EntSAIS18 24
#EntSAIS18 25
#EntSAIS18
Recursion /
Zoom Out
26
#EntSAIS18
Recursion / Zoom Out
27
#EntSAIS18
Recursion / Zoom Out
def finish(data, zoom) {
write(normalize(data), zoom)
if (zoom > 0) {
finish(zoomOut(data), zoom-1)
}
}
28
#EntSAIS18
Recursion / Zoom out
- Each iteration has ~1/4th as
much data
- Reduce number of tasks to
not waste time
- Exponential speedup of
stage runtime incredibly
satisfying
Level 16 written, 136928925282 bytes
Level 15 written, 53804156232 bytes
Level 14 written, 19795342663 bytes
Level 13 written, 6996622566 bytes
Level 12 written, 2474617135 bytes
Level 11 written, 897467724 bytes
Level 10 written, 331303516 bytes
Level 9 written, 122600962 bytes
Level 8 written, 44957727 bytes
Level 7 written, 16254495 bytes
Level 6 written, 5728247 bytes
Level 5 written, 1978297 bytes
Level 4 written, 673066 bytes
Level 3 written, 228565 bytes
Level 2 written, 78380 bytes
Level 1 written, 27417 bytes
29
#EntSAIS18
Store+Serve
30
#EntSAIS18
Store+Serve
Idea: Group spatially adjacent tiles to get similar file sizes.
Consider tile data as quad tree.
- Starting from all leafs, walk up tree accumulating total file size
at each node.
- When file size exceeds threshold, write all tiles as one file.
- Direct s3 write (no hadoopFS api)
- Repeat for each level of tile data.
31
#EntSAIS18
Tile Packing
Zoom 2:
-(1,0)
-(2,2)
Zoom 1:
-(0,2), (1,2), (0,3), (1,3)
-(3,2), (2,3), (3,3)
Zoom 0:
- Everything else
32
#EntSAIS18
Store+Serve
Reads:
- Reconstruct tree of tile groups in memory
- Traverse tree to find tile location
- Cache the tile group file-- nearby tiles also likely to be
served.
- Persisted data is 1 byte per pixel, colormap and PNG
applied at read time.
- Final rendered PNG also cached by cloudfront
33
#EntSAIS18
Clusterer
- Discover sets of
spatially similar activities
using clustering
- Compute patterns in
aggregated activity
properties
- Build catalog of routes
for every cluster (a few
million clusters)
34
#EntSAIS18
Clusterer
- Locality Sensitive Hashing (SuperMinHash O Ertl, 2017)
of quantized lat/lng shingles from activity streams
- MinHash pairwise similarity of activities: 10,000x
speedup over direct stream based methods
- Similarity join on collisions to efficiently find sample of
highly similar pairs
- Single linkage clustering globally on pairs
- Split locally and repeat, rehashing sample of activities in
each cluster
35
#EntSAIS18 36
#EntSAIS18
Spark Environment
- Spark on mesos on AWS
- Job submitted with Metronome/Marathon
- Isolated context with job-specific docker image for
driver+executors
- Mesos cluster running multiple spark contexts
- Scale up instances based on unscheduled tasks
- Prevent scale down on instances with active executors
- 100 i3.2xlarge machines (8cpu, 60gb ram, 1.7tb ssd)
- 5 hours for full heatmap build
37
#EntSAIS18
Spark Lessons
- Parameter tuning seems unavoidable in well utilized distributed systems
- With scala, managing application dependency conflicts can be hell (guava,
netty, logging frameworks…)
- Optimizing locally can be very misleading, must profile on real cluster
- Even with fantastic configuration management tools, lots of time wasted on
configuration
- Building from scratch may be worthwhile to fully understand performance
- Invest in generating representative sample data to use in prototypes
- Spilling to disk can use a surprising amount of IOPS
- Probably still a lot more optimizations possible
38
#EntSAIS18
Resources
Strava Global Heatmap: strava.com/heatmap
Strava Labs: labs.strava.com
Strava Engineering Blog: medium.com/strava-engineering
We are hiring Data Scientists: strava.com/careers
Strava Labs Clusterer*: labs.strava.com/clusterer
*Currently offline, will be back soon
39

More Related Content

What's hot (20)

PDF
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
PDF
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
PDF
Spark Streaming into context
David Martínez Rego
 
PDF
A look ahead at spark 2.0
Databricks
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
PPTX
Use r tutorial part1, introduction to sparkr
Databricks
 
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
PPTX
Spark streaming
Whiteklay
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
PDF
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
PDF
Operating and Supporting Delta Lake in Production
Databricks
 
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
PDF
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
Delta Lake Streaming: Under the Hood
Databricks
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PPTX
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Databricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Strata NYC 2015: What's new in Spark Streaming
Databricks
 
Spark Streaming into context
David Martínez Rego
 
A look ahead at spark 2.0
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Use r tutorial part1, introduction to sparkr
Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Spark Summit
 
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Anton Kirillov
 
Spark streaming
Whiteklay
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Operating and Supporting Delta Lake in Production
Databricks
 
Deep dive into stateful stream processing in structured streaming by Tathaga...
Databricks
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Helena Edelson
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Delta Lake Streaming: Under the Hood
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Omid: A Transactional Framework for HBase
DataWorks Summit/Hadoop Summit
 

Similar to Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache Spark with Drew Robb (20)

PPT
Spark streaming
Venkateswaran Kandasamy
 
PPT
strata_spark_streaming.ppt
rveiga100
 
PDF
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
PDF
Intel realtime analytics_spark
Geetanjali G
 
PPTX
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 
PDF
Artmosphere Demo
Keira Zhou
 
PDF
Managing your black friday logs - Code Europe
David Pilato
 
PPTX
Sonata- Query-Driven Streaming Network Telemetry -slides.pptx
Daibo Liu
 
PDF
Data Structures Handling Trillions of Daily Streaming Events by Evan Chan
ScyllaDB
 
PPT
Eedc.apache.pig last
Francesc Lordan Gomis
 
PPTX
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
InfluxData
 
PDF
Using eBPF Off-CPU Sampling to See What Your DBs are Really Waiting For by Ta...
ScyllaDB
 
PDF
Managing your Black Friday Logs NDC Oslo
David Pilato
 
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
PPTX
Programmable Exascale Supercomputer
Sagar Dolas
 
PDF
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
PPTX
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
PPTX
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
PDF
EEDC - Apache Pig
javicid
 
PPT
strata_spark_streaming.ppt
AbhijitManna19
 
Spark streaming
Venkateswaran Kandasamy
 
strata_spark_streaming.ppt
rveiga100
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Codemotion
 
Intel realtime analytics_spark
Geetanjali G
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 
Artmosphere Demo
Keira Zhou
 
Managing your black friday logs - Code Europe
David Pilato
 
Sonata- Query-Driven Streaming Network Telemetry -slides.pptx
Daibo Liu
 
Data Structures Handling Trillions of Daily Streaming Events by Evan Chan
ScyllaDB
 
Eedc.apache.pig last
Francesc Lordan Gomis
 
Samantha Wang [InfluxData] | Best Practices on How to Transform Your Data Usi...
InfluxData
 
Using eBPF Off-CPU Sampling to See What Your DBs are Really Waiting For by Ta...
ScyllaDB
 
Managing your Black Friday Logs NDC Oslo
David Pilato
 
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
ScyllaDB
 
Programmable Exascale Supercomputer
Sagar Dolas
 
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)
Alexander Ulanov
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
Using Graph Analysis and Fraud Detection in the Fintech Industry
Stanka Dalekova
 
EEDC - Apache Pig
javicid
 
strata_spark_streaming.ppt
AbhijitManna19
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Climate Action.pptx action plan for climate
justfortalabat
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
Dr djdjjdsjsjsjsjsjsjjsjdjdjdjdjjd1.pptx
Nandy31
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Early_Diabetes_Detection_using_Machine_L.pdf
maria879693
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 

Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache Spark with Drew Robb

  • 1. #EntSAIS18 Exploring a Billion Activity Dataset from Athletes with Apache Spark Drew Robb, Infrastructure/Data at Strava Strava Labs: #EntSAIS18 1
  • 3. #EntSAIS18 Outline 1. Activity Stream Data “Lake” 2. Grade Adjusted Pace 3. Global Heatmap a. Rasterize b. Normalize c. Recursion/Zoom out d. Store+Serve 4. Clusterer 5. Final Thoughts 3
  • 4. #EntSAIS18 - Activity Stream: Array of measurement data over time from a single Run/Ride/etc - Typically at 1/second rate - Stream Types: Time, Position (lat, lng), elevation, heart rate,... Data “Lake” 4
  • 5. #EntSAIS18 - Billions of streams, each with thousands of points - Canonical stream storage 1 file per stream in s3! - Keyed by autoincrement id!! - Over 10 billion total s3 files! - Can’t actually bulk read-- too error prone, too expensive - Need replica of data in a different format suitable for bulk read with spark Data “Lake” 5
  • 6. #EntSAIS18 - Delta Encoding + Compression - Parquet+S3 multi partition hive table - ~100mb per s3 file (streams for ~30,000 activities) - New data batch appended as new partitions - Handle updates: Entire table periodically rewritten in new s3 path, then update metastore - Original data store: Cost: ~$5,000, time: 3 days - Optimized for bulk read: Cost ~$5, time: 5 min Data “Lake” 6
  • 7. #EntSAIS18 - A synthetic running pace that reflects how fast you would have run in the absence of hills - Helps athletes compare performance of flat vs hilly runs - Requires precise understanding of how running difficulty is affected by elevation gradient Grade Adjusted Pace 7
  • 8. #EntSAIS18 - Previous model: Derived from metabolic measurements of a few athletes running on a treadmill - New model: Fit from actual running data drawn from millions of runs in real world - Find smooth data sample windows - Only draw data from near maximum effort for each athlete. Grade Adjusted Pace 8
  • 11. #EntSAIS18 Global Heatmap - 700 million activities - 1.4 trillion GPS points - 10 billion miles - 100,000 years of exercise - 10TB input - 500GB output - 5 Layers by Activity Type - https://strava.com/heatmap 11
  • 14. #EntSAIS18 Web Mercator Tile Math - Pixel coordinate system for Earth Surface. - Each zoom level: 4x as many tiles at 2x resolution - Each tile is 256x256 pixels def fromPoint(lat, lng, zoom) = TilePixel( zoom = zoom, tileY = floor((lat+90 ) / (180 * 2^zoom)), tileX = floor((lng+180) / (360 * 2^zoom)), pixelY = floor((256 * (lat+90 ) / (180 * 2^zoom)) % 256), pixelX = floor((256 * (lng+180) / (360 * 2^zoom)) % 256)) 14
  • 15. #EntSAIS18 Rasterize - Convert vectorized lat/lng line data into discrete pixel lines (Bresenham’s line algorithm) - Aggregate count of total passes of each pixel - Total pixel counts ~ 7 trillion! - Total tile counts ~ 30 million - Total memory load of tiles is 30TB, avoid simultaneous allocation by having many more tasks than cores 15
  • 16. #EntSAIS18 Rasterize Boundary case: adjacent points in different tiles 16
  • 17. #EntSAIS18 Rasterize Solution: Add virtual points on tile boundary 17
  • 18. #EntSAIS18 Rasterize - Add noise to “uncorrect” GPS road corrections - Other filtering of erroneous data 18
  • 20. #EntSAIS18 Normalize - Raw count of activity per pixel has domain [0, inf) - To visualize with a colormap we need [0, 1] - Tiles have dramatically different distributions of raw count values - General and parameter free solution: - Normalize using the Cumulative Distribution Function (CDF) of raw count values within each tile - Can approximate CDF with sampling (Sample 2^8 out of 2^16 values) - Evaluate CDF with binary search in sample array 20
  • 21. #EntSAIS18 Problem: Normalization function is different for every tile! => Artifacts along tile boundaries 21
  • 22. #EntSAIS18 Normalize Solution: - Averaging: Each tile’s CDF is calculated from data for that tile + some radius of neighbor tiles 22
  • 23. #EntSAIS18 Normalize Solution: - Averaging: Each tile’s CDF is calculated from data for that tile + some radius of neighbor tiles - Bilinear smoothing: - Every pixel value is a weighted average of CDFs from 4 nearest tiles - Continuously varying pixel perfect normalization! 23
  • 28. #EntSAIS18 Recursion / Zoom Out def finish(data, zoom) { write(normalize(data), zoom) if (zoom > 0) { finish(zoomOut(data), zoom-1) } } 28
  • 29. #EntSAIS18 Recursion / Zoom out - Each iteration has ~1/4th as much data - Reduce number of tasks to not waste time - Exponential speedup of stage runtime incredibly satisfying Level 16 written, 136928925282 bytes Level 15 written, 53804156232 bytes Level 14 written, 19795342663 bytes Level 13 written, 6996622566 bytes Level 12 written, 2474617135 bytes Level 11 written, 897467724 bytes Level 10 written, 331303516 bytes Level 9 written, 122600962 bytes Level 8 written, 44957727 bytes Level 7 written, 16254495 bytes Level 6 written, 5728247 bytes Level 5 written, 1978297 bytes Level 4 written, 673066 bytes Level 3 written, 228565 bytes Level 2 written, 78380 bytes Level 1 written, 27417 bytes 29
  • 31. #EntSAIS18 Store+Serve Idea: Group spatially adjacent tiles to get similar file sizes. Consider tile data as quad tree. - Starting from all leafs, walk up tree accumulating total file size at each node. - When file size exceeds threshold, write all tiles as one file. - Direct s3 write (no hadoopFS api) - Repeat for each level of tile data. 31
  • 32. #EntSAIS18 Tile Packing Zoom 2: -(1,0) -(2,2) Zoom 1: -(0,2), (1,2), (0,3), (1,3) -(3,2), (2,3), (3,3) Zoom 0: - Everything else 32
  • 33. #EntSAIS18 Store+Serve Reads: - Reconstruct tree of tile groups in memory - Traverse tree to find tile location - Cache the tile group file-- nearby tiles also likely to be served. - Persisted data is 1 byte per pixel, colormap and PNG applied at read time. - Final rendered PNG also cached by cloudfront 33
  • 34. #EntSAIS18 Clusterer - Discover sets of spatially similar activities using clustering - Compute patterns in aggregated activity properties - Build catalog of routes for every cluster (a few million clusters) 34
  • 35. #EntSAIS18 Clusterer - Locality Sensitive Hashing (SuperMinHash O Ertl, 2017) of quantized lat/lng shingles from activity streams - MinHash pairwise similarity of activities: 10,000x speedup over direct stream based methods - Similarity join on collisions to efficiently find sample of highly similar pairs - Single linkage clustering globally on pairs - Split locally and repeat, rehashing sample of activities in each cluster 35
  • 37. #EntSAIS18 Spark Environment - Spark on mesos on AWS - Job submitted with Metronome/Marathon - Isolated context with job-specific docker image for driver+executors - Mesos cluster running multiple spark contexts - Scale up instances based on unscheduled tasks - Prevent scale down on instances with active executors - 100 i3.2xlarge machines (8cpu, 60gb ram, 1.7tb ssd) - 5 hours for full heatmap build 37
  • 38. #EntSAIS18 Spark Lessons - Parameter tuning seems unavoidable in well utilized distributed systems - With scala, managing application dependency conflicts can be hell (guava, netty, logging frameworks…) - Optimizing locally can be very misleading, must profile on real cluster - Even with fantastic configuration management tools, lots of time wasted on configuration - Building from scratch may be worthwhile to fully understand performance - Invest in generating representative sample data to use in prototypes - Spilling to disk can use a surprising amount of IOPS - Probably still a lot more optimizations possible 38
  • 39. #EntSAIS18 Resources Strava Global Heatmap: strava.com/heatmap Strava Labs: labs.strava.com Strava Engineering Blog: medium.com/strava-engineering We are hiring Data Scientists: strava.com/careers Strava Labs Clusterer*: labs.strava.com/clusterer *Currently offline, will be back soon 39