Re-imagine Data Monitoring with whylogs and Spark

Re-imagine Data Monitoring
with whylogs and Apache
Spark
Andy Dang
Co-Founder & Lead Engineer, WhyLabs

Outline
ML Data Challenges
How traditional data analysis techniques fail ML
data pipelines
Lightweight Proﬁling for Big ML
Data
Proﬁling techniques for detecting data quality
problems
The Open Source whylogs Library
Building the standard for data logging
2

Source: Google Cloud AI
3
ML Lifecycle

Issues encountered in production (small sample)...
...or it simply doesn’t work, and nobody know why...
● Experiment/production
environment mismatch
● Wrong model version deployed
● Underprovisioned hardware
● Inappropriate hardware
● Latency/SLA issues
● Data permissions misconfigured
● Untracked changes broke prod
● Traffic sent to the wrong model
● Computational instability
● Customers gaming the model
(adversarial attacks)
● PII data exposed
● Expected accuracy doesn’t
materialize
● Pre-processing mismatch in
experiments vs. production
● Retrained on faulty data
● Accuracy improves on one
segment, regresses in others
● Outliers predicted incorrectly
● Bias identified
● Correlation with protected
features
● Overfitting on training/test
● Surge in missing values
● Surge in duplicates
● Poor performance on new
categories
customer segments
● Poor performance on outliers
● Data quality issues affect
accuracy
● Production data doesn’t match
test/training
● Accuracy is decaying over time
● Data drift in inputs
● Concept drift in outputs
● Extreme predictions for out of
distribution data
● Model not generalizing on new
data / new segments
● Major customer behavior shift
4

Issues encountered in production (small sample)...
issues caused by data
● Experiment/production
environment mismatch
● Wrong model version deployed
● Underprovisioned hardware
● Inappropriate hardware
● Latency/SLA issues
● Data permissions misconfigured
● Untracked changes broke prod
● Traffic sent to the wrong model
● Computational instability
● Customers gaming the model
(adversarial attacks)
● PII data exposed
● Expected accuracy doesn’t
materialize
● Pre-processing mismatch in
experiments vs. production
● Retrained on faulty data
● Accuracy improves on one
segment, regresses in others
● Outliers predicted incorrectly
● Bias identified
● Correlation with protected
features
● Overfitting on training/test
● Surge in missing values
● Surge in duplicates
categories
customer segments
● Poor performance on outliers
● Data quality issues affect
accuracy
● Production data doesn’t match
test/training
● Accuracy is decaying over time
● Data drift in inputs
● Concept drift in outputs
● Extreme predictions for out of
distribution data
● Model not generalizing on new
data / new segments
● Major customer behavior shift
5

Data Logs
Model Metadata
Pipeline Metadata
i.e. data proﬁling
Data profiling refers to the analysis of information [...] in order to clarify the
structure, content, relationships, and derivation rules of the data [Wikipedia]
6
Data monitoring starts with logging

7
Sampling Profiling
Pros
● Easy to build
● Little upfront design
● Log & raw data analysis identical
● Scalable & lightweight
● Flexible & configurable
● Rare events and outlier-dependent metrics
● Directly interpretable results
Cons
● I/O & storage
● Noisy
● Requires statistical analysis
● Rare events & outliers
● Min/max, unique values, etc
● Data dependent output format
● No existing widespread solutions
● Mathematical & engineering challenges
Data logs: sampling vs. profiling

8
Data logs: must be accurate
Median: errors in the estimate of the median for sampling vs proﬁling for various distributions. Mean
absolute error and mean relative (fractional) absolute error are shown.

9
Data logs: must be scalable
Dataset Size # of entries # of features Memory
consumption
Output size
Lending Club 1.6G 2.2M 151 14MB 7.4MB
NYC Tickets 1.9G 10.8 43 14MB 2.3MB
Pain pills 75GB 178M 42 15MB 2MB

10
Logging ML data at scale
Four key paradigms:
● Approximations rather than exact results
● Lightweight
● Additive
● Batch and streaming support
proﬁle: collection of lightweight metrics that provide these
properties

Lightweight
Old Approach
11
process process process process
Data Warehouse/
Data Lake
Processing Engine
New Approach
process
profiling
process
profiling
process
profiling
process
profiling
Profile Store
Analysis
Only feasible if:
● Proﬁling is fast
● Proﬁling is not memory intensive

Additive
12
dataset 1 dataset 2 dataset 3
sort (shuffle)
reduce step
Median
dataset 1
profile 1
dataset 2
profile 2
dataset 3
profile 3
add(profile1, 2, 3)
Estimated Median

Batch and streaming support
13
partition
1
profiling
partition
2
profiling
partition
3
profiling
partition
n
profiling
Spark/Hive
Query Engine
No shuffle!
day 0
profiling
day 1
profiling
day 2
profiling
day 3
profiling
... ...
sum(profiles)

Approximate Statistics
● Using Stochastic Streaming Algorithms
○ Model the problem as a stochastic process
○ Apache Datasketches is the open source implementation
● Statistics that we focus on at the moment:
○ Histograms
○ Frequent items
○ Cardinality
14

whylogs: The Data Logging Library
● Multi-language support: Python + Java
● Support both data engineering and data science
workﬂows
● Extensibility: image support. Text, video, audio &
embeddings support to come
● Growing integration list:
15

16
whylogs: Python
● A few lines of code to start logging
● Integrate with popular data science libraries
● Out of the box visualization utilities

17
whylogs in Apache Spark
Data Lake
col1
, col2
, …, coln
partition 1
partition 2
partition k
profile
profile
merge (
profile1
,
profile2
…,
profilek
)
global proﬁle
Schema
Metadata
Sketches
profile
Metrics

20
Catch distribution drift in a few lines of code

21
Scalable monitoring at input feature granularity

Monitoring layer for ML applications
22

andy@whylabs.ai
@andy_dng
24
bit.ly/whylogs
Help build the open
standard for data
logging!
Thank you!

Re-imagine Data Monitoring with whylogs and Spark

More Related Content

What's hot (20)

Similar to Re-imagine Data Monitoring with whylogs and Spark (20)

More from Databricks (20)

Recently uploaded (20)

Re-imagine Data Monitoring with whylogs and Spark