SlideShare a Scribd company logo
Re-imagine Data Monitoring
with whylogs and Apache
Spark
Andy Dang
Co-Founder & Lead Engineer, WhyLabs
Outline
ML Data Challenges
How traditional data analysis techniques fail ML
data pipelines
Lightweight Profiling for Big ML
Data
Profiling techniques for detecting data quality
problems
The Open Source whylogs Library
Building the standard for data logging
2
Source: Google Cloud AI
3
ML Lifecycle
Issues encountered in production (small sample)...
...or it simply doesn’t work, and nobody know why...
● Experiment/production
environment mismatch
● Wrong model version deployed
● Underprovisioned hardware
● Inappropriate hardware
● Latency/SLA issues
● Data permissions misconfigured
● Untracked changes broke prod
● Traffic sent to the wrong model
● Computational instability
● Customers gaming the model
(adversarial attacks)
● PII data exposed
● Expected accuracy doesn’t
materialize
● Pre-processing mismatch in
experiments vs. production
● Retrained on faulty data
● Accuracy improves on one
segment, regresses in others
● Outliers predicted incorrectly
● Bias identified
● Correlation with protected
features
● Overfitting on training/test
● Surge in missing values
● Surge in duplicates
● Poor performance on new
categories
● Poor performance on new
customer segments
● Poor performance on outliers
● Data quality issues affect
accuracy
● Production data doesn’t match
test/training
● Accuracy is decaying over time
● Data drift in inputs
● Concept drift in outputs
● Extreme predictions for out of
distribution data
● Model not generalizing on new
data / new segments
● Major customer behavior shift
4
Issues encountered in production (small sample)...
issues caused by data
● Experiment/production
environment mismatch
● Wrong model version deployed
● Underprovisioned hardware
● Inappropriate hardware
● Latency/SLA issues
● Data permissions misconfigured
● Untracked changes broke prod
● Traffic sent to the wrong model
● Computational instability
● Customers gaming the model
(adversarial attacks)
● PII data exposed
● Expected accuracy doesn’t
materialize
● Pre-processing mismatch in
experiments vs. production
● Retrained on faulty data
● Accuracy improves on one
segment, regresses in others
● Outliers predicted incorrectly
● Bias identified
● Correlation with protected
features
● Overfitting on training/test
● Surge in missing values
● Surge in duplicates
● Poor performance on new
categories
● Poor performance on new
customer segments
● Poor performance on outliers
● Data quality issues affect
accuracy
● Production data doesn’t match
test/training
● Accuracy is decaying over time
● Data drift in inputs
● Concept drift in outputs
● Extreme predictions for out of
distribution data
● Model not generalizing on new
data / new segments
● Major customer behavior shift
5
Data Logs
Model Metadata
Pipeline Metadata
i.e. data profiling
Data profiling refers to the analysis of information [...] in order to clarify the
structure, content, relationships, and derivation rules of the data [Wikipedia]
6
Data monitoring starts with logging
7
Sampling Profiling
Pros
● Easy to build
● Little upfront design
● Log & raw data analysis identical
● Scalable & lightweight
● Flexible & configurable
● Rare events and outlier-dependent metrics
● Directly interpretable results
Cons
● I/O & storage
● Noisy
● Requires statistical analysis
● Rare events & outliers
● Min/max, unique values, etc
● Data dependent output format
● No existing widespread solutions
● Mathematical & engineering challenges
Data logs: sampling vs. profiling
8
Data logs: must be accurate
Median: errors in the estimate of the median for sampling vs profiling for various distributions. Mean
absolute error and mean relative (fractional) absolute error are shown.
9
Data logs: must be scalable
Dataset Size # of entries # of features Memory
consumption
Output size
Lending Club 1.6G 2.2M 151 14MB 7.4MB
NYC Tickets 1.9G 10.8 43 14MB 2.3MB
Pain pills 75GB 178M 42 15MB 2MB
10
Logging ML data at scale
Four key paradigms:
● Approximations rather than exact results
● Lightweight
● Additive
● Batch and streaming support
profile: collection of lightweight metrics that provide these
properties
Lightweight
Old Approach
11
process process process process
Data Warehouse/
Data Lake
Processing Engine
New Approach
process
profiling
process
profiling
process
profiling
process
profiling
Profile Store
Analysis
Only feasible if:
● Profiling is fast
● Profiling is not memory intensive
Additive
12
dataset 1 dataset 2 dataset 3
sort (shuffle)
reduce step
Median
dataset 1
profile 1
dataset 2
profile 2
dataset 3
profile 3
add(profile1, 2, 3)
Estimated Median
Batch and streaming support
13
partition
1
profiling
partition
2
profiling
partition
3
profiling
partition
n
profiling
Spark/Hive
Query Engine
No shuffle!
day 0
profiling
day 1
profiling
day 2
profiling
day 3
profiling
... ...
sum(profiles)
Approximate Statistics
● Using Stochastic Streaming Algorithms
○ Model the problem as a stochastic process
○ Apache Datasketches is the open source implementation
● Statistics that we focus on at the moment:
○ Histograms
○ Frequent items
○ Cardinality
14
whylogs: The Data Logging Library
● Multi-language support: Python + Java
● Support both data engineering and data science
workflows
● Extensibility: image support. Text, video, audio &
embeddings support to come
● Growing integration list:
15
16
whylogs: Python
● A few lines of code to start logging
● Integrate with popular data science libraries
● Out of the box visualization utilities
17
whylogs in Apache Spark
Data Lake
col1
, col2
, …, coln
partition 1
partition 2
partition k
profile
profile
merge (
profile1
,
profile2
…,
profilek
)
global profile
Schema
Metadata
Sketches
profile
Metrics
18
Simple Spark API
19
pySpark support
20
Catch distribution drift in a few lines of code
21
Scalable monitoring at input feature granularity
Monitoring layer for ML applications
22
23
bit.ly/whylogs
andy@whylabs.ai
@andy_dng
24
bit.ly/whylogs
Help build the open
standard for data
logging!
Thank you!

More Related Content

What's hot (20)

PDF
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
PDF
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PDF
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Databricks
 
PPTX
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
PPTX
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
PDF
Deep Dive: Memory Management in Apache Spark
Databricks
 
PDF
The delta architecture
Prakash Chockalingam
 
PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
PPTX
Rds data lake @ Robinhood
BalajiVaradarajan13
 
PDF
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
PDF
Parallelization of Structured Streaming Jobs Using Delta Lake
Databricks
 
PPTX
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
PDF
Building End-to-End Delta Pipelines on GCP
Databricks
 
PDF
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
HostedbyConfluent
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Databricks
 
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu
Flink Forward
 
Apache NiFi in the Hadoop Ecosystem
DataWorks Summit/Hadoop Summit
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Owen O'Malley
 
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Deep Dive: Memory Management in Apache Spark
Databricks
 
The delta architecture
Prakash Chockalingam
 
Iceberg: a fast table format for S3
DataWorks Summit
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Rds data lake @ Robinhood
BalajiVaradarajan13
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
Parallelization of Structured Streaming Jobs Using Delta Lake
Databricks
 
Real-time Analytics with Trino and Apache Pinot
Xiang Fu
 
Building End-to-End Delta Pipelines on GCP
Databricks
 
Solving Data Discovery Challenges at Lyft with Amundsen, an Open-source Metad...
Databricks
 

Similar to Re-imagine Data Monitoring with whylogs and Spark (20)

PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PPTX
Apache Spark Model Deployment
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark
Databricks
 
PDF
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
PDF
Ideas spracklen-final
supportlogic
 
PPTX
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
PDF
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
PDF
Power Software Development with Apache Spark
OpenPOWERorg
 
PPTX
MLlib and Machine Learning on Spark
Petr Zapletal
 
PDF
DevOps for DataScience
Stepan Pushkarev
 
PPTX
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
PDF
C2_W1---.pdf
Humayun Kabir
 
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
PDF
Model Monitoring at Scale with Apache Spark and Verta
Databricks
 
PPTX
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
PPTX
Feature Store as a Data Foundation for Machine Learning
Provectus
 
PDF
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
PPTX
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Gianluca Tarasconi
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Apache Spark Model Deployment
Databricks
 
Foundations for Scaling ML in Apache Spark
Databricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
BigMine
 
Ideas spracklen-final
supportlogic
 
MLOps Virtual Event | Building Machine Learning Platforms for the Full Lifecycle
Databricks
 
Machine learning at scale challenges and solutions
Stavros Kontopoulos
 
Power Software Development with Apache Spark
OpenPOWERorg
 
MLlib and Machine Learning on Spark
Petr Zapletal
 
DevOps for DataScience
Stepan Pushkarev
 
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
C2_W1---.pdf
Humayun Kabir
 
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Databricks
 
Model Monitoring at Scale with Apache Spark and Verta
Databricks
 
AI Class Topic 4: Text Analytics, Sentiment Analysis and Apache Spark
Value Amplify Consulting
 
Feature Store as a Data Foundation for Machine Learning
Provectus
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Machine Learning by Example - Apache Spark
Meeraj Kunnumpurath
 
Of Unicorns, Yetis, and Error-Free Datasets (or what is data quality?)
Gianluca Tarasconi
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
big data eco system fundamentals of data science
arivukarasi
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
What Is Data Integration and Transformation?
subhashenia
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 

Re-imagine Data Monitoring with whylogs and Spark

  • 1. Re-imagine Data Monitoring with whylogs and Apache Spark Andy Dang Co-Founder & Lead Engineer, WhyLabs
  • 2. Outline ML Data Challenges How traditional data analysis techniques fail ML data pipelines Lightweight Profiling for Big ML Data Profiling techniques for detecting data quality problems The Open Source whylogs Library Building the standard for data logging 2
  • 3. Source: Google Cloud AI 3 ML Lifecycle
  • 4. Issues encountered in production (small sample)... ...or it simply doesn’t work, and nobody know why... ● Experiment/production environment mismatch ● Wrong model version deployed ● Underprovisioned hardware ● Inappropriate hardware ● Latency/SLA issues ● Data permissions misconfigured ● Untracked changes broke prod ● Traffic sent to the wrong model ● Computational instability ● Customers gaming the model (adversarial attacks) ● PII data exposed ● Expected accuracy doesn’t materialize ● Pre-processing mismatch in experiments vs. production ● Retrained on faulty data ● Accuracy improves on one segment, regresses in others ● Outliers predicted incorrectly ● Bias identified ● Correlation with protected features ● Overfitting on training/test ● Surge in missing values ● Surge in duplicates ● Poor performance on new categories ● Poor performance on new customer segments ● Poor performance on outliers ● Data quality issues affect accuracy ● Production data doesn’t match test/training ● Accuracy is decaying over time ● Data drift in inputs ● Concept drift in outputs ● Extreme predictions for out of distribution data ● Model not generalizing on new data / new segments ● Major customer behavior shift 4
  • 5. Issues encountered in production (small sample)... issues caused by data ● Experiment/production environment mismatch ● Wrong model version deployed ● Underprovisioned hardware ● Inappropriate hardware ● Latency/SLA issues ● Data permissions misconfigured ● Untracked changes broke prod ● Traffic sent to the wrong model ● Computational instability ● Customers gaming the model (adversarial attacks) ● PII data exposed ● Expected accuracy doesn’t materialize ● Pre-processing mismatch in experiments vs. production ● Retrained on faulty data ● Accuracy improves on one segment, regresses in others ● Outliers predicted incorrectly ● Bias identified ● Correlation with protected features ● Overfitting on training/test ● Surge in missing values ● Surge in duplicates ● Poor performance on new categories ● Poor performance on new customer segments ● Poor performance on outliers ● Data quality issues affect accuracy ● Production data doesn’t match test/training ● Accuracy is decaying over time ● Data drift in inputs ● Concept drift in outputs ● Extreme predictions for out of distribution data ● Model not generalizing on new data / new segments ● Major customer behavior shift 5
  • 6. Data Logs Model Metadata Pipeline Metadata i.e. data profiling Data profiling refers to the analysis of information [...] in order to clarify the structure, content, relationships, and derivation rules of the data [Wikipedia] 6 Data monitoring starts with logging
  • 7. 7 Sampling Profiling Pros ● Easy to build ● Little upfront design ● Log & raw data analysis identical ● Scalable & lightweight ● Flexible & configurable ● Rare events and outlier-dependent metrics ● Directly interpretable results Cons ● I/O & storage ● Noisy ● Requires statistical analysis ● Rare events & outliers ● Min/max, unique values, etc ● Data dependent output format ● No existing widespread solutions ● Mathematical & engineering challenges Data logs: sampling vs. profiling
  • 8. 8 Data logs: must be accurate Median: errors in the estimate of the median for sampling vs profiling for various distributions. Mean absolute error and mean relative (fractional) absolute error are shown.
  • 9. 9 Data logs: must be scalable Dataset Size # of entries # of features Memory consumption Output size Lending Club 1.6G 2.2M 151 14MB 7.4MB NYC Tickets 1.9G 10.8 43 14MB 2.3MB Pain pills 75GB 178M 42 15MB 2MB
  • 10. 10 Logging ML data at scale Four key paradigms: ● Approximations rather than exact results ● Lightweight ● Additive ● Batch and streaming support profile: collection of lightweight metrics that provide these properties
  • 11. Lightweight Old Approach 11 process process process process Data Warehouse/ Data Lake Processing Engine New Approach process profiling process profiling process profiling process profiling Profile Store Analysis Only feasible if: ● Profiling is fast ● Profiling is not memory intensive
  • 12. Additive 12 dataset 1 dataset 2 dataset 3 sort (shuffle) reduce step Median dataset 1 profile 1 dataset 2 profile 2 dataset 3 profile 3 add(profile1, 2, 3) Estimated Median
  • 13. Batch and streaming support 13 partition 1 profiling partition 2 profiling partition 3 profiling partition n profiling Spark/Hive Query Engine No shuffle! day 0 profiling day 1 profiling day 2 profiling day 3 profiling ... ... sum(profiles)
  • 14. Approximate Statistics ● Using Stochastic Streaming Algorithms ○ Model the problem as a stochastic process ○ Apache Datasketches is the open source implementation ● Statistics that we focus on at the moment: ○ Histograms ○ Frequent items ○ Cardinality 14
  • 15. whylogs: The Data Logging Library ● Multi-language support: Python + Java ● Support both data engineering and data science workflows ● Extensibility: image support. Text, video, audio & embeddings support to come ● Growing integration list: 15
  • 16. 16 whylogs: Python ● A few lines of code to start logging ● Integrate with popular data science libraries ● Out of the box visualization utilities
  • 17. 17 whylogs in Apache Spark Data Lake col1 , col2 , …, coln partition 1 partition 2 partition k profile profile merge ( profile1 , profile2 …, profilek ) global profile Schema Metadata Sketches profile Metrics
  • 20. 20 Catch distribution drift in a few lines of code
  • 21. 21 Scalable monitoring at input feature granularity
  • 22. Monitoring layer for ML applications 22
  • 24. [email protected] @andy_dng 24 bit.ly/whylogs Help build the open standard for data logging! Thank you!