SlideShare a Scribd company logo
Data Quality with or
without Apache Spark
and its ecosystem
Serge Smertin
Sr. Resident Solutions Architect at
Databricks
▪ Intro
▪ Dimensions
▪ Frameworks
▪ TLDR
▪ Outro
About me
▪ Worked in all stages of data
lifecycle for the past 14 years
▪ Built data science platforms from
scratch
▪ Tracked cyber criminals through
massively scaled data forensics
▪ Built anti-PII analysis measures
for payments industry
▪ Bringing Databricks strategic
customers to next level as
full-time job now
Colleen Graham
“Performance Management Driving BI Spending”, InformationWeek, February 14, 2006
https://www.informationweek.com/performance-management-driving-bi-spending/d/d-id/10405
52
Data quality requires certain
level of sophistication within a
company to even understand
that it’s a problem.
Data
Catalogs
Data
Profiling
ETL
Quality Checks
Metrics repository
Alerting
Noise filtering
Dashboards Oncall
Data
Catalogs
Data
Profiling
ETL
Metrics repository
Alerting
Noise filtering
Dashboards Oncall
Completeness
Consistency
Uniqueness
Timeliness
Relevance
Accuracy
Validity
Quality Checks
Record
level
Database
level
- Stream-friendly
- Quarantine invalid data
- Debug and re-process
- Make sure to (re-)watch
“Make reliable ETL easy
on Delta Lake” talk
- Batch-friendly
- See health of the entire pipeline
- Detect processing anomalies
- Reconciliation testing
- Mutual information analysis
- This talk
Data owners and
Subject Matter
Experts define
ideal shape of the
data
May not fully cover
all aspects, when
number of
datasets is bigger
that SME team
Often is the only way
for larger orgs,
where expertise still
has to be developed
internally
May lead to
incomplete data
coverage and
missed signals about
problems in data
pipelines
Exploration
Expertise
Semi-supervised
code generation
based on data
profiling results
May overfit
alerting with rules
that are too strict
by default,
resulting in more
noise than signal
Automation
Few solutions exist in the open-source
community either in the form of libraries or
complete stand-alone platforms, which can be
used to assure a certain data quality, especially
when continuous imports happen.
“1” if check(s)
succeeded for a given
row. Result is
averaged.
Streaming friendly.
Success
Keys
Check compares
incoming batch
with existing
dataset - e.g.
unique keys
Domain
Keys
Materialised
synthetic
aggregations - e.g.
is this batch |2σ|
records different
than previous?
Dataset
Metrics
Repeat computation
in a separate,
simplified pipeline
and validate results -
e.g. double-entry
bookkeeping
Reconciliation
Tests
If you “build your own everything” - consider
embedding Deequ.It has has constraint
suggestion among advanced enterprise
features like data profiling and anomaly
detection out of the box, though documentation
is not that extensive. And you may want to fork it
internally.
Deequ code
generation
from pydeequ.suggestions import *
suggestionResult = (
ConstraintSuggestionRunner(spark)
.onData(spark.table('demo'))
.addConstraintRule(DEFAULT())
.run())
print('from pydeequ.checks import *')
print('check = (Check(spark, CheckLevel.Warning, "Generated check")')
for suggestion in suggestionResult['constraint_suggestions']:
if 'Fractional' in suggestion['suggesting_rule']:
continue
print(f' {suggestion["code_for_constraint"]}')
print(')')
from pydeequ.checks import *
check = (Check(spark, CheckLevel.Warning,
"Generated check")
.isComplete("b")
.isNonNegative("b")
.isComplete("a")
.isNonNegative("a")
.isUnique("a")
.hasCompleteness("c", lambda x: x >= 0.32,
"It should be above 0.32!"))
Great Expectations is less enterprise'y data
validation platform written in Python, that
focuses on supporting Apache Spark among
other data sources, like Postgres, Pandas,
BigQuery, and so on.
Pandas Profiling
▪ Exploratory Data Analysis
simplified by generating HTML
report
▪ Native bi-directional
integration with Great
Expectations
▪ great_expectations
profile DATASOURCE
▪ (pandas_profiling
.ProfileReport(pandas_df)
.to_expectation_suite())
https://pandas-profiling.github.io/pandas-profiling/
Apache Griffin may be the most
enterprise-oriented solution with user interface
available, given the fact it being Apache
top-level project and backed up by eBay since
2016, but it is not as easily embeddable into
existing applications, because it requires
standalone deployment along with JSON DSL
definitions for rules.
Data Quality With or Without Apache Spark and Its Ecosystem
Completeness
SELECT AVG(IF(c IS NOT NULL, 1, 0)) AS isComplete FROM demo
Deequ
PySpark
Great Expectations
SQL
Uniqueness
SELECT (COUNT(DISTINCT c) / COUNT(1)) AS isUnique FROM demo
Deequ
Great Expectations
PySpark
SQL
Validity
SELECT AVG(IF(a < b, 1, 0)) AS isValid FROM demo
Deequ
Great Expectations
PySpark
SQL
Timeliness
SELECT NOW() - MAX(rawEventTime) AS delay
FROM processed_events
raw events processed
events
Honorable Mentions
• https://github.com/FRosner/drunken-data-quality
• https://github.com/databrickslabs/dataframe-rules-engine
Make sure to (re-)watch
“Make reliable ETL easy
on Delta Lake” talk
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot (20)

PDF
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Apache Atlas: Governance for your Data
DataWorks Summit/Hadoop Summit
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PPTX
Apache Arrow Flight Overview
Jacques Nadeau
 
PPTX
Modern data warehouse
Rakesh Jayaram
 
PDF
Introduction to Spark with Python
Gokhan Atil
 
PDF
Lakehouse in Azure
Sergio Zenatti Filho
 
PDF
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
PPTX
Azure Synapse Analytics Overview (r1)
James Serra
 
PDF
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
PDF
Databricks Delta Lake and Its Benefits
Databricks
 
PDF
Time to Talk about Data Mesh
LibbySchulze
 
PPTX
Kafka replication apachecon_2013
Jun Rao
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
Data Engineering A Deep Dive into Databricks
Knoldus Inc.
 
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
PDF
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Apache Atlas: Governance for your Data
DataWorks Summit/Hadoop Summit
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Apache Arrow Flight Overview
Jacques Nadeau
 
Modern data warehouse
Rakesh Jayaram
 
Introduction to Spark with Python
Gokhan Atil
 
Lakehouse in Azure
Sergio Zenatti Filho
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Azure Synapse Analytics Overview (r1)
James Serra
 
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Databricks Delta Lake and Its Benefits
Databricks
 
Time to Talk about Data Mesh
LibbySchulze
 
Kafka replication apachecon_2013
Jun Rao
 
Learn to Use Databricks for Data Science
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Data Engineering A Deep Dive into Databricks
Knoldus Inc.
 
Azure DataBricks for Data Engineering by Eugene Polonichko
Dimko Zhluktenko
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 

Similar to Data Quality With or Without Apache Spark and Its Ecosystem (20)

PDF
AI-Led-Cognitive-Data-Quality.pdf
arifulislam946965
 
DOCX
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Subhasish Guha
 
PDF
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
PDF
Data Profiling: The First Step to Big Data Quality
Precisely
 
PDF
Validating spark ml jobs stopping failures before production on Apache Spark ...
Holden Karau
 
PDF
Validating big data pipelines - FOSDEM 2019
Holden Karau
 
PDF
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Databricks
 
PDF
customized eager lazy data cleansing for satisfactory big data veracity
Rim Moussa
 
PDF
IRJET-Attribute Reduction using Apache Spark
IRJET Journal
 
PPT
Defence IT 2012 - Data Quality and Financial Services - Solvency II
David Twaddell
 
PDF
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Databricks
 
PDF
Correctness and Performance of Apache Spark SQL
Nicolas Poggi
 
PPTX
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
PDF
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
PDF
Streaming SQL for Data Engineers: The Next Big Thing? With Yaroslav Tkachenko...
HostedbyConfluent
 
PDF
Productionalizing a spark application
datamantra
 
PDF
Fast and Reliable Apache Spark SQL Engine
Databricks
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Validating big data jobs - Spark AI Summit EU
Holden Karau
 
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
AI-Led-Cognitive-Data-Quality.pdf
arifulislam946965
 
Data Quality, Correctness and Dynamic Transformations using Spark and Scala
Subhasish Guha
 
Fast and Reliable Apache Spark SQL Releases
DataWorks Summit
 
Data Profiling: The First Step to Big Data Quality
Precisely
 
Validating spark ml jobs stopping failures before production on Apache Spark ...
Holden Karau
 
Validating big data pipelines - FOSDEM 2019
Holden Karau
 
Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...
Databricks
 
customized eager lazy data cleansing for satisfactory big data veracity
Rim Moussa
 
IRJET-Attribute Reduction using Apache Spark
IRJET Journal
 
Defence IT 2012 - Data Quality and Financial Services - Solvency II
David Twaddell
 
Correctness and Performance of Apache Spark SQL with Bogdan Ghit and Nicolas ...
Databricks
 
Correctness and Performance of Apache Spark SQL
Nicolas Poggi
 
Is Spark the right choice for data analysis ?
Ahmed Kamal
 
Streaming SQL for Data Engineers: The Next Big Thing?
Yaroslav Tkachenko
 
Streaming SQL for Data Engineers: The Next Big Thing? With Yaroslav Tkachenko...
HostedbyConfluent
 
Productionalizing a spark application
datamantra
 
Fast and Reliable Apache Spark SQL Engine
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Validating big data jobs - Spark AI Summit EU
Holden Karau
 
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Databricks
 
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Ad

Recently uploaded (20)

PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PDF
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Singapore 2025 - The API Playbook for AI by Shin Wee Chuang (PAND AI)
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Research Methodology Overview Introduction
ayeshagul29594
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 

Data Quality With or Without Apache Spark and Its Ecosystem

  • 1. Data Quality with or without Apache Spark and its ecosystem Serge Smertin Sr. Resident Solutions Architect at Databricks
  • 2. ▪ Intro ▪ Dimensions ▪ Frameworks ▪ TLDR ▪ Outro
  • 3. About me ▪ Worked in all stages of data lifecycle for the past 14 years ▪ Built data science platforms from scratch ▪ Tracked cyber criminals through massively scaled data forensics ▪ Built anti-PII analysis measures for payments industry ▪ Bringing Databricks strategic customers to next level as full-time job now
  • 4. Colleen Graham “Performance Management Driving BI Spending”, InformationWeek, February 14, 2006 https://www.informationweek.com/performance-management-driving-bi-spending/d/d-id/10405 52 Data quality requires certain level of sophistication within a company to even understand that it’s a problem.
  • 6. Data Catalogs Data Profiling ETL Metrics repository Alerting Noise filtering Dashboards Oncall Completeness Consistency Uniqueness Timeliness Relevance Accuracy Validity Quality Checks
  • 7. Record level Database level - Stream-friendly - Quarantine invalid data - Debug and re-process - Make sure to (re-)watch “Make reliable ETL easy on Delta Lake” talk - Batch-friendly - See health of the entire pipeline - Detect processing anomalies - Reconciliation testing - Mutual information analysis - This talk
  • 8. Data owners and Subject Matter Experts define ideal shape of the data May not fully cover all aspects, when number of datasets is bigger that SME team Often is the only way for larger orgs, where expertise still has to be developed internally May lead to incomplete data coverage and missed signals about problems in data pipelines Exploration Expertise Semi-supervised code generation based on data profiling results May overfit alerting with rules that are too strict by default, resulting in more noise than signal Automation
  • 9. Few solutions exist in the open-source community either in the form of libraries or complete stand-alone platforms, which can be used to assure a certain data quality, especially when continuous imports happen.
  • 10. “1” if check(s) succeeded for a given row. Result is averaged. Streaming friendly. Success Keys Check compares incoming batch with existing dataset - e.g. unique keys Domain Keys Materialised synthetic aggregations - e.g. is this batch |2σ| records different than previous? Dataset Metrics Repeat computation in a separate, simplified pipeline and validate results - e.g. double-entry bookkeeping Reconciliation Tests
  • 11. If you “build your own everything” - consider embedding Deequ.It has has constraint suggestion among advanced enterprise features like data profiling and anomaly detection out of the box, though documentation is not that extensive. And you may want to fork it internally.
  • 12. Deequ code generation from pydeequ.suggestions import * suggestionResult = ( ConstraintSuggestionRunner(spark) .onData(spark.table('demo')) .addConstraintRule(DEFAULT()) .run()) print('from pydeequ.checks import *') print('check = (Check(spark, CheckLevel.Warning, "Generated check")') for suggestion in suggestionResult['constraint_suggestions']: if 'Fractional' in suggestion['suggesting_rule']: continue print(f' {suggestion["code_for_constraint"]}') print(')') from pydeequ.checks import * check = (Check(spark, CheckLevel.Warning, "Generated check") .isComplete("b") .isNonNegative("b") .isComplete("a") .isNonNegative("a") .isUnique("a") .hasCompleteness("c", lambda x: x >= 0.32, "It should be above 0.32!"))
  • 13. Great Expectations is less enterprise'y data validation platform written in Python, that focuses on supporting Apache Spark among other data sources, like Postgres, Pandas, BigQuery, and so on.
  • 14. Pandas Profiling ▪ Exploratory Data Analysis simplified by generating HTML report ▪ Native bi-directional integration with Great Expectations ▪ great_expectations profile DATASOURCE ▪ (pandas_profiling .ProfileReport(pandas_df) .to_expectation_suite()) https://pandas-profiling.github.io/pandas-profiling/
  • 15. Apache Griffin may be the most enterprise-oriented solution with user interface available, given the fact it being Apache top-level project and backed up by eBay since 2016, but it is not as easily embeddable into existing applications, because it requires standalone deployment along with JSON DSL definitions for rules.
  • 17. Completeness SELECT AVG(IF(c IS NOT NULL, 1, 0)) AS isComplete FROM demo Deequ PySpark Great Expectations SQL
  • 18. Uniqueness SELECT (COUNT(DISTINCT c) / COUNT(1)) AS isUnique FROM demo Deequ Great Expectations PySpark SQL
  • 19. Validity SELECT AVG(IF(a < b, 1, 0)) AS isValid FROM demo Deequ Great Expectations PySpark SQL
  • 20. Timeliness SELECT NOW() - MAX(rawEventTime) AS delay FROM processed_events raw events processed events
  • 21. Honorable Mentions • https://github.com/FRosner/drunken-data-quality • https://github.com/databrickslabs/dataframe-rules-engine Make sure to (re-)watch “Make reliable ETL easy on Delta Lake” talk
  • 22. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.