SlideShare a Scribd company logo
1
1
Smit Shah
Yuliana Havryshchuk
Democratizing Data Quality
at Zillow through a
Centralized Platform
2
Who We Are
Data Governance Platform Team
@ Zillow
Smit Shah
Senior Software Development
Engineer, Big Data
Yuliana Havryshchuk
Software Development Engineer,
Big Data
3
Agenda
● What is Zillow?
● Data Quality Challenges
● Centralized Data Quality Platform
○ Architecture
○ Self-Service
○ Pipeline integration
● Key Takeaways
Zillow
About Zillow
● Reimagining real estate to make it
easier to unlock life’s next chapter
* As of Q4-2020
● Offer customers an on-demand
experience for selling, buying,
renting and financing with
transparency and nearly seamless
end-to-end service
● Most-visited real estate website in
the United States
Data Quality Challenges
Why Monitor Data Quality?
● Data fuels many customer facing
and internal services at Zillow that
rely on high quality data
○ Zestimate
○ Zillow Offers
○ Zillow Premier Agent
○ Econ and many more
● Reliable performance of ML and
Services requires certain level of
data quality
Challenges we Faced
● No standard way to monitor quality
● Lack of visibility into data health
● No known lineage between data and processes
Centralized Data Quality
Platform
Data
Quality
Platform
Increase Visibility of
Data Health
Integrate with Data
Lineage
Support Built-in
Alerting
Enable Safe
Evolution of Rules
Standardize Data
Quality Rules
5 Pillars for Data Quality Platform
Platform Architecture
* As of May 2021
Platform Architecture
* As of May 2021
Platform Architecture
* As of May 2021
Platform Architecture
* As of May 2021
Platform Architecture
* As of May 2021
Platform Architecture
* As of May 2021
Platform Architecture
* As of May 2021
Self-Service Capabilities
Self-Service Onboarding - Goals
● Must be scalable
● Must be accessible to all user archetypes
● Must require minimal configuration
Self-Service Onboarding - Data Discovery
* These values are simulated
Self-Service Onboarding - Example
* These values are simulated
id name type page_views data_date
1 123 Green St house 709 2021-05-01
2 47 Walker Rd townhouse 132 2021-05-01
1225 City St #901 condo 800 2021-05-01
4 47 Walker Ave test 600 2021-05-01
Self-Service Onboarding - Rule-based
* These values are simulated
Self-Service Monitoring - Rule-based
* These values are simulated
Self-Service Onboarding - Example
* These values are simulated
id name type page_views data_date
1 123 Green St house 709 2021-05-01
1 123 Green St house 820 2021-05-02
1 123 Green St house 12 2021-05-03
1 123 Green St house 760 2021-05-04
Self-Service Onboarding - Metrics
* These values are simulated
Overview Metric
* These values are simulated
Self-Service Onboarding - Monitoring
Behind the Scenes
● Rule-based monitors turn into contracts
● Metrics monitors turn into ML-based anomaly detection
● Register data quality requirements in config stores
● Dynamically generate validation pipelines
Validation Libraries
Built in-house:
● Luminaire Contract Evaluation Library (scala) for rule-based constraints
● Luminaire Anomaly Detection Library (python) for time-series metrics
○ https://github.com/zillow/luminaire
Pipeline Integration
Pipeline Integration (before)
Producers
Consumers
Pipeline Integration (after)
Producers
Consumers
*
Validation Results
● Alert data users if any checks fail
● Integrate with pipeline execution to prevent propagation
● Provide visibility through data discovery tool
● Provide common understanding between producers and consumers
Future Direction
● Tighter integration between components
● Expand libraries to support more use-cases
● Move from detection to diagnosis
● Validation for streaming data
Key Takeaways
Key Takeaways
● 5 pillars that helped us build a robust platform: standardization,
visibility, evolution, alerting, lineage
● Alerting on data quality issues early allows proactive response
● Producing quality data increases trust in data and improves decisions
made
● Data quality is a shared responsibility, and collaboration is needed to
be successful
Questions?
Thank you!
https://www.zillow.com/careers/

More Related Content

What's hot (20)

PPTX
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Big data architectures and the data lake
James Serra
 
PPTX
Introduction to Data Engineering
Durga Gadiraju
 
PDF
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
PDF
Snowflake for Data Engineering
Harald Erb
 
PDF
Building End-to-End Delta Pipelines on GCP
Databricks
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
PDF
Modern Data architecture Design
Kujambu Murugesan
 
PDF
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
PPTX
Introduction to Azure Databricks
James Serra
 
PDF
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
PPTX
Demystifying Data Warehouse as a Service
Snowflake Computing
 
PPTX
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
PPTX
Free Training: How to Build a Lakehouse
Databricks
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
PDF
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
PPTX
Introducing Azure SQL Data Warehouse
James Serra
 
PDF
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Big data architectures and the data lake
James Serra
 
Introduction to Data Engineering
Durga Gadiraju
 
Data Architecture Best Practices for Advanced Analytics
DATAVERSITY
 
Snowflake for Data Engineering
Harald Erb
 
Building End-to-End Delta Pipelines on GCP
Databricks
 
Databricks Fundamentals
Dalibor Wijas
 
Modern Data architecture Design
Kujambu Murugesan
 
Introduction to Azure Data Lake
Antonios Chatzipavlis
 
Introduction to Azure Databricks
James Serra
 
Azure Synapse 101 Webinar Presentation
Matthew W. Bowers
 
Demystifying Data Warehouse as a Service
Snowflake Computing
 
Snowflake Datawarehouse Architecturing
Ishan Bhawantha Hewanayake
 
Free Training: How to Build a Lakehouse
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
James Serra
 
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Introducing Azure SQL Data Warehouse
James Serra
 
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
 

Similar to Democratizing Data Quality Through a Centralized Platform (20)

PDF
Scaling AutoML-Driven Anomaly Detection With Luminaire
Databricks
 
PPTX
Overview of Data Science at Zillow
njstevens
 
PPTX
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Precisely
 
PDF
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
PPTX
Data governance datalakes_multitenancy
Sathish K S
 
PDF
Data Science At Zillow
Nicholas McClure
 
PPTX
Data services brochure
Dave Loudon
 
PDF
Spark at Zillow
Steven Hoelscher
 
PDF
Empowering Zillow’s Developers with Self-Service ETL
Databricks
 
PDF
Data quality
drishtipuro1234
 
PDF
Data quality
sethnainaa
 
PDF
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Mathieu Bastian
 
PDF
Platforming the Major Analytic Use Cases for Modern Engineering
DATAVERSITY
 
PPTX
Deliveinrg explainable AI
Gary Allemann
 
PDF
Data preprocessing.pdf
sankirtishiravale
 
PDF
Foundations strata sf-2019_final
Jonathan Seidman
 
PDF
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Precisely
 
PDF
Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf
arifulislam946965
 
PPTX
Sunrun slide for informatica summit - Harish Ramachandraiah
Harish Ramachandraiah
 
PDF
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
Scaling AutoML-Driven Anomaly Detection With Luminaire
Databricks
 
Overview of Data Science at Zillow
njstevens
 
Big Data Matching - How to Find Two Similar Needles in a Really Big Haystack
Precisely
 
Engineering Machine Learning Data Pipelines Series: Big Data Quality - Cleans...
Precisely
 
Data governance datalakes_multitenancy
Sathish K S
 
Data Science At Zillow
Nicholas McClure
 
Data services brochure
Dave Loudon
 
Spark at Zillow
Steven Hoelscher
 
Empowering Zillow’s Developers with Self-Service ETL
Databricks
 
Data quality
drishtipuro1234
 
Data quality
sethnainaa
 
The Mechanics of Testing Large Data Pipelines (QCon London 2016)
Mathieu Bastian
 
Platforming the Major Analytic Use Cases for Modern Engineering
DATAVERSITY
 
Deliveinrg explainable AI
Gary Allemann
 
Data preprocessing.pdf
sankirtishiravale
 
Foundations strata sf-2019_final
Jonathan Seidman
 
Your AI and ML Projects Are Failing – Key Steps to Get Them Back on Track
Precisely
 
Business Case for leveraging Machine Learning (ML) to Validate Data Lake.pdf
arifulislam946965
 
Sunrun slide for informatica summit - Harish Ramachandraiah
Harish Ramachandraiah
 
Data Science Introduction and Process in Data Science
Pyingkodi Maran
 
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
PDF
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Databricks
 
Ad

Recently uploaded (20)

PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 

Democratizing Data Quality Through a Centralized Platform