SlideShare a Scribd company logo
Semantic Validation
For Kafka® Data
Quality
Diwei Jiang, Xinli
Shang
Uber | Kafka London Summit 2024
Speaker Introduction
● Diwei Jiang
○ Senior Software Engineer @ Uber Streaming Data
● Xinli Shang
○ Senior Engineering Manager @ Uber Streaming Data
○ Apache® Parquet PMC chair, Presto® committer
Uber | Kafka London Summit 2024
Agenda
● Uber Kafka & Data Lake architecture
● Motivation
● Semantic Validation
● Use cases in both Streaming and Data Lake
● Future work
Uber Streaming & Data Lake Architecture
Ingestion
Online Storage
Events
Telemetry
Feeds
Kafka Data
Lake
Compute Fabric
Real-Time Analytics
Data Platform & Tools
Batch Analytics
Stream Processing
Complex Processing
Data Workflow
(Piper, uWorc)
BI Tools
(QueryBuilder, Dashbuilder)
Metadata Platform
(Databook, Quality, Lineage)
Interactive ETL
In-memory (Pinot)
storage
Security
Global Data
Warehouse
1000 services
Uber Data Flows
● Catastrophic impact to
business
● Difficult to detect on timely
● Recovery process is costly
Corrupted Data is Poison Pill
Semantic Validation
What’s Semantic Validation?
Verifies the content of the data being transmitted through Kafka topics.
Example types of Constraints:
● Number Constraint:
○ eg: Payment amount, Age
● String Constraint:
○ eg: Product name length, Address format
● Platform Integration & reusability
○ Consistent with existing schema evolution flow.
○ Centralize validation flows.
● User Customizations
○ Provide users with the flexibility to customize
validation behavior and configure alerting.
● Timely Detection
○ Validate on Producer side before data enters
kafka.
Design Goals
Current Enforcement Limitations
● Current Checks Limitations:
● Relying on code application checks to verify data integrity can be insufficient.
● Often, validations in code are implemented downstream are reactive fixes
post-outage.
● Absence of Built-in Support in Avro:
● Avro lacks native mechanisms for expressing semantic constraints within
schemas.
● Custom validation outside Avro leads to inconsistency and complexity in data
pipelines.
Architecture
- Teams can easily access their schema and update constraints.
- Application services depend on producer client to fetch schema and validate.
- Validator will emit metrics for failed data and monitoring system will send out alert.
UI & Schema Evolution
● User create constraint on fields
● Frontend validate format
● Constraint change -> version
change
Constraint Examples
● Numeric type ● String type
Future plan, adding custom constraints for a
shared object (eg: BillingEntry) allows
centralized validation on same object across
schemas, the object level validation design is
work in progress.
Reusing Constraints
● Predefined constraints ● Object level constraints
The address regex is predefined in schema
backend.
Encoding and Validation
● Validating during
encoding
● Different rules for
each data type
● Sampling mechanism
● Each record
encoding P99 latency
with validation is
~130 μs, without
validation ~100 μs
Open Questions #1
Should we drop the bad data directly?
Here’s trade-offs of each:
○ Drop invalid data : prevent bad data but will cause data loss
○ Alert only: non disruptive approach, won’t prevent polluted data flow
in
○ Setting up DLQ for producer: increased maintenance cost
○ Insert a new header: delegate to consumers to identify polluted
data.
Decision: we chose to make it opt-in configuration if user wants to discard data
directly, otherwise we’re creating alerts only for our 1st phase.
Open Questions #2
Backward compatibility for constraints update:
Day 0, user sets constraints to be a range (0-100)
Day 1, users updated constraints to be (0-90)
Now data with value of 95 which is not considered valid anymore. Do we allow
this change when user update schema?
- If a topic has multiple producers, one of them with latest schema may
start to trigger more violation errors causing inconsistency
- We decided to allow this for first phase but warning user when they
update schema.
Semantic Validation for both Online and Offline
- Offline paths can
extend validator
logic upon consume
- This allow each
consumer pipeline
flexibility to configure
different behavior
Limitations
Sampling cannot guarantee thorough validation.
● Backpressure based on capacity in realtime to try to maximize sample with low
latency
● Progressive validation when error pattern trends emerge.
● Auditing service to consume topic and perform comprehensive validation
Future Work
● Productionize it
● Upstream to OSS
● Dynamic sampling
● Comprehensive auditing
● Reusable constraints, cross field constraints
Q & A
Send questions to: shangxinli@apache.org

More Related Content

Similar to Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification (20)

PPTX
Performance testing in scope of migration to cloud by Serghei Radov
Valeriia Maliarenko
 
PPT
Performance testing material
Keylabstraining Bangalore
 
PPTX
Incremental Queries and Transformations for Engineering Critical Systems
Ákos Horváth
 
PDF
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Paul Brebner
 
PDF
Laravel Load Testing: Strategies and Tools
Muhammad Shehata
 
DOC
Amita_Kashyap1_CV
Amita Kashyap
 
PDF
Continuous Performance Testing
Mark Price
 
DOC
Aakash shah performance tester
anandkayalmatrix
 
PPTX
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity Software Ireland
 
DOC
Amita_Kashyap_CV
Amita Kashyap
 
PDF
Implement Test Harness For Streaming Data Pipelines
Knoldus Inc.
 
PPT
JMeter
Md Samsul Kabir
 
PPT
Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...
TEST Huddle
 
PPTX
Dynomite @ RedisConf 2017
Ioannis Papapanagiotou
 
DOC
9 Yrs Manual and Selenium Testing Profile
Sivasankar Raju keerthipati
 
PPTX
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Soham Mondal
 
PDF
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
Deepak Shankar
 
PPTX
Modern Testing Strategies for Evolving Ecosystems
Julian Warszawski
 
PDF
From Continuous to Autonomous Testing with AI
Cognizant
 
Performance testing in scope of migration to cloud by Serghei Radov
Valeriia Maliarenko
 
Performance testing material
Keylabstraining Bangalore
 
Incremental Queries and Transformations for Engineering Critical Systems
Ákos Horváth
 
Past Experiences and Future Challenges using Automatic Performance Modelling ...
Paul Brebner
 
Laravel Load Testing: Strategies and Tools
Muhammad Shehata
 
Amita_Kashyap1_CV
Amita Kashyap
 
Continuous Performance Testing
Mark Price
 
Aakash shah performance tester
anandkayalmatrix
 
Curiosity and Xray present - In sprint testing: Aligning tests and teams to r...
Curiosity Software Ireland
 
Amita_Kashyap_CV
Amita Kashyap
 
Implement Test Harness For Streaming Data Pipelines
Knoldus Inc.
 
Mieke Gevers - Performance Testing in 5 Steps - A Guideline to a Successful L...
TEST Huddle
 
Dynomite @ RedisConf 2017
Ioannis Papapanagiotou
 
9 Yrs Manual and Selenium Testing Profile
Sivasankar Raju keerthipati
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
Soham Mondal
 
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
Deepak Shankar
 
Modern Testing Strategies for Evolving Ecosystems
Julian Warszawski
 
From Continuous to Autonomous Testing with AI
Cognizant
 

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PPT
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
PDF
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
PDF
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
Interview paper part 3, It is based on Interview Prep
SoumyadeepGhosh39
 
Timothy Rottach - Ramp up on AI Use Cases, from Vector Search to AI Agents wi...
AWS Chicago
 
SFWelly Summer 25 Release Highlights July 2025
Anna Loughnan Colquhoun
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
Smart Air Quality Monitoring with Serrax AQM190 LITE
SERRAX TECHNOLOGIES LLP
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
July Patch Tuesday
Ivanti
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Ad

Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification

  • 1. Semantic Validation For Kafka® Data Quality Diwei Jiang, Xinli Shang
  • 2. Uber | Kafka London Summit 2024 Speaker Introduction ● Diwei Jiang ○ Senior Software Engineer @ Uber Streaming Data ● Xinli Shang ○ Senior Engineering Manager @ Uber Streaming Data ○ Apache® Parquet PMC chair, Presto® committer
  • 3. Uber | Kafka London Summit 2024 Agenda ● Uber Kafka & Data Lake architecture ● Motivation ● Semantic Validation ● Use cases in both Streaming and Data Lake ● Future work
  • 4. Uber Streaming & Data Lake Architecture Ingestion Online Storage Events Telemetry Feeds Kafka Data Lake Compute Fabric Real-Time Analytics Data Platform & Tools Batch Analytics Stream Processing Complex Processing Data Workflow (Piper, uWorc) BI Tools (QueryBuilder, Dashbuilder) Metadata Platform (Databook, Quality, Lineage) Interactive ETL In-memory (Pinot) storage Security Global Data Warehouse 1000 services
  • 6. ● Catastrophic impact to business ● Difficult to detect on timely ● Recovery process is costly Corrupted Data is Poison Pill
  • 7. Semantic Validation What’s Semantic Validation? Verifies the content of the data being transmitted through Kafka topics. Example types of Constraints: ● Number Constraint: ○ eg: Payment amount, Age ● String Constraint: ○ eg: Product name length, Address format
  • 8. ● Platform Integration & reusability ○ Consistent with existing schema evolution flow. ○ Centralize validation flows. ● User Customizations ○ Provide users with the flexibility to customize validation behavior and configure alerting. ● Timely Detection ○ Validate on Producer side before data enters kafka. Design Goals
  • 9. Current Enforcement Limitations ● Current Checks Limitations: ● Relying on code application checks to verify data integrity can be insufficient. ● Often, validations in code are implemented downstream are reactive fixes post-outage. ● Absence of Built-in Support in Avro: ● Avro lacks native mechanisms for expressing semantic constraints within schemas. ● Custom validation outside Avro leads to inconsistency and complexity in data pipelines.
  • 10. Architecture - Teams can easily access their schema and update constraints. - Application services depend on producer client to fetch schema and validate. - Validator will emit metrics for failed data and monitoring system will send out alert.
  • 11. UI & Schema Evolution ● User create constraint on fields ● Frontend validate format ● Constraint change -> version change
  • 12. Constraint Examples ● Numeric type ● String type
  • 13. Future plan, adding custom constraints for a shared object (eg: BillingEntry) allows centralized validation on same object across schemas, the object level validation design is work in progress. Reusing Constraints ● Predefined constraints ● Object level constraints The address regex is predefined in schema backend.
  • 14. Encoding and Validation ● Validating during encoding ● Different rules for each data type ● Sampling mechanism ● Each record encoding P99 latency with validation is ~130 μs, without validation ~100 μs
  • 15. Open Questions #1 Should we drop the bad data directly? Here’s trade-offs of each: ○ Drop invalid data : prevent bad data but will cause data loss ○ Alert only: non disruptive approach, won’t prevent polluted data flow in ○ Setting up DLQ for producer: increased maintenance cost ○ Insert a new header: delegate to consumers to identify polluted data. Decision: we chose to make it opt-in configuration if user wants to discard data directly, otherwise we’re creating alerts only for our 1st phase.
  • 16. Open Questions #2 Backward compatibility for constraints update: Day 0, user sets constraints to be a range (0-100) Day 1, users updated constraints to be (0-90) Now data with value of 95 which is not considered valid anymore. Do we allow this change when user update schema? - If a topic has multiple producers, one of them with latest schema may start to trigger more violation errors causing inconsistency - We decided to allow this for first phase but warning user when they update schema.
  • 17. Semantic Validation for both Online and Offline - Offline paths can extend validator logic upon consume - This allow each consumer pipeline flexibility to configure different behavior
  • 18. Limitations Sampling cannot guarantee thorough validation. ● Backpressure based on capacity in realtime to try to maximize sample with low latency ● Progressive validation when error pattern trends emerge. ● Auditing service to consume topic and perform comprehensive validation
  • 19. Future Work ● Productionize it ● Upstream to OSS ● Dynamic sampling ● Comprehensive auditing ● Reusable constraints, cross field constraints