Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification

Semantic Validation
For Kafka® Data
Quality
Diwei Jiang, Xinli
Shang

Uber | Kafka London Summit 2024
Speaker Introduction
● Diwei Jiang
○ Senior Software Engineer @ Uber Streaming Data
● Xinli Shang
○ Senior Engineering Manager @ Uber Streaming Data
○ Apache® Parquet PMC chair, Presto® committer

Uber | Kafka London Summit 2024
Agenda
● Uber Kafka & Data Lake architecture
● Motivation
● Semantic Validation
● Use cases in both Streaming and Data Lake
● Future work

Uber Streaming & Data Lake Architecture
Ingestion
Online Storage
Events
Telemetry
Feeds
Kafka Data
Lake
Compute Fabric
Real-Time Analytics
Data Platform & Tools
Batch Analytics
Stream Processing
Complex Processing
Data Workflow
(Piper, uWorc)
BI Tools
(QueryBuilder, Dashbuilder)
Metadata Platform
(Databook, Quality, Lineage)
Interactive ETL
In-memory (Pinot)
storage
Security
Global Data
Warehouse
1000 services

● Catastrophic impact to
business
● Difficult to detect on timely
● Recovery process is costly
Corrupted Data is Poison Pill

Semantic Validation
What’s Semantic Validation?
Verifies the content of the data being transmitted through Kafka topics.
Example types of Constraints:
● Number Constraint:
○ eg: Payment amount, Age
● String Constraint:
○ eg: Product name length, Address format

● Platform Integration & reusability
○ Consistent with existing schema evolution flow.
○ Centralize validation flows.
● User Customizations
○ Provide users with the flexibility to customize
validation behavior and configure alerting.
● Timely Detection
○ Validate on Producer side before data enters
kafka.
Design Goals

Current Enforcement Limitations
● Current Checks Limitations:
● Relying on code application checks to verify data integrity can be insufficient.
● Often, validations in code are implemented downstream are reactive fixes
post-outage.
● Absence of Built-in Support in Avro:
● Avro lacks native mechanisms for expressing semantic constraints within
schemas.
● Custom validation outside Avro leads to inconsistency and complexity in data
pipelines.

Architecture
- Teams can easily access their schema and update constraints.
- Application services depend on producer client to fetch schema and validate.
- Validator will emit metrics for failed data and monitoring system will send out alert.

UI & Schema Evolution
● User create constraint on fields
● Frontend validate format
● Constraint change -> version
change

Constraint Examples
● Numeric type ● String type

Future plan, adding custom constraints for a
shared object (eg: BillingEntry) allows
centralized validation on same object across
schemas, the object level validation design is
work in progress.
Reusing Constraints
● Predefined constraints ● Object level constraints
The address regex is predefined in schema
backend.

Encoding and Validation
● Validating during
encoding
● Different rules for
each data type
● Sampling mechanism
● Each record
encoding P99 latency
with validation is
~130 μs, without
validation ~100 μs

Open Questions #1
Should we drop the bad data directly?
Here’s trade-offs of each:
○ Drop invalid data : prevent bad data but will cause data loss
○ Alert only: non disruptive approach, won’t prevent polluted data flow
in
○ Setting up DLQ for producer: increased maintenance cost
○ Insert a new header: delegate to consumers to identify polluted
data.
Decision: we chose to make it opt-in configuration if user wants to discard data
directly, otherwise we’re creating alerts only for our 1st phase.

Open Questions #2
Backward compatibility for constraints update:
Day 0, user sets constraints to be a range (0-100)
Day 1, users updated constraints to be (0-90)
Now data with value of 95 which is not considered valid anymore. Do we allow
this change when user update schema?
- If a topic has multiple producers, one of them with latest schema may
start to trigger more violation errors causing inconsistency
- We decided to allow this for first phase but warning user when they
update schema.

Semantic Validation for both Online and Offline
- Offline paths can
extend validator
logic upon consume
- This allow each
consumer pipeline
flexibility to configure
different behavior

Limitations
Sampling cannot guarantee thorough validation.
● Backpressure based on capacity in realtime to try to maximize sample with low
latency
● Progressive validation when error pattern trends emerge.
● Auditing service to consume topic and perform comprehensive validation

Future Work
● Productionize it
● Upstream to OSS
● Dynamic sampling
● Comprehensive auditing
● Reusable constraints, cross field constraints

Q & A
Send questions to: shangxinli@apache.org

Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification

More Related Content

Similar to Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Semantic Validation: Enforcing Kafka Data Quality Through Schema-Driven Verification