SlideShare a Scribd company logo
Invisible Interfaces
Zhenzhong Xu (@zhenzhongxu)
Current 22 - Oct, 2022
Considerations for Abstracting Complexities of
a Real-time ML Platform
The discovery of something invisible
Ancient Greek name for
amber: elektron
Thales of Miletus
The endeavor to make it
useful
Ubiquitous
Easy and responsive
Just works!
the invisible interface
About Zhenzhong Xu
● Building real-time ML platform @Claypot
● Real-time Data Infrastructure @ Netflix
● Cloud infra @ Microsoft
"There's been an explosion of ML use cases that … don't make sense if they aren't in real
time. More and more people are doing ML in production,
and most cases have to be streamed."
Ali Ghodsi, Databricks CEO
Fraud prevention Personalization Customer support Dynamic pricing
Trending products Risk assessment Robotics Ads
ETA Network analysis Sentiment analysis Object detection
…
AIIA survey (2022) - https://ai-infrastructure.org/ai-infrastructure-ecosystem-report-of-2022/
Data Science
Realtime ML Platform
Data Infrastructure
Exploration &
Research
Model Architecture
& Turning
Model Analysis
& Selection
Ingestion &
Transport
Security &
Governance
Multi-tenancy
Isolation
Data Sources Storage Query & Compute
Business Decision
Optimization
Workflow
Orchestration
Analytics /
Visualization
Model Serving
Model Training
Model
Monitoring
Model
Evaluation
Feature
Materialization
Label
Materialization
Data
Monitoring
Data Model Flow
Data Flow
Data Flow
Data Flow
Data Flow
Product
Ecosystem
Analytics
ecosystem
Data Loop Model Loop Challenge/Value
Slow Slow Low freshness, low quality.
Out-of-date models, predictions & trainings with stale data, model
drift results in low model accuracy.
Slow Fast Low freshness, low quality.
Model training is bottlenecked by availability of fresh data. Prediction
latency high or predicted with stale data.
Fast Slow High freshness, low quality.
Fresh data available for predictions, trainings, and observability. Slow
model iteration results in out-of-date model, lower accuracy.
Fast Fast High freshness, high quality.
You want your ML ecosystem to be here.
Combine your data and model loops: why you need both to be fast
Online Customer Service
Use Case Example
● Suggest diagnostic runbook
● Proactive in-the-moment remediation action
● Fraud prevention vs detection
Define model features
● average transaction amount from past 14
days
● request channel encoding
● text embedding similarity score
Data Scientists
What’s the appropriate level of complexity the
ML platform should expose?
ML Platforms: What’s preventing Ubiquitous?
DWH
(Snowflake / BigQuery / S3)
Predictions
1
Offline batch prediction
● Use cases: churn prediction, user LTV, risk planning, etc.
2
BI
Batch job to
generate predictions
(e.g. Airflow + Spark)
App
DWH
Prediction
requests
Batch job to
generate features
Prediction
service
3
Online prediction with batch features
● Batch features: computed offline, e.g. product embeddings
● Use cases: recsys
KV store
4
For low-latency online access
Write to
offline
2
Batch
features
Write to
online
1
2
Joined batch
features
App
DWH
Prediction
requests
Batch job to
generate features
Prediction
service
3
Online prediction with on-demand features
● Batch features: queried from transactional stores, e.g. # orders in the last 30 mins
● Use cases: recsys
KV store
4
For low-latency online access
Write to
offline
2
Batch
features
Write to
online
1
2
TX store
(eg Postgres,
Cassandra)
Joined
features
4
Transactions
App
DWH
Batch job to
generate features
Prediction
service
3
Online prediction with streaming features
● Online features: computed online,
○ e.g. distance between two locations, count/percentile in the last 30 mins
KV store
Write to
offline
2
Write to
online
2
Real-time
transport
Logs
Stream feature
extraction
Feature
service
5 4
Batch
features
1
4
Prediction
requests
Combining offline and online data
Time
DWH
Stream
transaction behavior over
the last 6 months
T-7 days
T-1 day to T-6 month
Combining offline and online data
Time
DWH
Stream
transaction behavior over
the last 6 months
T-7 days
T-1 day to T-6 month
Backfilling
challenge
Backfill in Lambda Architecture
Data Source
In-motion Compute
At-rest Compute
Online
Storage
Offline
Storage
Online Query
(serving)
Mixed Query
(backfill)
Offline Query
(training)
Backfill in Lambda Architecture
Data Source
In-motion Compute
At-rest Compute
Online
Storage
Offline
Storage
Online Query
(serving)
Mixed Query
(backfill)
Offline Query
(training)
Backfill in Kappa Architecture
Data Source
In-motion Compute
(Backfill from historical log)
Materialized
Views
Online Query
(serving)
Offline Query
(training)
batch transformation
streaming transformation
Backfill in Kappa Architecture
Data Source
In-motion Compute
(Backfill from historical log)
Materialized
Views
Online Query
(serving)
Offline Query
(training)
batch transformation
streaming transformation
23
Unified Backfill
Data Source
In-motion Compute
(intelligent backfill from dual
sources)
Materialized
Views
Online Query
(serving)
Offline Query
(training)
batch transformation
streaming transformation
DWH backed
logs
Orchestration & Governance
24
Abstracted Unified Backfill
Data Source
In-motion Compute
(intelligent backfill from dual
sources)
Materialized
Views
Online Query
(serving)
Offline Query
(training)
batch transformation
streaming transformation
DWH backed
logs
Orchestration & Governance
Build model features
● Should I declare features in SQL or Python?
● How do I join existing intent classification
results to my new feature
● What confidence can I get before checking in
my code?
Data Scientists
Does the ML platform speak the same language
as the users?
ML Platforms: What’s preventing easy and responsive?
Does the ML platform speak the same language as the users?
Questions for ML Platforms:
● Can users express or declare what they need to control in a single
coherent interface?
● Can the platform understand the intent and drive the underlying system?
● Can user and platform communicate interactively, in a timely fashion?
● Can the user understand their options and tradeoffs without reading a
300-pages manual?
● How much integration effort is needed to plug a model into existing data
streams?
Online Prediction: Latency vs. Staleness
Latency
Request Prediction
Feature
computation
Prediction
retrieval
Feature
retrieval
Prediction
computation
Raw
data
Staleness
RT Feature NRT Feature Batch Feature
Staleness No staleness* > secs > hours
Latency Low (10s ms-1s sec) Lower (10s-100s ms) Lower (10s-100s ms)
Footnote: *computation takes time, latency includes the computation time; Feature performance dependent on source technology and shared traffic pattern.
What about tradeoffs?
● Three dimensions!
● Can choose 2!
● Have to be flexible on the 3rd
● Need clean abstractions for full
freedom
Correctness Low cost
Low latency
1. Fast & Correct
2. Cheap & Correct
3. Fast & Cheap
reference: Open Problems in Stream Processing: A Call To Action, Tyler Akidau (2019)
Python vs SQL vs (Scala) ?
vs
Python vs SQL?
≈
Python vs SQL?
≈
Intermediate representation (IR)
Compute Engines
There is a catch!
UDF…
Don’t invent a new language/DSL!
Evolve existing ones to make it better.
Connector ecosystem is getting more mature.
Nice, but what about event schema and envelope standards?
Deploy model features
● Should I duplicate the feature results in a
different table?
● Which team do I need to inform about the
change?
● Do I need to worry about
training/prediction skew?
Data Scientists
What symptoms are there indicating your
platform is not trusted?
ML Platforms: What doesn’t just work?
What symptoms are there indicating your platform is not trusted?
ML Platforms: What doesn’t just work?
● My freedom and your responsibility
● Producer and consumer tension
● Users are forced to choose between basic requirements
Offline / Online consistencies
Sharing and reusing
Schema evolution
SWE Practices
You are part of the
endeavor to make
real-time data useful!
● Ubiquitous
● Easy and responsive
● Just works!
https://zhenzhongxu.com/
zhenzhong@claypot.ai
the invisible interface

More Related Content

PPTX
Serverless machine learning architectures at Helixa
Data Science Milan
 
PDF
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
PDF
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Kai Wähner
 
PDF
Streaming is a Detail
HostedbyConfluent
 
PDF
Real-time processing of large amounts of data
confluent
 
PDF
Bring Your Own Recipes Hands-On Session
Sri Ambati
 
PDF
Lyft data Platform - 2019 slides
Karthik Murugesan
 
PDF
The Lyft data platform: Now and in the future
markgrover
 
Serverless machine learning architectures at Helixa
Data Science Milan
 
GDG Cloud Southlake #16: Priyanka Vergadia: Scalable Data Analytics in Google...
James Anderson
 
Simplified Machine Learning Architecture with an Event Streaming Platform (Ap...
Kai Wähner
 
Streaming is a Detail
HostedbyConfluent
 
Real-time processing of large amounts of data
confluent
 
Bring Your Own Recipes Hands-On Session
Sri Ambati
 
Lyft data Platform - 2019 slides
Karthik Murugesan
 
The Lyft data platform: Now and in the future
markgrover
 

Similar to Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenzhong XU | Current 2022 (20)

PDF
Confluent Partner Tech Talk with Reply
confluent
 
PDF
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 
PDF
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Provectus
 
PDF
Monitoring AI with AI
Stepan Pushkarev
 
PPTX
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Kai Wähner
 
PPTX
Apache Kafka® + Machine Learning for Supply Chain 
confluent
 
PDF
Unleashing Apache Kafka and TensorFlow in the Cloud

Kai Wähner
 
PDF
C19013010 the tutorial to build shared ai services session 1
Bill Liu
 
PDF
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
PDF
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
Jan Kirenz
 
PDF
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
HostedbyConfluent
 
PPTX
Webinar september 2013
Marc Gille
 
PDF
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
HostedbyConfluent
 
PDF
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
confluent
 
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
PDF
Digital_IOT_(Microsoft_Solution).pdf
ssuserd23711
 
PDF
DevOps for DataScience
Stepan Pushkarev
 
PDF
Cloud-Native Patterns for Data-Intensive Applications
VMware Tanzu
 
PDF
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
confluent
 
Confluent Partner Tech Talk with Reply
confluent
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...
Provectus
 
Monitoring AI with AI
Stepan Pushkarev
 
IIoT with Kafka and Machine Learning for Supply Chain Optimization In Real Ti...
Kai Wähner
 
Apache Kafka® + Machine Learning for Supply Chain 
confluent
 
Unleashing Apache Kafka and TensorFlow in the Cloud

Kai Wähner
 
C19013010 the tutorial to build shared ai services session 1
Bill Liu
 
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Building and deploying LLM applications with Apache Airflow
Kaxil Naik
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
Jan Kirenz
 
From Monoliths to Microservices - A Journey With Confluent With Gayathri Veal...
HostedbyConfluent
 
Webinar september 2013
Marc Gille
 
Hybrid Kafka, Taking Real-time Analytics to the Business (Cody Irwin, Google ...
HostedbyConfluent
 
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
confluent
 
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
confluent
 
Digital_IOT_(Microsoft_Solution).pdf
ssuserd23711
 
DevOps for DataScience
Stepan Pushkarev
 
Cloud-Native Patterns for Data-Intensive Applications
VMware Tanzu
 
Machine Learning on Streaming Data using Kafka, Beam, and TensorFlow (Mikhail...
confluent
 
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
PDF
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
PDF
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
PDF
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
PDF
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
PDF
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
PDF
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
PDF
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
PDF
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
PDF
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 
Ad

Recently uploaded (20)

PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PPTX
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PDF
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
PDF
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Doc9.....................................
SofiaCollazos
 
Google I/O Extended 2025 Baku - all ppts
HusseinMalikMammadli
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Security features in Dell, HP, and Lenovo PC systems: A research-based compar...
Principled Technologies
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
What-is-the-World-Wide-Web -- Introduction
tonifi9488
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
AI-Cloud-Business-Management-Platforms-The-Key-to-Efficiency-Growth.pdf
Artjoker Software Development Company
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Using Anchore and DefectDojo to Stand Up Your DevSecOps Function
Anchore
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
OA presentation.pptx OA presentation.pptx
pateldhruv002338
 
Orbitly Pitch Deck|A Mission-Driven Platform for Side Project Collaboration (...
zz41354899
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 

Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenzhong XU | Current 2022

  • 1. Invisible Interfaces Zhenzhong Xu (@zhenzhongxu) Current 22 - Oct, 2022 Considerations for Abstracting Complexities of a Real-time ML Platform
  • 2. The discovery of something invisible Ancient Greek name for amber: elektron Thales of Miletus
  • 3. The endeavor to make it useful Ubiquitous Easy and responsive Just works! the invisible interface
  • 4. About Zhenzhong Xu ● Building real-time ML platform @Claypot ● Real-time Data Infrastructure @ Netflix ● Cloud infra @ Microsoft
  • 5. "There's been an explosion of ML use cases that … don't make sense if they aren't in real time. More and more people are doing ML in production, and most cases have to be streamed." Ali Ghodsi, Databricks CEO Fraud prevention Personalization Customer support Dynamic pricing Trending products Risk assessment Robotics Ads ETA Network analysis Sentiment analysis Object detection …
  • 6. AIIA survey (2022) - https://ai-infrastructure.org/ai-infrastructure-ecosystem-report-of-2022/
  • 7. Data Science Realtime ML Platform Data Infrastructure Exploration & Research Model Architecture & Turning Model Analysis & Selection Ingestion & Transport Security & Governance Multi-tenancy Isolation Data Sources Storage Query & Compute Business Decision Optimization Workflow Orchestration Analytics / Visualization
  • 8. Model Serving Model Training Model Monitoring Model Evaluation Feature Materialization Label Materialization Data Monitoring Data Model Flow Data Flow Data Flow Data Flow Data Flow Product Ecosystem Analytics ecosystem
  • 9. Data Loop Model Loop Challenge/Value Slow Slow Low freshness, low quality. Out-of-date models, predictions & trainings with stale data, model drift results in low model accuracy. Slow Fast Low freshness, low quality. Model training is bottlenecked by availability of fresh data. Prediction latency high or predicted with stale data. Fast Slow High freshness, low quality. Fresh data available for predictions, trainings, and observability. Slow model iteration results in out-of-date model, lower accuracy. Fast Fast High freshness, high quality. You want your ML ecosystem to be here. Combine your data and model loops: why you need both to be fast
  • 10. Online Customer Service Use Case Example ● Suggest diagnostic runbook ● Proactive in-the-moment remediation action ● Fraud prevention vs detection
  • 11. Define model features ● average transaction amount from past 14 days ● request channel encoding ● text embedding similarity score Data Scientists
  • 12. What’s the appropriate level of complexity the ML platform should expose? ML Platforms: What’s preventing Ubiquitous?
  • 13. DWH (Snowflake / BigQuery / S3) Predictions 1 Offline batch prediction ● Use cases: churn prediction, user LTV, risk planning, etc. 2 BI Batch job to generate predictions (e.g. Airflow + Spark)
  • 14. App DWH Prediction requests Batch job to generate features Prediction service 3 Online prediction with batch features ● Batch features: computed offline, e.g. product embeddings ● Use cases: recsys KV store 4 For low-latency online access Write to offline 2 Batch features Write to online 1 2 Joined batch features
  • 15. App DWH Prediction requests Batch job to generate features Prediction service 3 Online prediction with on-demand features ● Batch features: queried from transactional stores, e.g. # orders in the last 30 mins ● Use cases: recsys KV store 4 For low-latency online access Write to offline 2 Batch features Write to online 1 2 TX store (eg Postgres, Cassandra) Joined features 4 Transactions
  • 16. App DWH Batch job to generate features Prediction service 3 Online prediction with streaming features ● Online features: computed online, ○ e.g. distance between two locations, count/percentile in the last 30 mins KV store Write to offline 2 Write to online 2 Real-time transport Logs Stream feature extraction Feature service 5 4 Batch features 1 4 Prediction requests
  • 17. Combining offline and online data Time DWH Stream transaction behavior over the last 6 months T-7 days T-1 day to T-6 month
  • 18. Combining offline and online data Time DWH Stream transaction behavior over the last 6 months T-7 days T-1 day to T-6 month Backfilling challenge
  • 19. Backfill in Lambda Architecture Data Source In-motion Compute At-rest Compute Online Storage Offline Storage Online Query (serving) Mixed Query (backfill) Offline Query (training)
  • 20. Backfill in Lambda Architecture Data Source In-motion Compute At-rest Compute Online Storage Offline Storage Online Query (serving) Mixed Query (backfill) Offline Query (training)
  • 21. Backfill in Kappa Architecture Data Source In-motion Compute (Backfill from historical log) Materialized Views Online Query (serving) Offline Query (training) batch transformation streaming transformation
  • 22. Backfill in Kappa Architecture Data Source In-motion Compute (Backfill from historical log) Materialized Views Online Query (serving) Offline Query (training) batch transformation streaming transformation
  • 23. 23 Unified Backfill Data Source In-motion Compute (intelligent backfill from dual sources) Materialized Views Online Query (serving) Offline Query (training) batch transformation streaming transformation DWH backed logs Orchestration & Governance
  • 24. 24 Abstracted Unified Backfill Data Source In-motion Compute (intelligent backfill from dual sources) Materialized Views Online Query (serving) Offline Query (training) batch transformation streaming transformation DWH backed logs Orchestration & Governance
  • 25. Build model features ● Should I declare features in SQL or Python? ● How do I join existing intent classification results to my new feature ● What confidence can I get before checking in my code? Data Scientists
  • 26. Does the ML platform speak the same language as the users? ML Platforms: What’s preventing easy and responsive?
  • 27. Does the ML platform speak the same language as the users? Questions for ML Platforms: ● Can users express or declare what they need to control in a single coherent interface? ● Can the platform understand the intent and drive the underlying system? ● Can user and platform communicate interactively, in a timely fashion? ● Can the user understand their options and tradeoffs without reading a 300-pages manual? ● How much integration effort is needed to plug a model into existing data streams?
  • 28. Online Prediction: Latency vs. Staleness Latency Request Prediction Feature computation Prediction retrieval Feature retrieval Prediction computation Raw data Staleness RT Feature NRT Feature Batch Feature Staleness No staleness* > secs > hours Latency Low (10s ms-1s sec) Lower (10s-100s ms) Lower (10s-100s ms) Footnote: *computation takes time, latency includes the computation time; Feature performance dependent on source technology and shared traffic pattern.
  • 29. What about tradeoffs? ● Three dimensions! ● Can choose 2! ● Have to be flexible on the 3rd ● Need clean abstractions for full freedom Correctness Low cost Low latency 1. Fast & Correct 2. Cheap & Correct 3. Fast & Cheap reference: Open Problems in Stream Processing: A Call To Action, Tyler Akidau (2019)
  • 30. Python vs SQL vs (Scala) ? vs
  • 32. Python vs SQL? ≈ Intermediate representation (IR) Compute Engines
  • 33. There is a catch! UDF…
  • 34. Don’t invent a new language/DSL! Evolve existing ones to make it better.
  • 35. Connector ecosystem is getting more mature. Nice, but what about event schema and envelope standards?
  • 36. Deploy model features ● Should I duplicate the feature results in a different table? ● Which team do I need to inform about the change? ● Do I need to worry about training/prediction skew? Data Scientists
  • 37. What symptoms are there indicating your platform is not trusted? ML Platforms: What doesn’t just work?
  • 38. What symptoms are there indicating your platform is not trusted? ML Platforms: What doesn’t just work? ● My freedom and your responsibility ● Producer and consumer tension ● Users are forced to choose between basic requirements
  • 39. Offline / Online consistencies Sharing and reusing Schema evolution SWE Practices
  • 40. You are part of the endeavor to make real-time data useful! ● Ubiquitous ● Easy and responsive ● Just works! https://zhenzhongxu.com/ [email protected] the invisible interface