SlideShare a Scribd company logo
Architectures of AI systems
Engineering for Big Data & AI
HCMC, Sep 6th 2019 herve@quod.aiHerve Roussel
What is
Data Engineering ?
Is this data engineering?
UploadData.java
upload_data.py
cat console.log
| grep “ERROR”
> errors.log
Is this data engineering?
Data engineering?
Transformed dataEvent data
Program
Backend vs Data?
cat console.log
| grep “ERROR”
> errors.log
Is this data engineering?
Event data
Transform
Transformed data
What is
Big Data Engineering ?
Where is Big Data?
How to query news feed?
SELECT
*
FROM posts
INNER JOIN friends
WHERE ...
ORDER BY
posts.timestamp DESC
Notify? Web,
mobile?
Who can
see this?
Racist? Vulgar?
Is this a face? Who’s
this? Friend? Celebrity?
Courtney likes. Is that
good or bad?
Paddy commented. Is
that good or bad?
Chris posted. Is that
good or bad?
Anybody tagged?
What rank
in feed?
Copyright violation?
Is Big Data just for big companies?
300K QPS [R]
6K QPS [W]
As of JULY 8, 2013
1B+ QPM [P]
250M+ QPM [R]
400M LOC [P]
1.8 TB per year [P]
Data Engineering
Augmented dataEvent data
Program
Event data
Transform
Augmented data
Big Data Engineering + AI
Pipeline (Transform)
Source (Event data)
Sink (Augmented data)
What is a
source ?
Synchronous_
( 10-100 ms )_
Where is data coming from?
Main data
Event source
Why split?
Asynchronous_
( 3-5 s )_
What’s in an event data?
Post
{
id: 12345,
content: “hello world”,
created_at: …
updated_at: …
author_id: 67890,
…
}
PostCreatedEvent
{
story_id: 12345,
type: “story_posted”
…
}
Job 1
Job 2
Scheduler
What’s batch processing?
Which DB for event source?
● Volume?
● Velocity? QPS reads? QPS writes?
● Latency?
● Cost? Storage & R/W
● How to write?
○ Integrity?
○ Consistency?
○ Durability?
○ Version?
● How to read?
○ Random access or sequential?
○ Full text search?
○ Geo distance?
How to store events?
MySQL MongoDB JSON on S3 (or
GCS)
30 GB OK Good Very good
10K WPS OK Good Very good
1K RPS OK Good Very good
Range read OK Good Very good
Cost $$ $$$ $
MySQL MongoDB
30 GB OK Good
10K WPS OK Good
1K RPS OK Good
Sequential read OK Good
Cost $$ $$$
How to store events?
Who wants to become architect?
Job 1
Job 2
Scheduler
What’s the problem with batch?
LATENCY
How to process real-time?
Stream processing
How can 2 processes talk?
QUEUE
Why not use database?
Importance MySQL Kafka Redis
10K WPS 1.0 5 10 10
1K RPS 1.0 5 10 10
Sequential
read
1.0 10
(with B-TREE)
10 10
(using Lists)
Order
guarantee
0.2 10 0 10
Durability 0.1 10 5 (but perf. hit) 0
Deployability 0.5 10 5 7.5
Score 5.6 / 10 6.6 / 10 7.15 / 10
Why not database?
What is a
transform ?
Transforms
Source
Sink
Functional vs OOP
Librarian
.startShift()
Catalog.open() Library.close()
Books.create()
Operations on things
Add more things
find(book)
assign(book)
Things with operations
Add more operations
remove(book)
load_cover(book)
Functional vs OOP
find_similar(vid_uploaded)
transcribe_captions(vid_uploaded
)
Things with operations
Add more operations
alert_subscribers(vid_uploaded)
generate_thumbnails(vid_uploaded
)
What’s supporting data?
Transform
Supporting data
event
{
id: 12345,
type: “story_posted”
user_id: 67890
coordinates: [ 10.76, 106.66
]
}
Friends or city DB
Who uses ext. supporting data?
API vs Pipeline: availability?
Requests in thread Long running
API vs Pipeline: performance?
100ms
⇓
10ms
100ms * 300,000/60/60 = 9H
⇓
10ms * 300,000/60/60 = 55 min
Where is the data coming from?
Is this a face? Who’s
this? Friend? Celebrity?
Data pipelines & AI
TransformAI model
How can 2 processes talk?
Transform
AI model
What is a
sink ?
Which DB to sink to?
What to do with the sink?
Write Read
Data scientist
Sales
What are the read use cases?
Give me summary
report of last
month’s activity
Give me posts that
contain the words
Donald Trump,
Trump or President
Give me all posts by
female, age 18-35
Aggregation Full text search Bulk data, filtered
ACID
Denormalization: good or bad?
What is BCNF?
What’s distributed data systems?
Why re-run the pipeline?
TransformAI model Transform v2
Idempotency & backfill
f(f(x)) = f(x)
POST “/BankAccount/AddFunds”
{ value: 1000, token: TX123 }
Another reason for backfill?
What if the AI model improves?
TransformAI model v2
AI systems ≠ traditional systems?
93.2%
ProbabilisticDeterministic
Store output of model v1 or v2?
AI Model v1
( accuracy: 83.1% )
AI Model v2
( accuracy: ?? )
What have we
learned ?
Source: Uber Engineering
[DE] Collect data
[DE] Process data
[DS] Build DL model
[BE/FE] Use DL model in app
[DA] Validate DL model
Which NFR for Big Data?
• Scalability
• Availability
• Interoperability
• Portability
• Modifiability
• Maintainability
• Testability
• Usability
• Buildability
• Deployability
• Ease of Development
• Performance
• Security
• Localization
• Legal
• Reusability
• Supportability
• Monitorability
• Deployability
• Ease of Development
• Performance
• Security
• Localization
• Legal
• Reusability
• Supportability
• Monitorability
Which NFR for Big Data?
• Scalability
• Availability
• Interoperability
• Portability
• Modifiability
• Maintainability
• Testability
• Usability
• Buildability
Main data
+
Materialized view
Event data
⇓
Pipeline
⇓
Augmented data
What have we learned?
Want to learn more about
AI & Big Data?
We’re hiring:
● Big Data Engineer, in training (Java)
● Big Data Engineer (Java)
● Data Scientist (Python)
http://bit.ly/quod-ai-join
herve@quod.aiHerve Roussel

More Related Content

What's hot (20)

PDF
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Wei Di
 
PDF
Microservice-based software architecture
ArangoDB Database
 
PPTX
Traveloka's journey to no ops streaming analytics
Rendy Bambang Junior
 
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Databricks
 
PPTX
Grokking Techtalk #37: Data intensive problem
Grokking VN
 
PDF
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
PDF
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
PDF
Zipline - A Declarative Feature Engineering Framework
Databricks
 
PDF
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Databricks
 
PDF
MongoDB at Baidu
Mat Keep
 
PPTX
Tableau & MongoDB: Visual Analytics at the Speed of Thought
MongoDB
 
PDF
You Can Do It in SQL
Databricks
 
PPTX
Challenges in Building a Data Pipeline
Manish Kumar
 
PDF
On Improving Broadcast Joins in Apache Spark SQL
Databricks
 
PDF
Overhauling a database engine in 2 months
Max Neunhöffer
 
PDF
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
PDF
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
PDF
Funnel Analysis with Apache Spark and Druid
Databricks
 
PDF
Using Hazelcast in the Kappa architecture
Oliver Buckley-Salmon
 
PDF
Accelerating Data Ingestion with Databricks Autoloader
Databricks
 
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Wei Di
 
Microservice-based software architecture
ArangoDB Database
 
Traveloka's journey to no ops streaming analytics
Rendy Bambang Junior
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Databricks
 
Grokking Techtalk #37: Data intensive problem
Grokking VN
 
FlinkML - Big data application meetup
Theodoros Vasiloudis
 
Real time analytics at uber @ strata data 2019
Zhenxiao Luo
 
Zipline - A Declarative Feature Engineering Framework
Databricks
 
Efficiently Building Machine Learning Models for Predictive Maintenance in th...
Databricks
 
MongoDB at Baidu
Mat Keep
 
Tableau & MongoDB: Visual Analytics at the Speed of Thought
MongoDB
 
You Can Do It in SQL
Databricks
 
Challenges in Building a Data Pipeline
Manish Kumar
 
On Improving Broadcast Joins in Apache Spark SQL
Databricks
 
Overhauling a database engine in 2 months
Max Neunhöffer
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
Radical Speed for SQL Queries on Databricks: Photon Under the Hood
Databricks
 
Funnel Analysis with Apache Spark and Druid
Databricks
 
Using Hazelcast in the Kappa architecture
Oliver Buckley-Salmon
 
Accelerating Data Ingestion with Databricks Autoloader
Databricks
 

Similar to Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI (20)

PPTX
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
PDF
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
PDF
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
PDF
Big Data for Data Scientists - WeCloudData
WeCloudData
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PPTX
PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhu
bhushanshashi818
 
PPTX
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
PPTX
Big Data Processing
Michael Ming Lei
 
PDF
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
PDF
The role of data engineering in data science and analytics practice
Joseph Benjamin Ilagan
 
PPTX
Session 10 handling bigger data
bodaceacat
 
PPTX
Session 10 handling bigger data
Sara-Jayne Terp
 
PDF
Introduction Big Data
Frank Kienle
 
PDF
Big Data and Fast Data combined – is it possible?
Swiss Data Forum Swiss Data Forum
 
PDF
Big Data
Mehmet Burak Akgün
 
PPTX
Inroduction to Big Data
Omnia Safaan
 
PPT
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
PPTX
Chapter1-Introduction Εισαγωγικές έννοιες
ssuserb91a20
 
PPTX
Demystifying data engineering
Thang Bui (Bob)
 
PDF
Big data pipelines
Vivek Aanand Ganesan
 
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoa
larsgeorge
 
Google Developer Group Lublin 8 - Modern Lambda architecture in Big Data
Hejwowski Piotr
 
Big Data for Data Scientists - WeCloudData
WeCloudData
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PPT 1.1.2.pptx ehhllo hi hwi bdfhd dbdhu
bhushanshashi818
 
Big Data in 200 km/h | AWS Big Data Demystified #1.3
Omid Vahdaty
 
Big Data Processing
Michael Ming Lei
 
LUISS - Deep Learning and data analyses - 09/01/19
Alberto Paro
 
The role of data engineering in data science and analytics practice
Joseph Benjamin Ilagan
 
Session 10 handling bigger data
bodaceacat
 
Session 10 handling bigger data
Sara-Jayne Terp
 
Introduction Big Data
Frank Kienle
 
Big Data and Fast Data combined – is it possible?
Swiss Data Forum Swiss Data Forum
 
Inroduction to Big Data
Omnia Safaan
 
Lecture 5 - Big Data and Hadoop Intro.ppt
almaraniabwmalk
 
Chapter1-Introduction Εισαγωγικές έννοιες
ssuserb91a20
 
Demystifying data engineering
Thang Bui (Bob)
 
Big data pipelines
Vivek Aanand Ganesan
 
Ad

More from Grokking VN (20)

PDF
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking VN
 
PDF
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
PDF
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking VN
 
PDF
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
PPTX
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
PPTX
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking VN
 
PDF
Grokking Techtalk #39: Gossip protocol and applications
Grokking VN
 
PDF
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking VN
 
PDF
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking VN
 
PPTX
Grokking Techtalk #37: Software design and refactoring
Grokking VN
 
PDF
Grokking TechTalk #35: Efficient spellchecking
Grokking VN
 
PDF
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 
PDF
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking VN
 
PDF
SOLID & Design Patterns
Grokking VN
 
PDF
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
PDF
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking VN
 
PDF
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
PDF
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking VN
 
PDF
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking VN
 
PDF
Grokking TechTalk #26: Compare ios and android platform
Grokking VN
 
Grokking Techtalk #46: Lessons from years hacking and defending Vietnamese banks
Grokking VN
 
Grokking Techtalk #45: First Principles Thinking
Grokking VN
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking VN
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking VN
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking VN
 
Grokking Techtalk #38: Escape Analysis in Go compiler
Grokking VN
 
Grokking Techtalk #37: Software design and refactoring
Grokking VN
 
Grokking TechTalk #35: Efficient spellchecking
Grokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking VN
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking VN
 
SOLID & Design Patterns
Grokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking VN
 
Ad

Recently uploaded (20)

PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
July Patch Tuesday
Ivanti
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
PDF
Blockchain Transactions Explained For Everyone
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
July Patch Tuesday
Ivanti
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Log-Based Anomaly Detection: Enhancing System Reliability with Machine Learning
Mohammed BEKKOUCHE
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
NewMind AI - Journal 100 Insights After The 100th Issue
NewMind AI
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
From Code to Challenge: Crafting Skill-Based Games That Engage and Reward
aiyshauae
 
Blockchain Transactions Explained For Everyone
CIFDAQ
 

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big Data & AI