SlideShare a Scribd company logo
Data Engineering
Boston
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Design Choices for Cloud
Data Platforms
By Ashish Mrig
About Me
★ 18+ years in the Data Engineering trenches
★ Managing a DE & Analytics team at Wayfair
★ Favorite Buzzword (official version): Repeatable-Scalable-Performant
Architecture
★ Favorite Buzzword (broken record version): “We’re not just a SQL shop”
★ Scariest Phrase: “Did we backup this table ?”
★ Follow: Medium Blog Twitter: @DataEngBoston
3
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Evolution of Databases - A Brief History
4
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
RDBMS
• Developed by IBM researchers in 70s as an effective way to store/access data
• Shared everything architecture
Pros Works well with a reasonable volume (< 4TB) OLTP/OLAP use cases
ACID compliant - easy add/update/delete
Cons Severe performance & scaling limitations for larger volume
80% of DB time is spent in unproductive work - latching, record level locking,
buffer pools etc 5
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
MPP Appliance (Massively Parallel Architecture)
• In MPP each processor works on a different part of the task
• Mostly sold as pre-packaged commercial ‘Appliance’
Pros Support massive amount of parallel read & writing
Order of magnitude more performant compared to RDBMS
Cons Scaling out an appliance can be exorbitantly expensive
On-prem provisioning, DBA support
6
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Distributed File System Based Databases
• Developed by Google pair programmer duo in early 2000s to scale their search engine
• In January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software
Foundation).
Pros Unlimited scale out
Cheap commodity hardware
Cons Three level stack - lot of disk reads + M/R + Hive (or Pig) - performance is slow
M/R is written in Java – increased complexity
7
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Cloud based Distributed MPP Engine
• Multiple vendors offering full array of choices - from
managed to self service
• New paradigm: data warehouse as a service
Pros Easy to provision, low upfront cost
Minimal devops support required, functions like provisioning, scaling,
patching, upgrading etc are managed automatically
Function to scale up or down based on demand, pay for what you use
Cons Costs can mount over time
Tied with a specific vendor technology, barrier to exit 8
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Distributed In-memory Query/Data
Processing Engines
• Fast as MPP, easy as RDBMS and scalable as Hadoop
• Separation of compute and storage
Pros No disk reads and a powerful SQL interface
Extremely performant
Cons Memory is relatively expensive
Data I/O cost 9
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
NoSQL Databases
• Evolved from the need for transactional capability at Big Data (impedance
mismatch between relational & object oriented data structures)
• Improved data access performance via some combination of handling larger
data volumes, reduced latency lag, and improved throughput
Type Example Usage
In memory
caching
Aerospike, Redis,
Memcached
session info, shopping cart, user profile, preferences
Document Mongodb Blogging platform, content mgmt platform,
Column Family Hbase, Cassandra Log aggregation, Real time analytics
Graph Neo4J Social networks, Spatial Data, Recommendation engine
10
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 11
Quiz Time !
What are the two most popular databases on the cloud ?
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
* 73,317 developer survey by Stackoverflow
Most Popular Databases*
12
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Top Ranked Cloud Data Platforms*
* 62,061 developer survey by Stackoverflow 13
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Database Use Cases
1. OLAP: Data Warehouse / Data Lake - Trends & Insights
2. OLTP: Transactional Systems
3. Everything Else : Key-Value store, Document, Graph/Network, Time Series,
Search DBs etc
One Size Fits All Doesn’t Exist Anymore
14
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Design Considerations for Cloud Data Platforms
15
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
❏ Businesses are not willing to take downtime
❏ Need to scale quickly and painlessly as your business grows
❏ Eventually everyone will want data & analysis as quickly as
possible (near real time)
❏ More data is getting collected, processed & analyzed
16
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Scalability Scales near effortlessly Need strategy & planning to
scale
Doesn’t Scale
Typical Design
Strategy
Data in combination of block
storage & cloud native
databases
Raw/semi cooked data
on-Prem, customer facing data
in cloud
Benefits Low upfront cost
Easy to scale
Total cost is somewhat
predictable/manageable
Challenges Locked in with one vendor
Cost can mount over time
Migration Challenges
High Upfront Cost
Tip →Ideal
Usage
For growth companies or small
companies or companies with
small IT
For companies with mature IT Migrate to Cloud
or Hybrid
Design Considerations: Platform Strategy
Cloud Native Hybrid On-Prem Only
17
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Tip → Every component of data pipeline should be run using distributed compute with ability to scale
● Python (AWS Lambda)
● Orchestration (Airflow)
● APIs (API Gateways or Mesh)
● Container (Kubernetes)
There is always a new shiny tool on the block - Snowflake/Synapse/Druid/BigQuery/Presto...
Tip → Future Proof Your System
● Separate storage from compute
● Store granular data on block storage (S3, GCS, Blob storage)
● Use specialized tech stack for each use case, eg - Spark or Glue for ETL, Redshift or
BigQuery for analytic query, Druid for dashboard etc
● Persist customer facing or low latency use case data in the query engines
Design Considerations: Tech Stack
Tip → To the extent possible use dynamic compute, for ETL & Query Engines
● All major cloud providers have inbuilt support
● Third party tools (eg - Qubole, Databricks) provide more fine grained control
18
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 19
Design Considerations: Cost
Assumption Compute Intensive workload; Medium size 20 node cluster - Always On
Node type Dc2.8xlarge reserved instance, 1yr contract, no upfront; $3.8/hour/node
Total Annual Cost 3.8 x 24 x 365 x 20 = ¾ Million Dollars
Additional Costs External Data Processing, S3 storage, Devops ~ $1.5 - 2M
Ideal Use Case - Heavy duty data transformation
- Compute needs are round the clock but predictable
- Have devops support
Tip → Cost
Optimization
- Use manage node (RA3) if possible
- Size the cluster efficiently
- Offload low usage data to S3
Use Case: Provisioned Cloud Database
Example: Redshift
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 20
Design Considerations: Cost (continued)
Assumption Compute Intensive workload
Cost Unit Computational units called Slots
Slot Types On Demand / Flat Rate / Flex
Slot Pricing $5 per TB data scanned (max 2000 slots) / $2000 per month per 100
slots / $4 per hour per 100 slots
Storage Cost $0.02 per GB per month
How many slots your
workload needs
It depends (especially when the workload is not always predictable or
same)
Tip → Cost
Optimization
Monitor the usage with on demand slots and run the numbers
Don’t let slots be idle: distribute the workload evenly
Use Case: Serverless Cloud Database
Example: BigQuery
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 21
Design Considerations: Cost (conclusion)
1. Cloud provides ease of scale but also can act as credit card with unlimited
spend (and no incentive to optimize)
2. Put checks & balances: alerts, notification, dashboards, monitoring etc to
proactively manage cost
3. Avoid shared tenancy like plague for production or critical workloads - to the
extent possible control your own destiny by reserving compute
4. The option to build your own ETL & Query framework (for the core dataset)
is like spending cash with every incentive baked in to be frugal and will have
more levers to pull
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Tip → Column Stores are order of magnitude faster than row store for OLAP
Design Considerations: How Data Should be Stored
22
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 23
● Build ETL with embedded data quality framework
● Data test rails and continuous integration support
● Potential to deploy within containers
● Multiple Data Ingestion (batch/event) & Delivery (file/table/event/api)
modes
● Programmatic ETL: Ability to change platforms and runtime environment
(same SQL with some modifications can be run on Redshift or Spark) based
on traffic and/or cost
Design Considerations: Build Frameworks
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
1. Single Store if you need both OLAP & OLTP workloads at low latency
2. Presto if you have federated data sources and large data volume
3. BigQuery for serverless data storage & compute and if you have no
devops and have variety of analytics needs
4. Druid for querying extremely high volume raw data and need low latency
results typically in a dashboard (real time analytics)
5. Redshift for very high volume data warehouse/data lake in cloud if you
some devops support
6. Snowflake if you don’t have a full stack data engineering team and need
cloud based data warehouse engine
7. Cassandra or BigTable or Dynamo DB for low latency transactional use
cases
8. Apache Hudi if you need incremental processing and schema evolution
is important
9. Store data on block storage (like S3) in a proprietary file storage format
and point any of the above services to it - Priceless !!
24
Recommendations
© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 25
Thank You For Attending The Meetup !
Q & A

More Related Content

What's hot (20)

PDF
Webinar: OpenStack Benefits for VMware
Platform9
 
PDF
Paul Angus – Backup & Recovery in CloudStack
ShapeBlue
 
PDF
Highly scalable caching service on cloud - Redis
Krishna-Kumar
 
PPTX
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
ShapeBlue
 
PDF
PaaS on top of CloudStack
buildacloud
 
PPTX
On Docker and its use for LHC at CERN
Sebastien Goasguen
 
PDF
Cloud Foundry Compared With Other PaaSes (Cloud Foundry Summit 2014)
VMware Tanzu
 
PPTX
Choosing a dev ops paas platform svccd presentation v2 for slideshare
John Mathon
 
PPTX
Micro services vs hadoop
Gergely Devenyi
 
PDF
Big data and Kubernetes
Anirudh Ramanathan
 
PDF
Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18
CodeOps Technologies LLP
 
PPTX
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
PPTX
A New Centralized Volume Storage Solution for Docker and Container Cloud by W...
Docker, Inc.
 
PDF
How to Make Money Solving 5 Major Problems of Cloud Hosting Customers
Jelastic Multi-Cloud PaaS
 
PPTX
Open Source, infrastructure as Code, Cloud Native Apps 2015
Jonas Rosland
 
PDF
AWS-compared-to-OpenStack
Jonathan Gershater
 
PPTX
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
Docker, Inc.
 
PDF
Developing the Stratoscale System at Scale - Muli Ben-Yehuda, Stratoscale - D...
DevOpsDays Tel Aviv
 
PPTX
Containers and VMs and Clouds: Oh My. by Mike Coleman
Docker, Inc.
 
PPTX
Scaling DataStax in Docker
DataStax
 
Webinar: OpenStack Benefits for VMware
Platform9
 
Paul Angus – Backup & Recovery in CloudStack
ShapeBlue
 
Highly scalable caching service on cloud - Redis
Krishna-Kumar
 
Adam Dagnall: Advanced S3 compatible storage integration in CloudStack
ShapeBlue
 
PaaS on top of CloudStack
buildacloud
 
On Docker and its use for LHC at CERN
Sebastien Goasguen
 
Cloud Foundry Compared With Other PaaSes (Cloud Foundry Summit 2014)
VMware Tanzu
 
Choosing a dev ops paas platform svccd presentation v2 for slideshare
John Mathon
 
Micro services vs hadoop
Gergely Devenyi
 
Big data and Kubernetes
Anirudh Ramanathan
 
Power of Choice in Docker EE 2.0 - Anoop - Docker - CC18
CodeOps Technologies LLP
 
Cloud Foundry and OpenStack - A Marriage Made in Heaven! (Cloud Foundry Summi...
VMware Tanzu
 
A New Centralized Volume Storage Solution for Docker and Container Cloud by W...
Docker, Inc.
 
How to Make Money Solving 5 Major Problems of Cloud Hosting Customers
Jelastic Multi-Cloud PaaS
 
Open Source, infrastructure as Code, Cloud Native Apps 2015
Jonas Rosland
 
AWS-compared-to-OpenStack
Jonathan Gershater
 
The Mushroom Cloud Effect or What Happens When Containers Fail? by Alois Mayr...
Docker, Inc.
 
Developing the Stratoscale System at Scale - Muli Ben-Yehuda, Stratoscale - D...
DevOpsDays Tel Aviv
 
Containers and VMs and Clouds: Oh My. by Mike Coleman
Docker, Inc.
 
Scaling DataStax in Docker
DataStax
 

Similar to Design Choices for Cloud Data Platforms (20)

PPT
SQL/NoSQL How to choose ?
Venu Anuganti
 
PPT
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
PDF
Scaling data on public clouds
Liran Zelkha
 
PDF
Beyond Relational
Lynn Langit
 
PPTX
Big Data Infrastructure and Hadoop components.pptx
GEZWARDGERALD
 
PDF
Re-inventing the Database: What to Keep and What to Throw Away
DATAVERSITY
 
PPT
Database Management Myths & Reality for the future
A B M Moniruzzaman
 
PPTX
Essential Data Engineering for Data Scientist
SoftServe
 
PDF
Prepare Your Data For The Cloud
IndicThreads
 
PDF
Preparing your data for the cloud
Inphina Technologies
 
PPTX
Choosing technologies for a big data solution in the cloud
James Serra
 
PDF
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
Chris Richardson
 
PDF
Preparing yourdataforcloud
Inphina Technologies
 
PPTX
Data Ingestion Engine
Adam Doyle
 
PDF
Brandon
Brandon Smith
 
PPT
Big Data Paris : Hadoop and NoSQL
Tugdual Grall
 
PDF
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
PDF
High-performance database technology for rock-solid IoT solutions
Clusterpoint
 
PPT
SQL or NoSQL, that is the question!
Andraz Tori
 
PPTX
5 Things that Make Hadoop a Game Changer
Caserta
 
SQL/NoSQL How to choose ?
Venu Anuganti
 
SQL, NoSQL, BigData in Data Architecture
Venu Anuganti
 
Scaling data on public clouds
Liran Zelkha
 
Beyond Relational
Lynn Langit
 
Big Data Infrastructure and Hadoop components.pptx
GEZWARDGERALD
 
Re-inventing the Database: What to Keep and What to Throw Away
DATAVERSITY
 
Database Management Myths & Reality for the future
A B M Moniruzzaman
 
Essential Data Engineering for Data Scientist
SoftServe
 
Prepare Your Data For The Cloud
IndicThreads
 
Preparing your data for the cloud
Inphina Technologies
 
Choosing technologies for a big data solution in the cloud
James Serra
 
SQL? NoSQL? NewSQL?!? What’s a Java developer to do? - JDC2012 Cairo, Egypt
Chris Richardson
 
Preparing yourdataforcloud
Inphina Technologies
 
Data Ingestion Engine
Adam Doyle
 
Brandon
Brandon Smith
 
Big Data Paris : Hadoop and NoSQL
Tugdual Grall
 
Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence...
Perficient, Inc.
 
High-performance database technology for rock-solid IoT solutions
Clusterpoint
 
SQL or NoSQL, that is the question!
Andraz Tori
 
5 Things that Make Hadoop a Game Changer
Caserta
 
Ad

Recently uploaded (20)

PDF
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PPTX
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
PDF
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Merits and Demerits of DBMS over File System & 3-Tier Architecture in DBMS
MD RIZWAN MOLLA
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Copia de Strategic Roadmap Infographics by Slidesgo.pptx (1).pdf
ssuserd4c6911
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
GenAI-Introduction-to-Copilot-for-Bing-March-2025-FOR-HUB.pptx
cleydsonborges1
 
List of all the AI prompt cheat codes.pdf
Avijit Kumar Roy
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Ad

Design Choices for Cloud Data Platforms

  • 2. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Design Choices for Cloud Data Platforms By Ashish Mrig
  • 3. About Me ★ 18+ years in the Data Engineering trenches ★ Managing a DE & Analytics team at Wayfair ★ Favorite Buzzword (official version): Repeatable-Scalable-Performant Architecture ★ Favorite Buzzword (broken record version): “We’re not just a SQL shop” ★ Scariest Phrase: “Did we backup this table ?” ★ Follow: Medium Blog Twitter: @DataEngBoston 3
  • 4. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Evolution of Databases - A Brief History 4
  • 5. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved RDBMS • Developed by IBM researchers in 70s as an effective way to store/access data • Shared everything architecture Pros Works well with a reasonable volume (< 4TB) OLTP/OLAP use cases ACID compliant - easy add/update/delete Cons Severe performance & scaling limitations for larger volume 80% of DB time is spent in unproductive work - latching, record level locking, buffer pools etc 5
  • 6. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved MPP Appliance (Massively Parallel Architecture) • In MPP each processor works on a different part of the task • Mostly sold as pre-packaged commercial ‘Appliance’ Pros Support massive amount of parallel read & writing Order of magnitude more performant compared to RDBMS Cons Scaling out an appliance can be exorbitantly expensive On-prem provisioning, DBA support 6
  • 7. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Distributed File System Based Databases • Developed by Google pair programmer duo in early 2000s to scale their search engine • In January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software Foundation). Pros Unlimited scale out Cheap commodity hardware Cons Three level stack - lot of disk reads + M/R + Hive (or Pig) - performance is slow M/R is written in Java – increased complexity 7
  • 8. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Cloud based Distributed MPP Engine • Multiple vendors offering full array of choices - from managed to self service • New paradigm: data warehouse as a service Pros Easy to provision, low upfront cost Minimal devops support required, functions like provisioning, scaling, patching, upgrading etc are managed automatically Function to scale up or down based on demand, pay for what you use Cons Costs can mount over time Tied with a specific vendor technology, barrier to exit 8
  • 9. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Distributed In-memory Query/Data Processing Engines • Fast as MPP, easy as RDBMS and scalable as Hadoop • Separation of compute and storage Pros No disk reads and a powerful SQL interface Extremely performant Cons Memory is relatively expensive Data I/O cost 9
  • 10. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved NoSQL Databases • Evolved from the need for transactional capability at Big Data (impedance mismatch between relational & object oriented data structures) • Improved data access performance via some combination of handling larger data volumes, reduced latency lag, and improved throughput Type Example Usage In memory caching Aerospike, Redis, Memcached session info, shopping cart, user profile, preferences Document Mongodb Blogging platform, content mgmt platform, Column Family Hbase, Cassandra Log aggregation, Real time analytics Graph Neo4J Social networks, Spatial Data, Recommendation engine 10
  • 11. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 11 Quiz Time ! What are the two most popular databases on the cloud ?
  • 12. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved * 73,317 developer survey by Stackoverflow Most Popular Databases* 12
  • 13. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Top Ranked Cloud Data Platforms* * 62,061 developer survey by Stackoverflow 13
  • 14. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Database Use Cases 1. OLAP: Data Warehouse / Data Lake - Trends & Insights 2. OLTP: Transactional Systems 3. Everything Else : Key-Value store, Document, Graph/Network, Time Series, Search DBs etc One Size Fits All Doesn’t Exist Anymore 14
  • 15. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Design Considerations for Cloud Data Platforms 15
  • 16. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved ❏ Businesses are not willing to take downtime ❏ Need to scale quickly and painlessly as your business grows ❏ Eventually everyone will want data & analysis as quickly as possible (near real time) ❏ More data is getting collected, processed & analyzed 16
  • 17. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Scalability Scales near effortlessly Need strategy & planning to scale Doesn’t Scale Typical Design Strategy Data in combination of block storage & cloud native databases Raw/semi cooked data on-Prem, customer facing data in cloud Benefits Low upfront cost Easy to scale Total cost is somewhat predictable/manageable Challenges Locked in with one vendor Cost can mount over time Migration Challenges High Upfront Cost Tip →Ideal Usage For growth companies or small companies or companies with small IT For companies with mature IT Migrate to Cloud or Hybrid Design Considerations: Platform Strategy Cloud Native Hybrid On-Prem Only 17
  • 18. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Tip → Every component of data pipeline should be run using distributed compute with ability to scale ● Python (AWS Lambda) ● Orchestration (Airflow) ● APIs (API Gateways or Mesh) ● Container (Kubernetes) There is always a new shiny tool on the block - Snowflake/Synapse/Druid/BigQuery/Presto... Tip → Future Proof Your System ● Separate storage from compute ● Store granular data on block storage (S3, GCS, Blob storage) ● Use specialized tech stack for each use case, eg - Spark or Glue for ETL, Redshift or BigQuery for analytic query, Druid for dashboard etc ● Persist customer facing or low latency use case data in the query engines Design Considerations: Tech Stack Tip → To the extent possible use dynamic compute, for ETL & Query Engines ● All major cloud providers have inbuilt support ● Third party tools (eg - Qubole, Databricks) provide more fine grained control 18
  • 19. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 19 Design Considerations: Cost Assumption Compute Intensive workload; Medium size 20 node cluster - Always On Node type Dc2.8xlarge reserved instance, 1yr contract, no upfront; $3.8/hour/node Total Annual Cost 3.8 x 24 x 365 x 20 = ¾ Million Dollars Additional Costs External Data Processing, S3 storage, Devops ~ $1.5 - 2M Ideal Use Case - Heavy duty data transformation - Compute needs are round the clock but predictable - Have devops support Tip → Cost Optimization - Use manage node (RA3) if possible - Size the cluster efficiently - Offload low usage data to S3 Use Case: Provisioned Cloud Database Example: Redshift
  • 20. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 20 Design Considerations: Cost (continued) Assumption Compute Intensive workload Cost Unit Computational units called Slots Slot Types On Demand / Flat Rate / Flex Slot Pricing $5 per TB data scanned (max 2000 slots) / $2000 per month per 100 slots / $4 per hour per 100 slots Storage Cost $0.02 per GB per month How many slots your workload needs It depends (especially when the workload is not always predictable or same) Tip → Cost Optimization Monitor the usage with on demand slots and run the numbers Don’t let slots be idle: distribute the workload evenly Use Case: Serverless Cloud Database Example: BigQuery
  • 21. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 21 Design Considerations: Cost (conclusion) 1. Cloud provides ease of scale but also can act as credit card with unlimited spend (and no incentive to optimize) 2. Put checks & balances: alerts, notification, dashboards, monitoring etc to proactively manage cost 3. Avoid shared tenancy like plague for production or critical workloads - to the extent possible control your own destiny by reserving compute 4. The option to build your own ETL & Query framework (for the core dataset) is like spending cash with every incentive baked in to be frugal and will have more levers to pull
  • 22. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved Tip → Column Stores are order of magnitude faster than row store for OLAP Design Considerations: How Data Should be Stored 22
  • 23. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 23 ● Build ETL with embedded data quality framework ● Data test rails and continuous integration support ● Potential to deploy within containers ● Multiple Data Ingestion (batch/event) & Delivery (file/table/event/api) modes ● Programmatic ETL: Ability to change platforms and runtime environment (same SQL with some modifications can be run on Redshift or Spark) based on traffic and/or cost Design Considerations: Build Frameworks
  • 24. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 1. Single Store if you need both OLAP & OLTP workloads at low latency 2. Presto if you have federated data sources and large data volume 3. BigQuery for serverless data storage & compute and if you have no devops and have variety of analytics needs 4. Druid for querying extremely high volume raw data and need low latency results typically in a dashboard (real time analytics) 5. Redshift for very high volume data warehouse/data lake in cloud if you some devops support 6. Snowflake if you don’t have a full stack data engineering team and need cloud based data warehouse engine 7. Cassandra or BigTable or Dynamo DB for low latency transactional use cases 8. Apache Hudi if you need incremental processing and schema evolution is important 9. Store data on block storage (like S3) in a proprietary file storage format and point any of the above services to it - Priceless !! 24 Recommendations
  • 25. © Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 25 Thank You For Attending The Meetup ! Q & A