Design Choices for Cloud Data Platforms

© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved
Design Choices for Cloud
Data Platforms
By Ashish Mrig

About Me
★ 18+ years in the Data Engineering trenches
★ Managing a DE & Analytics team at Wayfair
★ Favorite Buzzword (ofﬁcial version): Repeatable-Scalable-Performant
Architecture
★ Favorite Buzzword (broken record version): “We’re not just a SQL shop”
★ Scariest Phrase: “Did we backup this table ?”
★ Follow: Medium Blog Twitter: @DataEngBoston
3

Evolution of Databases - A Brief History
4

RDBMS
• Developed by IBM researchers in 70s as an effective way to store/access data
• Shared everything architecture
Pros Works well with a reasonable volume (< 4TB) OLTP/OLAP use cases
ACID compliant - easy add/update/delete
Cons Severe performance & scaling limitations for larger volume
80% of DB time is spent in unproductive work - latching, record level locking,
buffer pools etc 5

MPP Appliance (Massively Parallel Architecture)
• In MPP each processor works on a different part of the task
• Mostly sold as pre-packaged commercial ‘Appliance’
Pros Support massive amount of parallel read & writing
Order of magnitude more performant compared to RDBMS
Cons Scaling out an appliance can be exorbitantly expensive
On-prem provisioning, DBA support
6

Distributed File System Based Databases
• Developed by Google pair programmer duo in early 2000s to scale their search engine
• In January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software
Foundation).
Pros Unlimited scale out
Cheap commodity hardware
Cons Three level stack - lot of disk reads + M/R + Hive (or Pig) - performance is slow
M/R is written in Java – increased complexity
7

Cloud based Distributed MPP Engine
• Multiple vendors offering full array of choices - from
managed to self service
• New paradigm: data warehouse as a service
Pros Easy to provision, low upfront cost
Minimal devops support required, functions like provisioning, scaling,
patching, upgrading etc are managed automatically
Function to scale up or down based on demand, pay for what you use
Cons Costs can mount over time
Tied with a specific vendor technology, barrier to exit 8

Distributed In-memory Query/Data
Processing Engines
• Fast as MPP, easy as RDBMS and scalable as Hadoop
• Separation of compute and storage
Pros No disk reads and a powerful SQL interface
Extremely performant
Cons Memory is relatively expensive
Data I/O cost 9

NoSQL Databases
• Evolved from the need for transactional capability at Big Data (impedance
mismatch between relational & object oriented data structures)
• Improved data access performance via some combination of handling larger
data volumes, reduced latency lag, and improved throughput
Type Example Usage
In memory
caching
Aerospike, Redis,
Memcached
session info, shopping cart, user profile, preferences
Document Mongodb Blogging platform, content mgmt platform,
Column Family Hbase, Cassandra Log aggregation, Real time analytics
Graph Neo4J Social networks, Spatial Data, Recommendation engine
10

© Copyright: Ashish Mrig for Data Engineering Boston. All rights reserved 11
Quiz Time !
What are the two most popular databases on the cloud ?

* 73,317 developer survey by Stackoverflow
Most Popular Databases*
12

Top Ranked Cloud Data Platforms*
* 62,061 developer survey by Stackoverflow 13

Database Use Cases
1. OLAP: Data Warehouse / Data Lake - Trends & Insights
2. OLTP: Transactional Systems
3. Everything Else : Key-Value store, Document, Graph/Network, Time Series,
Search DBs etc
One Size Fits All Doesn’t Exist Anymore
14

Design Considerations for Cloud Data Platforms
15

❏ Businesses are not willing to take downtime
❏ Need to scale quickly and painlessly as your business grows
❏ Eventually everyone will want data & analysis as quickly as
possible (near real time)
❏ More data is getting collected, processed & analyzed
16

Scalability Scales near effortlessly Need strategy & planning to
scale
Doesn’t Scale
Typical Design
Strategy
Data in combination of block
storage & cloud native
databases
Raw/semi cooked data
on-Prem, customer facing data
in cloud
Benefits Low upfront cost
Easy to scale
Total cost is somewhat
predictable/manageable
Challenges Locked in with one vendor
Cost can mount over time
Migration Challenges
High Upfront Cost
Tip →Ideal
Usage
For growth companies or small
companies or companies with
small IT
For companies with mature IT Migrate to Cloud
or Hybrid
Design Considerations: Platform Strategy
Cloud Native Hybrid On-Prem Only
17

Tip → Every component of data pipeline should be run using distributed compute with ability to scale
● Python (AWS Lambda)
● Orchestration (Airflow)
● APIs (API Gateways or Mesh)
● Container (Kubernetes)
There is always a new shiny tool on the block - Snowflake/Synapse/Druid/BigQuery/Presto...
Tip → Future Proof Your System
● Separate storage from compute
● Store granular data on block storage (S3, GCS, Blob storage)
● Use specialized tech stack for each use case, eg - Spark or Glue for ETL, Redshift or
BigQuery for analytic query, Druid for dashboard etc
● Persist customer facing or low latency use case data in the query engines
Design Considerations: Tech Stack
Tip → To the extent possible use dynamic compute, for ETL & Query Engines
● All major cloud providers have inbuilt support
● Third party tools (eg - Qubole, Databricks) provide more fine grained control
18

Design Considerations: Cost
Assumption Compute Intensive workload; Medium size 20 node cluster - Always On
Node type Dc2.8xlarge reserved instance, 1yr contract, no upfront; $3.8/hour/node
Total Annual Cost 3.8 x 24 x 365 x 20 = ¾ Million Dollars
Additional Costs External Data Processing, S3 storage, Devops ~ $1.5 - 2M
Ideal Use Case - Heavy duty data transformation
- Compute needs are round the clock but predictable
- Have devops support
Tip → Cost
Optimization
- Use manage node (RA3) if possible
- Size the cluster efficiently
- Offload low usage data to S3
Use Case: Provisioned Cloud Database
Example: Redshift

Design Considerations: Cost (continued)
Assumption Compute Intensive workload
Cost Unit Computational units called Slots
Slot Types On Demand / Flat Rate / Flex
Slot Pricing $5 per TB data scanned (max 2000 slots) / $2000 per month per 100
slots / $4 per hour per 100 slots
Storage Cost $0.02 per GB per month
How many slots your
workload needs
It depends (especially when the workload is not always predictable or
same)
Tip → Cost
Optimization
Monitor the usage with on demand slots and run the numbers
Don’t let slots be idle: distribute the workload evenly
Use Case: Serverless Cloud Database
Example: BigQuery

Design Considerations: Cost (conclusion)
1. Cloud provides ease of scale but also can act as credit card with unlimited
spend (and no incentive to optimize)
2. Put checks & balances: alerts, notification, dashboards, monitoring etc to
proactively manage cost
3. Avoid shared tenancy like plague for production or critical workloads - to the
extent possible control your own destiny by reserving compute
4. The option to build your own ETL & Query framework (for the core dataset)
is like spending cash with every incentive baked in to be frugal and will have
more levers to pull

Tip → Column Stores are order of magnitude faster than row store for OLAP
Design Considerations: How Data Should be Stored
22

● Build ETL with embedded data quality framework
● Data test rails and continuous integration support
● Potential to deploy within containers
● Multiple Data Ingestion (batch/event) & Delivery (file/table/event/api)
modes
● Programmatic ETL: Ability to change platforms and runtime environment
(same SQL with some modifications can be run on Redshift or Spark) based
on traffic and/or cost
Design Considerations: Build Frameworks

1. Single Store if you need both OLAP & OLTP workloads at low latency
2. Presto if you have federated data sources and large data volume
3. BigQuery for serverless data storage & compute and if you have no
devops and have variety of analytics needs
4. Druid for querying extremely high volume raw data and need low latency
results typically in a dashboard (real time analytics)
5. Redshift for very high volume data warehouse/data lake in cloud if you
some devops support
6. Snowflake if you don’t have a full stack data engineering team and need
cloud based data warehouse engine
7. Cassandra or BigTable or Dynamo DB for low latency transactional use
cases
8. Apache Hudi if you need incremental processing and schema evolution
is important
9. Store data on block storage (like S3) in a proprietary file storage format
and point any of the above services to it - Priceless !!
24
Recommendations

Thank You For Attending The Meetup !
Q & A

Design Choices for Cloud Data Platforms

More Related Content

What's hot (20)

Similar to Design Choices for Cloud Data Platforms (20)

Recently uploaded (20)

Design Choices for Cloud Data Platforms