SlideShare a Scribd company logo
Xuan Wang, Databricks
Cloud Cost Management
and Apache Spark
#DSSAIS13
Introduction
● Goal of this talk
○ share our experience in managing cloud costs
○ tools and technologies
○ lessons learnt and good practices
○ go wide rather than go deep
2#DSSAIS13
Introduction
● Goal of this talk
○ share our experience in managing cloud costs
○ tools and technologies
○ lessons learnt and good practices
○ go wide rather than go deep
● Why do we care about cloud cost?
○ growth in cloud revenue in Q1 2018: Amazon: 49%,
Microsoft: 58%
● 3#DSSAIS13
Databricks’ Unified Analytics Platform
4
DATABRICKS RUNTIME
COLLABORATIVE NOTEBOOKS
Delta SQL Streaming
Powered by
Data Engineers Data Scientists
CLOUD NATIVE SERVICE
Unifies Data
Engineers and Data
Scientists
Unifies Data and
AI Technologies
Eliminates
infrastructure
complexity
XGBoost
Three paths toward cost control
● Native reporting from cloud providers
○ Good general information and supports
○ Limited options, not scalable as environment grows
● Commercial tools
○ More details and flexibilities, connectors to raw data
○ Not enough customization, additional charges
5#DSSAIS13
Three paths toward cost control
● Native reporting from cloud providers
○ Good general information and supports
○ Limited options, not scalable as environment grows
● Commercial tools
○ More details and flexibilities, connectors to raw data
○ Not enough customization, additional charges
● In-house solutions
○ Most flexible, deeper understanding of the costs
○ Opportunity costs
6#DSSAIS13
Challenges in cloud cost control
● overwhelming and complex usage details
○ need to convert data into insights/actions
● gaps between “hands” and “wallets”
○ developers consume resources without realizing the
charges
● evolving cloud landscape
○ external: new services, new discounts, ...
○ internal: new use cases, new architecture, ...
7#DSSAIS13
Our solutions
8#DSSAIS13
DATABRICKS
DELTA
Raw Data
cost and usage
s3 access logs
s3 inventory
ec2/rds snapshot
reserved instances
...
Databricks
Notebooks
Monitors and
alerts
BI tools:
Superset, Tableau,
...
DATA LAKE
Analytics
Our solutions
9#DSSAIS13
DATABRICKS
DELTA
Raw Data
cost and usage
s3 access logs
s3 inventory
ec2/rds snapshot
reserved instances
...
Databricks
Notebooks
Monitors and
alerts
BI tools:
Superset, Tableau,
...
DATA LAKE
Analytics
The data problem:
ETL and attribute costs
The process problem:
prioritize, optimize, monitor,
automate
The data problem
● cost and usage report (detailed billing)
○ CSV, grouped by month, updated daily
10#DSSAIS13
The data problem
● cost and usage report (detailed billing)
○ CSV, grouped by month, updated daily
● EC2/RDS snapshots and reserved instances
○ JSON, from REST API
11#DSSAIS13
The data problem
● cost and usage report (detailed billing)
○ CSV, grouped by month, updated daily
● EC2/RDS snapshots and reserved instances
○ JSON, from REST API
● S3 inventory
○ CSV/ORC, snapshot, updated daily/weekly
● S3 access logs
○ raw logs in text, updated multiple times a day
12#DSSAIS13
The data problem
● cost and usage report (detailed billing)
○ CSV, grouped by month, updated daily
● EC2/RDS snapshots and reserved instances
○ JSON, from REST API
● S3 inventory
○ CSV/ORC, snapshot, updated daily/weekly
● S3 access logs
○ raw logs in text, updated multiple times a day
13#DSSAIS13
Data pipelines with Spark
Challenges
● Data corruptions
● Multiple jobs/staging tables
● Reliability and consistency
14
Raw Data InsightData Lake
ETL Analytics
#DSSAIS13
Databricks Delta: Analytics Ready Data
1. Data Reliability
ACID Compliant Transactions
Schema Enforcement & Evolution
DATABRICKS
DELTA
LOTS OF NEW DATA
Customer Data
Click Streams
Sensor data (IoT)
Video/Speech
…
Reporting
Machine Learning
Alerting
Dashboards
2. Query Performance
Very Fast at Scale
Indexing & Caching (10-100x Faster)
3. Simplified Architecture
Unify batch & streaming
Early data availability for analytics
DATA LAKE
ETL: AWS cost and usage
16#DSSAIS13
ETL: AWS cost and usage
17#DSSAIS13
ETL: AWS s3 access logs
18#DSSAIS13
Manage Databricks Delta tables
● Create table
CREATE TABLE s3_access_logs USING delta LOCATION '$path'
● Optimize table
OPTIMIZE s3_access_logs ZORDER BY bucket
19#DSSAIS13
Manage Databricks Delta tables
● Create table
CREATE TABLE s3_access_logs USING delta LOCATION '$path'
● Optimize table
OPTIMIZE s3_access_logs ZORDER BY bucket
● Query table
SELECT * FROM s3_access_logs WHERE bucket = 'my-bucket'
20
Files layout &
statistics:
File1 File2 File3
Delta Logs:
File1: min='a', max='g'
File2: min='g', max='n'
File3: min='o', max='z'
#DSSAIS13
Attributions
● Rule based attributions
○ accounts
■ dedicated accounts for different teams / use cases
○ tagging
■ tag resources with budget groups
○ manual rules
■ should avoid this as much as possible
21#DSSAIS13
The process problem
● Prioritize
○ high data transfer cost
● Optimize
○ reserved instance purchases
● Monitor
○ predictions and alerts
● Automate
○ auto-shutdown unused resources
22#DSSAIS13
Story: high data transfer cost
● Observation
○ Cross region data transfers
are expensive
○ Two buckets cost about
$1k/day
23#DSSAIS13
Story: high data transfer cost
● Observation
○ Cross region data transfers
are expensive
○ Two buckets cost about
$1k/day
● Root cause
○ downloading spark images
24#DSSAIS13
Story: high data transfer cost
● Actions
○ Distribute images to multiple
regions.
○ Monitor on cross region cost
25#DSSAIS13
Story: high data transfer cost
● Actions
○ Distribute images to multiple
regions.
○ Monitor on cross region cost
● Results
○ Significantly reduced cost
○ Faster cluster creation
26#DSSAIS13
Optimization: reserved instances
● Reserved instances (RI)
○ 1-yr/3-yr commitment in exchange for discounts
○ underutilized instances, upfront cost
○ significant discounts, availability
27#DSSAIS13
Optimization: reserved instances
● Reserved instances (RI)
○ 1-yr/3-yr commitment in exchange for discounts
○ underutilized instances, upfront cost
○ significant discounts, availability
● Challenges
○ non-trivial to decide how much RI to purchase
○ need to predict the future
28#DSSAIS13
Optimization: reserved instances
29
● Assign budgets to
teams
● Provide tool to compute
the optimal RI to buy
● Define process for RI
purchase requests and
approvals
#DSSAIS13
Monitor and alerts
● Why
○ prevent degenerations
○ proactive to “bill shock”
● Challenges
○ different patterns for different use cases
○ changing baselines
30#DSSAIS13
Monitor and alerts
31
Picture from Sharma, S., Swayne, D.A. & Obimbo, C. Energ. Ecol. Environ. (2016) 1: 123.
https://doi.org/10.1007/s40974-016-0011-1
#DSSAIS13
Monitor and alerts
● adaptive prediction with
change point detection
● alerts for each budget
group
● scheduled jobs and
dashboards
32#DSSAIS13
Auto-shutdown with Custodian
● https://github.com/capitalone/cloud-custodian
○ Rule based cloud infrastructure management tool
33#DSSAIS13
Auto-shutdown with Custodian
34#DSSAIS13
Auto-shutdown with Custodian
35#DSSAIS13
Auto-shutdown with Custodian
36#DSSAIS13
Summary
● in-house solutions because
○ flexibility and deeper understanding of cost and usage
● cost attribution - a data problem
○ ETL: Databricks Delta for analytics ready data
○ explore: Databricks Notebooks and BI tools via JDBC
○ attribute: rule-based, tagging is important
● cost control - a process problem
○ prioritize: get the work done!
○ optimize: distributed ownerships, and centralized tools
○ monitor: change points + basic linear model
○ automate: Custodian for managing cloud infrastructure
37#DSSAIS13
Thank you
xwang@databricks.com
https://www.linkedin.com/in/xuanwang2
https://databricks.com/product/unified-analytics-platform
38#DSSAIS13
Xuan Wang, Databricks
BACKUP SLIDES
#AssignedHashtagGoesHere
New: Databricks Delta
Extends Apache Spark to simplify data reliability and performance
Data Reliability Fast Analytics
+
The data problem
● cost and usage report
○ detailed usage and billing information by hours
○ CSV, delivered at least once a day
● s3 inventory
○ list of all objects in s3 and associated metadata
○ CSV/ORC, delivered daily/weekly
41#DSSAIS13
The data problem
● s3 access logs
○ requests/API calls to access s3 objects
○ raw logs, delivered frequently
● EC2/RDS snapshots and reserved instances
○ json from REST API
42#DSSAIS13

More Related Content

What's hot (20)

PDF
Fundamentals of Apache Kafka
Chhavi Parasher
 
PPTX
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema
 
PDF
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
PDF
Azure Security Overview
David J Rosenthal
 
PPTX
Data Center Migration to the AWS Cloud
Tom Laszewski
 
PPTX
Azure Cost Management
Stefano Tempesta
 
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
PPTX
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Timothy McAliley
 
PDF
DevSecOps
Tomas Honzak
 
PDF
Azure Data Factory v2
Sergio Zenatti Filho
 
PDF
Introducing Databricks Delta
Databricks
 
PDF
Google Associate Cloud Engineer Certification Tips
Daniel Zivkovic
 
PDF
Accelerate Application Innovation Journey with Azure Kubernetes Service
WinWire Technologies Inc
 
PPTX
Microsoft Azure Technical Overview
gjuljo
 
PDF
Setting up a Cloud Center of Excellence (CCoE) for Enterprise Customers
Ali Asgar Juzer
 
PPTX
Securing sensitive data with Azure Key Vault
Tom Kerkhove
 
PPTX
Kafka Streams for Java enthusiasts
Slim Baltagi
 
PDF
[Azure Governance] Lesson 4 : Azure Policy
☁ Hicham KADIRI ☁
 
PDF
Cloud Migration Checklist | Microsoft Azure Migration
Intellika
 
PPTX
Databricks Fundamentals
Dalibor Wijas
 
Fundamentals of Apache Kafka
Chhavi Parasher
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Azure Security Overview
David J Rosenthal
 
Data Center Migration to the AWS Cloud
Tom Laszewski
 
Azure Cost Management
Stefano Tempesta
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Jean-Paul Azar
 
Azure Cloud Adoption Framework + Governance - Sana Khan and Jay Kumar
Timothy McAliley
 
DevSecOps
Tomas Honzak
 
Azure Data Factory v2
Sergio Zenatti Filho
 
Introducing Databricks Delta
Databricks
 
Google Associate Cloud Engineer Certification Tips
Daniel Zivkovic
 
Accelerate Application Innovation Journey with Azure Kubernetes Service
WinWire Technologies Inc
 
Microsoft Azure Technical Overview
gjuljo
 
Setting up a Cloud Center of Excellence (CCoE) for Enterprise Customers
Ali Asgar Juzer
 
Securing sensitive data with Azure Key Vault
Tom Kerkhove
 
Kafka Streams for Java enthusiasts
Slim Baltagi
 
[Azure Governance] Lesson 4 : Azure Policy
☁ Hicham KADIRI ☁
 
Cloud Migration Checklist | Microsoft Azure Migration
Intellika
 
Databricks Fundamentals
Dalibor Wijas
 

Similar to Cloud Cost Management and Apache Spark with Xuan Wang (20)

PPTX
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
StampedeCon
 
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
PDF
The Hidden Value of Hadoop Migration
Databricks
 
PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PPTX
Finance and Accounting BPM
Bob Samuels
 
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
PDF
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
PDF
Building modern data lakes
Minio
 
PDF
Platforming the Major Analytic Use Cases for Modern Engineering
DATAVERSITY
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Estimating the Total Costs of Your Cloud Analytics Platform
DATAVERSITY
 
PDF
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 
PPTX
Databricks on AWS.pptx
Wasm1953
 
PDF
Architecting for Data Science
Johann Schleier-Smith
 
PDF
Big Data at a Gaming Company: Spil Games
Rob Winters
 
PPTX
Observability in real time at scale
Balvinder Hira
 
PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Disrupting Big Data with Apache Spark in the Cloud
Jen Aman
 
Beyond a Big Data Pilot: Building a Production Data Infrastructure - Stampede...
StampedeCon
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
The Hidden Value of Hadoop Migration
Databricks
 
Modernizing to a Cloud Data Architecture
Databricks
 
Finance and Accounting BPM
Bob Samuels
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Databricks
 
Building modern data lakes
Minio
 
Platforming the Major Analytic Use Cases for Modern Engineering
DATAVERSITY
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Estimating the Total Costs of Your Cloud Analytics Platform
DATAVERSITY
 
Cosmos DB Real-time Advanced Analytics Workshop
Databricks
 
Databricks on AWS.pptx
Wasm1953
 
Architecting for Data Science
Johann Schleier-Smith
 
Big Data at a Gaming Company: Spil Games
Rob Winters
 
Observability in real time at scale
Balvinder Hira
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Scaling up with Cisco Big Data: Data + Science = Data Science
eRic Choo
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Disrupting Big Data with Apache Spark in the Cloud
Jen Aman
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
Choosing the Right Database for Indexing.pdf
Tamanna
 
PPTX
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
Choosing the Right Database for Indexing.pdf
Tamanna
 
Exploring Multilingual Embeddings for Italian Semantic Search: A Pretrained a...
Sease
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
Simplifying Document Processing with Docling for AI Applications.pdf
Tamanna
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
ER_Model_Relationship_in_DBMS_Presentation.pptx
dharaadhvaryu1992
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 

Cloud Cost Management and Apache Spark with Xuan Wang

  • 1. Xuan Wang, Databricks Cloud Cost Management and Apache Spark #DSSAIS13
  • 2. Introduction ● Goal of this talk ○ share our experience in managing cloud costs ○ tools and technologies ○ lessons learnt and good practices ○ go wide rather than go deep 2#DSSAIS13
  • 3. Introduction ● Goal of this talk ○ share our experience in managing cloud costs ○ tools and technologies ○ lessons learnt and good practices ○ go wide rather than go deep ● Why do we care about cloud cost? ○ growth in cloud revenue in Q1 2018: Amazon: 49%, Microsoft: 58% ● 3#DSSAIS13
  • 4. Databricks’ Unified Analytics Platform 4 DATABRICKS RUNTIME COLLABORATIVE NOTEBOOKS Delta SQL Streaming Powered by Data Engineers Data Scientists CLOUD NATIVE SERVICE Unifies Data Engineers and Data Scientists Unifies Data and AI Technologies Eliminates infrastructure complexity XGBoost
  • 5. Three paths toward cost control ● Native reporting from cloud providers ○ Good general information and supports ○ Limited options, not scalable as environment grows ● Commercial tools ○ More details and flexibilities, connectors to raw data ○ Not enough customization, additional charges 5#DSSAIS13
  • 6. Three paths toward cost control ● Native reporting from cloud providers ○ Good general information and supports ○ Limited options, not scalable as environment grows ● Commercial tools ○ More details and flexibilities, connectors to raw data ○ Not enough customization, additional charges ● In-house solutions ○ Most flexible, deeper understanding of the costs ○ Opportunity costs 6#DSSAIS13
  • 7. Challenges in cloud cost control ● overwhelming and complex usage details ○ need to convert data into insights/actions ● gaps between “hands” and “wallets” ○ developers consume resources without realizing the charges ● evolving cloud landscape ○ external: new services, new discounts, ... ○ internal: new use cases, new architecture, ... 7#DSSAIS13
  • 8. Our solutions 8#DSSAIS13 DATABRICKS DELTA Raw Data cost and usage s3 access logs s3 inventory ec2/rds snapshot reserved instances ... Databricks Notebooks Monitors and alerts BI tools: Superset, Tableau, ... DATA LAKE Analytics
  • 9. Our solutions 9#DSSAIS13 DATABRICKS DELTA Raw Data cost and usage s3 access logs s3 inventory ec2/rds snapshot reserved instances ... Databricks Notebooks Monitors and alerts BI tools: Superset, Tableau, ... DATA LAKE Analytics The data problem: ETL and attribute costs The process problem: prioritize, optimize, monitor, automate
  • 10. The data problem ● cost and usage report (detailed billing) ○ CSV, grouped by month, updated daily 10#DSSAIS13
  • 11. The data problem ● cost and usage report (detailed billing) ○ CSV, grouped by month, updated daily ● EC2/RDS snapshots and reserved instances ○ JSON, from REST API 11#DSSAIS13
  • 12. The data problem ● cost and usage report (detailed billing) ○ CSV, grouped by month, updated daily ● EC2/RDS snapshots and reserved instances ○ JSON, from REST API ● S3 inventory ○ CSV/ORC, snapshot, updated daily/weekly ● S3 access logs ○ raw logs in text, updated multiple times a day 12#DSSAIS13
  • 13. The data problem ● cost and usage report (detailed billing) ○ CSV, grouped by month, updated daily ● EC2/RDS snapshots and reserved instances ○ JSON, from REST API ● S3 inventory ○ CSV/ORC, snapshot, updated daily/weekly ● S3 access logs ○ raw logs in text, updated multiple times a day 13#DSSAIS13
  • 14. Data pipelines with Spark Challenges ● Data corruptions ● Multiple jobs/staging tables ● Reliability and consistency 14 Raw Data InsightData Lake ETL Analytics #DSSAIS13
  • 15. Databricks Delta: Analytics Ready Data 1. Data Reliability ACID Compliant Transactions Schema Enforcement & Evolution DATABRICKS DELTA LOTS OF NEW DATA Customer Data Click Streams Sensor data (IoT) Video/Speech … Reporting Machine Learning Alerting Dashboards 2. Query Performance Very Fast at Scale Indexing & Caching (10-100x Faster) 3. Simplified Architecture Unify batch & streaming Early data availability for analytics DATA LAKE
  • 16. ETL: AWS cost and usage 16#DSSAIS13
  • 17. ETL: AWS cost and usage 17#DSSAIS13
  • 18. ETL: AWS s3 access logs 18#DSSAIS13
  • 19. Manage Databricks Delta tables ● Create table CREATE TABLE s3_access_logs USING delta LOCATION '$path' ● Optimize table OPTIMIZE s3_access_logs ZORDER BY bucket 19#DSSAIS13
  • 20. Manage Databricks Delta tables ● Create table CREATE TABLE s3_access_logs USING delta LOCATION '$path' ● Optimize table OPTIMIZE s3_access_logs ZORDER BY bucket ● Query table SELECT * FROM s3_access_logs WHERE bucket = 'my-bucket' 20 Files layout & statistics: File1 File2 File3 Delta Logs: File1: min='a', max='g' File2: min='g', max='n' File3: min='o', max='z' #DSSAIS13
  • 21. Attributions ● Rule based attributions ○ accounts ■ dedicated accounts for different teams / use cases ○ tagging ■ tag resources with budget groups ○ manual rules ■ should avoid this as much as possible 21#DSSAIS13
  • 22. The process problem ● Prioritize ○ high data transfer cost ● Optimize ○ reserved instance purchases ● Monitor ○ predictions and alerts ● Automate ○ auto-shutdown unused resources 22#DSSAIS13
  • 23. Story: high data transfer cost ● Observation ○ Cross region data transfers are expensive ○ Two buckets cost about $1k/day 23#DSSAIS13
  • 24. Story: high data transfer cost ● Observation ○ Cross region data transfers are expensive ○ Two buckets cost about $1k/day ● Root cause ○ downloading spark images 24#DSSAIS13
  • 25. Story: high data transfer cost ● Actions ○ Distribute images to multiple regions. ○ Monitor on cross region cost 25#DSSAIS13
  • 26. Story: high data transfer cost ● Actions ○ Distribute images to multiple regions. ○ Monitor on cross region cost ● Results ○ Significantly reduced cost ○ Faster cluster creation 26#DSSAIS13
  • 27. Optimization: reserved instances ● Reserved instances (RI) ○ 1-yr/3-yr commitment in exchange for discounts ○ underutilized instances, upfront cost ○ significant discounts, availability 27#DSSAIS13
  • 28. Optimization: reserved instances ● Reserved instances (RI) ○ 1-yr/3-yr commitment in exchange for discounts ○ underutilized instances, upfront cost ○ significant discounts, availability ● Challenges ○ non-trivial to decide how much RI to purchase ○ need to predict the future 28#DSSAIS13
  • 29. Optimization: reserved instances 29 ● Assign budgets to teams ● Provide tool to compute the optimal RI to buy ● Define process for RI purchase requests and approvals #DSSAIS13
  • 30. Monitor and alerts ● Why ○ prevent degenerations ○ proactive to “bill shock” ● Challenges ○ different patterns for different use cases ○ changing baselines 30#DSSAIS13
  • 31. Monitor and alerts 31 Picture from Sharma, S., Swayne, D.A. & Obimbo, C. Energ. Ecol. Environ. (2016) 1: 123. https://doi.org/10.1007/s40974-016-0011-1 #DSSAIS13
  • 32. Monitor and alerts ● adaptive prediction with change point detection ● alerts for each budget group ● scheduled jobs and dashboards 32#DSSAIS13
  • 33. Auto-shutdown with Custodian ● https://github.com/capitalone/cloud-custodian ○ Rule based cloud infrastructure management tool 33#DSSAIS13
  • 37. Summary ● in-house solutions because ○ flexibility and deeper understanding of cost and usage ● cost attribution - a data problem ○ ETL: Databricks Delta for analytics ready data ○ explore: Databricks Notebooks and BI tools via JDBC ○ attribute: rule-based, tagging is important ● cost control - a process problem ○ prioritize: get the work done! ○ optimize: distributed ownerships, and centralized tools ○ monitor: change points + basic linear model ○ automate: Custodian for managing cloud infrastructure 37#DSSAIS13
  • 39. Xuan Wang, Databricks BACKUP SLIDES #AssignedHashtagGoesHere
  • 40. New: Databricks Delta Extends Apache Spark to simplify data reliability and performance Data Reliability Fast Analytics +
  • 41. The data problem ● cost and usage report ○ detailed usage and billing information by hours ○ CSV, delivered at least once a day ● s3 inventory ○ list of all objects in s3 and associated metadata ○ CSV/ORC, delivered daily/weekly 41#DSSAIS13
  • 42. The data problem ● s3 access logs ○ requests/API calls to access s3 objects ○ raw logs, delivered frequently ● EC2/RDS snapshots and reserved instances ○ json from REST API 42#DSSAIS13