Building a Turbo-fast Data
Warehousing Platform with
Databricks
Parviz Deyhim
Agenda
• Introduction to Databricks
• Building a end-to-end Data warehouse platform
• Infrastructure
• Data ingest
• ETL
• Performance optimizations
• Process & Visualize
• Securing your platform
• Conclusion
3
Parviz Deyhim (Speaker)
Parviz works with variety of different customers and helps them
with adopting Apache Spark and architecting scalable data
processing platform with Databricks Cloud. Previous to joining
Databricks, Parviz worked at AWS as a big-data solutions architect.
Denny Lee (Moderator)
Denny is a Technology Evangelist with Databricks. Previous to
joining Databricks, Denny worked as a Senior Director of Data
Sciences Engineering at Concur and was part of the incubation
team that built Hadoop on Windows and Azure (currently known as
HDInsight).
About the Speakers
Introduction to Databricks
We are Databricks, the company behind Spark
• Founded by the creators of Apache Spark
• Contributed ~75% of the Spark code in 2014
• Created Databricks cloud, a cloud-based big data platform
on top of Spark to make big data simple
Typical big data project is far from ideal
Weeks to prepare then
explore data, and find
insights
Import and explore
data
Months to build, weeks
to provision in existing
Get a cluster up and
running
Months of re-engineering
to deploy as an application
Build and deploy
data applications
For each new project, it takes months until results
How Databricks powered by Spark helps
our customers
No infrastructure
management
Interactive
workflow
Collaboration
across the
organization
Experiment to
production
instantly
100x faster than
MapReduce
Spark SQL +
ML + Streaming +
Graph processing
Speed Flexibility Ease-of-use Unified
Databricks helps you to
harness the power of Spark
“Light switch” Spark clusters in the cloud
3rd Party Applications
Interactive workspace with notebooks
Production Pipeline Scheduler
Databricks Internal Data
Warehouse Use Case
10
Databricks Internal DWH Use Case
Today: Collect logs from deployed customer clusters
Our Goal:
○ Understand customers behavior
○ Create reports for various teams (e.g. customer success &
support)
Stages
Build & Maintain
Infrastructure
Data Ingest Process &
Visualize
Transform
& Store
12
Stages
Build & Maintain
Infrastructure
Data Ingest
Process &
Visualize
Transform
& Store
13
Challenges of Building a Data Warehouse
Datacenter or Cloud?
• Build/rent data center or use a public cloud offering?
Picking the right resources
• If datacenter: what server sizes and types? Storage?
• In cloud: what instance size, how large of a disk/SSD to use?
Deployment and Automation
• How to automate the deployment process:
• Chef, Puppet, Cloudformation and etc
14
Maintenance
• How to perform seamless upgrades?
Securing the platform
• How to encrypt datasets?
• Controls, Policies, Audits
Challenges of Building a Data Warehouse
15
Databricks Hosted Platform
Managed and automated hosted platform
• Fully deployed on AWS
• Create resources with a single click
• Zero touch maintenance
16
Compute Resources
Automatic Instance Provisioning
• R3.2xlarge instances
• Use SSD for caching
• No EBS
• Deployed in major regions and more coming
17
Networking: VPC
Security & Isolation with AWS VPC
18
Networking: Enhanced Networking
High performance node to node
connectivity with placement groups
19
Integration with AWS services
S3
Kinesis
RDS
Redshift
...
20
Databricks Demo
21
Stages
Build & Maintain
Infrastructure
Data Ingest
Process &
Visualize
Transform
& Store
22
Customer Data Sources
Customer have variety of different data sources
Cloud storage: S3
Databases: MySQL, NoSQL
APIs: Facebook, SalesForce and etc
Often required to join datasets
23
Traditional Approach
Traditionally data warehouses require data to be copied
Common Question: How do I move my datasets to Databricks?
24
Traditional Approach
Required to create a schema before data is copied
25
Traditional Approach: Challenges
Moving Data:
• Very expensive and time consuming
• Creates inconsistency as data gets updated
Predefined Schema:
• Challenging to change schema for different
use-case
26
Databricks Approach: Data Sources
De-coupling compute from storage
● Leverage S3. No HDFS
Read directly from data sources
● Eliminate the need to copy data
Schema definition on read
● SparkSQL
27
Spark Data Sources Support
28
Databricks Use Case
Different data sources
• Customer metrics on S3
• Internal CRM
Need a single view of our customers
29
Databricks Use Case
We use Spark to join datasets
30
Databricks Demo
1. Reading data from external API
2. Reading usage logs data from S3
3. Joining usage and external datasets
Link
31
Stages
Build & Maintain
Infrastructure
Data Ingest
Process &
Visualize
Transform
& Store
32
Data Transformation
Need to transform data before the join operation
• Aggregation
• Consolidation
• Data cleansing
33
Databricks Demo
Link
34
ETL: Common Approaches
Two common approaches
• Offline
• Streaming/real-time
35
Extract & Transformation
Offline ET
• Data gets stored in raw format (as is)
• Some recurring job perform ET on the dataset
• New transformed dataset gets stored for later processing
Advantage
• Easy and quick to setup
Disadvantages
• Traditionally slow process
36
Databricks Jobs
Databricks Jobs
• Schedule Production workflows using Notebooks or Jars
• Create pipelines
• Monitor results
37
Databricks Demo
Jobs
Performance Optimizations
39
Performance Optimizations
Storing data in parquet
Partitioning dataset
Spark caching
• JVM
• SSD
40
Spark allows data to be stored in
different data sources
41
Parquet: Efficient columnar storage
format for data warehousing use-cases
42
Optimization: Parquet
Columnar
• Faster Scans
Better compression

Optimized for storage
• Memory
• Disk
Advantages
• Fast memory access

Disadvantages
• GC pressure
• No durability after JVM crash
43
Optimization: Caching (JVM)
Spark caching: JVM
44
Optimization: Caching (SSD)
Spark caching: SSD
Advantages
• Survives JVM and instance crash

Disadvantages
• Much slower than JVM caching
45
Databricks Use Case:
Storing aggregate data in Parquet
46
Databricks Use Case:
Storing aggregate data in Parquet
47
Databricks Demo
Link
48
Stages
Build & Maintain
Infrastructure
Data Ingest
Process &
Visualize
Transform
& Store
49
Databricks Visualizations
Notebook Visualizations
1. Built-in graphing capabilities
2. ggplot and matplotlib
3. D3 visualizations
50
Databricks Visualizations
Notebook Visualizations (DEMO)
D3/SVG
51
3rd Party Visualizations
Zoomdata
52
Securing Your Platform
53
Secure Platform
Encryption
1. In flight: SSL
2. At rest: S3 Encryption
54
Secure Platform
User Management: ACLs
Notebooks read-write-execute
Admin users
55
Secure Platform
On Our Roadmap
S3 KMS encryption
Single Sign On (SSO)
AD/LDAP support
56
Secure Platform
On Our Roadmap
IAM Roles for Spark nodes
Thank you

More Related Content

PPTX
Microsoft cloud big data strategy
PPTX
Azure Data Factory
PPTX
Delta Lake with Azure Databricks
PPTX
Azure Synapse Analytics Overview (r1)
PPTX
Azure Lowlands: An intro to Azure Data Lake
PPTX
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
PPTX
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
PPTX
Introduction to Azure Databricks
Microsoft cloud big data strategy
Azure Data Factory
Delta Lake with Azure Databricks
Azure Synapse Analytics Overview (r1)
Azure Lowlands: An intro to Azure Data Lake
Eugene Polonichko "Azure Data Lake: what is it? why is it? where is it?"
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 
Introduction to Azure Databricks

What's hot (20)

PPTX
Microsoft Data Platform - What's included
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
PDF
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
PPTX
Azure data platform overview
PPTX
PPTX
Streaming Real-time Data to Azure Data Lake Storage Gen 2
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
PDF
Azure Data Factory V2; The Data Flows
PPTX
Should I move my database to the cloud?
PPTX
Running cost effective big data workloads with Azure Synapse and Azure Data L...
PPTX
RDX Insights Presentation - Microsoft Business Intelligence
PPTX
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
PDF
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
PPTX
Synapse for mere mortals
PPTX
Building Modern Data Platform with Microsoft Azure
PDF
Data warehouse con azure synapse analytics
PPTX
Choosing technologies for a big data solution in the cloud
PPTX
Modernize & Automate Analytics Data Pipelines
PDF
PDF
2017 OpenWorld Keynote for Data Integration
Microsoft Data Platform - What's included
Modern Data Warehousing with the Microsoft Analytics Platform System
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Azure data platform overview
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Azure Data Factory V2; The Data Flows
Should I move my database to the cloud?
Running cost effective big data workloads with Azure Synapse and Azure Data L...
RDX Insights Presentation - Microsoft Business Intelligence
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Synapse for mere mortals
Building Modern Data Platform with Microsoft Azure
Data warehouse con azure synapse analytics
Choosing technologies for a big data solution in the cloud
Modernize & Automate Analytics Data Pipelines
2017 OpenWorld Keynote for Data Integration
Ad

Viewers also liked (20)

PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
PDF
Apache Spark Use case for Education Industry
PDF
Cancer Outlier Pro file Analysis using Apache Spark
PDF
Dataiku pig - hive - cascading
PPTX
How Totango uses Apache Spark
PPTX
Getting Apache Spark Customers to Production
PPTX
Kodu Game Lab e Project Spark
PDF
Fighting Fraud with Apache Spark
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
PPTX
Dataiku r users group v2
PDF
Lambda Architectures in Practice
PPTX
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
PPTX
Big data, data science & fast data
PPTX
Chatbot: What is it ?
PDF
Big Data and Fast Data - big and fast combined, is it possible?
PPTX
Chatbot AI Aeromexico (public)
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
PDF
Real Time BOM Explosions with Apache Solr and Spark
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Apache Spark Use case for Education Industry
Cancer Outlier Pro file Analysis using Apache Spark
Dataiku pig - hive - cascading
How Totango uses Apache Spark
Getting Apache Spark Customers to Production
Kodu Game Lab e Project Spark
Fighting Fraud with Apache Spark
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Dataiku r users group v2
Lambda Architectures in Practice
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
Big data, data science & fast data
Chatbot: What is it ?
Big Data and Fast Data - big and fast combined, is it possible?
Chatbot AI Aeromexico (public)
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Real Time BOM Explosions with Apache Solr and Spark
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Ad

Similar to Building a Turbo-fast Data Warehousing Platform with Databricks (20)

PDF
Demystifying Data Warehouse as a Service (DWaaS)
PPTX
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
PPTX
Slide Share MDW Modern Data Warehouse DWH
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
PPTX
Lecture 5- Data Collection and Storage.pptx
PDF
IBM - Introduction to Cloudant
PDF
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
PDF
IBM Cloud Day January 2021 - A well architected data lake
PPTX
Your-Complete-Guide-to-Azure-Data-Engineering (1).pptx
PDF
Designing a modern data warehouse in azure
PDF
Designing a modern data warehouse in azure
PDF
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
PDF
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
PDF
Exploring sql server 2016
PDF
Unlocking the Value of Your Data Lake
PDF
Azure + DataStax Enterprise Powers Office 365 Per User Store
PDF
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Demystifying Data Warehouse as a Service (DWaaS)
Big Data Analytics on the Cloud Oracle Applications AWS Redshift & Tableau
Slide Share MDW Modern Data Warehouse DWH
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
Lecture 5- Data Collection and Storage.pptx
IBM - Introduction to Cloudant
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
IBM Cloud Day January 2021 - A well architected data lake
Your-Complete-Guide-to-Azure-Data-Engineering (1).pptx
Designing a modern data warehouse in azure
Designing a modern data warehouse in azure
ADV Slides: Building and Growing Organizational Analytics with Data Lakes
Data Lakehouse, Data Mesh, and Data Fabric (r1)
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
Can Your Mobile Infrastructure Survive 1 Million Concurrent Users?
Exploring sql server 2016
Unlocking the Value of Your Data Lake
Azure + DataStax Enterprise Powers Office 365 Per User Store
Industrializing Machine Learning on an Enterprise Azure Platform with Databri...
Data Lakehouse, Data Mesh, and Data Fabric (r2)

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPT
Technicalities in writing workshops indigenous language
PPTX
ch20 Database System Architecture by Rizvee
PPT
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
PPTX
Stats annual compiled ipd opd ot br 2024
PPTX
research framework and review of related literature chapter 2
PPTX
ifsm.pptx, institutional food service management
PPTX
C programming msc chemistry pankaj pandey
PPTX
Overview_of_Computing_Presentation.pptxxx
PPT
2011 HCRP presentation-final.pptjrirrififfi
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
Basic Statistical Analysis for experimental data.pptx
PDF
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PPT
What is life? We never know the answer exactly
PDF
Grey Minimalist Professional Project Presentation (1).pdf
PPTX
PPT for Diseases (1)-2, types of diseases.pptx
PPT
Classification methods in data analytics.ppt
PDF
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
PDF
Introduction to Database Systems Lec # 1
PDF
technical specifications solar ear 2025.
Technicalities in writing workshops indigenous language
ch20 Database System Architecture by Rizvee
dsa Lec-1 Introduction FOR THE STUDENTS OF bscs
Stats annual compiled ipd opd ot br 2024
research framework and review of related literature chapter 2
ifsm.pptx, institutional food service management
C programming msc chemistry pankaj pandey
Overview_of_Computing_Presentation.pptxxx
2011 HCRP presentation-final.pptjrirrififfi
AI AND ML PROPOSAL PRESENTATION MUST.pptx
Basic Statistical Analysis for experimental data.pptx
Delhi c@ll girl# cute girls in delhi with travel girls in delhi call now
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
What is life? We never know the answer exactly
Grey Minimalist Professional Project Presentation (1).pdf
PPT for Diseases (1)-2, types of diseases.pptx
Classification methods in data analytics.ppt
Hikvision-IR-PPT---EN.pdfSADASDASSAAAAAAAAAAAAAAA
Introduction to Database Systems Lec # 1
technical specifications solar ear 2025.

Building a Turbo-fast Data Warehousing Platform with Databricks