SlideShare a Scribd company logo
Building a Turbo-fast Data
Warehousing Platform with
Databricks
Parviz Deyhim
Agenda
• Introduction to Databricks
• Building a end-to-end Data warehouse platform
• Infrastructure
• Data ingest
• ETL
• Performance optimizations
• Process & Visualize
• Securing your platform
• Conclusion
3
Parviz Deyhim (Speaker)
Parviz works with variety of different customers and helps them
with adopting Apache Spark and architecting scalable data
processing platform with Databricks Cloud. Previous to joining
Databricks, Parviz worked at AWS as a big-data solutions architect.
Denny Lee (Moderator)
Denny is a Technology Evangelist with Databricks. Previous to
joining Databricks, Denny worked as a Senior Director of Data
Sciences Engineering at Concur and was part of the incubation
team that built Hadoop on Windows and Azure (currently known as
HDInsight).
About the Speakers
Introduction to Databricks
We are Databricks, the company behind Spark
• Founded by the creators of Apache Spark
• Contributed ~75% of the Spark code in 2014
• Created Databricks cloud, a cloud-based big data platform
on top of Spark to make big data simple
Typical big data project is far from ideal
Weeks to prepare then
explore data, and find
insights
Import and explore
data
Months to build, weeks
to provision in existing
Get a cluster up and
running
Months of re-engineering
to deploy as an application
Build and deploy
data applications
For each new project, it takes months until results
How Databricks powered by Spark helps
our customers
No infrastructure
management
Interactive
workflow
Collaboration
across the
organization
Experiment to
production
instantly
100x faster than
MapReduce
Spark SQL +
ML + Streaming +
Graph processing
Speed Flexibility Ease-of-use Unified
Databricks helps you to
harness the power of Spark
“Light switch” Spark clusters in the cloud
3rd Party Applications
Interactive workspace with notebooks
Production Pipeline Scheduler
Databricks Internal Data
Warehouse Use Case
10
Databricks Internal DWH Use Case
Today: Collect logs from deployed customer clusters
Our Goal:
○ Understand customers behavior
○ Create reports for various teams (e.g. customer success &
support)
Stages
Build & Maintain
Infrastructure
Data Ingest Process &
Visualize
Transform
& Store
12
Stages
Build & Maintain
Infrastructure
Data Ingest
Process &
Visualize
Transform
& Store
13
Challenges of Building a Data Warehouse
Datacenter or Cloud?
• Build/rent data center or use a public cloud offering?
Picking the right resources
• If datacenter: what server sizes and types? Storage?
• In cloud: what instance size, how large of a disk/SSD to use?
Deployment and Automation
• How to automate the deployment process:
• Chef, Puppet, Cloudformation and etc
14
Maintenance
• How to perform seamless upgrades?
Securing the platform
• How to encrypt datasets?
• Controls, Policies, Audits
Challenges of Building a Data Warehouse
15
Databricks Hosted Platform
Managed and automated hosted platform
• Fully deployed on AWS
• Create resources with a single click
• Zero touch maintenance
16
Compute Resources
Automatic Instance Provisioning
• R3.2xlarge instances
• Use SSD for caching
• No EBS
• Deployed in major regions and more coming
17
Networking: VPC
Security & Isolation with AWS VPC
18
Networking: Enhanced Networking
High performance node to node
connectivity with placement groups
19
Integration with AWS services
S3
Kinesis
RDS
Redshift
...
20
Databricks Demo
21
Stages
Build & Maintain
Infrastructure
Data Ingest
Process &
Visualize
Transform
& Store
22
Customer Data Sources
Customer have variety of different data sources
Cloud storage: S3
Databases: MySQL, NoSQL
APIs: Facebook, SalesForce and etc
Often required to join datasets
23
Traditional Approach
Traditionally data warehouses require data to be copied
Common Question: How do I move my datasets to Databricks?
24
Traditional Approach
Required to create a schema before data is copied
25
Traditional Approach: Challenges
Moving Data:
• Very expensive and time consuming
• Creates inconsistency as data gets updated
Predefined Schema:
• Challenging to change schema for different
use-case
26
Databricks Approach: Data Sources
De-coupling compute from storage
● Leverage S3. No HDFS
Read directly from data sources
● Eliminate the need to copy data
Schema definition on read
● SparkSQL
27
Spark Data Sources Support
28
Databricks Use Case
Different data sources
• Customer metrics on S3
• Internal CRM
Need a single view of our customers
29
Databricks Use Case
We use Spark to join datasets
30
Databricks Demo
1. Reading data from external API
2. Reading usage logs data from S3
3. Joining usage and external datasets
Link
31
Stages
Build & Maintain
Infrastructure
Data Ingest
Process &
Visualize
Transform
& Store
32
Data Transformation
Need to transform data before the join operation
• Aggregation
• Consolidation
• Data cleansing
33
Databricks Demo
Link
34
ETL: Common Approaches
Two common approaches
• Offline
• Streaming/real-time
35
Extract & Transformation
Offline ET
• Data gets stored in raw format (as is)
• Some recurring job perform ET on the dataset
• New transformed dataset gets stored for later processing
Advantage
• Easy and quick to setup
Disadvantages
• Traditionally slow process
36
Databricks Jobs
Databricks Jobs
• Schedule Production workflows using Notebooks or Jars
• Create pipelines
• Monitor results
37
Databricks Demo
Jobs
Performance Optimizations
39
Performance Optimizations
Storing data in parquet
Partitioning dataset
Spark caching
• JVM
• SSD
40
Spark allows data to be stored in
different data sources
41
Parquet: Efficient columnar storage
format for data warehousing use-cases
42
Optimization: Parquet
Columnar
• Faster Scans
Better compression

Optimized for storage
• Memory
• Disk
Advantages
• Fast memory access

Disadvantages
• GC pressure
• No durability after JVM crash
43
Optimization: Caching (JVM)
Spark caching: JVM
44
Optimization: Caching (SSD)
Spark caching: SSD
Advantages
• Survives JVM and instance crash

Disadvantages
• Much slower than JVM caching
45
Databricks Use Case:
Storing aggregate data in Parquet
46
Databricks Use Case:
Storing aggregate data in Parquet
47
Databricks Demo
Link
48
Stages
Build & Maintain
Infrastructure
Data Ingest
Process &
Visualize
Transform
& Store
49
Databricks Visualizations
Notebook Visualizations
1. Built-in graphing capabilities
2. ggplot and matplotlib
3. D3 visualizations
50
Databricks Visualizations
Notebook Visualizations (DEMO)
D3/SVG
51
3rd Party Visualizations
Zoomdata
52
Securing Your Platform
53
Secure Platform
Encryption
1. In flight: SSL
2. At rest: S3 Encryption
54
Secure Platform
User Management: ACLs
Notebooks read-write-execute
Admin users
55
Secure Platform
On Our Roadmap
S3 KMS encryption
Single Sign On (SSO)
AD/LDAP support
56
Secure Platform
On Our Roadmap
IAM Roles for Spark nodes
Thank you

More Related Content

What's hot (20)

PPTX
Microsoft Data Platform - What's included
James Serra
 
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
PDF
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Edureka!
 
PPTX
Azure data platform overview
James Serra
 
PPTX
Big Data in Azure
DataWorks Summit/Hadoop Summit
 
PPTX
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst
 
PDF
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
PDF
Azure Data Factory V2; The Data Flows
Thomas Sykes
 
PPTX
Should I move my database to the cloud?
James Serra
 
PPTX
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
PPTX
RDX Insights Presentation - Microsoft Business Intelligence
Christopher Foot
 
PPTX
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Microsoft Tech Community
 
PDF
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
PPTX
Synapse for mere mortals
Michael Stephenson
 
PPTX
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
PDF
Data warehouse con azure synapse analytics
Eduardo Castro
 
PPTX
Choosing technologies for a big data solution in the cloud
James Serra
 
PPTX
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
PDF
Data Mesh
Piethein Strengholt
 
PDF
2017 OpenWorld Keynote for Data Integration
Jeffrey T. Pollock
 
Microsoft Data Platform - What's included
James Serra
 
Modern Data Warehousing with the Microsoft Analytics Platform System
James Serra
 
Azure Data Factory | Moving On-Premise Data to Azure Cloud | Microsoft Azure ...
Edureka!
 
Azure data platform overview
James Serra
 
Streaming Real-time Data to Azure Data Lake Storage Gen 2
Carole Gunst
 
Modern Data Architecture for a Data Lake with Informatica and Hortonworks Dat...
Hortonworks
 
Azure Data Factory V2; The Data Flows
Thomas Sykes
 
Should I move my database to the cloud?
James Serra
 
Running cost effective big data workloads with Azure Synapse and Azure Data L...
Michael Rys
 
RDX Insights Presentation - Microsoft Business Intelligence
Christopher Foot
 
Develop scalable analytical solutions with Azure Data Factory & Azure SQL Dat...
Microsoft Tech Community
 
How to Architect a Serverless Cloud Data Lake for Enhanced Data Analytics
Informatica
 
Synapse for mere mortals
Michael Stephenson
 
Building Modern Data Platform with Microsoft Azure
Dmitry Anoshin
 
Data warehouse con azure synapse analytics
Eduardo Castro
 
Choosing technologies for a big data solution in the cloud
James Serra
 
Modernize & Automate Analytics Data Pipelines
Carole Gunst
 
2017 OpenWorld Keynote for Data Integration
Jeffrey T. Pollock
 

Viewers also liked (20)

PDF
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
PDF
Apache Spark Use case for Education Industry
Vinayak Agrawal
 
PDF
Cancer Outlier Pro file Analysis using Apache Spark
Mahmoud Parsian
 
PDF
Dataiku pig - hive - cascading
Dataiku
 
PPTX
How Totango uses Apache Spark
Oren Raboy
 
PPTX
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
PPTX
Kodu Game Lab e Project Spark
Fabrício Catae
 
PDF
Fighting Fraud with Apache Spark
Miklos Christine
 
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Modern Data Stack France
 
PDF
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
PPTX
Dataiku r users group v2
Cdiscount
 
PDF
Lambda Architectures in Practice
C4Media
 
PPTX
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
Paul Prae
 
PPTX
Big data, data science & fast data
Kunal Joshi
 
PPTX
Chatbot: What is it ?
Carl Gonthier
 
PDF
Big Data and Fast Data - big and fast combined, is it possible?
Guido Schmutz
 
PPTX
Chatbot AI Aeromexico (public)
Brian Gross
 
PPTX
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
PDF
Real Time BOM Explosions with Apache Solr and Spark
QAware GmbH
 
PPTX
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
A Journey into Databricks' Pipelines: Journey and Lessons Learned
Databricks
 
Apache Spark Use case for Education Industry
Vinayak Agrawal
 
Cancer Outlier Pro file Analysis using Apache Spark
Mahmoud Parsian
 
Dataiku pig - hive - cascading
Dataiku
 
How Totango uses Apache Spark
Oren Raboy
 
Getting Apache Spark Customers to Production
Cloudera, Inc.
 
Kodu Game Lab e Project Spark
Fabrício Catae
 
Fighting Fraud with Apache Spark
Miklos Christine
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Modern Data Stack France
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Databricks
 
Dataiku r users group v2
Cdiscount
 
Lambda Architectures in Practice
C4Media
 
Azure as a Chatbot Service: From Purpose To Production With A Cloud Bot Archi...
Paul Prae
 
Big data, data science & fast data
Kunal Joshi
 
Chatbot: What is it ?
Carl Gonthier
 
Big Data and Fast Data - big and fast combined, is it possible?
Guido Schmutz
 
Chatbot AI Aeromexico (public)
Brian Gross
 
Real-Time Log Analysis with Apache Mesos, Kafka and Cassandra
Joe Stein
 
Real Time BOM Explosions with Apache Solr and Spark
QAware GmbH
 
Pulsar: Real-time Analytics at Scale with Kafka, Kylin and Druid
Tony Ng
 
Ad

Similar to Building a Turbo-fast Data Warehousing Platform with Databricks (20)

PDF
Modernizing to a Cloud Data Architecture
Databricks
 
PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Building a Big Data Solution
James Serra
 
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
PDF
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
PPTX
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
PDF
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
PDF
The Hidden Value of Hadoop Migration
Databricks
 
PDF
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Slim Baltagi
 
PDF
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
PDF
Traditional data word
orcoxsm
 
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
PDF
Introducing Databricks Delta
Databricks
 
PPTX
The modern analytics architecture
Joseph D'Antoni
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PPTX
DATA MINING AND DATA WAREHOUSING TOOLS .pptx
ponmayilkarthik23
 
PPTX
Data Lake Overview
James Serra
 
PPTX
Which data should you move to Hadoop?
Attunity
 
PDF
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
PDF
Simply Business' Data Platform
Dani Solà Lagares
 
Modernizing to a Cloud Data Architecture
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Building a Big Data Solution
James Serra
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
DATAVERSITY
 
Data Warehouse Modernization: Accelerating Time-To-Action
MapR Technologies
 
Using Data Platforms That Are Fit-For-Purpose
DATAVERSITY
 
The Hidden Value of Hadoop Migration
Databricks
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Slim Baltagi
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
DATAVERSITY
 
Traditional data word
orcoxsm
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Introducing Databricks Delta
Databricks
 
The modern analytics architecture
Joseph D'Antoni
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
DATA MINING AND DATA WAREHOUSING TOOLS .pptx
ponmayilkarthik23
 
Data Lake Overview
James Serra
 
Which data should you move to Hadoop?
Attunity
 
Big Data 2.0 - How Spark technologies are reshaping the world of big data ana...
Lillian Pierson
 
Simply Business' Data Platform
Dani Solà Lagares
 
Ad

More from Databricks (20)

PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
PDF
Machine Learning CI/CD for Email Attack Detection
Databricks
 
PDF
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Databricks
 

Recently uploaded (20)

PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
PPT
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
AUDITABILITY & COMPLIANCE OF AI SYSTEMS IN HEALTHCARE
GAHI Youssef
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
Listify-Intelligent-Voice-to-Catalog-Agent.pptx
nareshkottees
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
OPPOTUS - Malaysias on Malaysia 1Q2025.pdf
Oppotus
 
Growth of Public Expendituuure_55423.ppt
NavyaDeora
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 

Building a Turbo-fast Data Warehousing Platform with Databricks