SlideShare a Scribd company logo
Databricks’ Data Pipelines:
Journey and Lessons Learned
Yu Peng, Burak Yavuz
07/06/2016
Who Are We
Yu Peng
Data Engineer at Databricks
Building Databricks’ next-generation data pipeline
on top of Apache Spark
BS in Xiamen University
Ph.D in The University of Hong Kong
Burak Yavuz
Software Engineer at Databricks
Contributor to Spark since Spark 1.1
Maintainer of Spark Packages
BS in Mechanical Engineering at Bogazici University
MS in Management Science & Engineering at Stanford
University
Building a data pipeline is hard
• At least once or exactly once semantics
• Fault tolerance
• Resource management
• Scalability
• Maintainability
Apache®
Spark™
+ Databricks = Our Solution
• All ETL jobs are built on top of Apache Spark
• Unified solution, everything in the same place
• All ETL jobs are run on Databricks platform
• Platform for Data Engineers and Scientists
• Test out Spark and Databricks new features
Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
Classic Lambda Data Pipeline
service 0
service ...
log collector
…
.
Centralized
Messaging
System
Delta ETL
Batch ETL
Storage
System
service 1
service ...
log collector
….
service x
service ...
log collector
…
.
…...
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Customer
Dep 2
Databricks Data Pipeline Overview
Databricks
Dep
….
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
7
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
8
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
9
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
Databricks Filesystem
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
Databricks
Dep
….
10
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemonRaw record batch (json)
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
11
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2 Databricks
Dep
….
12
Databricks Deployment
Customer
Dep 0
Customer
Dep 1
Amazon
Kinesis
DBFS
Databricks Jobs
service 1
service 2
service x
log-daemon
….
Customer
Dep 2
Cluster 0
service 0
service x
log-daemon
….
service 1
service y
log-daemon
….
Cluster 1
….
Sync daemon
ETL jobs
Data analysis
Raw record batch (json)
Tables (parquet)
Databricks Data Pipeline Overview
Cluster 2
Real-time analysis
Databricks
Dep
….
13
Log collection (Log-daemon)
• Fault tolerance and at least once semantics
• Streaming
• Batch
• Spark History Server
• Multi-tenant and config driven
• Spark container
14
Log Daemon
logStream1
Service 1
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
…..
Service 2
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
Kinesistopic-1
Service x
active.log
2015-11-30-20.log
2015-11-30-19.log
log rotation
state files
Log Daemon
Architecture
producer
reader
Message Producer
logStream2
producer
reader
logStreamX
producer
reader
…………... …………... …………...
15
topic-2
Sync Daemon
• Read from Kinesis and Write to DBFS
• Buffer and write in batches (128 MB or 5 Mins)
• Partitioned by date
• A long running Apache Spark job
• Easy to scale up and down
16
Databricks Deployment
ETL Jobs
Databricks
Filesystem
No dedup
Append
Dedup
Overwrite
17
New files
Current day
All files
Previous day
Databricks Jobs
Delta job
(every 10 mins)
Batch job
(daily)
Raw records
Databricks
Filesystem
ETL Tables
(Parquet)
ETL Jobs
• Use the same code for Delta and Batch jobs
• Run as scheduled Databricks jobs
• Use spot instances and fallback to on-demand
• Deliver to Databricks as parquet tables
Lessons Learned
- Partition Pruning can save a lot of time and money
Reduced query time from 2800 seconds to just 15 seconds.
Don’t partition too many levels as it leads to worse metadata discovery
performance and cost.
19
Lessons Learned
- High S3 costs: Lots of LIST Requests
Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s
metadata cache even after write operations.
20
Running It All in Databricks - Jobs
Running It All in Databricks - Spark
Data Analysis & Tools
We get the data in. What’s next?
● Monitoring
● Debugging
● Usage Analysis
● Product Design (A/B testing)
23
Debugging
Access to logs in a matter of seconds thanks to Apache Spark.
24
Monitoring
Monitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours.
25
Usage Analysis + Product Design
SparkR + ggplot2 = Match made in heaven
26
Summary
Databricks + Apache Spark create a unified platform for:
- ETL
- Data Warehousing
- Data Analysis
- Real time analytics
Issues with DevOps out of the question:
- No need to manage a huge cluster
- Jobs are isolated, they don’t cannibalize each other’s resources
- Can launch any Spark version
Ongoing & Future Work
Structured Streaming
- Reduce Complexity of pipeline:
Sync Daemon + Delta + Batch Jobs => Single Streaming Job
- Reduce Latency
Availability of data in seconds instead of minutes
- Event Time Dashboards
28
Try Apache Spark with Databricks
29
http://databricks.com/try
Thank you.
Have questions about ETL with Spark?
Join us at the Databricks Booth 3.45-6.00pm!

More Related Content

What's hot (20)

PDF
Iceberg: a fast table format for S3
DataWorks Summit
 
PDF
Snowflake Ohio Valley User Group Meeting - June 2022
Snowflake User Groups
 
PDF
Hyperspace for Delta Lake
Databricks
 
PDF
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
PDF
Running Apache Spark Jobs Using Kubernetes
Databricks
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
Introduction to Kibana
Vineet .
 
PDF
LLM Cheatsheet and it's brief introduction
DarkKnight437486
 
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
PDF
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
PDF
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
PPTX
Microsoft Azure Databricks
Sascha Dittmann
 
PDF
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
PDF
Making Apache Spark Better with Delta Lake
Databricks
 
PPTX
Data Streaming with Apache Kafka & MongoDB
confluent
 
PDF
Apache Hive Hook
Minwoo Kim
 
PDF
What’s New with Databricks Machine Learning
Databricks
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PDF
Owning Your Own (Data) Lake House
Data Con LA
 
PPTX
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 
Iceberg: a fast table format for S3
DataWorks Summit
 
Snowflake Ohio Valley User Group Meeting - June 2022
Snowflake User Groups
 
Hyperspace for Delta Lake
Databricks
 
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Databricks
 
Running Apache Spark Jobs Using Kubernetes
Databricks
 
Memory Management in Apache Spark
Databricks
 
Introduction to Kibana
Vineet .
 
LLM Cheatsheet and it's brief introduction
DarkKnight437486
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Databricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Matei Zaharia
 
Microsoft Azure Databricks
Sascha Dittmann
 
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
Making Apache Spark Better with Delta Lake
Databricks
 
Data Streaming with Apache Kafka & MongoDB
confluent
 
Apache Hive Hook
Minwoo Kim
 
What’s New with Databricks Machine Learning
Databricks
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Owning Your Own (Data) Lake House
Data Con LA
 
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation
 

Viewers also liked (20)

PDF
Scalable And Incremental Data Profiling With Spark
Jen Aman
 
PPTX
Spark Summit Keynote by Suren Nathan
Spark Summit
 
PDF
Airstream: Spark Streaming At Airbnb
Jen Aman
 
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
PDF
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
PDF
Morticia: Visualizing And Debugging Complex Spark Workflows
Spark Summit
 
PDF
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
PPTX
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 
PDF
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
PDF
Operational Tips For Deploying Apache Spark
Databricks
 
PDF
End-to-end Data Pipeline with Apache Spark
Databricks
 
PDF
Low Latency Execution For Apache Spark
Jen Aman
 
PDF
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
PDF
Livy: A REST Web Service For Apache Spark
Jen Aman
 
PDF
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
PDF
Spark on Mesos
Jen Aman
 
PDF
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Jen Aman
 
Scalable And Incremental Data Profiling With Spark
Jen Aman
 
Spark Summit Keynote by Suren Nathan
Spark Summit
 
Airstream: Spark Streaming At Airbnb
Jen Aman
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Databricks
 
Morticia: Visualizing And Debugging Complex Spark Workflows
Spark Summit
 
Huawei Advanced Data Science With Spark Streaming
Jen Aman
 
Building a Unified Data Pipline in Spark / Apache Sparkを用いたBig Dataパイプラインの統一
scalaconfjp
 
Building a Turbo-fast Data Warehousing Platform with Databricks
Databricks
 
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Spark Summit
 
Operational Tips For Deploying Apache Spark
Databricks
 
End-to-end Data Pipeline with Apache Spark
Databricks
 
Low Latency Execution For Apache Spark
Jen Aman
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Databricks
 
Livy: A REST Web Service For Apache Spark
Jen Aman
 
Spark And Cassandra: 2 Fast, 2 Furious
Jen Aman
 
Spark on Mesos
Jen Aman
 
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Jen Aman
 
Ad

Similar to A Journey into Databricks' Pipelines: Journey and Lessons Learned (20)

PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
PDF
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
What's New in Upcoming Apache Spark 2.3
Databricks
 
PPTX
Spark to DocumentDB connector
Denny Lee
 
PDF
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
PDF
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
PDF
Transactional writes to cloud storage with Eric Liang
Databricks
 
PDF
Media_Entertainment_Veriticals
Peyman Mohajerian
 
PDF
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
PPTX
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
PDF
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Synapse 2018 Guarding against failure in a hundred step pipeline
Calvin French-Owen
 
PDF
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PDF
Lightbend Fast Data Platform
Lightbend
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Apache Spark 2.0: Faster, Easier, and Smarter
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
What's New in Upcoming Apache Spark 2.3
Databricks
 
Spark to DocumentDB connector
Denny Lee
 
An Insider’s Guide to Maximizing Spark SQL Performance
Takuya UESHIN
 
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
Transactional writes to cloud storage with Eric Liang
Databricks
 
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Typesafe spark- Zalando meetup
Stavros Kontopoulos
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Calvin French-Owen
 
SQL Analytics Powering Telemetry Analysis at Comcast
Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Lightbend Fast Data Platform
Lightbend
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

Recently uploaded (20)

PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
Java Native Memory Leaks: The Hidden Villain Behind JVM Performance Issues
Tier1 app
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Thread In Android-Mastering Concurrency for Responsive Apps.pdf
Nabin Dhakal
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Import Data Form Excel to Tally Services
Tally xperts
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
3uTools Full Crack Free Version Download [Latest] 2025
muhammadgurbazkhan
 

A Journey into Databricks' Pipelines: Journey and Lessons Learned

  • 1. Databricks’ Data Pipelines: Journey and Lessons Learned Yu Peng, Burak Yavuz 07/06/2016
  • 2. Who Are We Yu Peng Data Engineer at Databricks Building Databricks’ next-generation data pipeline on top of Apache Spark BS in Xiamen University Ph.D in The University of Hong Kong Burak Yavuz Software Engineer at Databricks Contributor to Spark since Spark 1.1 Maintainer of Spark Packages BS in Mechanical Engineering at Bogazici University MS in Management Science & Engineering at Stanford University
  • 3. Building a data pipeline is hard • At least once or exactly once semantics • Fault tolerance • Resource management • Scalability • Maintainability
  • 4. Apache® Spark™ + Databricks = Our Solution • All ETL jobs are built on top of Apache Spark • Unified solution, everything in the same place • All ETL jobs are run on Databricks platform • Platform for Data Engineers and Scientists • Test out Spark and Databricks new features Apache, Apache Spark and Spark are trademarks of the Apache Software Foundation
  • 5. Classic Lambda Data Pipeline service 0 service ... log collector … . Centralized Messaging System Delta ETL Batch ETL Storage System service 1 service ... log collector …. service x service ... log collector … . …...
  • 6. Customer Dep 0 Customer Dep 1 Amazon Kinesis Customer Dep 2 Databricks Data Pipeline Overview Databricks Dep ….
  • 7. Customer Dep 0 Customer Dep 1 Amazon Kinesis service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 7
  • 8. Customer Dep 0 Customer Dep 1 Amazon Kinesis service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 8
  • 9. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis Databricks Filesystem Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 9
  • 10. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis Databricks Filesystem Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Databricks Data Pipeline Overview Cluster 2 Real-time analysis Databricks Dep …. 10
  • 11. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemonRaw record batch (json) Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 11
  • 12. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemon ETL jobs Raw record batch (json) Tables (parquet) Databricks Data Pipeline Overview Cluster 2 Databricks Dep …. 12
  • 13. Databricks Deployment Customer Dep 0 Customer Dep 1 Amazon Kinesis DBFS Databricks Jobs service 1 service 2 service x log-daemon …. Customer Dep 2 Cluster 0 service 0 service x log-daemon …. service 1 service y log-daemon …. Cluster 1 …. Sync daemon ETL jobs Data analysis Raw record batch (json) Tables (parquet) Databricks Data Pipeline Overview Cluster 2 Real-time analysis Databricks Dep …. 13
  • 14. Log collection (Log-daemon) • Fault tolerance and at least once semantics • Streaming • Batch • Spark History Server • Multi-tenant and config driven • Spark container 14
  • 15. Log Daemon logStream1 Service 1 active.log 2015-11-30-20.log 2015-11-30-19.log log rotation ….. Service 2 active.log 2015-11-30-20.log 2015-11-30-19.log log rotation Kinesistopic-1 Service x active.log 2015-11-30-20.log 2015-11-30-19.log log rotation state files Log Daemon Architecture producer reader Message Producer logStream2 producer reader logStreamX producer reader …………... …………... …………... 15 topic-2
  • 16. Sync Daemon • Read from Kinesis and Write to DBFS • Buffer and write in batches (128 MB or 5 Mins) • Partitioned by date • A long running Apache Spark job • Easy to scale up and down 16
  • 17. Databricks Deployment ETL Jobs Databricks Filesystem No dedup Append Dedup Overwrite 17 New files Current day All files Previous day Databricks Jobs Delta job (every 10 mins) Batch job (daily) Raw records Databricks Filesystem ETL Tables (Parquet)
  • 18. ETL Jobs • Use the same code for Delta and Batch jobs • Run as scheduled Databricks jobs • Use spot instances and fallback to on-demand • Deliver to Databricks as parquet tables
  • 19. Lessons Learned - Partition Pruning can save a lot of time and money Reduced query time from 2800 seconds to just 15 seconds. Don’t partition too many levels as it leads to worse metadata discovery performance and cost. 19
  • 20. Lessons Learned - High S3 costs: Lots of LIST Requests Metadata discovery on S3 is expensive. Spark SQL tries to refresh it’s metadata cache even after write operations. 20
  • 21. Running It All in Databricks - Jobs
  • 22. Running It All in Databricks - Spark
  • 23. Data Analysis & Tools We get the data in. What’s next? ● Monitoring ● Debugging ● Usage Analysis ● Product Design (A/B testing) 23
  • 24. Debugging Access to logs in a matter of seconds thanks to Apache Spark. 24
  • 25. Monitoring Monitor logs by log level. Bug introduced on 2016-05-26 01:00:00 UTC. Fix deployed in 2 hours. 25
  • 26. Usage Analysis + Product Design SparkR + ggplot2 = Match made in heaven 26
  • 27. Summary Databricks + Apache Spark create a unified platform for: - ETL - Data Warehousing - Data Analysis - Real time analytics Issues with DevOps out of the question: - No need to manage a huge cluster - Jobs are isolated, they don’t cannibalize each other’s resources - Can launch any Spark version
  • 28. Ongoing & Future Work Structured Streaming - Reduce Complexity of pipeline: Sync Daemon + Delta + Batch Jobs => Single Streaming Job - Reduce Latency Availability of data in seconds instead of minutes - Event Time Dashboards 28
  • 29. Try Apache Spark with Databricks 29 http://databricks.com/try
  • 30. Thank you. Have questions about ETL with Spark? Join us at the Databricks Booth 3.45-6.00pm!