SlideShare a Scribd company logo
SparkSQL:
A Compiler from Queries to RDDs
Sameer Agarwal
Spark Summit | Boston | February 9th 2017
About Me
• Software Engineer at Databricks (Spark Core/SQL)
• PhD in Databases (AMPLab, UC Berkeley)
• Research on BlinkDB (Approximate Queries in Spark)
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]
3
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]
4
Opaque Computation
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]
5
Opaque Data
RDD Programming Model
6
Constructexecution DAG using low level RDD operators.
RDD Programming Model
7
Constructexecution DAG using low level RDD operators.
RDD Programming Model
8
Constructexecution DAG using low level RDD operators.
SQL/Structured Programming Model
• High-level APIs (SQL, DataFrame/Dataset): Programs
describe what data operations are neededwithout
specifying how to executethese operations
• More efficient: An optimizer can automatically find out
the most efficient plan to executea query
9
10
SQL AST
DataFrame
Dataset
Query Plan
Optimized
Query Plan
RDDs
Transformations
Catalyst
Abstractionsof users’programs
(Trees)
Spark SQL Overview
Tungsten
11
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
Query Plan
Optimized
Query Plan
RDDs
Transformations
Catalyst
Abstractions of users’ programs
(Trees)
12
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
13
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Expression
• An expressionrepresentsa
new value, computed based
on input values
• e.g. 1 + 2 + t1.value
14
Trees: Abstractions of Users’ Programs
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
Query Plan
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Logical Plan
• A Logical Plan describescomputation
on datasets without defining how to
conductthe computation
15
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Physical Plan
• A Physical Plan describescomputation
on datasets with specific definitions on
how to conductthe computation
16
Parquet Scan
(t1)
JSONScan
(t2)
Sort-Merge
Join
Filter
Project
Hash-
Aggregate
sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
17
How Catalyst Works: An Overview
SQL AST
DataFrame
Dataset
(Java/Scala)
Query Plan
Optimized
Query Plan
RDDs
Transformations
Catalyst
Abstractionsof users’programs
(Trees)
• A function associated with everytree used to
implement a single rule
Transform
18
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2
for every row
Transform
• A transform is defined as a Partial Function
• Partial Function: A function that is defined for a subset
of its possible arguments
19
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Case statement determineifthe partialfunction is definedfora given input
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
20
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
21
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
22
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Transform
23
Attribute
(t1.value)
Add
Add
Literal(1) Literal(2)
1 + 2 + t1.value
Attribute
(t1.value)
Add
Literal(3)
3+ t1.value
Combining Multiple Rules
24
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Predicate Pushdown
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t2.id>50*1000
t1.id=t2.id
Combining Multiple Rules
25
Constant Folding
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t2.id>50*1000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Combining Multiple Rules
26
Column Pruning
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Project
t1.id
t1.value t2.id
Combining Multiple Rules
27
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
1+2+t1.value
as v
t1.id=t2.id
t2.id>50*1000
Scan
(t1)
Scan
(t2)
Join
Filter
Project
Aggregate sum(v)
t1.id,
3+t1.value as
v
t2.id>50000
t1.id=t2.id
Project Projectt1.id
t1.value
t2.id
Before transformations
After transformations
28
SQL AST
DataFrame
Dataset
Query Plan
Optimized
Query Plan
RDDs
Transformations
Catalyst
Abstractionsof users’programs
(Trees)
Spark SQL Overview
Tungsten
Scan
Filter
Project
Aggregate
select count(*) from store_sales
where ss_item_sk = 1000
G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System,
In IEEE Transactions on Knowledge and Data Engineering 1994
Volcano Iterator Model
• Standard for 30 years:
almost all databases do it
• Each operator is an
“iterator” that consumes
records from its input
operator
class Filter(
child: Operator,
predicate: (Row => Boolean))
extends Operator {
def next(): Row = {
var current = child.next()
while (current == null ||predicate(current)) {
current = child.next()
}
return current
}
}
Downside of the Volcano Model
1. Too many virtual function calls
o at least 3 calls for each row in Aggregate
2. Extensive memory access
o “row” is a small segment in memory (or in L1/L2/L3 cache)
3. Can’t take advantage of modern CPU features
o SIMD, pipelining, prefetching, branch prediction, ILP, instruction
cache, …
Scan
Filter
Project
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {
count += 1;
}
}
Whole-stage Codegen: Spark as a “Compiler”
Whole-stage Codegen
• Fusing operators together so the generated code looks like
hand optimized code:
- Identify chains of operators (“stages”)
- Compile each stage into a single function
- Functionality of a general purpose execution engine;
performance as if hand built system just to run your query
T Neumann, Efficiently compiling efficient query plans for modern hardware. InVLDB 2011
Putting it All Together
Operator Benchmarks: Cost/Row (ns)
5-30x
Speedups
Operator Benchmarks: Cost/Row (ns)
Radix Sort
10-100x
Speedups
Operator Benchmarks: Cost/Row (ns)
Shuffling
still the
bottleneck
Operator Benchmarks: Cost/Row (ns)
10x
Speedup
TPC-DS (Scale Factor 1500, 100 cores)
QueryTime
Query #
Spark 2.0 Spark 1.6
Lower is Better
What’s Next?
Spark 2.2 and beyond
1. SPARK-16026: Cost Based Optimizer
- Leverage table/column level statistics to optimize joins and aggregates
- Statistics Collection Framework (Spark 2.1)
- Cost Based Optimizer (Spark 2.2)
2. Boosting Spark’s Performance on Many-Core Machines
- In-memory/ single node shuffle
3. Improving quality of generated code and betterintegration
with the in-memory column format in Spark
Thank you.

More Related Content

What's hot (20)

PDF
Understanding Query Plans and Spark UIs
Databricks
 
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
PPTX
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
PDF
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
PDF
Parquet performance tuning: the missing guide
Ryan Blue
 
PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Migrating with Debezium
Mike Fowler
 
PDF
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
PDF
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
PPTX
Hive: Loading Data
Benjamin Leonhardi
 
PDF
Impala Architecture presentation
hadooparchbook
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PPTX
Optimizing Apache Spark SQL Joins
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PPTX
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
NTT DATA Technology & Innovation
 
PPTX
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Understanding Query Plans and Spark UIs
Databricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Bo Yang
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Databricks
 
Parquet performance tuning: the missing guide
Ryan Blue
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Migrating with Debezium
Mike Fowler
 
High-speed Database Throughput Using Apache Arrow Flight SQL
ScyllaDB
 
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Databricks
 
Top 5 Mistakes When Writing Spark Applications
Spark Summit
 
Hive: Loading Data
Benjamin Leonhardi
 
Impala Architecture presentation
hadooparchbook
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Optimizing Apache Spark SQL Joins
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
大量のデータ処理や分析に使えるOSS Apache Sparkのご紹介(Open Source Conference 2020 Online/Kyoto ...
NTT DATA Technology & Innovation
 
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 

Similar to SparkSQL: A Compiler from Queries to RDDs (20)

PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PPTX
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
PDF
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
PDF
Meetup talk
Arpit Tak
 
PDF
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
PPTX
Flink internals web
Kostas Tzoumas
 
PPTX
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
PPTX
Madeo - a CAD Tool for reconfigurable Hardware
ESUG
 
PDF
cb streams - gavin pickin
Ortus Solutions, Corp
 
PDF
Tuning and Debugging in Apache Spark
Databricks
 
PDF
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PDF
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
PDF
Deep dive into spark streaming
Tao Li
 
PPTX
Building a modern Application with DataFrames
Spark Summit
 
PPTX
Building a modern Application with DataFrames
Databricks
 
PDF
Apache Spark: What? Why? When?
Massimo Schenone
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Databricks
 
Meetup talk
Arpit Tak
 
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
Structuring Spark: DataFrames, Datasets, and Streaming
Databricks
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
Flink internals web
Kostas Tzoumas
 
Ten tools for ten big data areas 03_Apache Spark
Will Du
 
Madeo - a CAD Tool for reconfigurable Hardware
ESUG
 
cb streams - gavin pickin
Ortus Solutions, Corp
 
Tuning and Debugging in Apache Spark
Databricks
 
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Tulsa techfest Spark Core Aug 5th 2016
Mark Smith
 
Deep dive into spark streaming
Tao Li
 
Building a modern Application with DataFrames
Spark Summit
 
Building a modern Application with DataFrames
Databricks
 
Apache Spark: What? Why? When?
Massimo Schenone
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
PDF
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
PDF
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
PDF
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 
Empower Your Tech Vision- Why Businesses Prefer to Hire Remote Developers fro...
logixshapers59
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
4K Video Downloader Plus Pro Crack for MacOS New Download 2025
bashirkhan333g
 
TheFutureIsDynamic-BoxLang witch Luis Majano.pdf
Ortus Solutions, Corp
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
유니티에서 Burst Compiler+ThreadedJobs+SIMD 적용사례
Seongdae Kim
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Home Care Tools: Benefits, features and more
Third Rock Techkno
 
Driver Easy Pro 6.1.1 Crack Licensce key 2025 FREE
utfefguu
 
Generic or Specific? Making sensible software design decisions
Bert Jan Schrijver
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
NEW-Viral>Wondershare Filmora 14.5.18.12900 Crack Free
sherryg1122g
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
Digger Solo: Semantic search and maps for your local files
seanpedersen96
 

SparkSQL: A Compiler from Queries to RDDs

  • 1. SparkSQL: A Compiler from Queries to RDDs Sameer Agarwal Spark Summit | Boston | February 9th 2017
  • 2. About Me • Software Engineer at Databricks (Spark Core/SQL) • PhD in Databases (AMPLab, UC Berkeley) • Research on BlinkDB (Approximate Queries in Spark)
  • 3. Background: What is an RDD? • Dependencies • Partitions • Compute function: Partition => Iterator[T] 3
  • 4. Background: What is an RDD? • Dependencies • Partitions • Compute function: Partition => Iterator[T] 4 Opaque Computation
  • 5. Background: What is an RDD? • Dependencies • Partitions • Compute function: Partition => Iterator[T] 5 Opaque Data
  • 6. RDD Programming Model 6 Constructexecution DAG using low level RDD operators.
  • 7. RDD Programming Model 7 Constructexecution DAG using low level RDD operators.
  • 8. RDD Programming Model 8 Constructexecution DAG using low level RDD operators.
  • 9. SQL/Structured Programming Model • High-level APIs (SQL, DataFrame/Dataset): Programs describe what data operations are neededwithout specifying how to executethese operations • More efficient: An optimizer can automatically find out the most efficient plan to executea query 9
  • 10. 10 SQL AST DataFrame Dataset Query Plan Optimized Query Plan RDDs Transformations Catalyst Abstractionsof users’programs (Trees) Spark SQL Overview Tungsten
  • 11. 11 How Catalyst Works: An Overview SQL AST DataFrame Dataset Query Plan Optimized Query Plan RDDs Transformations Catalyst Abstractions of users’ programs (Trees)
  • 12. 12 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp
  • 13. 13 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Expression • An expressionrepresentsa new value, computed based on input values • e.g. 1 + 2 + t1.value
  • 14. 14 Trees: Abstractions of Users’ Programs SELECT sum(v) FROM ( SELECT t1.id, 1 + 2 + t1.value AS v FROM t1 JOIN t2 WHERE t1.id = t2.id AND t2.id > 50 * 1000) tmp Query Plan Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 15. Logical Plan • A Logical Plan describescomputation on datasets without defining how to conductthe computation 15 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 16. Physical Plan • A Physical Plan describescomputation on datasets with specific definitions on how to conductthe computation 16 Parquet Scan (t1) JSONScan (t2) Sort-Merge Join Filter Project Hash- Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000
  • 17. 17 How Catalyst Works: An Overview SQL AST DataFrame Dataset (Java/Scala) Query Plan Optimized Query Plan RDDs Transformations Catalyst Abstractionsof users’programs (Trees)
  • 18. • A function associated with everytree used to implement a single rule Transform 18 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.valueEvaluate 1 + 2 onceEvaluate 1 + 2 for every row
  • 19. Transform • A transform is defined as a Partial Function • Partial Function: A function that is defined for a subset of its possible arguments 19 val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Case statement determineifthe partialfunction is definedfora given input
  • 20. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 20 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 21. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 21 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 22. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 22 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value
  • 23. val expression: Expression = ... expression.transform { case Add(Literal(x, IntegerType), Literal(y, IntegerType)) => Literal(x + y) } Transform 23 Attribute (t1.value) Add Add Literal(1) Literal(2) 1 + 2 + t1.value Attribute (t1.value) Add Literal(3) 3+ t1.value
  • 24. Combining Multiple Rules 24 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Predicate Pushdown Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id
  • 25. Combining Multiple Rules 25 Constant Folding Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t2.id>50*1000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id
  • 26. Combining Multiple Rules 26 Column Pruning Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Project t1.id t1.value t2.id
  • 27. Combining Multiple Rules 27 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 1+2+t1.value as v t1.id=t2.id t2.id>50*1000 Scan (t1) Scan (t2) Join Filter Project Aggregate sum(v) t1.id, 3+t1.value as v t2.id>50000 t1.id=t2.id Project Projectt1.id t1.value t2.id Before transformations After transformations
  • 28. 28 SQL AST DataFrame Dataset Query Plan Optimized Query Plan RDDs Transformations Catalyst Abstractionsof users’programs (Trees) Spark SQL Overview Tungsten
  • 29. Scan Filter Project Aggregate select count(*) from store_sales where ss_item_sk = 1000
  • 30. G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System, In IEEE Transactions on Knowledge and Data Engineering 1994
  • 31. Volcano Iterator Model • Standard for 30 years: almost all databases do it • Each operator is an “iterator” that consumes records from its input operator class Filter( child: Operator, predicate: (Row => Boolean)) extends Operator { def next(): Row = { var current = child.next() while (current == null ||predicate(current)) { current = child.next() } return current } }
  • 32. Downside of the Volcano Model 1. Too many virtual function calls o at least 3 calls for each row in Aggregate 2. Extensive memory access o “row” is a small segment in memory (or in L1/L2/L3 cache) 3. Can’t take advantage of modern CPU features o SIMD, pipelining, prefetching, branch prediction, ILP, instruction cache, …
  • 33. Scan Filter Project Aggregate long count = 0; for (ss_item_sk in store_sales) { if (ss_item_sk == 1000) { count += 1; } } Whole-stage Codegen: Spark as a “Compiler”
  • 34. Whole-stage Codegen • Fusing operators together so the generated code looks like hand optimized code: - Identify chains of operators (“stages”) - Compile each stage into a single function - Functionality of a general purpose execution engine; performance as if hand built system just to run your query
  • 35. T Neumann, Efficiently compiling efficient query plans for modern hardware. InVLDB 2011
  • 36. Putting it All Together
  • 37. Operator Benchmarks: Cost/Row (ns) 5-30x Speedups
  • 38. Operator Benchmarks: Cost/Row (ns) Radix Sort 10-100x Speedups
  • 39. Operator Benchmarks: Cost/Row (ns) Shuffling still the bottleneck
  • 40. Operator Benchmarks: Cost/Row (ns) 10x Speedup
  • 41. TPC-DS (Scale Factor 1500, 100 cores) QueryTime Query # Spark 2.0 Spark 1.6 Lower is Better
  • 43. Spark 2.2 and beyond 1. SPARK-16026: Cost Based Optimizer - Leverage table/column level statistics to optimize joins and aggregates - Statistics Collection Framework (Spark 2.1) - Cost Based Optimizer (Spark 2.2) 2. Boosting Spark’s Performance on Many-Core Machines - In-memory/ single node shuffle 3. Improving quality of generated code and betterintegration with the in-memory column format in Spark