SlideShare a Scribd company logo
Apache Spark
Syed
Solutions Engineer - Big Data
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368
Spark SQL:
Relational Data Processing in Spark
Challenges and Solutions
Challenges Solutions
• Perform ETL to and from
various (semi- or
unstructured) data sources
• Perform advanced analytics
(e.g. machine learning, graph
processing) that are hard to
express in relational systems.
• A DataFrame API that can
perform relational operations
on both external data sources
and Spark’s built-in RDDs.
• A highly extensible optimizer,
Catalyst, that uses features of
Scala to add composable rule,
control code gen., and define
extensions.
3
Spark SQL
• Part of the core distribution since Spark 1.0 (April 2014)
About SQL
0
50
100
150
200
250
# Of Commits Per Month
0
50
100
150
200
# of Contributors
Spark SQL
Part of the core distribution since Spark
1.0 (April 2014)
Runs SQL / HiveQL queries, optionally
alongside or replacing existing Hive
deployments
About
Improvement upon Existing Art
Engine does not understand the
structure of the data in RDDs or
the semantics of user functions
 limited optimization.
Can only be used to query
external data in Hive catalog 
limited data sources
Can only be invoked via SQL
string from Spark error prone
Hive optimizer tailored for
MapReduce  difficult to extend
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Programming Interface
8
DataFrame
• A distributed collection of rows with the same schema
(RDDs suffer from type erasure)
• Can be constructed from external data sources or
RDDs into essentially an RDD of Row objects
(SchemaRDDs as of Spark < 1.3)
• Supports relational operators (e.g. where, groupby) as
well as Spark operations.
• Evaluated lazily  unmaterialized logical plan
Data Model
• Nested data model
• Supports both primitive SQL types (boolean, integer,
double, decimal, string, data, timestamp) and
complex types (structs, arrays, maps, and unions);
also user defined types.
• First class support for complex data types
DataFrame Operations
• Relational operations (select, where, join, groupBy) via a DSL
• Operators take expression objects
• Operators build up an abstract syntax tree (AST), which is then
optimized by Catalyst.
• Alternatively, register as temp SQL table and perform traditional
SQL query strings
Advantages over Relational Query Languages
• Holistic optimization across functions composed in
different languages.
• Control structures (e.g. if, for)
• Logical plan analyzed eagerly  identify code errors
associated with data schema issues on the fly.
Querying Native Datasets
• Infer column names and types directly from data objects
(via reflection in Java and Scala and data sampling in
Python, which is dynamically typed)
• Native objects accessed in-place to avoid
expensive data format transformation.
• Benefits:
• Run relational operations on existing Spark programs.
• Combine RDDs with external structured data
Columnar
storage with hot
columns cached
in memory
1
2
3
4
Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
• An attribute is unresolved if its type is not
known or it’s not matched to an input
table.
• To resolve attributes:
• Look up relations by name from the catalog.
• Map named attributes to the input provided
given operator’s children.
• UID for references to the same value
• Propagate and coerce types through
expressions (e.g. 1 + col)
Unresolved
Logical Plan
Logical Plan
Analysis
Catalog
SELECT col FROM sales
Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
• Applies standard rule-based
optimization (constant folding,
predicate-pushdown, projection
pruning, null propagation, boolean
expression simplification, etc)
• 800LOC
Logical Plan
Optimized
Logical Plan
Logical
Optimization
Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
CostModel
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Optimized
Logical Plan
Physical
Planning
Physical
Plans
e.g. Pipeline projections
and filters into a single
map
Physical Plan
with Predicate Pushdown
and Column Pruning
join
optimized
scan
(events)
optimized
scan
(users)
Logical Plan
filter
join
events file users table
Physical Plan
join
scan
(events)
filter
scan
(users)
An Example Catalyst Transformation
1. Find filters on top of
projections.
2. Check that the filter
can be evaluated
without the result of
the project.
3. If so, switch the
operators.
Project
name
Project
id,name
Filter
id = 1
People
Original
Plan
Project
name
Project
id,name
Filter
id = 1
People
Filter
Push-Down
Plan Optimization & Execution
SQL AST
DataFrame
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
RDDs
Selected
Physical Plan
Analysis
Logical
Optimization
Physical
Planning
CostModel
Physical
Plans
Code
Generation
Catalog
DataFrames and SQL share the same optimization/execution pipeline
Code Generation
• Relies on Scala’s quasiquotes to simplify code gen.
• Catalyst transforms a SQL tree into an abstract syntax tree (AST)
for Scala code to eval expr and generate code
: Declarative BigData Processing
Let Developers Create and Run Spark Programs Faster:
• Write less code
• Read less data
• Let the optimizer do the hard work
SQL
Write Less Code: Compute an Average
Using RDDs
data = sc.textFile(...).split("t")
data.map(lambda x: (x[0], [int(x[1]), 1])) 
.reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) 
.map(lambda x: [x[0], x[1][0] / x[1][1]]) 
.collect()
Using DataFrames
sqlCtx.table("people") 
.groupBy("name") 
.agg("name", avg("age")) 
.collect()
Using SQL
SELECT name, avg(age)
FROM people
GROUP BY name
Using Pig
P = load '/people' as (name, name);
G = group P by name;
R = foreach G generate … AVG G.age ;
Spark SQL In Depth www.syedacademy.com
Extensible Input & Output
Spark’s Data Source API allows optimizations like column pruning
and filter pushdown into custom data sources.
30
{ JSON }
Built-In External
JDBC
and more…
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
A Dataset is a strongly typed collection of domain-specific objects that can be
transformed in parallel using functional or relational operations. Each Dataset also has
an untyped view called a DataFrame, which is a Dataset of Row.
Operations available on Datasets are divided into transformations and actions.
Transformations are the ones that produce new Datasets, and actions are the ones
that trigger computation and return results. Example transformations include map,
filter, select, and aggregate (groupBy). Example actions count, show, or writing data
out to file systems.
Datasets are "lazy", i.e. computations are only triggered when an action is invoked.
Internally, a Dataset represents a logical plan that describes the computation required
to produce the data. When an action is invoked, Spark's query optimizer optimizes the
logical plan and generates a physical plan for efficient execution in a parallel and
distributed manner. To explore the logical plan as well as optimized physical plan, use
the explain function.
To efficiently support domain-specific objects, an Encoder is required. The encoder
maps the domain specific type T to Spark's internal type system. For example, given
a class Person with two fields, name (string) and age (int), an encoder is used to tell
Spark to generate code at runtime to serialize the Person object into a binary
structure. This binary structure often has much lower memory footprint as well as are
optimized for efficiency in data processing (e.g. in a columnar format). To understand
the internal binary representation for data, use the schema function.
DataSet
Thank you!
www.syedacademy.com
mail.syed786@gmail.com
info.syedacademy@gmail.com
+91-9030477368

More Related Content

What's hot (20)

PDF
RGiampaoli.DynamicIntegrations
Ricardo Giampaoli
 
PPTX
Datastage ppt
Newyorksys.com
 
PPTX
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
Skillwise Group
 
PDF
PGQL: A Language for Graphs
Jean Ihm
 
PDF
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
PPT
Sql Server 2005 Business Inteligence
abercius24
 
PPTX
Mapping Data Flows Training April 2021
Mark Kromer
 
PDF
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
PPT
Tableau Architecture
Kishore Chaganti
 
PPTX
Etl - Extract Transform Load
ABDUL KHALIQ
 
PDF
StreamHorizon overview
StreamHorizon
 
PDF
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
PPTX
Oracle
JIGAR MAKHIJA
 
PDF
Data stage scenario design 2 - job1
Naresh Bala
 
PPTX
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
PPTX
Azure Data Factory Data Wrangling with Power Query
Mark Kromer
 
PPTX
Software architecture for data applications
Ding Li
 
PPTX
2018 data warehouse features in spark
Chester Chen
 
PDF
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Jean Ihm
 
PDF
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Inside Analysis
 
RGiampaoli.DynamicIntegrations
Ricardo Giampaoli
 
Datastage ppt
Newyorksys.com
 
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
Skillwise Group
 
PGQL: A Language for Graphs
Jean Ihm
 
From Raw Data to Analytics with No ETL
Cloudera, Inc.
 
Sql Server 2005 Business Inteligence
abercius24
 
Mapping Data Flows Training April 2021
Mark Kromer
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Spark Summit
 
Tableau Architecture
Kishore Chaganti
 
Etl - Extract Transform Load
ABDUL KHALIQ
 
StreamHorizon overview
StreamHorizon
 
Introduction to ETL and Data Integration
CloverDX (formerly known as CloverETL)
 
Data stage scenario design 2 - job1
Naresh Bala
 
U-SQL Query Execution and Performance Basics (SQLBits 2016)
Michael Rys
 
Azure Data Factory Data Wrangling with Power Query
Mark Kromer
 
Software architecture for data applications
Ding Li
 
2018 data warehouse features in spark
Chester Chen
 
Build Knowledge Graphs with Oracle RDF to Extract More Value from Your Data
Jean Ihm
 
Not Your Father’s Data Warehouse: Breaking Tradition with Innovation
Inside Analysis
 

Similar to Spark SQL In Depth www.syedacademy.com (20)

PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
PPTX
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
PPTX
Spark Sql and DataFrame
Prashant Gupta
 
PPTX
Apache Spark sql
aftab alam
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PDF
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
PDF
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
PDF
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
PDF
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PDF
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
PDF
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
PDF
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
PPTX
Spark sql meetup
Michael Zhang
 
PDF
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
PDF
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
PDF
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
PDF
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Sachin Aggarwal
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Simplilearn
 
Spark Sql and DataFrame
Prashant Gupta
 
Apache Spark sql
aftab alam
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Anyscale
 
Spark Summit EU talk by Herman van Hovell
Spark Summit
 
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Anatomy of Data Frame API : A deep dive into Spark Data Frame API
datamantra
 
Anatomy of Spark SQL Catalyst - Part 2
datamantra
 
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Databricks
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Spark Summit
 
Spark sql meetup
Michael Zhang
 
Introduction to Spark SQL & Catalyst
Takuya UESHIN
 
Introduction to Spark SQL and Catalyst / Spark SQLおよびCalalystの紹介
scalaconfjp
 
Jumpstart on Apache Spark 2.2 on Databricks
Databricks
 
Jump Start on Apache® Spark™ 2.x with Databricks
Databricks
 
Ad

More from Syed Hadoop (6)

PDF
Kafka syed academy_v1_introduction
Syed Hadoop
 
PDF
Spark Streaming In Depth - www.syedacademy.com
Syed Hadoop
 
PDF
Spark_RDD_SyedAcademy
Syed Hadoop
 
PDF
Spark_Intro_Syed_Academy
Syed Hadoop
 
PDF
Hadoop Architecture in Depth
Syed Hadoop
 
PDF
Hadoop course content Syed Academy
Syed Hadoop
 
Kafka syed academy_v1_introduction
Syed Hadoop
 
Spark Streaming In Depth - www.syedacademy.com
Syed Hadoop
 
Spark_RDD_SyedAcademy
Syed Hadoop
 
Spark_Intro_Syed_Academy
Syed Hadoop
 
Hadoop Architecture in Depth
Syed Hadoop
 
Hadoop course content Syed Academy
Syed Hadoop
 
Ad

Recently uploaded (20)

PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
PDF
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PPTX
Engineering the Java Web Application (MVC)
abhishekoza1981
 
DOCX
Import Data Form Excel to Tally Services
Tally xperts
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PPTX
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
PPTX
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Understanding the Need for Systemic Change in Open Source Through Intersectio...
Imma Valls Bernaus
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Fundamentals_of_Microservices_Architecture.pptx
MuhammadUzair504018
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pdf
Varsha Nayak
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
Engineering the Java Web Application (MVC)
abhishekoza1981
 
Import Data Form Excel to Tally Services
Tally xperts
 
Tally software_Introduction_Presentation
AditiBansal54083
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
GridView,Recycler view, API, SQLITE& NetworkRequest.pdf
Nabin Dhakal
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Continouous failure - Why do we make our lives hard?
Papp Krisztián
 
How Odoo Became a Game-Changer for an IT Company in Manufacturing ERP
SatishKumar2651
 

Spark SQL In Depth www.syedacademy.com

  • 2. Spark SQL: Relational Data Processing in Spark Challenges and Solutions Challenges Solutions • Perform ETL to and from various (semi- or unstructured) data sources • Perform advanced analytics (e.g. machine learning, graph processing) that are hard to express in relational systems. • A DataFrame API that can perform relational operations on both external data sources and Spark’s built-in RDDs. • A highly extensible optimizer, Catalyst, that uses features of Scala to add composable rule, control code gen., and define extensions.
  • 3. 3 Spark SQL • Part of the core distribution since Spark 1.0 (April 2014) About SQL 0 50 100 150 200 250 # Of Commits Per Month 0 50 100 150 200 # of Contributors
  • 4. Spark SQL Part of the core distribution since Spark 1.0 (April 2014) Runs SQL / HiveQL queries, optionally alongside or replacing existing Hive deployments About
  • 5. Improvement upon Existing Art Engine does not understand the structure of the data in RDDs or the semantics of user functions  limited optimization. Can only be used to query external data in Hive catalog  limited data sources Can only be invoked via SQL string from Spark error prone Hive optimizer tailored for MapReduce  difficult to extend
  • 9. DataFrame • A distributed collection of rows with the same schema (RDDs suffer from type erasure) • Can be constructed from external data sources or RDDs into essentially an RDD of Row objects (SchemaRDDs as of Spark < 1.3) • Supports relational operators (e.g. where, groupby) as well as Spark operations. • Evaluated lazily  unmaterialized logical plan
  • 10. Data Model • Nested data model • Supports both primitive SQL types (boolean, integer, double, decimal, string, data, timestamp) and complex types (structs, arrays, maps, and unions); also user defined types. • First class support for complex data types
  • 11. DataFrame Operations • Relational operations (select, where, join, groupBy) via a DSL • Operators take expression objects • Operators build up an abstract syntax tree (AST), which is then optimized by Catalyst. • Alternatively, register as temp SQL table and perform traditional SQL query strings
  • 12. Advantages over Relational Query Languages • Holistic optimization across functions composed in different languages. • Control structures (e.g. if, for) • Logical plan analyzed eagerly  identify code errors associated with data schema issues on the fly.
  • 13. Querying Native Datasets • Infer column names and types directly from data objects (via reflection in Java and Scala and data sampling in Python, which is dynamically typed) • Native objects accessed in-place to avoid expensive data format transformation. • Benefits: • Run relational operations on existing Spark programs. • Combine RDDs with external structured data Columnar storage with hot columns cached in memory
  • 15. Plan Optimization & Execution SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 16. Plan Optimization & Execution SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 17. • An attribute is unresolved if its type is not known or it’s not matched to an input table. • To resolve attributes: • Look up relations by name from the catalog. • Map named attributes to the input provided given operator’s children. • UID for references to the same value • Propagate and coerce types through expressions (e.g. 1 + col) Unresolved Logical Plan Logical Plan Analysis Catalog SELECT col FROM sales
  • 18. Plan Optimization & Execution SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 19. Plan Optimization & Execution SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 20. • Applies standard rule-based optimization (constant folding, predicate-pushdown, projection pruning, null propagation, boolean expression simplification, etc) • 800LOC Logical Plan Optimized Logical Plan Logical Optimization
  • 21. Plan Optimization & Execution SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 22. Plan Optimization & Execution SQL AST DataFrame Unresolved Logical Plan Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization CostModel Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline Optimized Logical Plan Physical Planning Physical Plans e.g. Pipeline projections and filters into a single map
  • 23. Physical Plan with Predicate Pushdown and Column Pruning join optimized scan (events) optimized scan (users) Logical Plan filter join events file users table Physical Plan join scan (events) filter scan (users)
  • 24. An Example Catalyst Transformation 1. Find filters on top of projections. 2. Check that the filter can be evaluated without the result of the project. 3. If so, switch the operators. Project name Project id,name Filter id = 1 People Original Plan Project name Project id,name Filter id = 1 People Filter Push-Down
  • 25. Plan Optimization & Execution SQL AST DataFrame Unresolved Logical Plan Logical Plan Optimized Logical Plan RDDs Selected Physical Plan Analysis Logical Optimization Physical Planning CostModel Physical Plans Code Generation Catalog DataFrames and SQL share the same optimization/execution pipeline
  • 26. Code Generation • Relies on Scala’s quasiquotes to simplify code gen. • Catalyst transforms a SQL tree into an abstract syntax tree (AST) for Scala code to eval expr and generate code
  • 27. : Declarative BigData Processing Let Developers Create and Run Spark Programs Faster: • Write less code • Read less data • Let the optimizer do the hard work SQL
  • 28. Write Less Code: Compute an Average Using RDDs data = sc.textFile(...).split("t") data.map(lambda x: (x[0], [int(x[1]), 1])) .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) .map(lambda x: [x[0], x[1][0] / x[1][1]]) .collect() Using DataFrames sqlCtx.table("people") .groupBy("name") .agg("name", avg("age")) .collect() Using SQL SELECT name, avg(age) FROM people GROUP BY name Using Pig P = load '/people' as (name, name); G = group P by name; R = foreach G generate … AVG G.age ;
  • 30. Extensible Input & Output Spark’s Data Source API allows optimizations like column pruning and filter pushdown into custom data sources. 30 { JSON } Built-In External JDBC and more…
  • 34. A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row. Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems. Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function. To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function. DataSet