SlideShare a Scribd company logo
SPARK SQL
Relational Data Processing in Spark
Junting Lou
lou8@illinios.edu
Earlier Attempts
■ MapReduce
– Powerful, low-level, procedural programming interface.
– Onerous and require manual optimization
■ Pig, Hive, Dremel, Shark
– Take advantage of declarative queries to provide richer automatic
optimizations.
– Relational approach is insufficient for big data applications
■ ETL to/from semi-/unstructured data sources (e.g. JSON) requires custom code
■ Advanced analytics(ML and graph processing) are challenging to express in
relational system.
Spark SQL(2014)
■ A new module in Apache Spark that integrates relational processing with Spark’s
functional programming API.
■ Offers much tighter integration between relational and procedural in processing,
through a declarative DataFrame API.
■ Includes a highly extensible optimizer, Catalyst, that makes it easy to add data
sources, optimization rules, and data types.
Apache Spark(2010)
■ General cluster computing system.
■ One of the most widely-used systems with a “language-integrated” API.
■ One of the most active open source project for big data processing.
■ Manipulates(e.g. map, filter, reduce) distributed collections called Resilient
Distributed Datasets (RDD).
■ RDDs are evaluated lazily.
Scala RDD Example:
Counts lines starting with “ERROR” in an HDFS file
lines = spark.textFile("hdfs://...")
errors = lines.filter(s => s.contains("ERROR"))
println(errors.count())
■ Each RDD(lines, errors) represents a “logical plan” to compute a dataset, but Spark
waits until certain output operations, count, to launch a computation.
■ Spark will pipeline reading lines, applying filter and computer counts.
■ No intermediate materialization needed.
■ Useful but limited.
Shark
■ First effort to build a relational interface on Spark.
■ Modified the Apache Hive system with traditional RDBMS optimizations,
■ Shows good performance and opportunities for integration with Spark programs.
■ Challenges
– Only query external data stored in the Hive catalog, and was thus not useful for
relational queries on data inside a Spark program(e.g. RDD errors).
– Inconvenient and error-prone to work with.
– Hive optimizer was tailored for MapReduce and difficult to extend.
Goals for Spark SQL
■ Support relational processing both within Spark programs and external data sources
using a programmer-friendly API.
■ Provide high performance using established DBMS techniques.
■ Easily support new data sources, including semi-structured data and external
databases amenable to query federation.
■ Enable extension with advanced analytics algorithms such as graph processing and
machine learning.
Programming Interface
DataFrame API:
ctx = new HiveContext()
users = ctx.table("users")
young = users.where(users("age") < 21)
println(young.count())
■ Equivalent to a table in relational database
■ Can be manipulated in similar ways to the “native” RDD.
Data Model
■ Uses a nested data model based on Hive for tables and DataFrames
– Supports all major SQL data types
■ Supports user-defined types
■ Able to model data from a variety sources and formats(e.g. Hive, RDB, JSON, and
native objects in Java/Dcala/Python)
DataFrame Operations
Employees users.where(users("age") < 21)
.join(dept , employees("deptId") === dept("id")) .registerTempTable("young")
.where(employees("gender") === "female") ctx.sql("SELECT count(*),
avg(age)
.groupBy(dept("id"), dept("name")) FROM young")
.agg(count("name"))
■ All of these operators build up an abstract syntax tree (AST) of the expression, which is
then passed to Catalyst for optimization.
■ The DataFrames registered in the catalog can still be unmaterialized views, so that
optimizations can happen across SQL and the original DataFrame expressions.
■ Integration in a full programming language( DataFrames can be passed
Inter-language but still benefit from optimization across the whole plan).
Querying Native Datasets
■ Allows users to construct DataFrames directly against RDDs of objects native to the programming
language.
■ Automatically infer the schema and types of the objects.
■ Accesses the native objects in-place, extracting only the fields used in each query (avoid expensive
conversions).
case class User(name: String , age: Int)
// Create an RDD of User objects
usersRDD = spark.parallelize(List(User("Alice", 22), User("Bob", 19)))
// View the RDD as a DataFrame
usersDF = usersRDD.toDF
In-Memory Caching
Columnar cache can reduce memory footprint by an
order of magnitude
User-Defined Functions
supports inline definition of UDFs (avoid complicated
packaging and registration process)
Catalyst Optimizer
■ Based on functional programming constructs in Scala.
■ Easy to add new optimization techniques and features,
– Especially to tackle various problems when dealing with “big data”(e.g. semi-
structured data and advanced analytics)
■ Enable external developers to extend the optimizer.
– Data source specific rules that can push filtering or aggregation into external
storage systems
– Support for new data type
■ Supports rule-based and cost-based optimization
■ First production-quality query optimizer built on such a language (Scala).
Trees
Scala Code :Add(Attribute(x), Add(Literal(1), Literal(2)))
Rules
■ Trees can be manipulated using rules, which are functions from a tree to another
tree.
– Use a set of pattern matching functions that find and replace subtrees with a
specific structure.
– tree.transform{
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
}
– tree.transform {
case Add(Literal(c1), Literal(c2)) => Literal(c1+c2)
case Add(left , Literal(0)) => left
case Add(Literal(0), right) => right
}
■ Catalyst groups rules into batches, and executes each batch until it reaches a fixed
point.
Using Catalyst
Analysis
SELECT col FROM sales
■ Takes input from SQL parser or DataFrame object
■ Unresolved : have not matched it to input table or do not know type
■ Catalog object tracks the tables in all data sources
■ Around 1000 lines of rules
Logical Optimization
■ Applies standard rule-based optimizations to the logical plan
– Constant folding
– Predicate pushdown
– Projection pruning
– Null propagation
– Boolean expression simplification
– …
■ Extremely easy to add rules for specific situation
■ Around 800 lines of rules
Physical Planning
■ Take a Logical Plan and generates one or more physical plans.
■ Cost-based
– selects a plan using a cost model.(currently only used to select join algorithm)
■ Rule-based:
– Pipelining projections or filter into one Spark map operation
– Push operations from the logical plan into data sources that support predicate
or projection pushdown.
■ Around 500 lines of rules.
Code Generation
■ Generates Java bytecode to run on each machine.
■ Relies on quasiquotes of Scala to wrap codes into trees
■ Transform a tree representing an expression in SQL to an AST for Scala to evaluate that expression.
■ Compile(optimized by Scala again) and run the generated code.
■ Around 700 lines of rules
def compile(node: Node): AST = node match {
case Literal(value) => q"$value"
case Attribute(name) => q"row.get($name)"
case Add(left , right) => q"${compile(left)} + ${compile(right)}"
}
Performance by using quasiquotes
Extension Points
■ Catalyst’s design around composable rules makes it easy to extend.
■ Data Source
– CSV, Avro, Parquet, etc.
■ User-Defined Types (UDTs)
– Mapping user-defined types to structures composed of Catalyst’s built-in types.
Advanced Analytics Features
Specifically designed to handle “big data”
■ A schema inference algorithm for JSON and other semi-structured data.
■ A new high-level API for Spark’s machine learning library.
■ Supports query federation, allowing a single program to efficiently query disparate
sources.
This is training for spark SQL essential
Integration with Spark’s Machine Learning Library
SQL Performance
This is training for spark SQL essential
Conclusion
■ Extends Spark with a declarative DataFrame API to allow relational processing,
offering benefits such as automatic optimization, and letting users write complex
pipelines that mix relational and complex analytics.
■ Supports a wide range of features tailored to large-scale data analysis, including
semi-structured data, query federation, and data types for machine learning.

More Related Content

PDF
Spark SQL
Joud Khattab
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
PPTX
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
PDF
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
PDF
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PPTX
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
PDF
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Spark SQL
Joud Khattab
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
An introduction To Apache Spark
Amir Sedighi
 
Pyspark presentationsfspfsjfspfjsfpsjfspfjsfpsjfsfsf
sasuke20y4sh
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 

Similar to This is training for spark SQL essential (20)

PDF
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
PPTX
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
PPTX
APACHE SPARK.pptx
DeepaThirumurugan
 
PDF
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
PPTX
Spark Study Notes
Richard Kuo
 
PPTX
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
PDF
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
PPTX
Spark from the Surface
Josi Aranda
 
PDF
Apache Spark Presentation good for big data
kijekormu1
 
PDF
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
PPTX
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
PPTX
Big data overview
beCloudReady
 
PPTX
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
PDF
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PPTX
An Introduction to Spark
jlacefie
 
PPTX
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
PDF
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
PPTX
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
PPTX
Apache Spark sql
aftab alam
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
APACHE SPARK.pptx
DeepaThirumurugan
 
Composable Parallel Processing in Apache Spark and Weld
Databricks
 
Spark Study Notes
Richard Kuo
 
Spark with anjbnn hfkkjn hbkjbu h jhbk.pptx
nreddyjanga
 
Spark SQL In Depth www.syedacademy.com
Syed Hadoop
 
Spark from the Surface
Josi Aranda
 
Apache Spark Presentation good for big data
kijekormu1
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
In Memory Analytics with Apache Spark
Venkata Naga Ravi
 
Big data overview
beCloudReady
 
CLOUD_COMPUTING_MODULE5_RK_BIG_DATA.pptx
bhuvankumar3877
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Spark meetup TCHUG
Ryan Bosshart
 
An Introduction to Spark
jlacefie
 
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Apache Spark sql
aftab alam
 
Ad

Recently uploaded (20)

PDF
RA 12028_ARAL_Orientation_Day-2-Sessions_v2.pdf
Seven De Los Reyes
 
PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PDF
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
PDF
Virat Kohli- the Pride of Indian cricket
kushpar147
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PDF
UTS Health Student Promotional Representative_Position Description.pdf
Faculty of Health, University of Technology Sydney
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPT
Python Programming Unit II Control Statements.ppt
CUO VEERANAN VEERANAN
 
PPTX
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PDF
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
PPTX
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
PDF
The Picture of Dorian Gray summary and depiction
opaliyahemel
 
PPTX
Autodock-for-Beginners by Rahul D Jawarkar.pptx
Rahul Jawarkar
 
PDF
Phylum Arthropoda: Characteristics and Classification, Entomology Lecture
Miraj Khan
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Software Engineering BSC DS UNIT 1 .pptx
Dr. Pallawi Bulakh
 
RA 12028_ARAL_Orientation_Day-2-Sessions_v2.pdf
Seven De Los Reyes
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
What is CFA?? Complete Guide to the Chartered Financial Analyst Program
sp4989653
 
Virat Kohli- the Pride of Indian cricket
kushpar147
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
UTS Health Student Promotional Representative_Position Description.pdf
Faculty of Health, University of Technology Sydney
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
Python Programming Unit II Control Statements.ppt
CUO VEERANAN VEERANAN
 
How to Manage Leads in Odoo 18 CRM - Odoo Slides
Celine George
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Antianginal agents, Definition, Classification, MOA.pdf
Prerana Jadhav
 
An introduction to Prepositions for beginners.pptx
drsiddhantnagine
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
The Picture of Dorian Gray summary and depiction
opaliyahemel
 
Autodock-for-Beginners by Rahul D Jawarkar.pptx
Rahul Jawarkar
 
Phylum Arthropoda: Characteristics and Classification, Entomology Lecture
Miraj Khan
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Software Engineering BSC DS UNIT 1 .pptx
Dr. Pallawi Bulakh
 
Ad

This is training for spark SQL essential

  • 1. SPARK SQL Relational Data Processing in Spark Junting Lou [email protected]
  • 2. Earlier Attempts ■ MapReduce – Powerful, low-level, procedural programming interface. – Onerous and require manual optimization ■ Pig, Hive, Dremel, Shark – Take advantage of declarative queries to provide richer automatic optimizations. – Relational approach is insufficient for big data applications ■ ETL to/from semi-/unstructured data sources (e.g. JSON) requires custom code ■ Advanced analytics(ML and graph processing) are challenging to express in relational system.
  • 3. Spark SQL(2014) ■ A new module in Apache Spark that integrates relational processing with Spark’s functional programming API. ■ Offers much tighter integration between relational and procedural in processing, through a declarative DataFrame API. ■ Includes a highly extensible optimizer, Catalyst, that makes it easy to add data sources, optimization rules, and data types.
  • 4. Apache Spark(2010) ■ General cluster computing system. ■ One of the most widely-used systems with a “language-integrated” API. ■ One of the most active open source project for big data processing. ■ Manipulates(e.g. map, filter, reduce) distributed collections called Resilient Distributed Datasets (RDD). ■ RDDs are evaluated lazily.
  • 5. Scala RDD Example: Counts lines starting with “ERROR” in an HDFS file lines = spark.textFile("hdfs://...") errors = lines.filter(s => s.contains("ERROR")) println(errors.count()) ■ Each RDD(lines, errors) represents a “logical plan” to compute a dataset, but Spark waits until certain output operations, count, to launch a computation. ■ Spark will pipeline reading lines, applying filter and computer counts. ■ No intermediate materialization needed. ■ Useful but limited.
  • 6. Shark ■ First effort to build a relational interface on Spark. ■ Modified the Apache Hive system with traditional RDBMS optimizations, ■ Shows good performance and opportunities for integration with Spark programs. ■ Challenges – Only query external data stored in the Hive catalog, and was thus not useful for relational queries on data inside a Spark program(e.g. RDD errors). – Inconvenient and error-prone to work with. – Hive optimizer was tailored for MapReduce and difficult to extend.
  • 7. Goals for Spark SQL ■ Support relational processing both within Spark programs and external data sources using a programmer-friendly API. ■ Provide high performance using established DBMS techniques. ■ Easily support new data sources, including semi-structured data and external databases amenable to query federation. ■ Enable extension with advanced analytics algorithms such as graph processing and machine learning.
  • 9. DataFrame API: ctx = new HiveContext() users = ctx.table("users") young = users.where(users("age") < 21) println(young.count()) ■ Equivalent to a table in relational database ■ Can be manipulated in similar ways to the “native” RDD.
  • 10. Data Model ■ Uses a nested data model based on Hive for tables and DataFrames – Supports all major SQL data types ■ Supports user-defined types ■ Able to model data from a variety sources and formats(e.g. Hive, RDB, JSON, and native objects in Java/Dcala/Python)
  • 11. DataFrame Operations Employees users.where(users("age") < 21) .join(dept , employees("deptId") === dept("id")) .registerTempTable("young") .where(employees("gender") === "female") ctx.sql("SELECT count(*), avg(age) .groupBy(dept("id"), dept("name")) FROM young") .agg(count("name")) ■ All of these operators build up an abstract syntax tree (AST) of the expression, which is then passed to Catalyst for optimization. ■ The DataFrames registered in the catalog can still be unmaterialized views, so that optimizations can happen across SQL and the original DataFrame expressions. ■ Integration in a full programming language( DataFrames can be passed Inter-language but still benefit from optimization across the whole plan).
  • 12. Querying Native Datasets ■ Allows users to construct DataFrames directly against RDDs of objects native to the programming language. ■ Automatically infer the schema and types of the objects. ■ Accesses the native objects in-place, extracting only the fields used in each query (avoid expensive conversions). case class User(name: String , age: Int) // Create an RDD of User objects usersRDD = spark.parallelize(List(User("Alice", 22), User("Bob", 19))) // View the RDD as a DataFrame usersDF = usersRDD.toDF
  • 13. In-Memory Caching Columnar cache can reduce memory footprint by an order of magnitude User-Defined Functions supports inline definition of UDFs (avoid complicated packaging and registration process)
  • 14. Catalyst Optimizer ■ Based on functional programming constructs in Scala. ■ Easy to add new optimization techniques and features, – Especially to tackle various problems when dealing with “big data”(e.g. semi- structured data and advanced analytics) ■ Enable external developers to extend the optimizer. – Data source specific rules that can push filtering or aggregation into external storage systems – Support for new data type ■ Supports rule-based and cost-based optimization ■ First production-quality query optimizer built on such a language (Scala).
  • 15. Trees Scala Code :Add(Attribute(x), Add(Literal(1), Literal(2)))
  • 16. Rules ■ Trees can be manipulated using rules, which are functions from a tree to another tree. – Use a set of pattern matching functions that find and replace subtrees with a specific structure. – tree.transform{ case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) } – tree.transform { case Add(Literal(c1), Literal(c2)) => Literal(c1+c2) case Add(left , Literal(0)) => left case Add(Literal(0), right) => right } ■ Catalyst groups rules into batches, and executes each batch until it reaches a fixed point.
  • 18. Analysis SELECT col FROM sales ■ Takes input from SQL parser or DataFrame object ■ Unresolved : have not matched it to input table or do not know type ■ Catalog object tracks the tables in all data sources ■ Around 1000 lines of rules
  • 19. Logical Optimization ■ Applies standard rule-based optimizations to the logical plan – Constant folding – Predicate pushdown – Projection pruning – Null propagation – Boolean expression simplification – … ■ Extremely easy to add rules for specific situation ■ Around 800 lines of rules
  • 20. Physical Planning ■ Take a Logical Plan and generates one or more physical plans. ■ Cost-based – selects a plan using a cost model.(currently only used to select join algorithm) ■ Rule-based: – Pipelining projections or filter into one Spark map operation – Push operations from the logical plan into data sources that support predicate or projection pushdown. ■ Around 500 lines of rules.
  • 21. Code Generation ■ Generates Java bytecode to run on each machine. ■ Relies on quasiquotes of Scala to wrap codes into trees ■ Transform a tree representing an expression in SQL to an AST for Scala to evaluate that expression. ■ Compile(optimized by Scala again) and run the generated code. ■ Around 700 lines of rules def compile(node: Node): AST = node match { case Literal(value) => q"$value" case Attribute(name) => q"row.get($name)" case Add(left , right) => q"${compile(left)} + ${compile(right)}" }
  • 22. Performance by using quasiquotes
  • 23. Extension Points ■ Catalyst’s design around composable rules makes it easy to extend. ■ Data Source – CSV, Avro, Parquet, etc. ■ User-Defined Types (UDTs) – Mapping user-defined types to structures composed of Catalyst’s built-in types.
  • 24. Advanced Analytics Features Specifically designed to handle “big data” ■ A schema inference algorithm for JSON and other semi-structured data. ■ A new high-level API for Spark’s machine learning library. ■ Supports query federation, allowing a single program to efficiently query disparate sources.
  • 26. Integration with Spark’s Machine Learning Library
  • 29. Conclusion ■ Extends Spark with a declarative DataFrame API to allow relational processing, offering benefits such as automatic optimization, and letting users write complex pipelines that mix relational and complex analytics. ■ Supports a wide range of features tailored to large-scale data analysis, including semi-structured data, query federation, and data types for machine learning.

Editor's Notes

  • #2: In practice, most data pipelines would ideally be expressed with a combination of both relational queries and complex procedural algorithms. Relational and procedural—have remained largely disjoint.
  • #4: Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster.
  • #5: because the engine does not understand the structure of the data in RDDs
  • #6: Hive: data warehouse built on top of Apache Hadoop, providing data summarization, query, and analysis with SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop.
  • #9: A DataFrame is equivalent to a table in a relational database, and can also be manipulated in similar ways to the “native” distributed collections in Spark (RDDs).
  • #11: support all common relational operators, including projection (select), filter (where), join, and aggregations (groupBy).
  • #12: traditional object-relational mapping often incur expensive conversion that translates an entire object into a different format
  • #15: Each node has a node type and zero or more children
  • #17: (1) analyzing a logical plan to resolve references (2) logical plan optimization (3) physical planning (4) code generation to compile parts of the query to Java bytecode. In the physical planning phase, Catalyst may generate multiple plans and compare them based on cost. All other phases are purely rule-based.
  • #18: Named attribute col Unique ID allows optimization of expressions, such as col = col
  • #19:  Filtering and Projection ahead of join
  • #21: parsed by the Scala compiler at compile time and represent ASTs for the code within
  • #24: features all build on the Catalyst framework
  • #25: One pass Schema Inference Algorithm “most specific supertype"
  • #26: DataFrame ApI make exchange data between pipeline stages much easier Support different language
  • #27: Most selective to least selective. Impala choose a smarter join plan
  • #28: logical plan is constructed in Python, and all physical execution is compiled down into native Spark code as JVM bytecode the code in the DataFrame version avoids expensive allocation of key-value pairs that occurs in hand-written Scala code. avoids the cost of saving the whole result of the SQL query to an HDFS