THE PUSHDOWN
OF EVERYTHING
Stephan Kessler
Santiago Mola
Who we are?
Stephan Kessler
Developer @ SAP, Walldorf
o SAP HANA Vora team
o Integration of Vora query engine with
Apache Spark.
o Bringing new features and performance
improvements to Apache Spark.
o Before joining SAP: PhD and M.Sc. at the
Karlsruhe Institute of Technology.
o Research on privacy in databases and
sensor networks.
Santiago Mola
Developer @ Stratio, Madrid
o Working with the SAP HANA Vora team
o Focus on Apache Spark SQL extensions and data
sources implementation.
o Bootstrapped Stratio Sparkta, worked on Stratio
Ingestion and helped customers to build stream
processing solutions.
o Previously: CTO at Bitsnbrains, M.Sc. at Polytechnic
University of Valencia.
SAP HANA Vora
• SAP HANA Vora is a SQL-on-Hadoop solution based on:
– In-Memory columnar query execution engine with built-in query
compilation
– Spark SQL extensions (will be Open Source soon!):
• OLAP extensions
• Hierarchy queries
• Extended Data Sources API (‘Push Down Everything’)
Spark SQL
Data Sources API
Spark Core Engine
Data Sources
MLlib Streaming …
CSV HANA
HANA VORA
Motivation
• “The fastest way of processing data is not processing it at all!”
• Data Sources API allows to defer computation of filters and projects to
the ‘source’
– Less I/O spent reading
– Less memory spent
• But: Data Sources can also be full-blown databases
– Deferring parts of the logical plan leads to
additional benefits
→ The Pushdown of Everything
Pushed down:
Project: Column1
Filter: Column2 > 20
Average: Column2
Implementing a Data Source
1. Creating a ‘DefaultSource’ class that implements the trait
(Schema)RelationProvider
trait SchemaRelationProvider {
def createRelation(
sqlContext: SQLContext, parameters: Map[String, String],
schema: StructType): BaseRelation
}
2. The returned “BaseRelation” can implement the following traits
– TableScan
– PrunedScan
– PrunedFilterScan
Full Scan
• The most basic form of reading data: read it all, sequentially.
• Implementing trait table scan
trait TableScan {
def buildScan(): RDD[Row]
}
• SQL: SELECT * FROM table
Pruned Scan
• Read all rows, only a few columns
• Implementing trait PrunedScan
trait PrunedScan {
def buildScan(requiredColumns: Array[String]): RDD[Row]
}
• SQL: SELECT <column list> FROM table
Pruned Filtered Scan
• Can filter which rows are fetched (predicate push down).
• Implement trait PrunedFilteredScan
trait PrunedFilteredScan {
def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]
}
• SQL: SELECT <column list> FROM table WHERE <predicate>
• Spark SQL allows basic predicates here (e.g. EqualTo, GreaterThan).
How does it work?
Assume the following table attendees
Query:
SELECT hometown, AVG(age) FROM attendees
WHERE hometown = ’Amsterdam’
GROUP BY hometown Name Age Hometown
Peter 23 London
John 30 New York
Stephan 72 Karlsruhe
… … …
How does it work?
Query:
SELECT hometown, AVG(age) FROM attendees
WHERE hometown = ’Amsterdam’
GROUP BY hometown
The query is parsed into this Logical Plan:
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Example with TableScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
Planning
PhysicalRDD
(full scan)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SQL
SELECT name, age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Example with TableScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(full scan)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT name, age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
Example with PrunedScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(pruned: age, hometown)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT age, hometown
FROM attendees
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
Example with
PrunedFilteredScan
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(pruned: age, hometown
filtered: hometown = ‘Amsterdam’)
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Physical plan
SELECT age, hometown
FROM attendees
WHERE hometown = ‘Amsterdam’
SELECT hometown, AVG(age)
FROM source
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
How can we improve this?
• There are sources doing more than filtering and pruning
– aggregation, joins, ...
• Some sources can execute more complex filters and functions
– Example: SELECT col1 + 1 WHERE col2 + col3 < col4.
• Default Data Sources API cannot push down these things
– They might be trivial for the data source to execute.
• This leads to unnecessary work
– fetching more data
– Not using optimizations of the source
Enter the Catalyst Source API
• We implemented a new interface that data sources can implement to
signal that they can push down complex queries.
• Complexity of pushed down queries is arbitrary
– functions, set operators, joins, deeply nested subqueries, …
– even data source UDFs that are not supported in Spark).
trait CatalystSource {
def isMultiplePartitionExecution(relations: Seq[CatalystSource]): Boolean
def supportsLogicalPlan(plan: LogicalPlan): Boolean
def supportsExpression(expr: Expression): Boolean
def logicalPlanToRDD(plan: LogicalPlan): RDD[Row]
}
Partitioned and Holistic sources
• Data sources that can compute queries that operate on a holistic data set
– HANA, Cassandra, PostgreSQL, MongoDB
• Data sources that can compute queries that operate only over each
partition
– Vora, Parquet, ORC, PostgreSQL instances in Postgres XL
• Some can do both (to some degree)
• Our planner extensions allow to optimize push down for both cases if the
data source implements the Catalyst Source API.
Partitioned vs. Holistic Sources
HDFS
Physical
Node
Physical
Node
Physical
Node
Data Node Data Node Data Node
Vora
Engine
Vora
Engine
Vora
Engine
Spark
Worker
Spark
Worker
Spark
Worker
Spark
Worker
SAP
HANA
Postgres
SQL
…
Example with CatalystSource
(partioned execution)
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
Planning
PhysicalRDD
(CatalystSource)
Aggregate
(hometown, SUM(PartialSum) /
SUM(PartialCount))
Physical plan
SELECT hometown,
SUM(age) AS PartialSum,
COUNT(age) AS PartialCount
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SELECT hometown,
SUM(PartialSum) / SUM(PartialCount)
FROM source
GROUP BY hometown
SQL representation
SQL
Example with CatalystSource
(holistic source)
Relation (datasource)
Attendees
Aggregate
(hometown, AVG(age))
Filter
hometown = ‘Amsterdam’
Logical plan
PhysicalRDD
(CatalystSource)
Physical plan
SELECT hometown, AGE(age)
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SQL representation
Planning SQL
Returned Rows
Assumption: Table Size is 𝒏 Rows
SELECT hometown,
SUM(age) AS PartialSum,
COUNT(age) AS PartialCount
FROM attendees
WHERE hometown = ‘Amsterdam’
GROUP BY hometown
SELECT age, hometown
FROM attendees
WHERE hometown = ‘Amsterdam’
SELECT name, age, hometown
FROM attendees
TableScan/
Pruned Scan
Pruned Filter
Scan
Catalyst
Source
Returns
𝑛 Rows
Returns
< 𝑛 Rows
Returns
<< 𝑛 Rows
#distinct ‘hometowns’
Advantages
• A single interface covers all queries.
• CatalystSource subsumes TableScan, PrunedScan, PrunedFilteredScan.
• Fine-grained control of features supported by the data source
• Incremental implementation of a data source possible
– Start with supporting projects and filters and continue with more
• Opens the door to tighter integration with all kinds of databases.
– Dramatic performance improvements possible.
Current disadvantages and limitations
• Implementing CatalystSource for a rich data source (e.g., supporting SQL)
is a considerably complex task.
• Current implementation relies on (some) Spark APIs that are unstable.
– Backwards compatibility is not guaranteed.
• Pushing down a complex query could be slower than not pushing it down
– Examples:
• it overloads the data source
• generates a result larger than its input tables)
– CatalystSource implementors can workaround this by marking such
queries as unsupported
What are the next steps?
• Improve the API to make it simpler for implementors
– add utilities to generate SQL,
– matchers to simplify working with logical plans
• Provide a stable API
– CatalystSource implementations should work with different Spark
versions without modification.
• Provide a common trait to reduce boilerplate code
– Example: A data source implementing CatalystSource should not
need to implement TableScan, PrunedScan or PrunedFilteredScan.
Summary
• Extension of the Data Sources API to pushdown arbitrary logical plans
• Leveraging functionality of source to process less data
• Part of SAP Hana Vora
• We will put it Open Source
Thank you!
stephan.kessler@sap.com smola@stratio.com

More Related Content

PDF
The Parquet Format and Performance Optimization Opportunities
PDF
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PPTX
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
PDF
Common Strategies for Improving Performance on Your Delta Lakehouse
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Memory Management in Apache Spark
PDF
Deep Dive: Memory Management in Apache Spark
The Parquet Format and Performance Optimization Opportunities
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Common Strategies for Improving Performance on Your Delta Lakehouse
A Deep Dive into Query Execution Engine of Spark SQL
Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark

What's hot (20)

PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Apache Spark Core – Practical Optimization
PDF
Understanding Query Plans and Spark UIs
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Physical Plans in Spark SQL
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Apache Spark in Depth: Core Concepts, Architecture & Internals
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
PDF
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
PPTX
Optimizing Apache Spark SQL Joins
PPTX
PySpark dataframe
PDF
Debugging PySpark: Spark Summit East talk by Holden Karau
PDF
Understanding and Improving Code Generation
PDF
Introduction to Spark with Python
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
PPT
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
PDF
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
PDF
Spark SQL
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
ODP
Apache Spark Internals
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core – Practical Optimization
Understanding Query Plans and Spark UIs
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Physical Plans in Spark SQL
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Apache Spark in Depth: Core Concepts, Architecture & Internals
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Optimizing Apache Spark SQL Joins
PySpark dataframe
Debugging PySpark: Spark Summit East talk by Holden Karau
Understanding and Improving Code Generation
Introduction to Spark with Python
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
Spark SQL
Optimizing Delta/Parquet Data Lakes for Apache Spark
Apache Spark Internals
Ad

Similar to The Pushdown of Everything by Stephan Kessler and Santiago Mola (20)

PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
PDF
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
PDF
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
PDF
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
PPTX
2018 data warehouse features in spark
PPTX
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
PDF
SparkSQL: A Compiler from Queries to RDDs
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PDF
Spark SQL In Depth www.syedacademy.com
PPTX
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
PPTX
Spark sql meetup
PDF
Migrating to Spark 2.0 - Part 2
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
Real-Time Spark: From Interactive Queries to Streaming
PDF
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
PPTX
Apache spark
PPTX
This is training for spark SQL essential
PPTX
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
PDF
Boston Spark Meetup event Slides Update
PDF
Introduction to Spark SQL & Catalyst
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Advanced Apache Spark Meetup Spark and Elasticsearch 02-15-2016
Advanced Apache Spark Meetup Data Sources API Cassandra Spark Connector Spark...
Scotland Data Science Meetup Oct 13, 2015: Spark SQL, DataFrames, Catalyst, ...
2018 data warehouse features in spark
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
SparkSQL: A Compiler from Queries to RDDs
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Spark SQL In Depth www.syedacademy.com
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Spark sql meetup
Migrating to Spark 2.0 - Part 2
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Real-Time Spark: From Interactive Queries to Streaming
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
Apache spark
This is training for spark SQL essential
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Boston Spark Meetup event Slides Update
Introduction to Spark SQL & Catalyst
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PPTX
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
PPTX
Chapter security of computer_8_v8.1.pptx
PPTX
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
PPTX
research framework and review of related literature chapter 2
PDF
Mcdonald's : a half century growth . pdf
PDF
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
PPTX
GPS sensor used agriculture land for automation
PDF
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
PPTX
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
PPTX
Capstone Presentation a.pptx on data sci
PPTX
Hushh.ai: Your Personal Data, Your Business
PPTX
inbound6529290805104538764.pptxmmmmmmmmm
PPTX
Hushh Hackathon for IIT Bombay: Create your very own Agents
PPTX
AI AND ML PROPOSAL PRESENTATION MUST.pptx
PPTX
inbound2857676998455010149.pptxmmmmmmmmm
PDF
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
PDF
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
PPT
What is life? We never know the answer exactly
PDF
technical specifications solar ear 2025.
PPTX
PPT for Diseases.pptx, there are 3 types of diseases
DIGITAL DESIGN AND.pptx hhhhhhhhhhhhhhhhh
Chapter security of computer_8_v8.1.pptx
cp-and-safeguarding-training-2018-2019-mmfv2-230818062456-767bc1a7.pptx
research framework and review of related literature chapter 2
Mcdonald's : a half century growth . pdf
Book Trusted Companions in Delhi – 24/7 Available Delhi Personal Meeting Ser...
GPS sensor used agriculture land for automation
book-34714 (2).pdfhjkkljgfdssawtjiiiiiujj
9 Bioterrorism.pptxnsbhsjdgdhdvkdbebrkndbd
Capstone Presentation a.pptx on data sci
Hushh.ai: Your Personal Data, Your Business
inbound6529290805104538764.pptxmmmmmmmmm
Hushh Hackathon for IIT Bombay: Create your very own Agents
AI AND ML PROPOSAL PRESENTATION MUST.pptx
inbound2857676998455010149.pptxmmmmmmmmm
Concepts of Database Management, 10th Edition by Lisa Friedrichsen Test Bank.pdf
9 FinOps Tools That Simplify Cloud Cost Reporting.pdf
What is life? We never know the answer exactly
technical specifications solar ear 2025.
PPT for Diseases.pptx, there are 3 types of diseases

The Pushdown of Everything by Stephan Kessler and Santiago Mola

  • 1. THE PUSHDOWN OF EVERYTHING Stephan Kessler Santiago Mola
  • 2. Who we are? Stephan Kessler Developer @ SAP, Walldorf o SAP HANA Vora team o Integration of Vora query engine with Apache Spark. o Bringing new features and performance improvements to Apache Spark. o Before joining SAP: PhD and M.Sc. at the Karlsruhe Institute of Technology. o Research on privacy in databases and sensor networks. Santiago Mola Developer @ Stratio, Madrid o Working with the SAP HANA Vora team o Focus on Apache Spark SQL extensions and data sources implementation. o Bootstrapped Stratio Sparkta, worked on Stratio Ingestion and helped customers to build stream processing solutions. o Previously: CTO at Bitsnbrains, M.Sc. at Polytechnic University of Valencia.
  • 3. SAP HANA Vora • SAP HANA Vora is a SQL-on-Hadoop solution based on: – In-Memory columnar query execution engine with built-in query compilation – Spark SQL extensions (will be Open Source soon!): • OLAP extensions • Hierarchy queries • Extended Data Sources API (‘Push Down Everything’)
  • 4. Spark SQL Data Sources API Spark Core Engine Data Sources MLlib Streaming … CSV HANA HANA VORA
  • 5. Motivation • “The fastest way of processing data is not processing it at all!” • Data Sources API allows to defer computation of filters and projects to the ‘source’ – Less I/O spent reading – Less memory spent • But: Data Sources can also be full-blown databases – Deferring parts of the logical plan leads to additional benefits → The Pushdown of Everything Pushed down: Project: Column1 Filter: Column2 > 20 Average: Column2
  • 6. Implementing a Data Source 1. Creating a ‘DefaultSource’ class that implements the trait (Schema)RelationProvider trait SchemaRelationProvider { def createRelation( sqlContext: SQLContext, parameters: Map[String, String], schema: StructType): BaseRelation } 2. The returned “BaseRelation” can implement the following traits – TableScan – PrunedScan – PrunedFilterScan
  • 7. Full Scan • The most basic form of reading data: read it all, sequentially. • Implementing trait table scan trait TableScan { def buildScan(): RDD[Row] } • SQL: SELECT * FROM table
  • 8. Pruned Scan • Read all rows, only a few columns • Implementing trait PrunedScan trait PrunedScan { def buildScan(requiredColumns: Array[String]): RDD[Row] } • SQL: SELECT <column list> FROM table
  • 9. Pruned Filtered Scan • Can filter which rows are fetched (predicate push down). • Implement trait PrunedFilteredScan trait PrunedFilteredScan { def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row] } • SQL: SELECT <column list> FROM table WHERE <predicate> • Spark SQL allows basic predicates here (e.g. EqualTo, GreaterThan).
  • 10. How does it work? Assume the following table attendees Query: SELECT hometown, AVG(age) FROM attendees WHERE hometown = ’Amsterdam’ GROUP BY hometown Name Age Hometown Peter 23 London John 30 New York Stephan 72 Karlsruhe … … …
  • 11. How does it work? Query: SELECT hometown, AVG(age) FROM attendees WHERE hometown = ’Amsterdam’ GROUP BY hometown The query is parsed into this Logical Plan: Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’
  • 12. Example with TableScan Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan Planning PhysicalRDD (full scan) Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Physical plan SQL SELECT name, age, hometown FROM attendees SELECT hometown, AVG(age) FROM source WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation
  • 13. Example with TableScan Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan PhysicalRDD (full scan) Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Physical plan SELECT name, age, hometown FROM attendees SELECT hometown, AVG(age) FROM source WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation Planning SQL
  • 14. Example with PrunedScan Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan PhysicalRDD (pruned: age, hometown) Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Physical plan SELECT age, hometown FROM attendees SELECT hometown, AVG(age) FROM source WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation Planning SQL
  • 15. Example with PrunedFilteredScan Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan PhysicalRDD (pruned: age, hometown filtered: hometown = ‘Amsterdam’) Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Physical plan SELECT age, hometown FROM attendees WHERE hometown = ‘Amsterdam’ SELECT hometown, AVG(age) FROM source WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation Planning SQL
  • 16. How can we improve this? • There are sources doing more than filtering and pruning – aggregation, joins, ... • Some sources can execute more complex filters and functions – Example: SELECT col1 + 1 WHERE col2 + col3 < col4. • Default Data Sources API cannot push down these things – They might be trivial for the data source to execute. • This leads to unnecessary work – fetching more data – Not using optimizations of the source
  • 17. Enter the Catalyst Source API • We implemented a new interface that data sources can implement to signal that they can push down complex queries. • Complexity of pushed down queries is arbitrary – functions, set operators, joins, deeply nested subqueries, … – even data source UDFs that are not supported in Spark). trait CatalystSource { def isMultiplePartitionExecution(relations: Seq[CatalystSource]): Boolean def supportsLogicalPlan(plan: LogicalPlan): Boolean def supportsExpression(expr: Expression): Boolean def logicalPlanToRDD(plan: LogicalPlan): RDD[Row] }
  • 18. Partitioned and Holistic sources • Data sources that can compute queries that operate on a holistic data set – HANA, Cassandra, PostgreSQL, MongoDB • Data sources that can compute queries that operate only over each partition – Vora, Parquet, ORC, PostgreSQL instances in Postgres XL • Some can do both (to some degree) • Our planner extensions allow to optimize push down for both cases if the data source implements the Catalyst Source API.
  • 19. Partitioned vs. Holistic Sources HDFS Physical Node Physical Node Physical Node Data Node Data Node Data Node Vora Engine Vora Engine Vora Engine Spark Worker Spark Worker Spark Worker Spark Worker SAP HANA Postgres SQL …
  • 20. Example with CatalystSource (partioned execution) Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan Planning PhysicalRDD (CatalystSource) Aggregate (hometown, SUM(PartialSum) / SUM(PartialCount)) Physical plan SELECT hometown, SUM(age) AS PartialSum, COUNT(age) AS PartialCount FROM attendees WHERE hometown = ‘Amsterdam’ GROUP BY hometown SELECT hometown, SUM(PartialSum) / SUM(PartialCount) FROM source GROUP BY hometown SQL representation SQL
  • 21. Example with CatalystSource (holistic source) Relation (datasource) Attendees Aggregate (hometown, AVG(age)) Filter hometown = ‘Amsterdam’ Logical plan PhysicalRDD (CatalystSource) Physical plan SELECT hometown, AGE(age) FROM attendees WHERE hometown = ‘Amsterdam’ GROUP BY hometown SQL representation Planning SQL
  • 22. Returned Rows Assumption: Table Size is 𝒏 Rows SELECT hometown, SUM(age) AS PartialSum, COUNT(age) AS PartialCount FROM attendees WHERE hometown = ‘Amsterdam’ GROUP BY hometown SELECT age, hometown FROM attendees WHERE hometown = ‘Amsterdam’ SELECT name, age, hometown FROM attendees TableScan/ Pruned Scan Pruned Filter Scan Catalyst Source Returns 𝑛 Rows Returns < 𝑛 Rows Returns << 𝑛 Rows #distinct ‘hometowns’
  • 23. Advantages • A single interface covers all queries. • CatalystSource subsumes TableScan, PrunedScan, PrunedFilteredScan. • Fine-grained control of features supported by the data source • Incremental implementation of a data source possible – Start with supporting projects and filters and continue with more • Opens the door to tighter integration with all kinds of databases. – Dramatic performance improvements possible.
  • 24. Current disadvantages and limitations • Implementing CatalystSource for a rich data source (e.g., supporting SQL) is a considerably complex task. • Current implementation relies on (some) Spark APIs that are unstable. – Backwards compatibility is not guaranteed. • Pushing down a complex query could be slower than not pushing it down – Examples: • it overloads the data source • generates a result larger than its input tables) – CatalystSource implementors can workaround this by marking such queries as unsupported
  • 25. What are the next steps? • Improve the API to make it simpler for implementors – add utilities to generate SQL, – matchers to simplify working with logical plans • Provide a stable API – CatalystSource implementations should work with different Spark versions without modification. • Provide a common trait to reduce boilerplate code – Example: A data source implementing CatalystSource should not need to implement TableScan, PrunedScan or PrunedFilteredScan.
  • 26. Summary • Extension of the Data Sources API to pushdown arbitrary logical plans • Leveraging functionality of source to process less data • Part of SAP Hana Vora • We will put it Open Source

Editor's Notes

  • #3: Notes: Quick slide: about 1 minute
  • #4: Notes: 1 or 2 minutes about SAP HANA Vora.
  • #5: Notes: 30 seconds about Data Sources API intro: Data Sources API defines how Spark SQL can interact with an external source of data. The Data Source can represent a file format on HDFS, a relation database, a web service…
  • #13: With TableScan, everything is pulled from the data source: every row with every column. Then all further steps are performed in Spark. Clarification: Here are three columns: Logical plan. Physical plan. A SQL representation with the query that is executed in the data source and the query that is executed in Spark SQL. This is just an idealization, it does not mean that the data source actually uses SQL or that Spark SQL uses it internally.
  • #14: With TableScan, everything is pulled from the data source: every row with every column. Then all further steps are performed in Spark. Clarification: Here are three columns: Logical plan. Physical plan. A SQL representation with the query that is executed in the data source and the query that is executed in Spark SQL. This is just an idealization, it does not mean that the data source actually uses SQL or that Spark SQL uses it internally.
  • #15: With PrunedScan, we fetch all rows with a subset of columns. This can reduce I/O considerably.
  • #16: PrunedFilteredScan works as PrunedFilteredScan, but adding a filter on rows according to a condition. This is equivalent to adding a WHERE clause.