Stephan Kessler and Santiago Mola presented SAP HANA Vora, which extends Spark SQL's data sources API to allow "pushing down" more of a SQL query's logical plan to the data source for execution. This "Pushdown of Everything" approach leverages data sources' capabilities to process less data and optimize query execution. They described how data sources can implement interfaces like TableScan, PrunedScan, and the new CatalystSource interface to support pushing down projections, filters, and more complex queries respectively. While this approach has advantages in performance, challenges include the complexity of implementing CatalystSource and ensuring compatibility across Spark versions. Future work aims to improve the API and provide utilities to simplify implementation.
Related topics: