Integrating Apache Phoenix with Distributed Query Engines

Integrating Apache Phoenix with
Distributed Query Engines
Vincent Poon
Thomas D’Silva

Outline
● Presto Connector
● Spark Connector
● Demo

What is Presto?
● Presto is an open source distributed SQL query engine for
running interactive analytic queries on big datasets
○ latency sensitive use cases
■ Visualizations, dashboards, notebooks, BI tools
○ queries in seconds or minutes
● Developed at Facebook, contributed to open source (2013)
● ANSI SQL compliant
● AWS Athena, Google BigQuery

Source: https://www.slideshare.net/GuorongLIANG/facebook-presto-presentation/14

Source: https://www.slideshare.net/Electrum/presto-fast-sql-on-everything

Presto-Phoenix Connector
https://github.com/prestosql/presto/pull/672

Phoenix MapReduce framework
● Useful for long running queries that read most/all of the data
● Steps:
○ Run query through planner to get QueryPlan
○ Setup parallel scans for QueryPlan
■ One scan per region (or guidepost with stats)
○ Extract HBase scans from final QueryPlan
○ Create one Mapper per scan, execute in YARN

Phoenix connector for Presto
● Similar to Phoenix MapReduce:
● Steps:
○ Run query through planner, get parallel HBase scans
○ Create a Presto Split for each scan
■ reuses the Presto JDBC connector code
■ wrap each scan in a ResultSet
○ Splits executed by Presto Workers

Integrating Apache Phoenix with Distributed Query Engines

Future work
● Integrate with new Presto work on pushdown of complex operations
(aggregations, joins, etc)
○ Currently only predicates are pushed down to Phoenix
○ Phoenix can then pushdown further to HBase coprocessor
● Integrate Phoenix stats with Presto cost-based optimizer
○ Join reordering based on stats

Background
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage
Map
Map
Map
Reduce
Reduce
Input Output

Motivation
Map
Map
Map
Reduce
Reduce
Input Output
Benefits of data flow: runtime can decide
where to run tasks and can automatically
recover from failures
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage

Motivation
Acyclic data flow is inefficient for applications
that repeatedly reuse a working set of data:
»Iterative algorithms (machine learning, graphs)
»Interactive data mining tools (R, Excel, Python)
With current frameworks, apps reload data
from stable storage on each query

Spark Goals
Extend the MapReduce model to better support
two common classes of analytics apps:
»Iterative algorithms (machine learning, graphs)
»Interactive data mining
Enhance programmability:
»Integrate into Scala programming language
»Allow interactive use from Scala interpreter
(slides from https://svn.apache.org/repos/asf/spark/talks/overview.pptx)

Solution: Resilient
Distributed Datasets (RDDs)
Allow apps to keep working sets in memory for
efficient reuse
Retain the attractive properties of MapReduce
»Fault tolerance, data locality, scalability
Support a wide range of applications

Programming Model
Resilient distributed datasets (RDDs)
»Immutable, partitioned collections of objects
»Created through parallel transformations (map, filter,
groupBy, join, …) on data in stable storage
»Can be cached for efficient reuse
Actions on RDDs
»Count, reduce, collect, save, …

Phoenix-Spark connector (Datasource V1)
● Spark supports JDBC, but parallelizes queries only numeric columns
● Connector uses splits provided by Phoenix to read/write data
● Support column projection and simple filter push down

case class PhoenixRelation(tableName: String, zkUrl: String...) extends
BaseRelation with PrunedFilteredScan {
override def buildScan(requiredColumns: Array[String], filters:
Array[Filter]): RDD[Row] = {new PhoenixRDD(..)
override def schema: StructType = {...
override def unhandledFilters(filters: Array[Filter]): Array[Filter] =
{....
}

Uses NewHadoopRDD to read/write data from a phoenix table
val phoenixRDD = sc.newAPIHadoopRDD(phoenixConf,
classOf[PhoenixInputFormat[PhoenixRecordWritable]], // class of
MR input format
classOf[NullWritable], // class of key
classOf[PhoenixRecordWritable]) // class of value

Datasource V1
● No support for pushing down limit or aggregates
● No support for statistics
● Depends on upper level RDD API

Datasource V2 (evolving)
public interface DataSourceReader {
StructType readSchema();
List<InputPartition<InternalRow>> planInputPartitions();
}
public interface SupportsPushDownFilters extends DataSourceReader {
Filter[] pushFilters(Filter[] filters);
Filter[] pushedFilters();
}
public interface SupportsPushDownRequiredColumns extends DataSourceReader{
void pruneColumns(StructType requiredSchema);
}

Future Work
SPARK-22386
● Limit Pushdown
● Aggregate Pushdown
● Support clustering for writes

Integrating Apache Phoenix with Distributed Query Engines

More Related Content

What's hot (20)

Similar to Integrating Apache Phoenix with Distributed Query Engines (20)

More from DataWorks Summit (20)

Recently uploaded (20)

Integrating Apache Phoenix with Distributed Query Engines