SlideShare a Scribd company logo
Integrating Apache Phoenix with
Distributed Query Engines
Vincent Poon
Thomas D’Silva
Outline
● Presto Connector
● Spark Connector
● Demo
What is Presto?
● Presto is an open source distributed SQL query engine for
running interactive analytic queries on big datasets
○ latency sensitive use cases
■ Visualizations, dashboards, notebooks, BI tools
○ queries in seconds or minutes
● Developed at Facebook, contributed to open source (2013)
● ANSI SQL compliant
● AWS Athena, Google BigQuery
Source: https://www.slideshare.net/GuorongLIANG/facebook-presto-presentation/14
Source: https://www.slideshare.net/GuorongLIANG/facebook-presto-presentation/14
Source: https://www.slideshare.net/Electrum/presto-fast-sql-on-everything
Source: https://www.slideshare.net/GuorongLIANG/facebook-presto-presentation/14
Presto-Phoenix Connector
https://github.com/prestosql/presto/pull/672
Phoenix MapReduce framework
● Useful for long running queries that read most/all of the data
● Steps:
○ Run query through planner to get QueryPlan
○ Setup parallel scans for QueryPlan
■ One scan per region (or guidepost with stats)
○ Extract HBase scans from final QueryPlan
○ Create one Mapper per scan, execute in YARN
Phoenix connector for Presto
● Similar to Phoenix MapReduce:
● Steps:
○ Run query through planner, get parallel HBase scans
○ Create a Presto Split for each scan
■ reuses the Presto JDBC connector code
■ wrap each scan in a ResultSet
○ Splits executed by Presto Workers
Phoenix MapReduce limitations
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
Integrating Apache Phoenix with Distributed Query Engines
Future work
● Integrate with new Presto work on pushdown of complex operations
(aggregations, joins, etc)
○ Currently only predicates are pushed down to Phoenix
○ Phoenix can then pushdown further to HBase coprocessor
● Integrate Phoenix stats with Presto cost-based optimizer
○ Join reordering based on stats
Background
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage
Map
Map
Map
Reduce
Reduce
Input Output
Motivation
Map
Map
Map
Reduce
Reduce
Input Output
Benefits of data flow: runtime can decide
where to run tasks and can automatically
recover from failures
Most current cluster programming models are
based on acyclic data flow from stable storage
to stable storage
Motivation
Acyclic data flow is inefficient for applications
that repeatedly reuse a working set of data:
»Iterative algorithms (machine learning, graphs)
»Interactive data mining tools (R, Excel, Python)
With current frameworks, apps reload data
from stable storage on each query
Spark Goals
Extend the MapReduce model to better support
two common classes of analytics apps:
»Iterative algorithms (machine learning, graphs)
»Interactive data mining
Enhance programmability:
»Integrate into Scala programming language
»Allow interactive use from Scala interpreter
(slides from https://svn.apache.org/repos/asf/spark/talks/overview.pptx)
Solution: Resilient
Distributed Datasets (RDDs)
Allow apps to keep working sets in memory for
efficient reuse
Retain the attractive properties of MapReduce
»Fault tolerance, data locality, scalability
Support a wide range of applications
Programming Model
Resilient distributed datasets (RDDs)
»Immutable, partitioned collections of objects
»Created through parallel transformations (map, filter,
groupBy, join, …) on data in stable storage
»Can be cached for efficient reuse
Actions on RDDs
»Count, reduce, collect, save, …
Phoenix-Spark connector (Datasource V1)
● Spark supports JDBC, but parallelizes queries only numeric columns
● Connector uses splits provided by Phoenix to read/write data
● Support column projection and simple filter push down
Phoenix-Spark connector (Datasource V1)
case class PhoenixRelation(tableName: String, zkUrl: String...) extends
BaseRelation with PrunedFilteredScan {
override def buildScan(requiredColumns: Array[String], filters:
Array[Filter]): RDD[Row] = {new PhoenixRDD(..)
override def schema: StructType = {...
override def unhandledFilters(filters: Array[Filter]): Array[Filter] =
{....
}
Phoenix-Spark connector (Datasource V1)
Uses NewHadoopRDD to read/write data from a phoenix table
val phoenixRDD = sc.newAPIHadoopRDD(phoenixConf,
classOf[PhoenixInputFormat[PhoenixRecordWritable]], // class of
MR input format
classOf[NullWritable], // class of key
classOf[PhoenixRecordWritable]) // class of value
Datasource V1
● No support for pushing down limit or aggregates
● No support for statistics
● Depends on upper level RDD API
Datasource V2 (evolving)
public interface DataSourceReader {
StructType readSchema();
List<InputPartition<InternalRow>> planInputPartitions();
}
public interface SupportsPushDownFilters extends DataSourceReader {
Filter[] pushFilters(Filter[] filters);
Filter[] pushedFilters();
}
public interface SupportsPushDownRequiredColumns extends DataSourceReader{
void pruneColumns(StructType requiredSchema);
}
Future Work
SPARK-22386
● Limit Pushdown
● Aggregate Pushdown
● Support clustering for writes
Demo

More Related Content

What's hot (20)

PDF
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
 
PPTX
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
PDF
Application Architectures with Hadoop
hadooparchbook
 
PDF
Big Telco - Yousun Jeong
Spark Summit
 
PPTX
Ravi Namboori 's Open stack framework introduction
Ravi namboori
 
PPTX
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
PPTX
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
PPTX
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
PDF
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
PDF
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
PPTX
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
PDF
Dchug m7-30 apr2013
jdfiori
 
PDF
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Data Con LA
 
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PPTX
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
PDF
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
PPTX
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
PPTX
Built-In Security for the Cloud
DataWorks Summit
 
Apache Flink & Kudu: a connector to develop Kappa architectures
Nacho García Fernández
 
Active Learning for Fraud Prevention
DataWorks Summit/Hadoop Summit
 
Application Architectures with Hadoop
hadooparchbook
 
Big Telco - Yousun Jeong
Spark Summit
 
Ravi Namboori 's Open stack framework introduction
Ravi namboori
 
Enabling the Active Data Warehouse with Apache Kudu
Grant Henke
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Kevin Mao
 
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Data Con LA
 
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Sherlock: an anomaly detection service on top of Druid
DataWorks Summit
 
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Data Con LA
 
Dchug m7-30 apr2013
jdfiori
 
Big Data Day LA 2016/ Use Case Driven track - Hydrator: Open Source, Code-Fre...
Data Con LA
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
DataWorks Summit
 
Built-In Security for the Cloud
DataWorks Summit
 

Similar to Integrating Apache Phoenix with Distributed Query Engines (20)

PDF
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
PDF
SQL for Everything at CWT2014
N Masahiro
 
PDF
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
PDF
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
PDF
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
PPTX
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
PPSX
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
PDF
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
PDF
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
ODP
Presto
Knoldus Inc.
 
PDF
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
PDF
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
PDF
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
PDF
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
PDF
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
PDF
Drill architecture 20120913
jasonfrantz
 
PDF
Presto - Analytical Database. Overview and use cases.
Wojciech Biela
 
PDF
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
PPTX
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
PPTX
An Introduction to Spark
jlacefie
 
SQL on Hadoop in Taiwan
Treasure Data, Inc.
 
SQL for Everything at CWT2014
N Masahiro
 
Presto - Hadoop Conference Japan 2014
Sadayuki Furuhashi
 
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...
BigDataEverywhere
 
Spark Concepts - Spark SQL, Graphx, Streaming
Petr Zapletal
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
Institute of Contemporary Sciences
 
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with P...
viirya
 
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Presto
Knoldus Inc.
 
Understanding Presto - Presto meetup @ Tokyo #1
Sadayuki Furuhashi
 
Apache Spark 101 - Demi Ben-Ari
Demi Ben-Ari
 
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
Sa introduction to big data pipelining with cassandra &amp; spark west mins...
Simon Ambridge
 
Apache Hadoop and Spark: Introduction and Use Cases for Data Analysis
Trieu Nguyen
 
Drill architecture 20120913
jasonfrantz
 
Presto - Analytical Database. Overview and use cases.
Wojciech Biela
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
An Introduct to Spark - Atlanta Spark Meetup
jlacefie
 
An Introduction to Spark
jlacefie
 
Ad

More from DataWorks Summit (20)

PPTX
Data Science Crash Course
DataWorks Summit
 
PPTX
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
PPTX
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
PDF
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
PPTX
Managing the Dewey Decimal System
DataWorks Summit
 
PPTX
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
PPTX
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
PPTX
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
PPTX
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
PPTX
Security Framework for Multitenant Architecture
DataWorks Summit
 
PDF
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
PPTX
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
PPTX
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
PDF
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
PPTX
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
PPTX
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 
Ad

Recently uploaded (20)

PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
PDF
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
PDF
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
PDF
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PPTX
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
The 2025 InfraRed Report - Redpoint Ventures
Razin Mustafiz
 
Building Real-Time Digital Twins with IBM Maximo & ArcGIS Indoors
Safe Software
 
“Voice Interfaces on a Budget: Building Real-time Speech Recognition on Low-c...
Edge AI and Vision Alliance
 
"Beyond English: Navigating the Challenges of Building a Ukrainian-language R...
Fwdays
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
"Autonomy of LLM Agents: Current State and Future Prospects", Oles` Petriv
Fwdays
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Automating Feature Enrichment and Station Creation in Natural Gas Utility Net...
Safe Software
 

Integrating Apache Phoenix with Distributed Query Engines