SlideShare a Scribd company logo
1 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Apache	Spark	– Apache	HBase Connector
Feature	Rich	and	Efficient	Access	to	HBase
through	Spark	SQL
Weiqing Yang	
Mingjie Tang	
October,	2017
2 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
About	Authors
à Weiqing Yang
• Contribute	to	Apache	Spark,	Apache	Hadoop,	Apache	HBase,	Apache	Ambari
• Software	Engineer	at	Hortonworks
à Mingjie Tang
• SparkSQL,	Spark	Mllib,	Spark	Streaming,	Data	Mining,	Machine	Learning
• Software	Engineer	at	Hortonworks
à …	All	Other	SHC	Contributors
3 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Agenda
Motivation
Overview
Architecture	&	Implementation
Usage	&	Demo
4 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Motivation
à Limited	Spark	Support	in	HBase Upstream
– RDD	level
– But	Spark	Is	Moving	to	DataFrame/Dataset
à Existing	Connectors	in	DataFrame Level
– Complicated	Design
• Embedding	Optimization	Plan	inside	Catalyst	Engine
• Stability	Impact	with	Coprocessor
• Serialized	RDD	Lineage	to	HBase
– Heavy	Maintenance	Overhead
5 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Overview
6 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Apache	Spark– Apache	HBase Connector	(SHC)
à Combine	Spark	and	HBase
– Spark	Catalyst	Engine	for	Query	Plan	and	Optimization
– HBase as	Fast	Access	KV	Store
– Implement	Standard	External	Data	Source	with	Build-in	Filter,	Maintain	Easily
à Full	Fledged	DataFrameSupport
– Spark	SQL
– Integrated	Language	Query
à High	Performance
– Partition	Pruning,	Data	Locality,	Column	Pruning,	Predicate	Pushdown
– Use	Spark	UnhandledFilters API
– Cache	Spark	HBase Connections
7 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Data	Coder	&	Data	Schema
à Support	Different	Data	Coders
– PrimitiveType:	Native	Support	Java	Primitive	Types
– Avro:	Native	Support	Avro	Encoding/Decoding
– Phoenix:	Phoenix	Encoding/Decoding
– Plug-In	Data	Coder
– Can	Run	on	the	Top	of	Existing	HBase Tables
à Support	Composite	Key
– def cat	=	s"""{
|"table":{"namespace":"default",	"name":"shcExampleTable",	"tableCoder":”Phoenix"},
|"rowkey":"key1:key2",
|"columns":{
|"col00":{"cf":"rowkey",	"col":"key1",	"type":"string”},
|"col01":{"cf":"rowkey",	"col":"key2",	"type":"int"},
…
...
8 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Architecture	&	Implementation
9 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Architecture
…...
Driver
Executor Executor Executor
Region	
Server
Region	
Server
Region	
Server…...
Spark
HBase
Picture	1.	SHC	architecture
Host	1
10 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Architecture
…...
Driver
Executor Executor Executor
Region	
Server
Region	
Server
Region	
Server…...
Picture	1.	SHC	architecture
Task
Query
Partition
Filters,	Required	
Columns
RS	start/end	
point
sqlContext.sql("select	
count(col1)	from	table1	
where	key	<	'row050'")
PP P
Scans
BulkGets
11 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Implementation
…...
Driver
Executor Executor Executor
Region	
Server
Region	
Server
Region	
Server…...
Picture	1.	SHC	architecture
Task
Query
Partition
Filters,	Required	
ColumnsPartition	Pruning:	Task	Only
Performed	in	Region	Server	
Holding	Requested	Data
PP P
Scans
BulkGets Filters	->	Multiple	Scan	Ranges		
															∩
(Start	point,	end	point)
RS	start/end	
point
12 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Implementation
…...
Driver
Executor Executor Executor
Region	
Server
Region	
Server
Region	
Server…...
Picture	1.	SHC	architecture
Task
Query
Partition
Filters,	Required	
Columns
RS	start/end	
point
Data	Locality:	Move	
Computation	to	Data.	
PP P
Scans
BulkGets
RDD	Partition	has	preferred	location:
getPreferredLocations(partition) {
return RS.hostName}
13 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Implementation
…...
Driver
Executor Executor Executor
Region	
Server
Region	
Server
Region	
Server…...
Picture	1.	SHC	architecture
Task
Query
Partition
Filters,	Required	
Columns
RS	start/end	
point
Column	Pruning:	Required	
Column
Predicate	Pushdown:	HBase
built-in	Filters
PP P
Filters,	Required	
Columns
Filters,	Required	
Columns
Scans
BulkGets
Filters,	Required	
Columns
14 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Implementation
…...
Driver
Executor Executor Executor
Region	
Server
Region	
Server
Region	
Server…...
Picture	1.	SHC	architecture
Task
Query
Partition
Filters,	Required	
Columns
RS	start/end	
point
Scan	and	BulkGets:	Grouped	
by	region	server.	
PP P
Scans
BulkGets
WHERE	column	>	x	and	
column	<	y for	scan
and WHERE	column	=	
x for	get.
15 ©	Hortonworks	 Inc.	2011	– 2016.	All	Rights	Reserved
Usage	&	Demo
16 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
How	to	Use	SHC?
à Github
– https://github.com/hortonworks-spark/shc	 	
à SHC	Examples
– https://github.com/hortonworks-spark/shc/tree/master/examples
à Apache	HBase Jira
– https://issues.apache.org/jira/browse/HBASE-14789
17 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Demo
à Interactive	Jobs	through	Spark Shell
à Batch	Jobs
18 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Acknowledgement
à HBase Community	&	Spark	Community
à All	SHC Contributors,	Zhan	Zhang
19 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Reference
à Hortonworks	Public	Repo
– http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/
à Apache	Spark
– http://spark.apache.org/
à Apache	HBase
– https://hbase.apache.org/
20 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Thanks
Q	&	A
Emails:	
wyang@hortonworks.com
21 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
BACKUP
22 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Kerberos	Cluster
à Kerberos	Ticket
– kinit	-kt foo.keytab foouser or		Principle/Keytab
à Long	Running	Service
– --principal,	--keytab
à Multiple	Secure	HBase Clusters
– Spark	only	Supports	Single	Secure	HBase Cluster
– Use	SHC	Credential	Manager
– Refer	LRJobAccessing2Clusters	Example	in	github
23 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Usage
Define	the	catalog	for	the	schema	mapping:
24 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Usage
à Prepare	the	data	and	populate	the	HBase table
val data	=	(0	to	255).map	{	i =>	 HBaseRecord(i,	“extra”)}
sc.parallelize(data).toDF.write.options(
Map(HBaseTableCatalog.tableCatalog ->	catalog,	HBaseTableCatalog.newTable->	“5”))
.format(“org.apache.spark.sql.execution.datasources.hbase”)
.save()
25 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Usage
à Load	the	DataFrame
def withCatalog(cat:	String):	DataFrame =	{
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format(“org.apache.spark.sql.execution.datasources.hbase”)
.load()
}
val df =	withCatalog(catalog)
26 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Usage
à Query
Language	integrated	query:
val s	=	df.filter((($"col0ʺ	<=	“çrow050ʺ	&&	$”col0”	>	“row040”)	||
$”col0ʺ	===	“row005”	&& ($”col4ʺ	===	1	|| $”col4ʺ	===	42))
.select(“col0”,	“col1”,	“col4”)
SQL:
val s	=	df.filter((($”col0ʺ	<=	“row050ʺ	&&	$”col0”	>	“row040”)
df.registerTempTable(“table”)
sqlContext.sql(“select	count(col1)	from	table”).show
27 ©	Hortonworks	 Inc.	2011	– 2017.	All	Rights	Reserved
Usage
à Work	with	different	data	sources
//	Part	1:	write	data	into	Hive	table	and	read	data	from	it
val df1	=	sql("SELECT	*	FROM	shcHiveTable")
//	Part	2:	read data from Hbase table
val df2	=	withCatalog(cat)
//	Part	3:	join the two dataframes
val s1	=	df1.filter($"key"	<=	"40").select("key",	"col1")
val s2	=	df2.filter($"key"	<=	"20"	&&	$"key"	>=	"1").select("key",	"col2")
val result =		s1.join(s2,	Seq("key"))
result.show()

More Related Content

What's hot (20)

PDF
Oracle RAC 19c: Best Practices and Secret Internals
Anil Nair
 
PDF
Migrating Oracle database to PostgreSQL
Umair Mansoob
 
PDF
Highly efficient backups with percona xtrabackup
Nilnandan Joshi
 
PDF
How to use histograms to get better performance
MariaDB plc
 
PPTX
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Carlos Sierra
 
PDF
Expert performance tuning tips for Oracle RAC
SolarWinds
 
PDF
InnoDB Performance Optimisation
Mydbops
 
PDF
M|18 Architectural Overview: MariaDB MaxScale
MariaDB plc
 
PDF
PostgreSQL HA
haroonm
 
PPT
AIXpert - AIX Security expert
dlfrench
 
PPTX
Spark Tips & Tricks
Jason Hubbard
 
PDF
oVirt introduction
Rogan Kyuseok Lee
 
PDF
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL-Consulting
 
PDF
Percona xtrabackup - MySQL Meetup @ Mumbai
Nilnandan Joshi
 
PDF
Understanding oracle rac internals part 2 - slides
Mohamed Farouk
 
PDF
MyRocks introduction and production deployment
Yoshinori Matsunobu
 
PDF
MySQL Performance - Best practices
Ted Wennmark
 
PDF
Performance Stability, Tips and Tricks and Underscores
Jitendra Singh
 
PDF
How to Manage Scale-Out Environments with MariaDB MaxScale
MariaDB plc
 
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 
Oracle RAC 19c: Best Practices and Secret Internals
Anil Nair
 
Migrating Oracle database to PostgreSQL
Umair Mansoob
 
Highly efficient backups with percona xtrabackup
Nilnandan Joshi
 
How to use histograms to get better performance
MariaDB plc
 
Survey of some free Tools to enhance your SQL Tuning and Performance Diagnost...
Carlos Sierra
 
Expert performance tuning tips for Oracle RAC
SolarWinds
 
InnoDB Performance Optimisation
Mydbops
 
M|18 Architectural Overview: MariaDB MaxScale
MariaDB plc
 
PostgreSQL HA
haroonm
 
AIXpert - AIX Security expert
dlfrench
 
Spark Tips & Tricks
Jason Hubbard
 
oVirt introduction
Rogan Kyuseok Lee
 
PostgreSQL worst practices, version PGConf.US 2017 by Ilya Kosmodemiansky
PostgreSQL-Consulting
 
Percona xtrabackup - MySQL Meetup @ Mumbai
Nilnandan Joshi
 
Understanding oracle rac internals part 2 - slides
Mohamed Farouk
 
MyRocks introduction and production deployment
Yoshinori Matsunobu
 
MySQL Performance - Best practices
Ted Wennmark
 
Performance Stability, Tips and Tricks and Underscores
Jitendra Singh
 
How to Manage Scale-Out Environments with MariaDB MaxScale
MariaDB plc
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Altinity Ltd
 

Viewers also liked (20)

PDF
Building the Ideal Stack for Real-Time Analytics
SingleStore
 
PDF
빅데이터윈윈 컨퍼런스_데이터시각화자료
ABRC_DATA
 
PDF
Softnix Security Data Lake
Softnix Technology
 
PDF
Softnix Messaging Server
Softnix Technology
 
PPTX
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
PDF
Zoomdata
Vivek Mohan
 
PDF
The Fast Path to Building Operational Applications with Spark
SingleStore
 
PPTX
Ibm watson
Vivek Mohan
 
PDF
CWIN17 Frankfurt / Cloudera
Capgemini
 
PPTX
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Cloudera, Inc.
 
PPTX
Security implementation on hadoop
Wei-Chiu Chuang
 
PDF
Spark meetup - Zoomdata Streaming
Zoomdata
 
PPTX
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
confluent
 
PDF
Cloudera and Qlik: Big Data Analytics for Business
Data IQ Argentina
 
PPTX
Put Alternative Data to Use in Capital Markets

Cloudera, Inc.
 
PPTX
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
PPTX
The Evolution of Data Architecture
Wei-Chiu Chuang
 
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
PPTX
Benefits of Transferring Real-Time Data to Hadoop at Scale
Hortonworks
 
Building the Ideal Stack for Real-Time Analytics
SingleStore
 
빅데이터윈윈 컨퍼런스_데이터시각화자료
ABRC_DATA
 
Softnix Security Data Lake
Softnix Technology
 
Softnix Messaging Server
Softnix Technology
 
Using Big Data to Transform Your Customer’s Experience - Part 1

Cloudera, Inc.
 
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Data Con LA
 
Zoomdata
Vivek Mohan
 
The Fast Path to Building Operational Applications with Spark
SingleStore
 
Ibm watson
Vivek Mohan
 
CWIN17 Frankfurt / Cloudera
Capgemini
 
Webinar - Sehr empfehlenswert: wie man aus Daten durch maschinelles Lernen We...
Cloudera, Inc.
 
Security implementation on hadoop
Wei-Chiu Chuang
 
Spark meetup - Zoomdata Streaming
Zoomdata
 
Real-Time Analytics Visualized w/ Kafka + Streamliner + MemSQL + ZoomData, An...
confluent
 
Cloudera and Qlik: Big Data Analytics for Business
Data IQ Argentina
 
Put Alternative Data to Use in Capital Markets

Cloudera, Inc.
 
Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
DataWorks Summit
 
The Evolution of Data Architecture
Wei-Chiu Chuang
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
Spark Summit
 
Benefits of Transferring Real-Time Data to Hadoop at Scale
Hortonworks
 
Ad

Similar to Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBase through Spark SQL with Weiqing Yang (20)

PDF
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Databricks
 
PDF
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon
 
PPTX
Spark + HBase
DataWorks Summit/Hadoop Summit
 
PPTX
Hspark index conf
Chester Chen
 
PDF
Hbase mhug 2015
Joseph Niemiec
 
PPTX
HiveWarehouseConnector
Eric Wohlstadter
 
PDF
HBaseCon 2013: Integration of Apache Hive and HBase
Cloudera, Inc.
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
PPTX
An Apache Hive Based Data Warehouse
DataWorks Summit
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
PDF
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks
 
PPTX
Intro to Spark with Zeppelin
Hortonworks
 
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Sparkling Water 5 28-14
Sri Ambati
 
PPTX
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
PPTX
Learning spark ch09 - Spark SQL
phanleson
 
PPTX
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Spark HBase Connector: Feature Rich and Efficient Access to HBase Through Spa...
Databricks
 
HBaseCon2017 Spark HBase Connector: Feature Rich and Efficient Access to HBas...
HBaseCon
 
Hspark index conf
Chester Chen
 
Hbase mhug 2015
Joseph Niemiec
 
HiveWarehouseConnector
Eric Wohlstadter
 
HBaseCon 2013: Integration of Apache Hive and HBase
Cloudera, Inc.
 
Spark sql
Zahra Eskandari
 
Hive edw-dataworks summit-eu-april-2017
alanfgates
 
An Apache Hive Based Data Warehouse
DataWorks Summit
 
Spark meetup v2.0.5
Yan Zhou
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks
 
Intro to Spark with Zeppelin
Hortonworks
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
Sparkling Water 5 28-14
Sri Ambati
 
Big data processing engines, Atlanta Meetup 4/30
Ashish Narasimham
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Learning spark ch09 - Spark SQL
phanleson
 
Apache Phoenix and HBase: Past, Present and Future of SQL over HBase
DataWorks Summit/Hadoop Summit
 
Ad

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
PDF
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
PDF
Goal Based Data Production with Sim Simeonov
Spark Summit
 
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
PDF
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Spark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Spark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Spark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Spark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Spark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spark Summit
 
Goal Based Data Production with Sim Simeonov
Spark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Spark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Spark Summit
 

Recently uploaded (20)

PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PPTX
Powerful Uses of Data Analytics You Should Know
subhashenia
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PDF
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
PDF
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
How to Add Columns and Rows in an R Data Frame
subhashenia
 
PDF
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
PPT
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PDF
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Research Methodology Overview Introduction
ayeshagul29594
 
Powerful Uses of Data Analytics You Should Know
subhashenia
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
Development and validation of the Japanese version of the Organizational Matt...
Yoga Tokuyoshi
 
A GraphRAG approach for Energy Efficiency Q&A
Marco Brambilla
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
How to Add Columns and Rows in an R Data Frame
subhashenia
 
InformaticsPractices-MS - Google Docs.pdf
seshuashwin0829
 
tuberculosiship-2106031cyyfuftufufufivifviviv
AkshaiRam
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
1750162332_Snapshot-of-Indias-oil-Gas-data-May-2025.pdf
sandeep718278
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 

Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBase through Spark SQL with Weiqing Yang

  • 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Spark – Apache HBase Connector Feature Rich and Efficient Access to HBase through Spark SQL Weiqing Yang Mingjie Tang October, 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved About Authors à Weiqing Yang • Contribute to Apache Spark, Apache Hadoop, Apache HBase, Apache Ambari • Software Engineer at Hortonworks à Mingjie Tang • SparkSQL, Spark Mllib, Spark Streaming, Data Mining, Machine Learning • Software Engineer at Hortonworks à … All Other SHC Contributors
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Motivation Overview Architecture & Implementation Usage & Demo
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Motivation à Limited Spark Support in HBase Upstream – RDD level – But Spark Is Moving to DataFrame/Dataset à Existing Connectors in DataFrame Level – Complicated Design • Embedding Optimization Plan inside Catalyst Engine • Stability Impact with Coprocessor • Serialized RDD Lineage to HBase – Heavy Maintenance Overhead
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Overview
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Apache Spark– Apache HBase Connector (SHC) à Combine Spark and HBase – Spark Catalyst Engine for Query Plan and Optimization – HBase as Fast Access KV Store – Implement Standard External Data Source with Build-in Filter, Maintain Easily à Full Fledged DataFrameSupport – Spark SQL – Integrated Language Query à High Performance – Partition Pruning, Data Locality, Column Pruning, Predicate Pushdown – Use Spark UnhandledFilters API – Cache Spark HBase Connections
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Data Coder & Data Schema à Support Different Data Coders – PrimitiveType: Native Support Java Primitive Types – Avro: Native Support Avro Encoding/Decoding – Phoenix: Phoenix Encoding/Decoding – Plug-In Data Coder – Can Run on the Top of Existing HBase Tables à Support Composite Key – def cat = s"""{ |"table":{"namespace":"default", "name":"shcExampleTable", "tableCoder":”Phoenix"}, |"rowkey":"key1:key2", |"columns":{ |"col00":{"cf":"rowkey", "col":"key1", "type":"string”}, |"col01":{"cf":"rowkey", "col":"key2", "type":"int"}, … ...
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture & Implementation
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Architecture …... Driver Executor Executor Executor Region Server Region Server Region Server…... Spark HBase Picture 1. SHC architecture Host 1
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Architecture …... Driver Executor Executor Executor Region Server Region Server Region Server…... Picture 1. SHC architecture Task Query Partition Filters, Required Columns RS start/end point sqlContext.sql("select count(col1) from table1 where key < 'row050'") PP P Scans BulkGets
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Implementation …... Driver Executor Executor Executor Region Server Region Server Region Server…... Picture 1. SHC architecture Task Query Partition Filters, Required ColumnsPartition Pruning: Task Only Performed in Region Server Holding Requested Data PP P Scans BulkGets Filters -> Multiple Scan Ranges ∩ (Start point, end point) RS start/end point
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Implementation …... Driver Executor Executor Executor Region Server Region Server Region Server…... Picture 1. SHC architecture Task Query Partition Filters, Required Columns RS start/end point Data Locality: Move Computation to Data. PP P Scans BulkGets RDD Partition has preferred location: getPreferredLocations(partition) { return RS.hostName}
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Implementation …... Driver Executor Executor Executor Region Server Region Server Region Server…... Picture 1. SHC architecture Task Query Partition Filters, Required Columns RS start/end point Column Pruning: Required Column Predicate Pushdown: HBase built-in Filters PP P Filters, Required Columns Filters, Required Columns Scans BulkGets Filters, Required Columns
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Implementation …... Driver Executor Executor Executor Region Server Region Server Region Server…... Picture 1. SHC architecture Task Query Partition Filters, Required Columns RS start/end point Scan and BulkGets: Grouped by region server. PP P Scans BulkGets WHERE column > x and column < y for scan and WHERE column = x for get.
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Usage & Demo
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved How to Use SHC? à Github – https://github.com/hortonworks-spark/shc à SHC Examples – https://github.com/hortonworks-spark/shc/tree/master/examples à Apache HBase Jira – https://issues.apache.org/jira/browse/HBASE-14789
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Demo à Interactive Jobs through Spark Shell à Batch Jobs
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Acknowledgement à HBase Community & Spark Community à All SHC Contributors, Zhan Zhang
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Reference à Hortonworks Public Repo – http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/ à Apache Spark – http://spark.apache.org/ à Apache HBase – https://hbase.apache.org/
  • 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thanks Q & A Emails: [email protected]
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved BACKUP
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Kerberos Cluster à Kerberos Ticket – kinit -kt foo.keytab foouser or Principle/Keytab à Long Running Service – --principal, --keytab à Multiple Secure HBase Clusters – Spark only Supports Single Secure HBase Cluster – Use SHC Credential Manager – Refer LRJobAccessing2Clusters Example in github
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Usage Define the catalog for the schema mapping:
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Usage à Prepare the data and populate the HBase table val data = (0 to 255).map { i => HBaseRecord(i, “extra”)} sc.parallelize(data).toDF.write.options( Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable-> “5”)) .format(“org.apache.spark.sql.execution.datasources.hbase”) .save()
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Usage à Load the DataFrame def withCatalog(cat: String): DataFrame = { sqlContext .read .options(Map(HBaseTableCatalog.tableCatalog->cat)) .format(“org.apache.spark.sql.execution.datasources.hbase”) .load() } val df = withCatalog(catalog)
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Usage à Query Language integrated query: val s = df.filter((($"col0ʺ <= “çrow050ʺ && $”col0” > “row040”) || $”col0ʺ === “row005” && ($”col4ʺ === 1 || $”col4ʺ === 42)) .select(“col0”, “col1”, “col4”) SQL: val s = df.filter((($”col0ʺ <= “row050ʺ && $”col0” > “row040”) df.registerTempTable(“table”) sqlContext.sql(“select count(col1) from table”).show
  • 27. 27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Usage à Work with different data sources // Part 1: write data into Hive table and read data from it val df1 = sql("SELECT * FROM shcHiveTable") // Part 2: read data from Hbase table val df2 = withCatalog(cat) // Part 3: join the two dataframes val s1 = df1.filter($"key" <= "40").select("key", "col1") val s2 = df2.filter($"key" <= "20" && $"key" >= "1").select("key", "col2") val result = s1.join(s2, Seq("key")) result.show()