Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBase through Spark SQL with Weiqing Yang

1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Apache Spark – Apache HBase Connector
Feature Rich and Efficient Access to HBase
through Spark SQL
Weiqing Yang
Mingjie Tang
October, 2017

About Authors
Ã Weiqing Yang
• Contribute to Apache Spark, Apache Hadoop, Apache HBase, Apache Ambari
• Software Engineer at Hortonworks
Ã Mingjie Tang
• SparkSQL, Spark Mllib, Spark Streaming, Data Mining, Machine Learning
• Software Engineer at Hortonworks
Ã … All Other SHC Contributors

Agenda
Motivation
Overview
Architecture & Implementation
Usage & Demo

Motivation
Ã Limited Spark Support in HBase Upstream
– RDD level
– But Spark Is Moving to DataFrame/Dataset
Ã Existing Connectors in DataFrame Level
– Complicated Design
• Embedding Optimization Plan inside Catalyst Engine
• Stability Impact with Coprocessor
• Serialized RDD Lineage to HBase
– Heavy Maintenance Overhead

Overview

Apache Spark– Apache HBase Connector (SHC)
Ã Combine Spark and HBase
– Spark Catalyst Engine for Query Plan and Optimization
– HBase as Fast Access KV Store
– Implement Standard External Data Source with Build-in Filter, Maintain Easily
Ã Full Fledged DataFrameSupport
– Spark SQL
– Integrated Language Query
Ã High Performance
– Partition Pruning, Data Locality, Column Pruning, Predicate Pushdown
– Use Spark UnhandledFilters API
– Cache Spark HBase Connections

Data Coder & Data Schema
Ã Support Different Data Coders
– PrimitiveType: Native Support Java Primitive Types
– Avro: Native Support Avro Encoding/Decoding
– Phoenix: Phoenix Encoding/Decoding
– Plug-In Data Coder
– Can Run on the Top of Existing HBase Tables
Ã Support Composite Key
– def cat = s"""{
|"table":{"namespace":"default", "name":"shcExampleTable", "tableCoder":”Phoenix"},
|"rowkey":"key1:key2",
|"columns":{
|"col00":{"cf":"rowkey", "col":"key1", "type":"string”},
|"col01":{"cf":"rowkey", "col":"key2", "type":"int"},
…
...

Architecture & Implementation

Architecture
…...
Driver
Executor Executor Executor
Region
Server
Region
Server
Region
Server…...
Spark
HBase
Picture 1. SHC architecture
Host 1

Architecture
…...
Driver
Region
Server
Region
Server
Region
Server…...
Task
Query
Partition
Filters, Required
Columns
RS start/end
point
sqlContext.sql("select
count(col1) from table1
where key < 'row050'")
PP P
Scans
BulkGets

Implementation
…...
Driver
Region
Server
Region
Server
Region
Server…...
Task
Query
Partition
Filters, Required
ColumnsPartition Pruning: Task Only
Performed in Region Server
Holding Requested Data
PP P
Scans
BulkGets Filters -> Multiple Scan Ranges
∩
(Start point, end point)
RS start/end
point

Implementation
…...
Driver
Region
Server
Region
Server
Region
Server…...
Task
Query
Partition
Filters, Required
Columns
RS start/end
point
Data Locality: Move
Computation to Data.
PP P
Scans
BulkGets
RDD Partition has preferred location:
getPreferredLocations(partition) {
return RS.hostName}

Implementation
…...
Driver
Region
Server
Region
Server
Region
Server…...
Task
Query
Partition
Filters, Required
Columns
RS start/end
point
Column Pruning: Required
Column
Predicate Pushdown: HBase
built-in Filters
PP P
Filters, Required
Columns
Filters, Required
Columns
Scans
BulkGets
Filters, Required
Columns

Implementation
…...
Driver
Region
Server
Region
Server
Region
Server…...
Task
Query
Partition
Filters, Required
Columns
RS start/end
point
Scan and BulkGets: Grouped
by region server.
PP P
Scans
BulkGets
WHERE column > x and
column < y for scan
and WHERE column =
x for get.

Usage & Demo

How to Use SHC?
Ã Github
– https://github.com/hortonworks-spark/shc
Ã SHC Examples
– https://github.com/hortonworks-spark/shc/tree/master/examples
Ã Apache HBase Jira
– https://issues.apache.org/jira/browse/HBASE-14789

Demo
Ã Interactive Jobs through Spark Shell
Ã Batch Jobs

Acknowledgement
Ã HBase Community & Spark Community
Ã All SHC Contributors, Zhan Zhang

Reference
Ã Hortonworks Public Repo
– http://repo.hortonworks.com/content/repositories/releases/com/hortonworks/
Ã Apache Spark
– http://spark.apache.org/
Ã Apache HBase
– https://hbase.apache.org/

Thanks
Q & A
Emails:
wyang@hortonworks.com

BACKUP

Kerberos Cluster
Ã Kerberos Ticket
– kinit -kt foo.keytab foouser or Principle/Keytab
Ã Long Running Service
– --principal, --keytab
Ã Multiple Secure HBase Clusters
– Spark only Supports Single Secure HBase Cluster
– Use SHC Credential Manager
– Refer LRJobAccessing2Clusters Example in github

Usage
Define the catalog for the schema mapping:

Usage
Ã Prepare the data and populate the HBase table
val data = (0 to 255).map { i => HBaseRecord(i, “extra”)}
sc.parallelize(data).toDF.write.options(
Map(HBaseTableCatalog.tableCatalog -> catalog, HBaseTableCatalog.newTable-> “5”))
.format(“org.apache.spark.sql.execution.datasources.hbase”)
.save()

Usage
Ã Load the DataFrame
def withCatalog(cat: String): DataFrame = {
sqlContext
.read
.options(Map(HBaseTableCatalog.tableCatalog->cat))
.format(“org.apache.spark.sql.execution.datasources.hbase”)
.load()
}
val df = withCatalog(catalog)

Usage
Ã Query
Language integrated query:
val s = df.filter((($"col0ʺ <= “çrow050ʺ && $”col0” > “row040”) ||
$”col0ʺ === “row005” && ($”col4ʺ === 1 || $”col4ʺ === 42))
.select(“col0”, “col1”, “col4”)
SQL:
val s = df.filter((($”col0ʺ <= “row050ʺ && $”col0” > “row040”)
df.registerTempTable(“table”)
sqlContext.sql(“select count(col1) from table”).show

Usage
Ã Work with different data sources
// Part 1: write data into Hive table and read data from it
val df1 = sql("SELECT * FROM shcHiveTable")
// Part 2: read data from Hbase table
val df2 = withCatalog(cat)
// Part 3: join the two dataframes
val s1 = df1.filter($"key" <= "40").select("key", "col1")
val s2 = df2.filter($"key" <= "20" && $"key" >= "1").select("key", "col2")
val result = s1.join(s2, Seq("key"))
result.show()

Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBase through Spark SQL with Weiqing Yang

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBase through Spark SQL with Weiqing Yang (20)

More from Spark Summit (20)

Recently uploaded (20)

Apache Spark—Apache HBase Connector: Feature Rich and Efficient Access to HBase through Spark SQL with Weiqing Yang