Hspark index conf

© 2016 IBM Corporation
HSpark – Enable Spark SQL on NoSQL HBase tables
Bo Meng, Yan Zhou
@IBM Index, Feb 2018

Agenda
 Introduction of HBase
 Mapping HBase to SQL
 HSpark & its Data Types
 HSpark DDL and Query
 To-dos
 Demo

HBase – A short introduction
 HBase is an open source, distributed sorted map modeled after Google's BigTable
 Apache top-level project since 2010
 Apache 2.0 License
 Part of Hadoop ecosystem
 Widely used – Adobe, Airbnb, Facebook, LinkedIn, Netflix, Yahoo, etc.
 Current stable version 1.2.6

HBase is
 Distributed NOSQL/Key-Value Store – uses HDFS for actual storage
 Modeled after Google BigTable
 Column – Oriented
 Multiversion Concurrency Control (MVCC)
 Dynamic Schema
 Distributed and Scalable Storage Engine
HBase is not
 A SQL Database (SQL parser, optimizers and relational data models, etc.)
 No traditional DBA needed

Advantage of HBase
 High performance
 rowkey is sorted (always)
 scan rowkey controlled by different filters
 Columnar storage (column families)
 Support of advanced in-DB processing
 High Scalability
 Hadoop
 Zookeeper
 Other advantages
 Flexible schema
 Real-time ingests/queries
 Fault-tolerance
 etc.

SQL vs HBase
Name Age Sex
Tom 21 Male
Bob
Andrew (???)
Row Key + Column Key + Timestamp (version) => Value
Row Key Column Key Timestamp Value
0000001H info:name 123 Tom
0000001H Info:age 123 21
0000001H Info:sex 123 Male
0000002H Info:name 124 Bob
0000002H Info:name 125 Andrew

DDL using HBase Shell
 create ‘test’, ‘cf’ // create table “test”, with column name “cf”
 put ‘test’, ‘row1’, ‘cf:name’, ‘Bob’ // add 3 records, with column + qualifier into “test”
 put ‘test’, ’row2’, ‘cf:name’, ‘Tom’
 put ‘test’, ’row3’, ‘cf:name’, “Andrew’
 list // list all the tables in the current namespace
 scan ‘test’ // list all the records in the table “test”
 get ‘test’, ‘row1’ // list all the records in the table “test” with rowkey equals to “row1”
 delete ‘test’, ’row1’, ‘cf:name’ // delete the record in table “test”
 disable ‘test’ // disable the usage of “test” table
 drop ‘test’ // delete the table

Mapping HBase to SQL (a possible approach)
HBase SQL Term
Namespace Database
Table Table
Row key (multi-dimensional) Key columns
Column families + qualifiers Non-key columns
Byte array Data types (Int, Double, Float, Long, Date, Timestamp, etc.)

Example
create table customers (id int not null, name string not null, age int not null, salary float,
primary key (id, name))
 HBase
Table name: logical name -> customers, physical name-> hcustomers (could be)
Rowkey: (id + name) in byte array format
Column Family + Qualifier: column:age, column:salary

How HSpark fits in
 Optimal combo of Spark optimizer and Hbase filtering/pruning capabilities, yield
unprecedented performance edge
 Seamlessly integrated into Spark Eco System
 Provide Spark SQL/DataSet interface to HBase users
 Similar Technologies
 Apache Phoenix
 Spark connectors in HBase

HSpark – High Performance Spark on HBase
 Running Spark on HBase
 Leverage Spark’s framework such as parser, optimizer and execution
 HBase will be one of the data sources
 Metadata table will also be stored on HBase
 Using enhanced DDL to manage HBase, same SQL to query
 Optimal Spark Data Encoding into HBase
 Extensible to support other NOSQL DBs
 Advanced predicate analysis to precisely prune partitions/rows/columns
o Logical Disjunction, Conjunction and Negation are supported to prune the data to be
accessed, in contrast to other big data engines where only logical conjunction is
supported
SELECT * From students where country = ‘US’ OR country = ‘Canada’
Only those in US or Canada are accessed instead of full table scan
o BulkGet being used to fetch a list of “point data”
 Optimizations based upon multi-dimensional row key compositions to minimize data scans
 Bulk load into HBase is optimized for tabular data from Spark

HSpark Supported Data Types (subset of Spark data types)
 String (variable length)
 Byte (1), Short (2), Int (4), Long (8)
 Float (4), Double (8)
 Boolean (1)
 Date (4)
 Timestamp (8)
 Every data type needs to be able to convert to byte array back and forth
 The ordering needs to be preserved, even in binary array form
Integer Binary using Integer.toBinaryString()
1 00000000000000000000000000000001
-1 11111111111111111111111111111111

HSpark Supported Data Types (subset of Spark data types)
 Use less bytes as much as possible
 Example: Key columns (Date, String, Boolean) -> Row key
 Solution 1: column length (3) + offset 1 + column 1 length + offset 2 + column 2 length +
offset 3 + column 3 length + data 1 (date) + data 2 (string) + data 3 (boolean)
 Solution 2: data 1 (4 bytes) + data 2 (string) + 00H + data 3 (1 bytes)
• Perform as fast as possible to create the byte array (reuse the memory block, etc.)

Converting the predicates into HBase domain
 Logical predicates will be optimized by Spark already
 Predicates will be divided into 2 groups – can be handled by HBase and the rest
 For HBase-doable predicates
 “Not” will be pushed down (eliminate “Not”)
 Reduce the predicates based on HBase regions
 Handle key columns (rowkeys) and non-key columns (column families) using different
filters
 Scan the data set with filters
 Construct the final result based on the scan

HSpark syntax
 create database
 create table
 insert data
 load data
 query
 drop table
 drop database
 …
 Use the Spark SQL parser to parse the statement, HBase specific information added as
properties
 All Spark SQL queries are supported, literally

Enabling the quick test
 HSpark SQL shell
 test out the HSpark in the shell
 integrate with Spark job submit
 Python shell

HSpark statistics
 LOC: 9341 (main), 5902 (test)
 Version number will be same as Spark version number
 Currently 2.2.0
 JDK 8, Scala 2.11.8
 Spark 2.2.0
 Source codes can be found in GitHub
 Current devs: Yan Zhou and me
 README has information to set up the environment and run the tests
 Contributors / Testers (any improvements) are welcome

HSpark performance
 Queries (TPC-DS, 10M records)
 1-key range
 select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000)
 2-key range
or (ss_item_sk = 5000 and ss_ticket_number < 20000)
 3-key range
or (ss_item_sk = 5000 and ss_ticket_number < 20000) or (ss_item_sk = 28000 and
ss_ticket_number <= 10000)
 Aggregate on the 2nd key
 select count(1) from store_sales group by ss_ticket_number

HSpark performance (TPC-DS, 10M records)
0.03
4.29 4.44
79
0.18 0.22 0.27
37
1-key 2-key 3-key Aggregation
Queries
Phoenix HSpark
1093
762
557
185
No presplit 6 presplits
Bulk load
Phoenix HSpark

HSpark To-Dos
 More performance benchmarking (on-going)
 More data types support – Decimal, Array, Map, etc.
 HBase co-processor support
 More tests / documentations / code improvements
 Update on Spark Package website

Future Plans
 Support of other NOSQL/KV stores
 Graph DB support
 Transaction Support

HSpark Demo (Video)
 Links:
 IBM code pattern
https://developer.ibm.com/code/patterns/use-spark-sql-to-access-nosql-hbase-tables/
 Youtube video
https://www.youtube.com/watch?v=E1GPJMn0qF0
https://www.youtube.com/watch?v=f4PvL6E1LOo
 Use HSpark shell
 Create the table
 Import data
 Queries

Hspark index conf

More Related Content

What's hot (20)

Similar to Hspark index conf (20)

More from Chester Chen (20)

Recently uploaded (20)

Hspark index conf