SlideShare a Scribd company logo
© 2016 IBM Corporation
HSpark – Enable Spark SQL on NoSQL HBase tables
Bo Meng, Yan Zhou
@IBM Index, Feb 2018
© 2016 IBM Corporation
Agenda
 Introduction of HBase
 Mapping HBase to SQL
 HSpark & its Data Types
 HSpark DDL and Query
 To-dos
 Demo
© 2016 IBM Corporation
HBase – A short introduction
 HBase is an open source, distributed sorted map modeled after Google's BigTable
 Apache top-level project since 2010
 Apache 2.0 License
 Part of Hadoop ecosystem
 Widely used – Adobe, Airbnb, Facebook, LinkedIn, Netflix, Yahoo, etc.
 Current stable version 1.2.6
© 2016 IBM Corporation
HBase is
 Distributed NOSQL/Key-Value Store – uses HDFS for actual storage
 Modeled after Google BigTable
 Column – Oriented
 Multiversion Concurrency Control (MVCC)
 Dynamic Schema
 Distributed and Scalable Storage Engine
HBase is not
 A SQL Database (SQL parser, optimizers and relational data models, etc.)
 No traditional DBA needed
© 2016 IBM Corporation
Advantage of HBase
 High performance
 rowkey is sorted (always)
 scan rowkey controlled by different filters
 Columnar storage (column families)
 Support of advanced in-DB processing
 High Scalability
 Hadoop
 Zookeeper
 Other advantages
 Flexible schema
 Real-time ingests/queries
 Fault-tolerance
 etc.
© 2016 IBM Corporation
SQL vs HBase
Name Age Sex
Tom 21 Male
Bob
Andrew (???)
Row Key + Column Key + Timestamp (version) => Value
Row Key Column Key Timestamp Value
0000001H info:name 123 Tom
0000001H Info:age 123 21
0000001H Info:sex 123 Male
0000002H Info:name 124 Bob
0000002H Info:name 125 Andrew
© 2016 IBM Corporation
DDL using HBase Shell
 create ‘test’, ‘cf’ // create table “test”, with column name “cf”
 put ‘test’, ‘row1’, ‘cf:name’, ‘Bob’ // add 3 records, with column + qualifier into “test”
 put ‘test’, ’row2’, ‘cf:name’, ‘Tom’
 put ‘test’, ’row3’, ‘cf:name’, “Andrew’
 list // list all the tables in the current namespace
 scan ‘test’ // list all the records in the table “test”
 get ‘test’, ‘row1’ // list all the records in the table “test” with rowkey equals to “row1”
 delete ‘test’, ’row1’, ‘cf:name’ // delete the record in table “test”
 disable ‘test’ // disable the usage of “test” table
 drop ‘test’ // delete the table
© 2016 IBM Corporation
Mapping HBase to SQL (a possible approach)
HBase SQL Term
Namespace Database
Table Table
Row key (multi-dimensional) Key columns
Column families + qualifiers Non-key columns
Byte array Data types (Int, Double, Float, Long, Date, Timestamp, etc.)
© 2016 IBM Corporation
Example
create table customers (id int not null, name string not null, age int not null, salary float,
primary key (id, name))
 HBase
Table name: logical name -> customers, physical name-> hcustomers (could be)
Rowkey: (id + name) in byte array format
Column Family + Qualifier: column:age, column:salary
© 2016 IBM Corporation
How HSpark fits in
 Optimal combo of Spark optimizer and Hbase filtering/pruning capabilities, yield
unprecedented performance edge
 Seamlessly integrated into Spark Eco System
 Provide Spark SQL/DataSet interface to HBase users
 Similar Technologies
 Apache Phoenix
 Spark connectors in HBase
© 2016 IBM Corporation
HSpark – High Performance Spark on HBase
 Running Spark on HBase
 Leverage Spark’s framework such as parser, optimizer and execution
 HBase will be one of the data sources
 Metadata table will also be stored on HBase
 Using enhanced DDL to manage HBase, same SQL to query
 Optimal Spark Data Encoding into HBase
 Extensible to support other NOSQL DBs
 Advanced predicate analysis to precisely prune partitions/rows/columns
o Logical Disjunction, Conjunction and Negation are supported to prune the data to be
accessed, in contrast to other big data engines where only logical conjunction is
supported
SELECT * From students where country = ‘US’ OR country = ‘Canada’
Only those in US or Canada are accessed instead of full table scan
o BulkGet being used to fetch a list of “point data”
 Optimizations based upon multi-dimensional row key compositions to minimize data scans
 Bulk load into HBase is optimized for tabular data from Spark
© 2016 IBM Corporation
HSpark Supported Data Types (subset of Spark data types)
 String (variable length)
 Byte (1), Short (2), Int (4), Long (8)
 Float (4), Double (8)
 Boolean (1)
 Date (4)
 Timestamp (8)
 Every data type needs to be able to convert to byte array back and forth
 The ordering needs to be preserved, even in binary array form
Integer Binary using Integer.toBinaryString()
1 00000000000000000000000000000001
-1 11111111111111111111111111111111
© 2016 IBM Corporation
HSpark Supported Data Types (subset of Spark data types)
 Use less bytes as much as possible
 Example: Key columns (Date, String, Boolean) -> Row key
 Solution 1: column length (3) + offset 1 + column 1 length + offset 2 + column 2 length +
offset 3 + column 3 length + data 1 (date) + data 2 (string) + data 3 (boolean)
 Solution 2: data 1 (4 bytes) + data 2 (string) + 00H + data 3 (1 bytes)
• Perform as fast as possible to create the byte array (reuse the memory block, etc.)
© 2016 IBM Corporation
Converting the predicates into HBase domain
 Logical predicates will be optimized by Spark already
 Predicates will be divided into 2 groups – can be handled by HBase and the rest
 For HBase-doable predicates
 “Not” will be pushed down (eliminate “Not”)
 Reduce the predicates based on HBase regions
 Handle key columns (rowkeys) and non-key columns (column families) using different
filters
 Scan the data set with filters
 Construct the final result based on the scan
© 2016 IBM Corporation
HSpark syntax
 create database
 create table
 insert data
 load data
 query
 drop table
 drop database
 …
 Use the Spark SQL parser to parse the statement, HBase specific information added as
properties
 All Spark SQL queries are supported, literally
© 2016 IBM Corporation
Enabling the quick test
 HSpark SQL shell
 test out the HSpark in the shell
 integrate with Spark job submit
 Python shell
© 2016 IBM Corporation
HSpark statistics
 LOC: 9341 (main), 5902 (test)
 Version number will be same as Spark version number
 Currently 2.2.0
 JDK 8, Scala 2.11.8
 Spark 2.2.0
 Source codes can be found in GitHub
 Current devs: Yan Zhou and me
 README has information to set up the environment and run the tests
 Contributors / Testers (any improvements) are welcome
© 2016 IBM Corporation
HSpark performance
 Queries (TPC-DS, 10M records)
 1-key range
 select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000)
 2-key range
 select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000)
or (ss_item_sk = 5000 and ss_ticket_number < 20000)
 3-key range
 select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000)
or (ss_item_sk = 5000 and ss_ticket_number < 20000) or (ss_item_sk = 28000 and
ss_ticket_number <= 10000)
 Aggregate on the 2nd key
 select count(1) from store_sales group by ss_ticket_number
© 2016 IBM Corporation
HSpark performance (TPC-DS, 10M records)
0.03
4.29 4.44
79
0.18 0.22 0.27
37
1-key 2-key 3-key Aggregation
Queries
Phoenix HSpark
1093
762
557
185
No presplit 6 presplits
Bulk load
Phoenix HSpark
© 2016 IBM Corporation
HSpark To-Dos
 More performance benchmarking (on-going)
 More data types support – Decimal, Array, Map, etc.
 HBase co-processor support
 More tests / documentations / code improvements
 Update on Spark Package website
© 2016 IBM Corporation
Future Plans
 Support of other NOSQL/KV stores
 Graph DB support
 Transaction Support
© 2016 IBM Corporation
HSpark Demo (Video)
 Links:
 IBM code pattern
https://developer.ibm.com/code/patterns/use-spark-sql-to-access-nosql-hbase-tables/
 Youtube video
https://www.youtube.com/watch?v=E1GPJMn0qF0
https://www.youtube.com/watch?v=f4PvL6E1LOo
 Use HSpark shell
 Create the table
 Import data
 Queries

More Related Content

PDF
Hypertable Distilled by edydkim.github.com
Edward D. Kim
 
PPTX
Hive : WareHousing Over hadoop
Chirag Ahuja
 
PDF
Hive Demo Paper at VLDB 2009
Namit Jain
 
PPT
Hive Apachecon 2008
athusoo
 
PDF
Inside Parquet Format
Yue Chen
 
PPTX
MapReduce Design Patterns
Donald Miner
 
PPTX
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Yahoo Developer Network
 
PPTX
BIG DATA: Apache Hadoop
Oleksiy Krotov
 
Hypertable Distilled by edydkim.github.com
Edward D. Kim
 
Hive : WareHousing Over hadoop
Chirag Ahuja
 
Hive Demo Paper at VLDB 2009
Namit Jain
 
Hive Apachecon 2008
athusoo
 
Inside Parquet Format
Yue Chen
 
MapReduce Design Patterns
Donald Miner
 
Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain
Yahoo Developer Network
 
BIG DATA: Apache Hadoop
Oleksiy Krotov
 

What's hot (20)

PPT
Hive User Meeting March 2010 - Hive Team
Zheng Shao
 
PPT
Hive ICDE 2010
ragho
 
PPTX
Hadoop architecture by ajay
Hadoop online training
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
PDF
Intro To Cascading
Nate Murray
 
PDF
SQL to Hive Cheat Sheet
Hortonworks
 
PDF
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
PDF
report on aadhaar anlysis using bid data hadoop and hive
siddharthboora
 
PPTX
Hive commands
Ganesh Sanap
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
Intro to HBase
alexbaranau
 
PPT
Meethadoop
IIIT-H
 
PPTX
SQLRally Amsterdam 2013 - Hadoop
Jan Pieter Posthuma
 
PPTX
Advanced topics in hive
Uday Vakalapudi
 
PDF
20081030linkedin
Jeff Hammerbacher
 
PDF
Apache Hadoop and HBase
Cloudera, Inc.
 
PPT
Hive(ppt)
Abhinav Tyagi
 
PPT
An Introduction to Hadoop
DerrekYoungDotCom
 
PPTX
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
PDF
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
Hive User Meeting March 2010 - Hive Team
Zheng Shao
 
Hive ICDE 2010
ragho
 
Hadoop architecture by ajay
Hadoop online training
 
Hive User Meeting August 2009 Facebook
ragho
 
Intro To Cascading
Nate Murray
 
SQL to Hive Cheat Sheet
Hortonworks
 
Apache Spark - Loading & Saving data | Big Data Hadoop Spark Tutorial | Cloud...
CloudxLab
 
report on aadhaar anlysis using bid data hadoop and hive
siddharthboora
 
Hive commands
Ganesh Sanap
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
Intro to HBase
alexbaranau
 
Meethadoop
IIIT-H
 
SQLRally Amsterdam 2013 - Hadoop
Jan Pieter Posthuma
 
Advanced topics in hive
Uday Vakalapudi
 
20081030linkedin
Jeff Hammerbacher
 
Apache Hadoop and HBase
Cloudera, Inc.
 
Hive(ppt)
Abhinav Tyagi
 
An Introduction to Hadoop
DerrekYoungDotCom
 
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Alex Levenson
 
Ad

Similar to Hspark index conf (20)

PDF
Big Data: Big SQL and HBase
Cynthia Saracco
 
PPTX
Advance Hive, NoSQL Database (HBase) - Module 7
Rohit Agrawal
 
PDF
Intro to Spark and Spark SQL
jeykottalam
 
PPT
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Piotr Pruski
 
PPTX
Spark sql
Zahra Eskandari
 
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
ODP
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...
IndicThreads
 
PPT
HBase and Hive at StumbleUpon Presentation.ppt
zaynablboudaoudi
 
PPTX
HBase_-_data_operaet le opérations de calciletions_final.pptx
HmadSADAQ2
 
PPTX
H base introduction & development
Shashwat Shriparv
 
PPTX
Spark meetup v2.0.5
Yan Zhou
 
PPTX
HBase.pptx
Sadhik7
 
PPTX
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
Michael Stack
 
PDF
HBase and Impala Notes - Munich HUG - 20131017
larsgeorge
 
PDF
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
Inhacking
 
PDF
Valerii Moisieienko Apache hbase workshop
Аліна Шепшелей
 
PDF
H base one page
Milind Zodge
 
PPT
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
PDF
SQL on Hadoop
nvvrajesh
 
PDF
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Big Data: Big SQL and HBase
Cynthia Saracco
 
Advance Hive, NoSQL Database (HBase) - Module 7
Rohit Agrawal
 
Intro to Spark and Spark SQL
jeykottalam
 
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL
Piotr Pruski
 
Spark sql
Zahra Eskandari
 
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Breaking with relational DBMS and dating with Hbase [5th IndicThreads.com Con...
IndicThreads
 
HBase and Hive at StumbleUpon Presentation.ppt
zaynablboudaoudi
 
HBase_-_data_operaet le opérations de calciletions_final.pptx
HmadSADAQ2
 
H base introduction & development
Shashwat Shriparv
 
Spark meetup v2.0.5
Yan Zhou
 
HBase.pptx
Sadhik7
 
HBaseConAsia2018 Track2-4: HTAP DB-System: AsparaDB HBase, Phoenix, and Spark
Michael Stack
 
HBase and Impala Notes - Munich HUG - 20131017
larsgeorge
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
Inhacking
 
Valerii Moisieienko Apache hbase workshop
Аліна Шепшелей
 
H base one page
Milind Zodge
 
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
SQL on Hadoop
nvvrajesh
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Ad

More from Chester Chen (20)

PDF
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
PDF
zookeeer+raft-2.pdf
Chester Chen
 
PPTX
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
PDF
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
PDF
A missing link in the ML infrastructure stack?
Chester Chen
 
PDF
Shopify datadiscoverysf bigdata
Chester Chen
 
PDF
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
PDF
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
PDF
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
PDF
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
PPTX
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
PPTX
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
PDF
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
PDF
Sf big analytics: bighead
Chester Chen
 
PPTX
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
PPTX
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
PPTX
2018 data warehouse features in spark
Chester Chen
 
PDF
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
PPTX
2018 02 20-jeg_index
Chester Chen
 
PDF
Index conf sparkml-feb20-n-pentreath
Chester Chen
 
SFBigAnalytics_SparkRapid_20220622.pdf
Chester Chen
 
zookeeer+raft-2.pdf
Chester Chen
 
SF Big Analytics 2022-03-15: Persia: Scaling DL Based Recommenders up to 100 ...
Chester Chen
 
SF Big Analytics talk: NVIDIA FLARE: Federated Learning Application Runtime E...
Chester Chen
 
A missing link in the ML infrastructure stack?
Chester Chen
 
Shopify datadiscoverysf bigdata
Chester Chen
 
SF Big Analytics 20191112: How to performance-tune Spark applications in larg...
Chester Chen
 
SF Big Analytics 2019112: Uncovering performance regressions in the TCP SACK...
Chester Chen
 
SFBigAnalytics_20190724: Monitor kafka like a Pro
Chester Chen
 
SF Big Analytics 2019-06-12: Managing uber's data workflows at scale
Chester Chen
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
Chester Chen
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
SFBigAnalytics- hybrid data management using cdap
Chester Chen
 
Sf big analytics: bighead
Chester Chen
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Chester Chen
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Chester Chen
 
2018 data warehouse features in spark
Chester Chen
 
2018 02-08-what's-new-in-apache-spark-2.3
Chester Chen
 
2018 02 20-jeg_index
Chester Chen
 
Index conf sparkml-feb20-n-pentreath
Chester Chen
 

Recently uploaded (20)

PPTX
short term internship project on Data visualization
JMJCollegeComputerde
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PDF
blockchain123456789012345678901234567890
tanvikhunt1003
 
PPTX
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PDF
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PPTX
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
PPTX
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPT
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 
short term internship project on Data visualization
JMJCollegeComputerde
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
blockchain123456789012345678901234567890
tanvikhunt1003
 
Future_of_AI_Presentation for everyone.pptx
boranamanju07
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
An Uncut Conversation With Grok | PDF Document
Mike Hydes
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
Classifcation using Machine Learning and deep learning
bhaveshagrawal35
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
Data-Driven Machine Learning for Rail Infrastructure Health Monitoring
Sione Palu
 
Introduction to Biostatistics Presentation.pptx
AtemJoshua
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
From Vision to Reality: The Digital India Revolution
Harsh Bharvadiya
 

Hspark index conf

  • 1. © 2016 IBM Corporation HSpark – Enable Spark SQL on NoSQL HBase tables Bo Meng, Yan Zhou @IBM Index, Feb 2018
  • 2. © 2016 IBM Corporation Agenda  Introduction of HBase  Mapping HBase to SQL  HSpark & its Data Types  HSpark DDL and Query  To-dos  Demo
  • 3. © 2016 IBM Corporation HBase – A short introduction  HBase is an open source, distributed sorted map modeled after Google's BigTable  Apache top-level project since 2010  Apache 2.0 License  Part of Hadoop ecosystem  Widely used – Adobe, Airbnb, Facebook, LinkedIn, Netflix, Yahoo, etc.  Current stable version 1.2.6
  • 4. © 2016 IBM Corporation HBase is  Distributed NOSQL/Key-Value Store – uses HDFS for actual storage  Modeled after Google BigTable  Column – Oriented  Multiversion Concurrency Control (MVCC)  Dynamic Schema  Distributed and Scalable Storage Engine HBase is not  A SQL Database (SQL parser, optimizers and relational data models, etc.)  No traditional DBA needed
  • 5. © 2016 IBM Corporation Advantage of HBase  High performance  rowkey is sorted (always)  scan rowkey controlled by different filters  Columnar storage (column families)  Support of advanced in-DB processing  High Scalability  Hadoop  Zookeeper  Other advantages  Flexible schema  Real-time ingests/queries  Fault-tolerance  etc.
  • 6. © 2016 IBM Corporation SQL vs HBase Name Age Sex Tom 21 Male Bob Andrew (???) Row Key + Column Key + Timestamp (version) => Value Row Key Column Key Timestamp Value 0000001H info:name 123 Tom 0000001H Info:age 123 21 0000001H Info:sex 123 Male 0000002H Info:name 124 Bob 0000002H Info:name 125 Andrew
  • 7. © 2016 IBM Corporation DDL using HBase Shell  create ‘test’, ‘cf’ // create table “test”, with column name “cf”  put ‘test’, ‘row1’, ‘cf:name’, ‘Bob’ // add 3 records, with column + qualifier into “test”  put ‘test’, ’row2’, ‘cf:name’, ‘Tom’  put ‘test’, ’row3’, ‘cf:name’, “Andrew’  list // list all the tables in the current namespace  scan ‘test’ // list all the records in the table “test”  get ‘test’, ‘row1’ // list all the records in the table “test” with rowkey equals to “row1”  delete ‘test’, ’row1’, ‘cf:name’ // delete the record in table “test”  disable ‘test’ // disable the usage of “test” table  drop ‘test’ // delete the table
  • 8. © 2016 IBM Corporation Mapping HBase to SQL (a possible approach) HBase SQL Term Namespace Database Table Table Row key (multi-dimensional) Key columns Column families + qualifiers Non-key columns Byte array Data types (Int, Double, Float, Long, Date, Timestamp, etc.)
  • 9. © 2016 IBM Corporation Example create table customers (id int not null, name string not null, age int not null, salary float, primary key (id, name))  HBase Table name: logical name -> customers, physical name-> hcustomers (could be) Rowkey: (id + name) in byte array format Column Family + Qualifier: column:age, column:salary
  • 10. © 2016 IBM Corporation How HSpark fits in  Optimal combo of Spark optimizer and Hbase filtering/pruning capabilities, yield unprecedented performance edge  Seamlessly integrated into Spark Eco System  Provide Spark SQL/DataSet interface to HBase users  Similar Technologies  Apache Phoenix  Spark connectors in HBase
  • 11. © 2016 IBM Corporation HSpark – High Performance Spark on HBase  Running Spark on HBase  Leverage Spark’s framework such as parser, optimizer and execution  HBase will be one of the data sources  Metadata table will also be stored on HBase  Using enhanced DDL to manage HBase, same SQL to query  Optimal Spark Data Encoding into HBase  Extensible to support other NOSQL DBs  Advanced predicate analysis to precisely prune partitions/rows/columns o Logical Disjunction, Conjunction and Negation are supported to prune the data to be accessed, in contrast to other big data engines where only logical conjunction is supported SELECT * From students where country = ‘US’ OR country = ‘Canada’ Only those in US or Canada are accessed instead of full table scan o BulkGet being used to fetch a list of “point data”  Optimizations based upon multi-dimensional row key compositions to minimize data scans  Bulk load into HBase is optimized for tabular data from Spark
  • 12. © 2016 IBM Corporation HSpark Supported Data Types (subset of Spark data types)  String (variable length)  Byte (1), Short (2), Int (4), Long (8)  Float (4), Double (8)  Boolean (1)  Date (4)  Timestamp (8)  Every data type needs to be able to convert to byte array back and forth  The ordering needs to be preserved, even in binary array form Integer Binary using Integer.toBinaryString() 1 00000000000000000000000000000001 -1 11111111111111111111111111111111
  • 13. © 2016 IBM Corporation HSpark Supported Data Types (subset of Spark data types)  Use less bytes as much as possible  Example: Key columns (Date, String, Boolean) -> Row key  Solution 1: column length (3) + offset 1 + column 1 length + offset 2 + column 2 length + offset 3 + column 3 length + data 1 (date) + data 2 (string) + data 3 (boolean)  Solution 2: data 1 (4 bytes) + data 2 (string) + 00H + data 3 (1 bytes) • Perform as fast as possible to create the byte array (reuse the memory block, etc.)
  • 14. © 2016 IBM Corporation Converting the predicates into HBase domain  Logical predicates will be optimized by Spark already  Predicates will be divided into 2 groups – can be handled by HBase and the rest  For HBase-doable predicates  “Not” will be pushed down (eliminate “Not”)  Reduce the predicates based on HBase regions  Handle key columns (rowkeys) and non-key columns (column families) using different filters  Scan the data set with filters  Construct the final result based on the scan
  • 15. © 2016 IBM Corporation HSpark syntax  create database  create table  insert data  load data  query  drop table  drop database  …  Use the Spark SQL parser to parse the statement, HBase specific information added as properties  All Spark SQL queries are supported, literally
  • 16. © 2016 IBM Corporation Enabling the quick test  HSpark SQL shell  test out the HSpark in the shell  integrate with Spark job submit  Python shell
  • 17. © 2016 IBM Corporation HSpark statistics  LOC: 9341 (main), 5902 (test)  Version number will be same as Spark version number  Currently 2.2.0  JDK 8, Scala 2.11.8  Spark 2.2.0  Source codes can be found in GitHub  Current devs: Yan Zhou and me  README has information to set up the environment and run the tests  Contributors / Testers (any improvements) are welcome
  • 18. © 2016 IBM Corporation HSpark performance  Queries (TPC-DS, 10M records)  1-key range  select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000)  2-key range  select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000) or (ss_item_sk = 5000 and ss_ticket_number < 20000)  3-key range  select count(1) from store_sales where (ss_item_sk = 99 and ss_ticket_number > 1000) or (ss_item_sk = 5000 and ss_ticket_number < 20000) or (ss_item_sk = 28000 and ss_ticket_number <= 10000)  Aggregate on the 2nd key  select count(1) from store_sales group by ss_ticket_number
  • 19. © 2016 IBM Corporation HSpark performance (TPC-DS, 10M records) 0.03 4.29 4.44 79 0.18 0.22 0.27 37 1-key 2-key 3-key Aggregation Queries Phoenix HSpark 1093 762 557 185 No presplit 6 presplits Bulk load Phoenix HSpark
  • 20. © 2016 IBM Corporation HSpark To-Dos  More performance benchmarking (on-going)  More data types support – Decimal, Array, Map, etc.  HBase co-processor support  More tests / documentations / code improvements  Update on Spark Package website
  • 21. © 2016 IBM Corporation Future Plans  Support of other NOSQL/KV stores  Graph DB support  Transaction Support
  • 22. © 2016 IBM Corporation HSpark Demo (Video)  Links:  IBM code pattern https://developer.ibm.com/code/patterns/use-spark-sql-to-access-nosql-hbase-tables/  Youtube video https://www.youtube.com/watch?v=E1GPJMn0qF0 https://www.youtube.com/watch?v=f4PvL6E1LOo  Use HSpark shell  Create the table  Import data  Queries