Apache HBase Workshop

HBase Workshop
Moisieienko Valerii
Big Data Morning@Lohika

Agenda
1.What is Apache HBase?
2.HBase data model
3.CRUD operations
4.HBase architecture
5.HBase schema design
6.Java API

Apache HBase is
• Open source project built on top of Apache
Hadoop
• NoSQL database
• Distributed, scalable datastore
• Column-family datastore

Use cases
Time Series Data
• Sensor, System metrics, Events, Log files
• User Activity
• Hi Volume, Velocity Writes
Information Exchange
• Email, Chat, Inbox
• High Volume, Velocity ReadWrite
Enterprise Application Backend
• Online Catalog
• Search Index
• Pre-Computed View
• High Volume, Velocity Reads

Data model overview
Component Description
Table Data organized into tables
RowKey Data stored in rows; Rows identified by RowKeys
Region Rows are grouped in Regions
Column Family Columns grouped into families
Column Qualifier
(Column)
Indentifies the column
Cell Combination of the row key, column family, column, timestamp; contains the
value
Version Values within in cell versioned by version number → timestamp

Data model: Rows
RowKey
contacs accounts …
mobile email skype UAH USD …
084ab67e VAL VAL
2333bbac VAL VAL
342bbecc VAL
4345235b VAL
565c4f8f VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
a89d3211 VAL VAL VAL VAL
f091e589 VAL VAL VAL

Data model: Rows order
Rows are sorted in lexicographical order
+bill
04523
10942
53205
_tim
andy
josh
steve
will

Data model: Regions
RowKey
contacs accounts …
mobile email skype UAH USD …
084ab67e VAL VAL
2333bbac VAL VAL
… VAL
4345235b VAL
… VAL VAL VAL
9745c563 VAL VAL
… VAL VAL VAL VAL
f091e589 VAL VAL VAL
RowKeys ranges → Regions
R1
R2
R3

Data model: Column Family
RowKey
contacs accounts
mobile email skype UAH USD
084ab67e VAL VAL
2333bbac VAL VAL
342bbecc VAL
4345235b VAL
565c4f8f VAL VAL VAL
9745c563 VAL VAL

Data model: Column Family
• Column Families are part of the table schema and
defined on the table creation
• Columns are grouped into column families
• Column Families are stored in separate HFiles at
HDFS
• Data is grouped to Column Families by common
attribute

Data model: Columns
RowKey
contacs accounts
mobile email skype UAH USD
084ab67e 977685798 user123@gmail.com user123 2875 10
… … … … … …

Data model: Cells
Key
Value
RowKey
Column
Family
Column Qualifier Version
084ab67e contacs mobile 1454767653075 977685798

Data model: Cells
• Data is stored in KeyValue format
• Value for each cell is specified by complete
coordinates: RowKey, Column Family, Column
Qualifier, Version

Data model: Versions
CF1:colA CF1:colB CF1:colC
Row1
Row10
Row2
vl1
val2
val3
val1
val1
val2
vl1
val2
val3
val1
val2
val1
val1
val1
val2

Create table
create 'user_accounts',
{NAME=>'contacts',VERSIONS=>1},
{NAME=>'accounts'}
• Default Versions = 1, since HBase 0.98
• Default Versions = 3, before HBase 0.98

Insert/Update
put 'user_accounts',
'user3455','contacts:mobile','977685798'
put 'user_accounts',
'user3455','contacts:email','user@mail.c
om',2
There is no update command. Just reinsert row.

Read
get 'user_accounts', 'user3455'
get 'user_accounts', 'user3455',
'contacts:mobile'
get 'user_accounts', 'user3455', {COLUMN
=> 'contacts:email', TIMESTAMP => 2}
scan ‘user_accounts’
scan 'user_accounts',
{STARTROW=>'a',STOPROW=>'u'}

Delete
delete 'user_accounts',
'user3455','contacts:mobile'
delete 'user_accounts',
'user3455','contacts:mobile',
1459690212356
deleteall 'user_accounts', 'user3455'

Useful commands
list
describe 'user_accounts'
truncate 'user_accounts'
disable 'user_accounts'
alter 'user_accounts',
{NAME=>'contacts',VERSIONS=>2},
{NAME=>'spends'}
enable 'user_accounts'

Data write and fault tolerance
• Data writes are recorded in WAL
• Data is written to memstore
• When memstore is full -> data is written to disk in
HFile

Region split
When region size > hbase.hregion.max.ﬁlesize -> split

Web console
Default address: master_host:60010
Shows:
• Live and dead region servers
• Region request count per second
• Tables and region sizes
• Current compactions
• Current memory state

Elements of Schema Design
HBase schema design is QUERY based
1.Column families determination
2.RowKey design
3.Columns usage
4.Cell versions usage
5.Column family attribute: Compression, TimeToLive,
Min/Max Versions, Im-Memory

Column Families determination
• Data, that accessed together should be stored
together!
• Big number of column families may avoid
performance. Optimal: ≤ 3
• Using compression may improve read performance
and reduce store data size, but affect write
performance

RowKey design
• Do not use sequential keys like timestamp
• Use hash for effective key distribution
• Use composite keys for effective scans

Columns and Versions usage
Tall-Narrow Table Flat-Wide Table

Tall-Narrow Vs. Flat-Wide Tables
Tall-Narrow provides better quality granularity
• Finer grained RowKey
• Works well with Get
Flat-Wide supports build-in row atomicity
• More values in a single row
• Works well to update multiple values
• Works well to get multiple associated values

Column Families properties
Compression
• LZO
• GZIP
• SNAPPY
Time To Live (TTL)
• Keep data for some time and then delete when TTL is passed
Versioning
• Keep fewer versions means less data in scans. Default now 1
• Combine MIN_VERSIONS with TTL to keep data older than TTL
In-Memory setting
• A setting to suggest that server keeps data in cache. Not guaranteed
• Use for small, high-access column families

API: All the things
• New Java API since HBase 1.0
• Table Interface for Data Operations: Put, Get, Scan,
Increment, Delete
• Admin Interface for DDL operations: Create Table,
Alter Table, Enable/Disable

Performance: Client reads
• Determine as much key component, as possible
• Determination of ColumnFamily reduce disk IO
• Determination of Column, Version reduce network
traffic
• Determine startRow, endRow for Scans, where
possible
• Use caching with Scans

Performance: Client writes
• Use batches to reduce RPC calls and improve
performance
• Use write buffer for not critical data. BufferMutator
introduced in HBase API 1.0
• Durability.ASYNC_WAL may be good balance
between performance and reliability

How to start?
• MapR Sandbox:
https://www.mapr.com/products/mapr-sandbox-
hadoop/download
• Cloudera Sandbox:
http://www.cloudera.com/downloads/
quickstart_vms/5-5.html

Thank you
Write me → valeramoiseenko@gmail.com

Apache HBase Workshop

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache HBase Workshop (20)

Recently uploaded (20)

Apache HBase Workshop