SlideShare a Scribd company logo
HBase Workshop
Moisieienko Valerii
Big Data Morning@Lohika
Agenda
1.What is Apache HBase?
2.HBase data model
3.CRUD operations
4.HBase architecture
5.HBase schema design
6.Java API
What is Apache HBase?
Apache HBase is
• Open source project built on top of Apache
Hadoop
• NoSQL database
• Distributed, scalable datastore
• Column-family datastore
Use cases
Time Series Data
• Sensor, System metrics, Events, Log files
• User Activity
• Hi Volume, Velocity Writes
Information Exchange
• Email, Chat, Inbox
• High Volume, Velocity ReadWrite
Enterprise Application Backend
• Online Catalog
• Search Index
• Pre-Computed View
• High Volume, Velocity Reads
HBase data model
Data model overview
Component Description
Table Data organized into tables
RowKey Data stored in rows; Rows identified by RowKeys
Region Rows are grouped in Regions
Column Family Columns grouped into families
Column Qualifier
(Column)
Indentifies the column
Cell Combination of the row key, column family, column, timestamp; contains the
value
Version Values within in cell versioned by version number → timestamp
Data model: Rows
RowKey
contacs accounts …
mobile email skype UAH USD …
084ab67e VAL VAL
2333bbac VAL VAL
342bbecc VAL
4345235b VAL
565c4f8f VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
a89d3211 VAL VAL VAL VAL
f091e589 VAL VAL VAL
Data model: Rows order
Rows are sorted in lexicographical order
+bill
04523
10942
53205
_tim
andy
josh
steve
will
Data model: Regions
RowKey
contacs accounts …
mobile email skype UAH USD …
084ab67e VAL VAL
2333bbac VAL VAL
… VAL
4345235b VAL
… VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
… VAL VAL VAL VAL
f091e589 VAL VAL VAL
RowKeys ranges → Regions
R1
R2
R3
Data model: Column Family
RowKey
contacs accounts
mobile email skype UAH USD
084ab67e VAL VAL
2333bbac VAL VAL
342bbecc VAL
4345235b VAL
565c4f8f VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
Data model: Column Family
• Column Families are part of the table schema and
defined on the table creation
• Columns are grouped into column families
• Column Families are stored in separate HFiles at
HDFS
• Data is grouped to Column Families by common
attribute
Data model: Columns
RowKey
contacs accounts
mobile email skype UAH USD
084ab67e 977685798 user123@gmail.com user123 2875 10
… … … … … …
Data model: Cells
Key
Value
RowKey
Column
Family
Column Qualifier Version
084ab67e contacs mobile 1454767653075 977685798
Data model: Cells
• Data is stored in KeyValue format
• Value for each cell is specified by complete
coordinates: RowKey, Column Family, Column
Qualifier, Version
Data model: Versions
CF1:colA CF1:colB CF1:colC
Row1
Row10
Row2
vl1
val2
val3
val1
val1
val2
vl1
val2
val3
val1
val2
val1
val1
val1
val2
CRUD Operations
Create table
create 'user_accounts',
{NAME=>'contacts',VERSIONS=>1},
{NAME=>'accounts'}
• Default Versions = 1, since HBase 0.98
• Default Versions = 3, before HBase 0.98
Insert/Update
put 'user_accounts',
'user3455','contacts:mobile','977685798'
put 'user_accounts',
'user3455','contacts:email','user@mail.c
om',2
There is no update command. Just reinsert row.
Read
get 'user_accounts', 'user3455'
get 'user_accounts', 'user3455',
'contacts:mobile'
get 'user_accounts', 'user3455', {COLUMN
=> 'contacts:email', TIMESTAMP => 2}
scan ‘user_accounts’
scan 'user_accounts',
{STARTROW=>'a',STOPROW=>'u'}
Delete
delete 'user_accounts',
'user3455','contacts:mobile'
delete 'user_accounts',
'user3455','contacts:mobile',
1459690212356
deleteall 'user_accounts', 'user3455'
Useful commands
list
describe 'user_accounts'
truncate 'user_accounts'
disable 'user_accounts'
alter 'user_accounts',
{NAME=>'contacts',VERSIONS=>2},
{NAME=>'spends'}
enable 'user_accounts'
HBase Architecture
Components
Regions
Master
Zookeeper
Data write
Data write and fault tolerance
• Data writes are recorded in WAL
• Data is written to memstore
• When memstore is full -> data is written to disk in
HFile
Minor compaction
Major compaction
Region split
When region size > hbase.hregion.max.filesize -> split
Region load balancing
Web console
Default address: master_host:60010
Shows:
• Live and dead region servers
• Region request count per second
• Tables and region sizes
• Current compactions
• Current memory state
HBase Schema Design
Elements of Schema Design
HBase schema design is QUERY based
1.Column families determination
2.RowKey design
3.Columns usage
4.Cell versions usage
5.Column family attribute: Compression, TimeToLive,
Min/Max Versions, Im-Memory
Column Families determination
• Data, that accessed together should be stored
together!
• Big number of column families may avoid
performance. Optimal: ≤ 3
• Using compression may improve read performance
and reduce store data size, but affect write
performance
RowKey design
• Do not use sequential keys like timestamp
• Use hash for effective key distribution
• Use composite keys for effective scans
Columns and Versions usage
Tall-Narrow Table Flat-Wide Table
Tall-Narrow Vs. Flat-Wide Tables
Tall-Narrow provides better quality granularity
• Finer grained RowKey
• Works well with Get
Flat-Wide supports build-in row atomicity
• More values in a single row
• Works well to update multiple values
• Works well to get multiple associated values
Column Families properties
Compression
• LZO
• GZIP
• SNAPPY
Time To Live (TTL)
• Keep data for some time and then delete when TTL is passed
Versioning
• Keep fewer versions means less data in scans. Default now 1
• Combine MIN_VERSIONS with TTL to keep data older than TTL
In-Memory setting
• A setting to suggest that server keeps data in cache. Not guaranteed
• Use for small, high-access column families
HBase Java API
API: All the things
• New Java API since HBase 1.0
• Table Interface for Data Operations: Put, Get, Scan,
Increment, Delete
• Admin Interface for DDL operations: Create Table,
Alter Table, Enable/Disable
Client
Let’s see the code
Performance: Client reads
• Determine as much key component, as possible
• Determination of ColumnFamily reduce disk IO
• Determination of Column, Version reduce network
traffic
• Determine startRow, endRow for Scans, where
possible
• Use caching with Scans
Performance: Client writes
• Use batches to reduce RPC calls and improve
performance
• Use write buffer for not critical data. BufferMutator
introduced in HBase API 1.0
• Durability.ASYNC_WAL may be good balance
between performance and reliability
The last few words
How to start?
• MapR Sandbox:
https://www.mapr.com/products/mapr-sandbox-
hadoop/download
• Cloudera Sandbox:
http://www.cloudera.com/downloads/
quickstart_vms/5-5.html
Thank you
Write me → valeramoiseenko@gmail.com

More Related Content

What's hot (20)

PDF
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
confluent
 
PDF
Real-time Data Streaming from Oracle to Apache Kafka
confluent
 
PPTX
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
HostedbyConfluent
 
PDF
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
HostedbyConfluent
 
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
PDF
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
confluent
 
PPTX
Bootstrap SaaS startup using Open Source Tools
botsplash.com
 
PDF
Riak at shareaholic
freerobby
 
PDF
Data integration with Apache Kafka
confluent
 
PPTX
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
PPTX
Cloud native data platform
Li Gao
 
PDF
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
PDF
A Collaborative Data Science Development Workflow
Databricks
 
PDF
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
PDF
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
confluent
 
PPTX
Change Data Capture using Kafka
Akash Vacher
 
PDF
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
HostedbyConfluent
 
PDF
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
confluent
 
Real-time Data Streaming from Oracle to Apache Kafka
confluent
 
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...
HostedbyConfluent
 
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...
HostedbyConfluent
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Guozhang Wang
 
Maximize the Business Value of Machine Learning and Data Science with Kafka (...
confluent
 
Bootstrap SaaS startup using Open Source Tools
botsplash.com
 
Riak at shareaholic
freerobby
 
Data integration with Apache Kafka
confluent
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Piotr Czarnas
 
Cloud native data platform
Li Gao
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
A Collaborative Data Science Development Workflow
Databricks
 
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...
confluent
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
HostedbyConfluent
 
Kafka Summit SF 2017 - Riot's Journey to Global Kafka Aggregation
confluent
 
Change Data Capture using Kafka
Akash Vacher
 
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...
HostedbyConfluent
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Databricks
 
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
Databricks
 

Viewers also liked (20)

PPTX
From Pilot to Product - Morning@Lohika
Ivan Verhun
 
PDF
From Data Dinosaurs to the Dawn of Big Data
Grupo Indata Periodismo
 
PDF
The dawn of Big Data
The Marketing Distillery
 
PPTX
The dawn of big data
Neal Hannon
 
ODP
Jee conf
Valerii Moisieienko
 
PDF
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
NoSQLmatters
 
PDF
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
Serhiy Batyuk
 
PPTX
Spark - Migration Story
Roman Chukh
 
PPTX
Big data analysis in java world
Serg Masyutin
 
PDF
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
PPTX
React. Flux. Redux
Andrey Kolodnitsky
 
PPTX
Marionette talk 2016
Kseniya Redunova
 
ODP
Java GC, Off-heap workshop
Valerii Moisieienko
 
PPTX
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
PPTX
Boot in Production
Oleksiy Rezchykov
 
PPTX
NLP: a peek into a day of a computational linguist
Mariana Romanyshyn
 
PPTX
Operating and Supporting Apache HBase Best Practices and Improvements
DataWorks Summit/Hadoop Summit
 
PPTX
Memory Management: What You Need to Know When Moving to Java 8
AppDynamics
 
PDF
Introduction to Data Science
Anastasiia Kornilova
 
PPTX
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
From Pilot to Product - Morning@Lohika
Ivan Verhun
 
From Data Dinosaurs to the Dawn of Big Data
Grupo Indata Periodismo
 
The dawn of Big Data
The Marketing Distillery
 
The dawn of big data
Neal Hannon
 
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...
NoSQLmatters
 
AWS Simple Workflow: Distributed Out of the Box! - Morning@Lohika
Serhiy Batyuk
 
Spark - Migration Story
Roman Chukh
 
Big data analysis in java world
Serg Masyutin
 
Tweaking performance on high-load projects
Dmitriy Dumanskiy
 
React. Flux. Redux
Andrey Kolodnitsky
 
Marionette talk 2016
Kseniya Redunova
 
Java GC, Off-heap workshop
Valerii Moisieienko
 
Introduction to real time big data with Apache Spark
Taras Matyashovsky
 
Boot in Production
Oleksiy Rezchykov
 
NLP: a peek into a day of a computational linguist
Mariana Romanyshyn
 
Operating and Supporting Apache HBase Best Practices and Improvements
DataWorks Summit/Hadoop Summit
 
Memory Management: What You Need to Know When Moving to Java 8
AppDynamics
 
Introduction to Data Science
Anastasiia Kornilova
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
DataStax
 
Ad

Similar to Apache HBase Workshop (20)

PDF
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
Inhacking
 
PDF
Valerii Moisieienko Apache hbase workshop
Аліна Шепшелей
 
PDF
Hbase
Vetri V
 
PDF
training huawei big data for data engineer
EricSandria2
 
PPTX
Advance Hive, NoSQL Database (HBase) - Module 7
Rohit Agrawal
 
PDF
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
PPTX
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
PPTX
HBase_-_data_operaet le opérations de calciletions_final.pptx
HmadSADAQ2
 
PPTX
HBase.pptx
vijayapraba1
 
PDF
Getting Started with HBase
Carol McDonald
 
PPTX
HBASE, HIVE , ARCHITECTURE AND WORKING EXAMPLES
harikumar288574
 
PDF
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 
PDF
Apache HBase
Vishnupriya T H
 
ODP
Apache hadoop hbase
sheetal sharma
 
PPTX
HBase in Practice
larsgeorge
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
Introduction to HBase - Phoenix HUG 5/14
Jeremy Walsh
 
PPTX
HBase: Just the Basics
HBaseCon
 
PPTX
Hbase
AmitkumarPal21
 
PPT
HBASE Overview
Sampath Rachakonda
 
SE2016 Java Valerii Moisieienko "Apache HBase Workshop"
Inhacking
 
Valerii Moisieienko Apache hbase workshop
Аліна Шепшелей
 
Hbase
Vetri V
 
training huawei big data for data engineer
EricSandria2
 
Advance Hive, NoSQL Database (HBase) - Module 7
Rohit Agrawal
 
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Introduction to Apache HBase, MapR Tables and Security
MapR Technologies
 
HBase_-_data_operaet le opérations de calciletions_final.pptx
HmadSADAQ2
 
HBase.pptx
vijayapraba1
 
Getting Started with HBase
Carol McDonald
 
HBASE, HIVE , ARCHITECTURE AND WORKING EXAMPLES
harikumar288574
 
NoSQL HBase schema design and SQL with Apache Drill
Carol McDonald
 
Apache HBase
Vishnupriya T H
 
Apache hadoop hbase
sheetal sharma
 
HBase in Practice
larsgeorge
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
Introduction to HBase - Phoenix HUG 5/14
Jeremy Walsh
 
HBase: Just the Basics
HBaseCon
 
HBASE Overview
Sampath Rachakonda
 
Ad

Recently uploaded (20)

PDF
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PDF
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
PPTX
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
PPT
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
PDF
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PDF
Executive Business Intelligence Dashboards
vandeslie24
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PPTX
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
PPTX
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
PDF
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PPTX
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
PDF
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
GetOnCRM Speeds Up Agentforce 3 Deployment for Enterprise AI Wins.pdf
GetOnCRM Solutions
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Revolutionizing Code Modernization with AI
KrzysztofKkol1
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
MiniTool Partition Wizard 12.8 Crack License Key LATEST
hashhshs786
 
Feb 2021 Cohesity first pitch presentation.pptx
enginsayin1
 
MergeSortfbsjbjsfk sdfik k
RafishaikIT02044
 
Revenue streams of the Wazirx clone script.pdf
aaronjeffray
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
Executive Business Intelligence Dashboards
vandeslie24
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
An Introduction to ZAP by Checkmarx - Official Version
Simon Bennetts
 
MiniTool Power Data Recovery Full Crack Latest 2025
muhammadgurbazkhan
 
Powering GIS with FME and VertiGIS - Peak of Data & AI 2025
Safe Software
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Why Businesses Are Switching to Open Source Alternatives to Crystal Reports.pptx
Varsha Nayak
 
Linux Certificate of Completion - LabEx Certificate
VICTOR MAESTRE RAMIREZ
 
Agentic Automation Journey Session 1/5: Context Grounding and Autopilot for E...
klpathrudu
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 

Apache HBase Workshop