CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER

Hbase
What is HBase?
1. Hbase is an open source and sorted map data built on Hadoop.
2. It is column oriented and horizontally scalable.
3. It is based on Google's Big Table.It has set of tables which keep data in key value
format.
4. Hbase is well suited for sparse data sets which are very common in big data use
cases.
5. Hbase provides APIs enabling development in practically any programming
language.
6. It is a part of the Hadoop ecosystem that provides random real-time read/write
access to data in the Hadoop File System

Why Hbase?
• RDBMS get exponentially slow as the data becomes large
• Expects data to be highly structured, i.e. ability to fit in a well-
defined schema
• Any change in schema might require a downtime
• For sparse datasets, too much of overhead of maintaining
NULL values

Features of Hbase
• Horizontally scalable: You can add any number of columns anytime.
• Automatic Failover: Automatic failover is a resource that allows a
system administrator to automatically switch data handling to a
standby system in the event of system compromise
• Integrations with Map/Reduce framework: Al the commands and java
codes internally implement Map/ Reduce to do the task and it is built
over Hadoop Distributed File System.
• sparse, distributed, persistent, multidimensional sorted map, which is
indexed by rowkey, column key,and timestamp.
• Often referred as a key value store or column family-oriented
database, or storing versioned maps of maps.

• fundamentally, it's a platform for storing and retrieving data with
random access.
• It doesn't care about datatypes(storing an integer in one row
and a string in another for the same column).
• It doesn't enforce relationships within your data.
• It is designed to run on a cluster of computers, built using
commodity hardware.

HBase Architecture: Use Cases,
Components & Data Model

• HBase architecture consists mainly of four components
• HMaster
• HRegionserver
• HRegions
• Zookeeper
• HDFS

HMaster
• HMaster in HBase is the implementation of a Master
server in HBase architecture.
• It acts as a monitoring agent to monitor all Region Server
instances present in the cluster and acts as an interface
all the metadata changes.
• In a distributed cluster environment, Master runs on
NameNode. Master runs several background threads.

• Plays a vital role in terms of performance and maintaining
nodes in the cluster.
• HMaster provides admin performance and distributes services
to different region servers.
• HMaster assigns regions to region servers.
• HMaster has the features like controlling load balancing and
failover to handle the load over nodes present in the cluster.
• When a client wants to change any schema and to change any
Metadata operations, HMaster takes responsibility for these
operations.
• Some of the methods exposed by HMaster Interface are
primarily Metadata oriented methods.
• Table (createTable, removeTable, enable, disable)
• ColumnFamily (add Column, modify Column)
• Region (move, assign)

• The client communicates in a bi-directional way with both
HMaster and ZooKeeper. For read and write operations, it
directly contacts with HRegion servers. HMaster assigns
regions to region servers and in turn, check the health
status of region servers.
• In entire architecture, we have multiple region servers.
Hlog present in region servers which are going to store all
the log files.

HBase Region Servers
• When HBase Region Server receives writes and read requests
from the client, it assigns the request to a specific region,
where the actual column family resides. However, the client
can directly contact with HRegion servers, there is no need of
HMaster mandatory permission to the client regarding
communication with HRegion servers. The client requires
HMaster help when operations related to metadata and
schema changes are required.
• HRegionServer is the Region Server implementation. It is
responsible for serving and managing regions or data that is
present in a distributed cluster. The region servers run on
Data Nodes present in the Hadoop cluster.
• HMaster can get into contact with multiple HRegion servers
and performs the following functions.

• Hosting and managing regions
• Splitting regions automatically
• Handling read and writes requests
• Communicating with the client directly

• HBase Regions
• HRegions are the basic building elements of HBase cluster
that consists of the distribution of tables and are
comprised of Column families.
• It contains multiple stores, one for each column family. It
consists of mainly two components, which are Memstore
and Hfile.

ZooKeeper
• HBase Zookeeper is a centralized monitoring server which
maintains configuration information and provides distributed
synchronization. Distributed synchronization is to access the
distributed applications running across the cluster with the
responsibility of providing coordination services between
nodes.
• If the client wants to communicate with regions, the server’s
client has to approach ZooKeeper first.
• It is an open source project, and it provides so many
important services.
Services provided by ZooKeeper
• Maintains Configuration information
• Provides distributed synchronization
• Client Communication establishment with region servers

• Provides ephemeral nodes for which represent different region
servers
• Master servers usability of ephemeral nodes for discovering
available servers in the cluster
• To track server failure and network partitions
• Master and HBase slave nodes ( region servers) registered
themselves with ZooKeeper. The client needs access to
ZK(zookeeper) quorum configuration to connect with master
and region servers.
• During a failure of nodes that present in HBase cluster,
ZKquoram will trigger error messages, and it starts to repair the
failed nodes.

HDFS
• HDFS is a Hadoop distributed File System, as the name
implies it provides a distributed environment for the
storage and it is a file system designed in a way to run on
commodity hardware.
• It stores each file in multiple blocks and to maintain fault
tolerance, the blocks are replicated across a Hadoop
cluster.

HBase Data Model
• HBase Data Model is a set of components that consists of
Tables, Rows, Column families, Cells, Columns, and Versions.
HBase tables contain column families and rows with elements
defined as Primary keys. A column in HBase data model table
represents attributes to the objects.
• HBase Data Model consists of following elements,
• Set of tables
• Each table with column families and rows
• Each table must have an element defined as Primary Key.
• Row key acts as a Primary key in HBase.
• Any access to HBase tables uses this Primary Key
• Each column present in HBase denotes attribute
corresponding to object

Storage Mechanism in HBase
• HBase is a column-oriented database and data is stored
in tables. The tables are sorted by RowId. As shown
below, HBase has RowId, which is the collection of several
column families that are present in the table.
• The column families that are present in the schema are
key-value pairs. If we observe in detail each column
family having multiple numbers of columns. The column
values stored into disk memory. Each cell of the table has
its own Metadata like timestamp and other information.

• Coming to HBase the following are the key terms
representing table schema
• Table: Collection of rows present.
• Row: Collection of column families.
• Column Family: Collection of columns.
• Column: Collection of key-value pairs.
• Namespace: Logical grouping of tables.
• Cell: A {row, column, version} tuple exactly specifies a cell
definition in HBase.

HBase Read and Write operation

• Step 1) Client wants to write data and in turn first
communicates with Regions server and then regions
• Step 2) Regions contacting memstore for storing associated
with the column family
• Step 3) First data stores into Memstore, where the data is
sorted and after that, it flushes into HFile. The main reason for
using Memstore is to store data in a Distributed file system
based on Row Key. Memstore will be placed in Region server
main memory while HFiles are written into HDFS.
• Step 4) Client wants to read data from Regions
• Step 5) In turn Client can have direct access to Mem store, and
it can request for data.
• Step 6) Client approaches HFiles to get the data. The data are
fetched and retrieved by the Client.
• Memstore holds in-memory modifications to the store. The
hierarchy of objects in HBase Regions is as shown from top to
bottom in below table.

Table HBase table present in the HBase cluster
Region HRegions for the presented tables
Store It stores per ColumnFamily for each region for the table
Memstore
•Memstore for each store for each region for the table
•It sorts data before flushing into HFiles
•Write and read performance will increase because of sorting
StoreFile StoreFiles for each store for each region for the table
Block Blocks present inside StoreFiles

Hbase clients
• The HBase shell
• Kundera – the object mapper
• The REST client
• The Thrift client
• The Hadoop ecosystem client

HBase Shell
• HBase contains a shell using which you can communicate
with HBase.
• HBase uses the Hadoop File System to store its data. It will
have a master server and region servers.
• The data storage will be in the form of regions (tables).
These regions will be split up and stored in region servers.
• The master server manages these region servers and all
these tasks take place on HDFS.
• Given below are some of the commands supported by
HBase Shell.

• Data Definition Language
• These are the commands that operate on the tables in HBase.
• create - Creates a table.
• list - Lists all the tables in HBase.
• disable - Disables a table.
• is_disabled - Verifies whether a table is disabled.
• enable - Enables a table.
• is_enabled - Verifies whether a table is enabled.
• describe - Provides the description of a table.
• alter - Alters a table.
• exists - Verifies whether a table exists.
• drop - Drops a table from HBase.
• drop_all - Drops the tables matching the ‘regex’ given in the command.
• Java Admin API - Prior to all the above commands, Java provides an Admin API
to achieve DDL functionalities through programming.
Under org.apache.hadoop.hbase.client package, HBaseAdmin and
HTableDescriptor are the two important classes in this package that provide
DDL functionalities.

Data Manipulation Language
• put - Puts a cell value at a specified column in a specified
row in a particular table.
• get - Fetches the contents of row or a cell.
• delete - Deletes a cell value in a table.
• deleteall - Deletes all the cells in a given row.
• scan - Scans and returns the table data.
• count - Counts and returns the number of rows in a table.
• truncate - Disables, drops, and recreates a specified table.
• Java client API - Prior to all the above commands, Java
provides a client API to achieve DML
functionalities, CRUD (Create Retrieve Update Delete)
operations and more through programming, under
org.apache.hadoop.hbase.client package. HTable
Put and Get are the important classes in this package.

Kundera-Object mapper
• In order to start using HBase in your Java application with minimal
learning, you can use one of the popular open source API named
Kundera.
• Kundera is a polyglot object mapper for NoSQL, as well as RDBMS
data stores. It is a single high-level Java API that supports NoSQL
data stores.
• The idea behind Kundera is to make working with NoSQL databases
drop-dead simple and fun. Kundera provides the following qualities:
• A robust querying system
• Easy object/relation mapping
• Support for secondary level caching and event-based data handling
• Optimized data store persistence
• Connection pooling and Lucene-based indexing

• Kundera currently supports following data stores :
• Cassandra
• MongoDB
• HBase
• Redis
• OracleNoSQL
• Neo4j
• Couchdb
• RethinkDB
• Kudu
• Relational databases
• Apache Spark

REST client
• The Java API provides the most functionality, but many people want
to use HBase without Java.
• There are two main approaches for doing that: One is the Thrift
interface, which is the faster and more lightweight of the two
options. The other way to access HBase is using the REST interface,
which uses HTTP verbs to perform an action, giving developers a
wide choice of languages and programs to use.
• HBase REST Basics
• For both Thrift and REST to work, another HBase daemon needs to
be running to handle these requests. These daemons can be
installed in the hbase-thrift and hbase-rest packages.

The diagram below illustrates where Thrift and REST are
placed in the cluster. Note that the Thrift and REST
clients usually don’t run any other services services like
DataNode or RegionServers to keep the load down, and
responsiveness high, for REST interactions.

Thrift
• The Apache Thrift software framework, for scalable cross-
language services development, combines a software stack
with a code generation engine to build services that work
efficiently and seamlessly between C++, Java, Python, PHP,
Ruby, Erlang, Perl, Haskell, C#, Cocoa, JavaScript, Node.js,
Smalltalk, OCaml and Delphi and other languages.

HBASE EXAMPLES
• It is used whenever there is a need to write heavy
applications.
• HBase is used whenever we need to provide fast
random access to available data.
• Companies such as Facebook, Twitter, Yahoo, and
Adobe use HBase internally.
• MEDICAL
• SPORTS
• WEB

• E commerce
• Banking industry
• Telecom industry
• Oil and petroleum industry
• Hbase at facebook,
• pinterest
• Hbase at longtail video
• AADHAR
• Twitter
• meetup

CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER

More Related Content

What's hot (20)

Similar to CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER (20)

More from KrishnaVeni451953 (6)

Recently uploaded (20)

CCS334 BIG DATA ANALYTICS UNIT 5 PPT ELECTIVE PAPER