Not your dad's h base new

Not your Dad’s Old HBase
Gilad Moscovitch - Senior Consultant UXC PS
@moscovig
Yaniv Rodenski - Principal Consultant UXC PS
@YRodenski

Agenda
Our use cases
Introduction to Apache
Phoenix
The first use case -
retrospective
Managing a large scale
Graph with TitanDB
The second use case -
retrospective

The Cable Company
Our story starts with a
cable company that
grew:
Over a decade ago,
bought an ISP
Bought a mobile
network
Started new ventures
such as VOD and VoIP

Our Dataset
Billions of records (PB scale)
Countless number of formats:
Multiple systems
Network equipment
Devices
Dynamic data model
New devices are introduced frequently
(on average every two weeks)
New demands are introduced even
more frequently

The Cable Guys:
Gilad Moscovitch
Engineering Manager
Yaniv Rodenski
Architect in the CTO team

Our Starting Point:
Devices
Systems of
Records
ETL via
ODI
Oracle Exadata

Challenges
The Oracle Data Warehouse and ODI could
not handle the load
ETL devs could not handle the load, the ETL
team became a bottleneck
Not all data types arrive at the warehouse
We had to prioritise due to lack of ETL devs
Incompatibility with the existing data model
Changes to the data model would take an
average of a month
Even when data was loaded, analysts were
not aware of the new tables, and we ended up
with an unusable schema

More Challenges
New data models that are not a
good fit for SQL databases:
Sparse data
Geospatial data
Full text
Graph
Need to ask harder questions
that require heavy processing:
Machine learning

Breaking Out
The new data platform
was Hadoop based
Using CDH (at that
time the most
advanced option)
Trying to reuse existing
components of the
platform as much as
possible

Challenge #1: Early Data
Access
Giving analysts, BI
developers and
business access to
raw data
For this use case we
reviewed a few tools,
including Apache
Phoenix

Apache Phoenix - SQL on
HBase
Apache Phoenix is a relational database layer over HBase with a
difference:
Table metadata is stored in an HBase table and versioned,
snapshot queries over prior versions will automatically use the
correct schema
Secondary indexes
Dynamic columns with schema on read
Views
Indexed
Updatable

Challenge no 1: Results
In addition to Phoenix we also looked at Hive and Impala
Spark SQL, Presto and Drill were not considered due to immaturity
Impala was chosen
Schema on read was important
Hive on CDH doesn’t support Tez
Apache Phoenix was overkill and better suited to be a database rather than a
warehouse

Challenge no 2: Family
Time
Clients are never represented
by a single entity:
Households
Business
Clients have multiple devices
generating data:
Home and mobile phones
IP adresses for devices
DVRs

Titan - A Distributed Graph
Titan is a scalable graph database
Optimized for storing and querying graphs
Runs on top of:
Cassandra
HBase
DynamoDB
BerkeleyDB
Support for geo, numeric range, and full-text search via:
ElasticSearch
SolR
Supports Gremlin - a graph querying DSL via
Tinkerpop Gremlin over HTTP

Challenge #2: Testing Stage
Hbase vs Cassandra benchmark + sanity check
Simulation for 1 billion Vertices
Sanity check- OK
Not much difference in loading time and querying time on both stores
HBase chosen because of the existing infrastructure
Retrospective: 1 billion Vertices on an empty graph didn’t really simulate anythin

Challenge #2: POC Stage
Initializing an untuned Hbase Cluster on all 24 nodes of the existing cluster
Hosted side by side with Map Reduce and Impala
Developing initial ontology for the largest data source together with a developer
from the client application team
Developing Map Reduce for loading hundreds of GB a day according to the
ontology

POC Performance
Input Data was stored in hourly directories so at first we scheduled the Map
Reduce for each hour.
An hour took about 40 minutes to process and load.
Later on - scheduled the Map-Reduce for a whole day at a time. The whole
day loading took about half a day.
ap-Reduce jobs create new challenges - Hold lots of reducers for a long time, not fun to re

Performance Tuning
HBase didn't handle the load, the symptoms included
HBase write-blocking compactions
Retired region servers
Tuning performed:
Region split size - split after 11 GB
Memstore flush size tuning
GC Tuning
Java Heap size decreasing from 32 to 16
Daily major compaction for the graph table
Retrospective: We had to statically partition to two different clusters:
One for HBase, and one for everything else

Today
The main graph ingests:
~1.7 billion edges
~1.7 billion vertices
The main graph size is 20TB
20 region servers
Rebuilding the graph on average every 3 months for new ontology
New data sources are added within a day by one (awesome) developer
Using a web based UI tool for graph exploration
Retrospective: Titan on HBase works pretty well for those sizes

Summary
HBase is a versatile datastore
Apache Phoenix modernises HBase with semi-relational
SQL layer
Titan provides powerful graph capabilities
Never be naive about Big Data tools, they will bite you,
badly

Next month:
Karel Alfonso
Apache Flink Ned Shawa
Apache NiFi

Not your dad's h base new

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Not your dad's h base new (20)

Recently uploaded (20)

Not your dad's h base new