SlideShare a Scribd company logo
Not your Dad’s Old HBase
Gilad Moscovitch - Senior Consultant UXC PS
@moscovig
Yaniv Rodenski - Principal Consultant UXC PS
@YRodenski
Agenda
Our use cases
Introduction to Apache
Phoenix
The first use case -
retrospective
Managing a large scale
Graph with TitanDB
The second use case -
retrospective
The Cable Company
Our story starts with a
cable company that
grew:
Over a decade ago,
bought an ISP
Bought a mobile
network
Started new ventures
such as VOD and VoIP
Our Dataset
Billions of records (PB scale)
Countless number of formats:
Multiple systems
Network equipment
Devices
Dynamic data model
New devices are introduced frequently
(on average every two weeks)
New demands are introduced even
more frequently
The Cable Guys:
Gilad Moscovitch
Engineering Manager
Yaniv Rodenski
Architect in the CTO team
Our Starting Point:
Devices
Systems of
Records
ETL via
ODI
Oracle Exadata
Challenges
The Oracle Data Warehouse and ODI could
not handle the load
ETL devs could not handle the load, the ETL
team became a bottleneck
Not all data types arrive at the warehouse
We had to prioritise due to lack of ETL devs
Incompatibility with the existing data model
Changes to the data model would take an
average of a month
Even when data was loaded, analysts were
not aware of the new tables, and we ended up
with an unusable schema
More Challenges
New data models that are not a
good fit for SQL databases:
Sparse data
Geospatial data
Full text
Graph
Need to ask harder questions
that require heavy processing:
Machine learning
Breaking Out
The new data platform
was Hadoop based
Using CDH (at that
time the most
advanced option)
Trying to reuse existing
components of the
platform as much as
possible
Challenge #1: Early Data
Access
Giving analysts, BI
developers and
business access to
raw data
For this use case we
reviewed a few tools,
including Apache
Phoenix
Apache Phoenix - SQL on
HBase
Apache Phoenix is a relational database layer over HBase with a
difference:
Table metadata is stored in an HBase table and versioned,
snapshot queries over prior versions will automatically use the
correct schema
Secondary indexes
Dynamic columns with schema on read
Views
Indexed
Updatable
Demo - Apache Phoenix
Challenge no 1: Results
In addition to Phoenix we also looked at Hive and Impala
Spark SQL, Presto and Drill were not considered due to immaturity
Impala was chosen
Schema on read was important
Hive on CDH doesn’t support Tez
Apache Phoenix was overkill and better suited to be a database rather than a
warehouse
Challenge no 2: Family
Time
Clients are never represented
by a single entity:
Households
Business
Clients have multiple devices
generating data:
Home and mobile phones
IP adresses for devices
DVRs
Titan - A Distributed Graph
Titan is a scalable graph database
Optimized for storing and querying graphs
Runs on top of:
Cassandra
HBase
DynamoDB
BerkeleyDB
Support for geo, numeric range, and full-text search via:
ElasticSearch
SolR
Supports Gremlin - a graph querying DSL via
Tinkerpop Gremlin over HTTP
Demo - Clash of the Titan
Challenge #2: Testing Stage
Hbase vs Cassandra benchmark + sanity check
Simulation for 1 billion Vertices
Sanity check- OK
Not much difference in loading time and querying time on both stores
HBase chosen because of the existing infrastructure
Retrospective: 1 billion Vertices on an empty graph didn’t really simulate anythin
Challenge #2: POC Stage
Initializing an untuned Hbase Cluster on all 24 nodes of the existing cluster
Hosted side by side with Map Reduce and Impala
Developing initial ontology for the largest data source together with a developer
from the client application team
Developing Map Reduce for loading hundreds of GB a day according to the
ontology
POC Performance
Input Data was stored in hourly directories so at first we scheduled the Map
Reduce for each hour.
An hour took about 40 minutes to process and load.
Later on - scheduled the Map-Reduce for a whole day at a time. The whole
day loading took about half a day.
ap-Reduce jobs create new challenges - Hold lots of reducers for a long time, not fun to re
Performance Tuning
HBase didn't handle the load, the symptoms included
HBase write-blocking compactions
Retired region servers
Tuning performed:
Region split size - split after 11 GB
Memstore flush size tuning
GC Tuning
Java Heap size decreasing from 32 to 16
Daily major compaction for the graph table
Retrospective: We had to statically partition to two different clusters:
One for HBase, and one for everything else
Today
The main graph ingests:
~1.7 billion edges
~1.7 billion vertices
The main graph size is 20TB
20 region servers
Rebuilding the graph on average every 3 months for new ontology
New data sources are added within a day by one (awesome) developer
Using a web based UI tool for graph exploration
Retrospective: Titan on HBase works pretty well for those sizes
Summary
HBase is a versatile datastore
Apache Phoenix modernises HBase with semi-relational
SQL layer
Titan provides powerful graph capabilities
Never be naive about Big Data tools, they will bite you,
badly
Next month:
Karel Alfonso
Apache Flink Ned Shawa
Apache NiFi

More Related Content

PPTX
Rest assured
Yaniv Rodenski
 
PPT
Introduction to the Web API
Brad Genereaux
 
ODP
Soa With Ruby
zak.mandhro
 
PPTX
REST and ASP.NET Web API (Milan)
Jef Claes
 
PPTX
REST and ASP.NET Web API (Tunisia)
Jef Claes
 
PPTX
Overview of Rest Service and ASP.NET WEB API
Pankaj Bajaj
 
PDF
Representational State Transfer (REST)
David Krmpotic
 
PPTX
REST API Design
Devi Kiran G
 
Rest assured
Yaniv Rodenski
 
Introduction to the Web API
Brad Genereaux
 
Soa With Ruby
zak.mandhro
 
REST and ASP.NET Web API (Milan)
Jef Claes
 
REST and ASP.NET Web API (Tunisia)
Jef Claes
 
Overview of Rest Service and ASP.NET WEB API
Pankaj Bajaj
 
Representational State Transfer (REST)
David Krmpotic
 
REST API Design
Devi Kiran G
 

What's hot (20)

PDF
Modern Web Applications
Srdjan Strbanovic
 
PPTX
Introduction to REST - API
Chetan Gadodia
 
PPTX
Building Ext JS Using HATEOAS - Jeff Stano
Sencha
 
PDF
Give a REST to your LDAP directory services
LDAPCon
 
PPTX
Birds Eye View on API Development - v1.0
API Talent
 
PPTX
Webservices Overview : XML RPC, SOAP and REST
Pradeep Kumar
 
PPTX
RESTful Web Service using Swagger
Hong-Jhih Lin
 
PDF
Building REST and Hypermedia APIs with PHP
AzRy LLC, Caucasus School of Technology
 
PDF
Building Killer RESTful APIs with NodeJs
Srdjan Strbanovic
 
PPTX
Survey of restful web services frameworks
Vijay Prasad Gupta
 
PPTX
Single page application
Ismaeel Enjreny
 
PPTX
ASP.NET Web API and HTTP Fundamentals
Ido Flatow
 
PDF
Single page application
Jeremy Lee
 
KEY
Web API Basics
LearnNowOnline
 
PPTX
introduction about REST API
AmilaSilva13
 
PPTX
An Overview of Web Services: SOAP and REST
Ram Awadh Prasad, PMP
 
PPTX
The Power of Drupal and Alfresco Together
Jeff Potts
 
PPT
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco Software
 
PPT
Excellent rest using asp.net web api
Maurice De Beijer [MVP]
 
PDF
Role of Rest vs. Web Services and EI
WSO2
 
Modern Web Applications
Srdjan Strbanovic
 
Introduction to REST - API
Chetan Gadodia
 
Building Ext JS Using HATEOAS - Jeff Stano
Sencha
 
Give a REST to your LDAP directory services
LDAPCon
 
Birds Eye View on API Development - v1.0
API Talent
 
Webservices Overview : XML RPC, SOAP and REST
Pradeep Kumar
 
RESTful Web Service using Swagger
Hong-Jhih Lin
 
Building REST and Hypermedia APIs with PHP
AzRy LLC, Caucasus School of Technology
 
Building Killer RESTful APIs with NodeJs
Srdjan Strbanovic
 
Survey of restful web services frameworks
Vijay Prasad Gupta
 
Single page application
Ismaeel Enjreny
 
ASP.NET Web API and HTTP Fundamentals
Ido Flatow
 
Single page application
Jeremy Lee
 
Web API Basics
LearnNowOnline
 
introduction about REST API
AmilaSilva13
 
An Overview of Web Services: SOAP and REST
Ram Awadh Prasad, PMP
 
The Power of Drupal and Alfresco Together
Jeff Potts
 
Alfresco As SharePoint Alternative - Architecture Overview
Alfresco Software
 
Excellent rest using asp.net web api
Maurice De Beijer [MVP]
 
Role of Rest vs. Web Services and EI
WSO2
 
Ad

Viewers also liked (20)

PPTX
Scale up your thinking
Yardena Meymann
 
PDF
Elasticsearch na prática
Breno Oliveira
 
DOCX
HagayOnn_EnglishCV_ 2016
Hagay Onn (the Spot)
 
PPTX
Orchestration tool roundup - OpenStack Israel summit - kubernetes vs. docker...
Uri Cohen
 
PPTX
JavaScript TDD
Uri Lavi
 
PPTX
Scala does the Catwalk
Ariel Kogan
 
PDF
What's the Magic in LinkedIn?
Efrat Fenigson
 
PDF
Scrum. software engineering seminar
Alexandr Gavrishev
 
PDF
Storm at Forter
Re'em Bensimhon
 
PDF
טלפונים חכמים ואתם
Idan ofek
 
PPTX
Joy of scala
Maxim Novak
 
PPTX
1953 and all that. A tale of two sciences (Kitcher, 1984)
Yoav Francis
 
PDF
Guice - dependency injection framework
Evgeny Barabanov
 
PDF
How does the Internet Work?
Dina Goldshtein
 
PDF
מכתב המלצה - לירן פרידמן
Liran Fridman
 
PPTX
Lessons Learned with Unity and WebGL
Lior Tal
 
PDF
How fast ist it really? Benchmarking in practice
Tobias Pfeiffer
 
PPTX
Continuous Deployment into the Unknown with Artifactory, Bintray, Docker and ...
Gilad Garon
 
ODP
Optimizing DevOps strategy in a large enterprise
Eyal Edri
 
PDF
Responsive Web Design
Nir Elbaz
 
Scale up your thinking
Yardena Meymann
 
Elasticsearch na prática
Breno Oliveira
 
HagayOnn_EnglishCV_ 2016
Hagay Onn (the Spot)
 
Orchestration tool roundup - OpenStack Israel summit - kubernetes vs. docker...
Uri Cohen
 
JavaScript TDD
Uri Lavi
 
Scala does the Catwalk
Ariel Kogan
 
What's the Magic in LinkedIn?
Efrat Fenigson
 
Scrum. software engineering seminar
Alexandr Gavrishev
 
Storm at Forter
Re'em Bensimhon
 
טלפונים חכמים ואתם
Idan ofek
 
Joy of scala
Maxim Novak
 
1953 and all that. A tale of two sciences (Kitcher, 1984)
Yoav Francis
 
Guice - dependency injection framework
Evgeny Barabanov
 
How does the Internet Work?
Dina Goldshtein
 
מכתב המלצה - לירן פרידמן
Liran Fridman
 
Lessons Learned with Unity and WebGL
Lior Tal
 
How fast ist it really? Benchmarking in practice
Tobias Pfeiffer
 
Continuous Deployment into the Unknown with Artifactory, Bintray, Docker and ...
Gilad Garon
 
Optimizing DevOps strategy in a large enterprise
Eyal Edri
 
Responsive Web Design
Nir Elbaz
 
Ad

Similar to Not your dad's h base new (20)

PPTX
Data infrastructure at Facebook
AhmedDoukh
 
PDF
Hoodie - DataEngConf 2017
Vinoth Chandar
 
DOCX
Bigdata & Hadoop
Pinto Das
 
PPTX
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
PPT
Hive @ Hadoop day seattle_2010
nzhang
 
PPTX
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
PPTX
Crossing the Chasm
Hortonworks
 
PPT
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Cloudera, Inc.
 
PPTX
Near Real-Time Data Analysis With FlyData
FlyData Inc.
 
PPT
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
PDF
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
PPT
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
PDF
Enabling big data & AI workloads on the object store at DBS
Alluxio, Inc.
 
PPTX
DWH & big data architecture approaches
Luxoft
 
PPTX
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
PPTX
Hadoop.pptx
arslanhaneef
 
PPTX
Hadoop.pptx
sonukumar379092
 
PPTX
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
PPTX
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Hortonworks
 
PPTX
Hadoop and Big data in Big data and cloud.pptx
gvlbcy
 
Data infrastructure at Facebook
AhmedDoukh
 
Hoodie - DataEngConf 2017
Vinoth Chandar
 
Bigdata & Hadoop
Pinto Das
 
Bringing OLTP woth OLAP: Lumos on Hadoop
DataWorks Summit
 
Hive @ Hadoop day seattle_2010
nzhang
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
StreamNative
 
Crossing the Chasm
Hortonworks
 
Hw09 Rethinking The Data Warehouse With Hadoop And Hive
Cloudera, Inc.
 
Near Real-Time Data Analysis With FlyData
FlyData Inc.
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks
 
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hitendra Kumar
 
Enabling big data & AI workloads on the object store at DBS
Alluxio, Inc.
 
DWH & big data architecture approaches
Luxoft
 
Владимир Слободянюк «DWH & BigData – architecture approaches»
Anna Shymchenko
 
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
sonukumar379092
 
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Modernize Your Existing EDW with IBM Big SQL & Hortonworks Data Platform
Hortonworks
 
Hadoop and Big data in Big data and cloud.pptx
gvlbcy
 

Recently uploaded (20)

PPTX
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
PDF
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PPTX
PFAS Reporting Requirements 2026 Are You Submission Ready Certivo.pptx
Certivo Inc
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PDF
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
PPT
Activate_Methodology_Summary presentatio
annapureddyn
 
DOCX
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
PPTX
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PPTX
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
PDF
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
PDF
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
PDF
Bandai Playdia The Book - David Glotz
BluePanther6
 
PPTX
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
PDF
Protecting the Digital World Cyber Securit
dnthakkar16
 
PDF
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 
The-Dawn-of-AI-Reshaping-Our-World.pptxx
parthbhanushali307
 
49784907924775488180_LRN2959_Data_Pump_23ai.pdf
Abilash868456
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
ShowUs: Pharo Stream Deck (ESUG 2025, Gdansk)
ESUG
 
PFAS Reporting Requirements 2026 Are You Submission Ready Certivo.pptx
Certivo Inc
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Exploring AI Agents in Process Industries
amoreira6
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
Activate_Methodology_Summary presentatio
annapureddyn
 
Can You Build Dashboards Using Open Source Visualization Tool.docx
Varsha Nayak
 
ASSIGNMENT_1[1][1][1][1][1] (1) variables.pptx
kr2589474
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Odoo Integration Services by Candidroot Solutions
CandidRoot Solutions Private Limited
 
New Download FL Studio Crack Full Version [Latest 2025]
imang66g
 
WatchTraderHub - Watch Dealer software with inventory management and multi-ch...
WatchDealer Pavel
 
Bandai Playdia The Book - David Glotz
BluePanther6
 
ConcordeApp: Engineering Global Impact & Unlocking Billions in Event ROI with AI
chastechaste14
 
Protecting the Digital World Cyber Securit
dnthakkar16
 
Enhancing Healthcare RPM Platforms with Contextual AI Integration
Cadabra Studio
 

Not your dad's h base new

  • 1. Not your Dad’s Old HBase Gilad Moscovitch - Senior Consultant UXC PS @moscovig Yaniv Rodenski - Principal Consultant UXC PS @YRodenski
  • 2. Agenda Our use cases Introduction to Apache Phoenix The first use case - retrospective Managing a large scale Graph with TitanDB The second use case - retrospective
  • 3. The Cable Company Our story starts with a cable company that grew: Over a decade ago, bought an ISP Bought a mobile network Started new ventures such as VOD and VoIP
  • 4. Our Dataset Billions of records (PB scale) Countless number of formats: Multiple systems Network equipment Devices Dynamic data model New devices are introduced frequently (on average every two weeks) New demands are introduced even more frequently
  • 5. The Cable Guys: Gilad Moscovitch Engineering Manager Yaniv Rodenski Architect in the CTO team
  • 6. Our Starting Point: Devices Systems of Records ETL via ODI Oracle Exadata
  • 7. Challenges The Oracle Data Warehouse and ODI could not handle the load ETL devs could not handle the load, the ETL team became a bottleneck Not all data types arrive at the warehouse We had to prioritise due to lack of ETL devs Incompatibility with the existing data model Changes to the data model would take an average of a month Even when data was loaded, analysts were not aware of the new tables, and we ended up with an unusable schema
  • 8. More Challenges New data models that are not a good fit for SQL databases: Sparse data Geospatial data Full text Graph Need to ask harder questions that require heavy processing: Machine learning
  • 9. Breaking Out The new data platform was Hadoop based Using CDH (at that time the most advanced option) Trying to reuse existing components of the platform as much as possible
  • 10. Challenge #1: Early Data Access Giving analysts, BI developers and business access to raw data For this use case we reviewed a few tools, including Apache Phoenix
  • 11. Apache Phoenix - SQL on HBase Apache Phoenix is a relational database layer over HBase with a difference: Table metadata is stored in an HBase table and versioned, snapshot queries over prior versions will automatically use the correct schema Secondary indexes Dynamic columns with schema on read Views Indexed Updatable
  • 12. Demo - Apache Phoenix
  • 13. Challenge no 1: Results In addition to Phoenix we also looked at Hive and Impala Spark SQL, Presto and Drill were not considered due to immaturity Impala was chosen Schema on read was important Hive on CDH doesn’t support Tez Apache Phoenix was overkill and better suited to be a database rather than a warehouse
  • 14. Challenge no 2: Family Time Clients are never represented by a single entity: Households Business Clients have multiple devices generating data: Home and mobile phones IP adresses for devices DVRs
  • 15. Titan - A Distributed Graph Titan is a scalable graph database Optimized for storing and querying graphs Runs on top of: Cassandra HBase DynamoDB BerkeleyDB Support for geo, numeric range, and full-text search via: ElasticSearch SolR Supports Gremlin - a graph querying DSL via Tinkerpop Gremlin over HTTP
  • 16. Demo - Clash of the Titan
  • 17. Challenge #2: Testing Stage Hbase vs Cassandra benchmark + sanity check Simulation for 1 billion Vertices Sanity check- OK Not much difference in loading time and querying time on both stores HBase chosen because of the existing infrastructure Retrospective: 1 billion Vertices on an empty graph didn’t really simulate anythin
  • 18. Challenge #2: POC Stage Initializing an untuned Hbase Cluster on all 24 nodes of the existing cluster Hosted side by side with Map Reduce and Impala Developing initial ontology for the largest data source together with a developer from the client application team Developing Map Reduce for loading hundreds of GB a day according to the ontology
  • 19. POC Performance Input Data was stored in hourly directories so at first we scheduled the Map Reduce for each hour. An hour took about 40 minutes to process and load. Later on - scheduled the Map-Reduce for a whole day at a time. The whole day loading took about half a day. ap-Reduce jobs create new challenges - Hold lots of reducers for a long time, not fun to re
  • 20. Performance Tuning HBase didn't handle the load, the symptoms included HBase write-blocking compactions Retired region servers Tuning performed: Region split size - split after 11 GB Memstore flush size tuning GC Tuning Java Heap size decreasing from 32 to 16 Daily major compaction for the graph table Retrospective: We had to statically partition to two different clusters: One for HBase, and one for everything else
  • 21. Today The main graph ingests: ~1.7 billion edges ~1.7 billion vertices The main graph size is 20TB 20 region servers Rebuilding the graph on average every 3 months for new ontology New data sources are added within a day by one (awesome) developer Using a web based UI tool for graph exploration Retrospective: Titan on HBase works pretty well for those sizes
  • 22. Summary HBase is a versatile datastore Apache Phoenix modernises HBase with semi-relational SQL layer Titan provides powerful graph capabilities Never be naive about Big Data tools, they will bite you, badly
  • 23. Next month: Karel Alfonso Apache Flink Ned Shawa Apache NiFi