SlideShare a Scribd company logo
Cloudera Operational DB
(powered by Apache HBase and
Apache Phoenix)
Beyond the Tyranny of the Schema
December 2019
Timothy Spann
© 2019 Cloudera, Inc. All rights reserved. 2
Welcome to Future of Data - Princeton
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
© 2019 Cloudera, Inc. All rights reserved. 3
Who Am I? Timothy Spann
Data in Motion Field Engineer
@PaasDev
DZone Zone Leader and Big Data MVB;
Princeton NJ Future of Data Meetup;
ex-Pivotal Field Engineer;
Author of Apache Kafka RefCard
https://github.com/tspannhw
https://www.datainmotion.dev/
© 2019 Cloudera, Inc. All rights reserved. 4
This Meetup Made Possible Thanks To:
Paul Vidal from https://www.meetup.com/futureofdata-philadelphia/ for CDP HBase
Environment and Cloud Magic
Josh Elser and Josiah Goodson for OpDB Slides and HBase Guidance
Milind Pandit from https://www.meetup.com/TechnologySolutionsHub
Mehul Shah from https://www.meetup.com/TechnologySolutionsHub
Vijay Garg from https://pga.fund/
Madhavi from https://www.nuwaysolutions.com/
https://www.slideshare.net/bunkertor/tracking-crime-as-it-occurs-with-apache-phoenix-apache-hbase-and-apache-nifi
© 2019 Cloudera, Inc. All rights reserved. 5
WHAT HAVE Apache HBase 2.0 & Apache Phoenix ENABLED?
• Operationalizing ML / AI to revolutionize
healthcare, public utilities, etc
• Serving real-time content at webscale
• Empowering big data analytics for operational
and offline uses
• Acting as a resilient store of record
CLOUDERA OPERATIONAL DB (powered by Apache HBase & Apache Phoenix)
Operational DB is DBMS used to manage dynamic and changing data in real time and enable
applications that drive the business
6© Cloudera, Inc. All rights reserved.
APACHE HBASE FAST FACTS
Largest Database
14 Petabytes
Best Known App
Siri
Fastest
Ingestion
20M Events/s
Users
750+
7© Cloudera, Inc. All rights reserved.
HBASE ARCHITECTURE
HMaster
Orchestration layer
ZooKeeper
Region
Server
DataNode
Data plane
ColFam ColFam
Col Col Col Col
R Val Val Val Val
R Val Val Val Val
ColFam ColFam
Col Col Col Col
R Val Val Val Val
R Val Val Val Val
Region
ColFam ColFam
Col Col Col Col
R Val Val Val Val
R Val Val Val Val
• Regions are table
segments
• Read and write path
are in the data plane
• DDL operations
• Region assignment
• Recovery orchestration
• Heartbeat
• Server
state
• Services client reads &
writes
• Maximizes in-memory
operations for low-latency
operations
• Provides data resiliency
8© Cloudera, Inc. All rights reserved.
SCHEMA-LESS DATA MODEL
• Column families defined at time of table creation
• Columns created as required (at time of data insertion)
• No limits to number of columns
• Tables can grow in two dimensions – columns and rows
• Compression & encoding applied at column family level
• No declaration of data types (i.e., a column can contain multiple data types)
Column Family Column Family
Column Column Column Column
RowKey Cell Cell Cell Cell
RowKey Cell Cell Cell Cell
© 2019 Cloudera, Inc. All rights reserved. 9
HIGHLY AVAILABLE OUT OF THE BOX (<1 MINUTE RECOVERY)
Region Server 1
HDFS (3 copies of data)
Region Server 2 Region Server 3
What happens when a region crashes
1. Region server crashes
2. Writes and reads time-out for regions
in impacted region server
3. Regions are redistributed to other
region servers
4. WAL is replayed in other region
servers
5. Reads & writes are able to continue to
impacted regions
Typical recovery period < 1 minute (for impacted regions only)
No manual intervention
10© Cloudera, Inc. All rights reserved.
SECURITY MODEL
Authentication • Kerberos
Role Based
Access control
• Permissions & Scope enable flexible role based access control
• Scope: Global, Namespace, Table, Column Family, Cell
DB security &
encryption
• Transparent encryption of data on the wire and data on disk
(HFile for data at rest, secure WAL for data in motion within
HBase)
• Logging & auditability: configurable & fixed event
11
stmt.executeUpdate(“UPSERT INTO TABLE_NAME
VALUES(rowKey, GREETINGS) ");
stmt.execute();
Phoenix
What Phoenix adds to HBase
Pros:
Cons:
• Maximally flexible & customizable
• SQL only for data remediation
• Unfamiliar to SQL developers
• Requires non-traditional data architecture
• Programmatic ANSI SQL support
• RDBMS-like data architecture
• Auto-applies performance best practices
• Can co-exist with HBase apps
• Reduced flexibility vis-à-vis vanilla HBase
• Phoenix specific data format means you
can’t use HBase APIs directly
Put put = new Put(Bytes.toBytes(rowKey));
put.addColumn(COLUMN_FAMILY_NAME, COLUMN_NAME,
Bytes.toBytes(GREETINGS));
table.put(put);
HBase
RDBMS-like, scale-out databaseFlexible, scale-out, no-sql database
12
Key Phoenix capabilities
• ANSI SQL including joins
• Flexible Schemas / Dynamic Columns
• Secondary Indexes
• Aggregation pushdowns
• Cross-language client support
• Query logging
• Security through Ranger (supports RBAC, ABAC,
etc)
• JDBC/ODBC connectivity for operational reporting
• Plugs in to any JDBC/ODBC-compatible BI tool
to enable self-service analytics and insight
Phoenix
Applications
13
ANSI SQL 92 Support
Supported today Roadmap
Standard SQL Data Types UNION
SELECT, UPSERT, DELETE Windowing Functions
JOINs: Inner and Outer Transactions
Subqueries Cross Joins
Secondary Indexes Authorization
GROUP BY, ORDER BY, HAVING Replication Management
AVG, COUNT, MIN, MAX, SUM Column Constraints and Defaults
Primary Keys, Constraints UDFs
CASE, COALESCE
VIEWs
Flexible Schema
UNION ALL
© 2019 Cloudera, Inc. All rights reserved. 14
CLOUD OPTIMIZED : HBASE backed by both HDFS and S3
Cloudera provides HBase backed by Amazon’s S3
● Cloudera Data Platform (CDP) provides an out-of-the-box solution that allows Apache HBase
deployments to use Amazon Simple Storage Service (S3) as its main persistence layer for saving
table data
● Amazon’s Simple Storage Service (S3) is an eventually consistent object store, and HBase requires a
consistent and atomic filesystem which means that it cannot directly use S3. Let's look at the topology.
© 2019 Cloudera, Inc. All rights reserved. 15
CLOUD OPTIMIZED : Cloudera HBASE backed by both HDFS and S3
Cloudera with CDP has built a solution where when you launch an Operational Database (HBase)
cluster on CDP, HBase StoreFiles (the backing files for HBase tables) are stored in S3 and HBase
write-ahead-logs (WAL) are stored in an HDFS instance run alongside HBase per usual.
© 2019 Cloudera, Inc. All rights reserved. 16
CLOUD OPTIMIZED : HBASE backed by both HDFS and S3
● Configuring HBase to use S3 for its StoreFiles has many benefits to our users.
● One such benefit is that users can decouple their storage and compute.
● If there are times in which no access to HBase is necessary, HBase can be cleanly
shut down and all compute resources reclaimed to eliminate any cost of compute.
● When HBase access is needed again, the HBase cluster can be recreated, pointing
to the same data in S3. Upon startup, HBase can re-initialize itself solely from the
data in S3.
© 2019 Cloudera, Inc. All rights reserved. 17
WHERE IS APACHE HBASE TODAY
• Large ecosystem (Nifi, Spark, Hive, Impala,
SOLR, Ranger, Atlas, etc)
• Supports NoSQL, SQL, Geospatial, Graph,
TimeSeries, Key Value and other modes1
1. In conjunction with other open source projects built on top of HBase
© 2019 Cloudera, Inc. All rights reserved. 18
Cloudera and Apache HBase
● The upstream community is pretty huge and very active with contributions coming
from multiple developers from Cloudera, Microsoft, Amazon, Alibaba, Apple Salesforce
and Xiaomi etc.
● Cloudera is a very active contributor to upstream HBase along with Apache Phoenix.
○ Currently > 8 PMCs and > 2 committers.
● CDP is based off latest HBase v2 and Phoenix v5.
© 2019 Cloudera, Inc. All rights reserved. 19
New features in HBase 2+
● Operational simplicity
○ Assignment Manager V2 (using Procedure Framework 2)
○ Offline compaction tool (outside regionservers to save I/O thrasing)
○ Replication: namespace & serial and for bulk-loads
● Performance
○ Off-heap cache improvements (Uses DirectByteByffers to manage buckets outside
of the JVM heap to eliminate impact of gc to get better read perf)
● Space Quotas (to support multi tenancy)
● S3 support
● Spark 2 integration
● Async Client
© 2019 Cloudera, Inc. All rights reserved. 20
• Provides familiar & easy interface for
developers
• Advanced multi-tenancy capabilities
• Support near 100% availability for mission
critical applications & many traditional
transactional apps
• Scale to billions of rows and millions of
columns
• Easily combine data sources that use a wide
variety of different structures and schemas
Storage for business apps that require big-data
Ingest Store Primary Use
Query &
Remediate
NO(T ONLY)SQL PHOENIX
© 2019 Cloudera, Inc. All rights reserved. 21
© 2019 Cloudera, Inc. All rights reserved. 22
© 2019 Cloudera, Inc. All rights reserved. 23
None of this command line mess:
24© Cloudera, Inc. All rights reserved.
HUE FOR SQL & DATA BROWSING FOR REMEDIATION
Supports SQL based insert, update,
delete query for data in HBase
Supports search, insert, update,
delete, DDL for HBase
SQL interface using Impala or Hive for query processing GUI based data browser
© 2019 Cloudera, Inc. All rights reserved. 25
Visualization
© 2019 Cloudera, Inc. All rights reserved. 26
HBase CRUD
27© Cloudera, Inc. All rights reserved.
ENABLING DATA-DRIVEN APPS
Fast
• Real time model serving w/ <5ms latency
• Limitless concurrency (>100M updates/sec)
Easy
• Stream and bulk ingest & processing
• Process automation
• Consolidate multiple databases
• Schema flexibility
• SQL & NoSQL interface
Scalable
• Multi-petabyte scale
• Unlimited Tenants
Highly-available
• Automatic recovery from server failure
• Advanced replication & synchronization
topographies
• Multiple backup methodologies
Multi-tenant capable
• Resource isolation
• Throttling & Quotas
Secure
• Role based access control
• Fine-grained authorizations (e.g., tenant, table,
column family, cell)
© 2019 Cloudera, Inc. All rights reserved. 28
NiFi Integration
with HBase
• PutHBaseRecord
• PutHBaseJSON
• PutHBaseCell
• FetchHBaseRow
• GetHBase
• ScanHBase
• DeleteHBaseRow
• DeleteHBaseCells
https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html
ELT/ETL Lookup Services
• HBase_1_1_2_ListLookupService
• HBase_2_RecordLookupService
• HBase_2_ClientService
• HBase_2_ClientMapCacheService
https://community.cloudera.com/t5/Community-Articles/Reading-OpenData-JSON-and-Stori
ng-into-Phoenix-Tables/ta-p/247323
© 2019 Cloudera, Inc. All rights reserved. 29
© 2019 Cloudera, Inc. All rights reserved. 30
© 2019 Cloudera, Inc. All rights reserved. 31
© 2019 Cloudera, Inc. All rights reserved. 32
© 2019 Cloudera, Inc. All rights reserved. 33
© 2019 Cloudera, Inc. All rights reserved. 34
© 2019 Cloudera, Inc. All rights reserved. 35
© 2019 Cloudera, Inc. All rights reserved. 36
© 2019 Cloudera, Inc. All rights reserved. 37
© 2019 Cloudera, Inc. All rights reserved. 38
© 2019 Cloudera, Inc. All rights reserved. 39
© 2019 Cloudera, Inc. All rights reserved. 40
© 2019 Cloudera, Inc. All rights reserved. 41
© 2019 Cloudera, Inc. All rights reserved. 42
© 2019 Cloudera, Inc. All rights reserved. 43
© 2019 Cloudera, Inc. All rights reserved. 44
© 2019 Cloudera, Inc. All rights reserved. 45
© 2019 Cloudera, Inc. All rights reserved. 46
https://github.com/tspannhw/HBase2
© 2019 Cloudera, Inc. All rights reserved. 47
© 2019 Cloudera, Inc. All rights reserved. 48
SPRING BOOT APPLICATION TO PHOENIX
https://community.cloudera.com/t5/Community-Articles/Creating-a-Spring-Boot-Java-8-Microservice-To-Read-Apache/ta-p/247379
https://github.com/tspannhw/phillycrime-springboot-phoenix
© 2019 Cloudera, Inc. All rights reserved. 49
© 2019 Cloudera, Inc. All rights reserved. 50
TH N Y U

More Related Content

What's hot (20)

PDF
Emerging trends in data analytics
Wei-Chiu Chuang
 
PPTX
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
PPTX
Accelerating Big Data Insights
DataWorks Summit
 
PPTX
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
 
PPTX
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
PPTX
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
DataWorks Summit
 
PPTX
Lightning Fast Analytics with Hive LLAP and Druid
DataWorks Summit
 
PDF
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Spark Summit
 
PPTX
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
PPTX
Insight into Hyperconverged Infrastructure
HTS Hosting
 
PPTX
Multi-Tenant Operations with Cloudera 5.7 & BT
Cloudera, Inc.
 
PPTX
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
PPTX
Built-In Security for the Cloud
DataWorks Summit
 
PPTX
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
DataWorks Summit/Hadoop Summit
 
PPTX
Enabling Modern Application Architecture using Data.gov open government data
DataWorks Summit
 
PDF
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark Summit
 
PDF
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
DataWorks Summit
 
PPTX
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
PPTX
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Denis Magda
 
Emerging trends in data analytics
Wei-Chiu Chuang
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Accelerating Big Data Insights
DataWorks Summit
 
Hive LLAP: A High Performance, Cost-effective Alternative to Traditional MPP ...
DataWorks Summit
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Cloudera, Inc.
 
Accelerate Your Big Data Analytics Efforts with SAS and Hadoop
DataWorks Summit
 
Lightning Fast Analytics with Hive LLAP and Druid
DataWorks Summit
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Spark Summit
 
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
Insight into Hyperconverged Infrastructure
HTS Hosting
 
Multi-Tenant Operations with Cloudera 5.7 & BT
Cloudera, Inc.
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
Built-In Security for the Cloud
DataWorks Summit
 
Hadoop and Friends as Key Enabler of the IoE - Continental's Dynamic eHorizon
DataWorks Summit/Hadoop Summit
 
Enabling Modern Application Architecture using Data.gov open government data
DataWorks Summit
 
Spark + Flashblade: Spark Summit East talk by Brian Gold
Spark Summit
 
Securing Data in Hybrid on-premise and Cloud Environments Using Apache Ranger
DataWorks Summit
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
DataWorks Summit
 
Cloudy with a chance of Hadoop - real world considerations
DataWorks Summit
 
Apache Spark and Apache Ignite: Where Fast Data Meets IoT
Denis Magda
 

Similar to Cloudera Operational DB (Apache HBase & Apache Phoenix) (20)

PDF
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
 
PDF
Hadoop and HBase in the Real World
Cloudera, Inc.
 
PDF
Architectural Evolution Starting from Hadoop
SpagoWorld
 
PPTX
Hadoop and h base in the real world
Joey Echeverria
 
PDF
Introduction to HBase
Apekshit Sharma
 
PDF
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
PPTX
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
PPTX
The Future of Hbase
Salesforce Engineering
 
PPTX
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
PDF
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
PDF
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Spark Summit
 
PPTX
Keynote: The Future of Apache HBase
HBaseCon
 
PDF
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
wchevreuil
 
PPTX
Hbasepreso 111116185419-phpapp02
Gokuldas Pillai
 
PDF
SQL Engines for Hadoop - The case for Impala
markgrover
 
PPTX
HBase: Just the Basics
HBaseCon
 
PPTX
HBaseCon 2014-Just the Basics
Jesse Anderson
 
PDF
Hbase: an introduction
Jean-Baptiste Poullet
 
PDF
HBase ArcheTypes
Matteo Bertozzi
 
PDF
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
c-bslim
 
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
 
Hadoop and HBase in the Real World
Cloudera, Inc.
 
Architectural Evolution Starting from Hadoop
SpagoWorld
 
Hadoop and h base in the real world
Joey Echeverria
 
Introduction to HBase
Apekshit Sharma
 
Impala: Real-time Queries in Hadoop
Cloudera, Inc.
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
cdmaxime
 
The Future of Hbase
Salesforce Engineering
 
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
Storage Engine Considerations for Your Apache Spark Applications with Mladen ...
Spark Summit
 
Keynote: The Future of Apache HBase
HBaseCon
 
Cloudera Enabling Native Integration of NoSQL HBase with Cloud Providers.pdf
wchevreuil
 
Hbasepreso 111116185419-phpapp02
Gokuldas Pillai
 
SQL Engines for Hadoop - The case for Impala
markgrover
 
HBase: Just the Basics
HBaseCon
 
HBaseCon 2014-Just the Basics
Jesse Anderson
 
Hbase: an introduction
Jean-Baptiste Poullet
 
HBase ArcheTypes
Matteo Bertozzi
 
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
c-bslim
 
Ad

More from Timothy Spann (20)

PDF
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
PDF
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
PDF
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
PDF
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
PDF
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
PDF
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
PDF
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
PDF
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
PDF
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
PPTX
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
PDF
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
PDF
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
PDF
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
PDF
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
PDF
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
PDF
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
PDF
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
PDF
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
PDF
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
14May2025_TSPANN_FromAirQualityUnstructuredData.pdf
Timothy Spann
 
Streaming AI Pipelines with Apache NiFi and Snowflake NYC 2025
Timothy Spann
 
2025-03-03-Philly-AAAI-GoodData-Build Secure RAG Apps With Open LLM
Timothy Spann
 
Conf42_IoT_Dec2024_Building IoT Applications With Open Source
Timothy Spann
 
2024 Dec 05 - PyData Global - Tutorial Its In The Air Tonight
Timothy Spann
 
2024Nov20-BigDataEU-RealTimeAIWithOpenSource
Timothy Spann
 
TSPANN-2024-Nov-CloudX-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
2024-Nov-BuildStuff-Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
14 November 2024 - Conf 42 - Prompt Engineering - Codeless Generative AI Pipe...
Timothy Spann
 
2024 Nov 05 - Linux Foundation TAC TALK With Milvus
Timothy Spann
 
tspann06-NOV-2024_AI-Alliance_NYC_ intro to Data Prep Kit and Open Source RAG
Timothy Spann
 
tspann08-Nov-2024_PyDataNYC_Unstructured Data Processing with a Raspberry Pi ...
Timothy Spann
 
2024-10-28 All Things Open - Advanced Retrieval Augmented Generation (RAG) Te...
Timothy Spann
 
10-25-2024_BITS_NYC_Unstructured Data and LLM_ What, Why and How
Timothy Spann
 
2024-OCT-23 NYC Meetup - Unstructured Data Meetup - Unstructured Halloween
Timothy Spann
 
DBTA Round Table with Zilliz and Airbyte - Unstructured Data Engineering
Timothy Spann
 
17-October-2024 NYC AI Camp - Step-by-Step RAG 101
Timothy Spann
 
11-OCT-2024_AI_101_CryptoOracle_UnstructuredData
Timothy Spann
 
2024-10-04 - Grace Hopper Celebration Open Source Day - Stefan
Timothy Spann
 
01-Oct-2024_PES-VectorDatabasesAndAI.pdf
Timothy Spann
 
Ad

Recently uploaded (20)

PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
PPTX
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
PPTX
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
PPT
Data base management system Transactions.ppt
gandhamcharan2006
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Climate Action.pptx action plan for climate
justfortalabat
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PDF
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
PPTX
TSM_08_0811111111111111111111111111111111111111111111111
csomonasteriomoscow
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PPTX
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 
Advanced_NLP_with_Transformers_PPT_final 50.pptx
Shiwani Gupta
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
apidays Helsinki & North 2025 - Vero APIs - Experiences of API development in...
apidays
 
Usage of Power BI for Pharmaceutical Data analysis.pptx
Anisha Herala
 
Data base management system Transactions.ppt
gandhamcharan2006
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Climate Action.pptx action plan for climate
justfortalabat
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
This PowerPoint presentation titled "Data Visualization: Turning Data into In...
HemaDivyaKantamaneni
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
TSM_08_0811111111111111111111111111111111111111111111111
csomonasteriomoscow
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
Human-Action-Recognition-Understanding-Behavior.pptx
nreddyjanga
 

Cloudera Operational DB (Apache HBase & Apache Phoenix)

  • 1. Cloudera Operational DB (powered by Apache HBase and Apache Phoenix) Beyond the Tyranny of the Schema December 2019 Timothy Spann
  • 2. © 2019 Cloudera, Inc. All rights reserved. 2 Welcome to Future of Data - Princeton @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 3. © 2019 Cloudera, Inc. All rights reserved. 3 Who Am I? Timothy Spann Data in Motion Field Engineer @PaasDev DZone Zone Leader and Big Data MVB; Princeton NJ Future of Data Meetup; ex-Pivotal Field Engineer; Author of Apache Kafka RefCard https://github.com/tspannhw https://www.datainmotion.dev/
  • 4. © 2019 Cloudera, Inc. All rights reserved. 4 This Meetup Made Possible Thanks To: Paul Vidal from https://www.meetup.com/futureofdata-philadelphia/ for CDP HBase Environment and Cloud Magic Josh Elser and Josiah Goodson for OpDB Slides and HBase Guidance Milind Pandit from https://www.meetup.com/TechnologySolutionsHub Mehul Shah from https://www.meetup.com/TechnologySolutionsHub Vijay Garg from https://pga.fund/ Madhavi from https://www.nuwaysolutions.com/ https://www.slideshare.net/bunkertor/tracking-crime-as-it-occurs-with-apache-phoenix-apache-hbase-and-apache-nifi
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 WHAT HAVE Apache HBase 2.0 & Apache Phoenix ENABLED? • Operationalizing ML / AI to revolutionize healthcare, public utilities, etc • Serving real-time content at webscale • Empowering big data analytics for operational and offline uses • Acting as a resilient store of record CLOUDERA OPERATIONAL DB (powered by Apache HBase & Apache Phoenix) Operational DB is DBMS used to manage dynamic and changing data in real time and enable applications that drive the business
  • 6. 6© Cloudera, Inc. All rights reserved. APACHE HBASE FAST FACTS Largest Database 14 Petabytes Best Known App Siri Fastest Ingestion 20M Events/s Users 750+
  • 7. 7© Cloudera, Inc. All rights reserved. HBASE ARCHITECTURE HMaster Orchestration layer ZooKeeper Region Server DataNode Data plane ColFam ColFam Col Col Col Col R Val Val Val Val R Val Val Val Val ColFam ColFam Col Col Col Col R Val Val Val Val R Val Val Val Val Region ColFam ColFam Col Col Col Col R Val Val Val Val R Val Val Val Val • Regions are table segments • Read and write path are in the data plane • DDL operations • Region assignment • Recovery orchestration • Heartbeat • Server state • Services client reads & writes • Maximizes in-memory operations for low-latency operations • Provides data resiliency
  • 8. 8© Cloudera, Inc. All rights reserved. SCHEMA-LESS DATA MODEL • Column families defined at time of table creation • Columns created as required (at time of data insertion) • No limits to number of columns • Tables can grow in two dimensions – columns and rows • Compression & encoding applied at column family level • No declaration of data types (i.e., a column can contain multiple data types) Column Family Column Family Column Column Column Column RowKey Cell Cell Cell Cell RowKey Cell Cell Cell Cell
  • 9. © 2019 Cloudera, Inc. All rights reserved. 9 HIGHLY AVAILABLE OUT OF THE BOX (<1 MINUTE RECOVERY) Region Server 1 HDFS (3 copies of data) Region Server 2 Region Server 3 What happens when a region crashes 1. Region server crashes 2. Writes and reads time-out for regions in impacted region server 3. Regions are redistributed to other region servers 4. WAL is replayed in other region servers 5. Reads & writes are able to continue to impacted regions Typical recovery period < 1 minute (for impacted regions only) No manual intervention
  • 10. 10© Cloudera, Inc. All rights reserved. SECURITY MODEL Authentication • Kerberos Role Based Access control • Permissions & Scope enable flexible role based access control • Scope: Global, Namespace, Table, Column Family, Cell DB security & encryption • Transparent encryption of data on the wire and data on disk (HFile for data at rest, secure WAL for data in motion within HBase) • Logging & auditability: configurable & fixed event
  • 11. 11 stmt.executeUpdate(“UPSERT INTO TABLE_NAME VALUES(rowKey, GREETINGS) "); stmt.execute(); Phoenix What Phoenix adds to HBase Pros: Cons: • Maximally flexible & customizable • SQL only for data remediation • Unfamiliar to SQL developers • Requires non-traditional data architecture • Programmatic ANSI SQL support • RDBMS-like data architecture • Auto-applies performance best practices • Can co-exist with HBase apps • Reduced flexibility vis-à-vis vanilla HBase • Phoenix specific data format means you can’t use HBase APIs directly Put put = new Put(Bytes.toBytes(rowKey)); put.addColumn(COLUMN_FAMILY_NAME, COLUMN_NAME, Bytes.toBytes(GREETINGS)); table.put(put); HBase RDBMS-like, scale-out databaseFlexible, scale-out, no-sql database
  • 12. 12 Key Phoenix capabilities • ANSI SQL including joins • Flexible Schemas / Dynamic Columns • Secondary Indexes • Aggregation pushdowns • Cross-language client support • Query logging • Security through Ranger (supports RBAC, ABAC, etc) • JDBC/ODBC connectivity for operational reporting • Plugs in to any JDBC/ODBC-compatible BI tool to enable self-service analytics and insight Phoenix Applications
  • 13. 13 ANSI SQL 92 Support Supported today Roadmap Standard SQL Data Types UNION SELECT, UPSERT, DELETE Windowing Functions JOINs: Inner and Outer Transactions Subqueries Cross Joins Secondary Indexes Authorization GROUP BY, ORDER BY, HAVING Replication Management AVG, COUNT, MIN, MAX, SUM Column Constraints and Defaults Primary Keys, Constraints UDFs CASE, COALESCE VIEWs Flexible Schema UNION ALL
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 CLOUD OPTIMIZED : HBASE backed by both HDFS and S3 Cloudera provides HBase backed by Amazon’s S3 ● Cloudera Data Platform (CDP) provides an out-of-the-box solution that allows Apache HBase deployments to use Amazon Simple Storage Service (S3) as its main persistence layer for saving table data ● Amazon’s Simple Storage Service (S3) is an eventually consistent object store, and HBase requires a consistent and atomic filesystem which means that it cannot directly use S3. Let's look at the topology.
  • 15. © 2019 Cloudera, Inc. All rights reserved. 15 CLOUD OPTIMIZED : Cloudera HBASE backed by both HDFS and S3 Cloudera with CDP has built a solution where when you launch an Operational Database (HBase) cluster on CDP, HBase StoreFiles (the backing files for HBase tables) are stored in S3 and HBase write-ahead-logs (WAL) are stored in an HDFS instance run alongside HBase per usual.
  • 16. © 2019 Cloudera, Inc. All rights reserved. 16 CLOUD OPTIMIZED : HBASE backed by both HDFS and S3 ● Configuring HBase to use S3 for its StoreFiles has many benefits to our users. ● One such benefit is that users can decouple their storage and compute. ● If there are times in which no access to HBase is necessary, HBase can be cleanly shut down and all compute resources reclaimed to eliminate any cost of compute. ● When HBase access is needed again, the HBase cluster can be recreated, pointing to the same data in S3. Upon startup, HBase can re-initialize itself solely from the data in S3.
  • 17. © 2019 Cloudera, Inc. All rights reserved. 17 WHERE IS APACHE HBASE TODAY • Large ecosystem (Nifi, Spark, Hive, Impala, SOLR, Ranger, Atlas, etc) • Supports NoSQL, SQL, Geospatial, Graph, TimeSeries, Key Value and other modes1 1. In conjunction with other open source projects built on top of HBase
  • 18. © 2019 Cloudera, Inc. All rights reserved. 18 Cloudera and Apache HBase ● The upstream community is pretty huge and very active with contributions coming from multiple developers from Cloudera, Microsoft, Amazon, Alibaba, Apple Salesforce and Xiaomi etc. ● Cloudera is a very active contributor to upstream HBase along with Apache Phoenix. ○ Currently > 8 PMCs and > 2 committers. ● CDP is based off latest HBase v2 and Phoenix v5.
  • 19. © 2019 Cloudera, Inc. All rights reserved. 19 New features in HBase 2+ ● Operational simplicity ○ Assignment Manager V2 (using Procedure Framework 2) ○ Offline compaction tool (outside regionservers to save I/O thrasing) ○ Replication: namespace & serial and for bulk-loads ● Performance ○ Off-heap cache improvements (Uses DirectByteByffers to manage buckets outside of the JVM heap to eliminate impact of gc to get better read perf) ● Space Quotas (to support multi tenancy) ● S3 support ● Spark 2 integration ● Async Client
  • 20. © 2019 Cloudera, Inc. All rights reserved. 20 • Provides familiar & easy interface for developers • Advanced multi-tenancy capabilities • Support near 100% availability for mission critical applications & many traditional transactional apps • Scale to billions of rows and millions of columns • Easily combine data sources that use a wide variety of different structures and schemas Storage for business apps that require big-data Ingest Store Primary Use Query & Remediate NO(T ONLY)SQL PHOENIX
  • 21. © 2019 Cloudera, Inc. All rights reserved. 21
  • 22. © 2019 Cloudera, Inc. All rights reserved. 22
  • 23. © 2019 Cloudera, Inc. All rights reserved. 23 None of this command line mess:
  • 24. 24© Cloudera, Inc. All rights reserved. HUE FOR SQL & DATA BROWSING FOR REMEDIATION Supports SQL based insert, update, delete query for data in HBase Supports search, insert, update, delete, DDL for HBase SQL interface using Impala or Hive for query processing GUI based data browser
  • 25. © 2019 Cloudera, Inc. All rights reserved. 25 Visualization
  • 26. © 2019 Cloudera, Inc. All rights reserved. 26 HBase CRUD
  • 27. 27© Cloudera, Inc. All rights reserved. ENABLING DATA-DRIVEN APPS Fast • Real time model serving w/ <5ms latency • Limitless concurrency (>100M updates/sec) Easy • Stream and bulk ingest & processing • Process automation • Consolidate multiple databases • Schema flexibility • SQL & NoSQL interface Scalable • Multi-petabyte scale • Unlimited Tenants Highly-available • Automatic recovery from server failure • Advanced replication & synchronization topographies • Multiple backup methodologies Multi-tenant capable • Resource isolation • Throttling & Quotas Secure • Role based access control • Fine-grained authorizations (e.g., tenant, table, column family, cell)
  • 28. © 2019 Cloudera, Inc. All rights reserved. 28 NiFi Integration with HBase • PutHBaseRecord • PutHBaseJSON • PutHBaseCell • FetchHBaseRow • GetHBase • ScanHBase • DeleteHBaseRow • DeleteHBaseCells https://www.datainmotion.dev/2019/11/exploring-apache-nifi-110-parameters.html ELT/ETL Lookup Services • HBase_1_1_2_ListLookupService • HBase_2_RecordLookupService • HBase_2_ClientService • HBase_2_ClientMapCacheService https://community.cloudera.com/t5/Community-Articles/Reading-OpenData-JSON-and-Stori ng-into-Phoenix-Tables/ta-p/247323
  • 29. © 2019 Cloudera, Inc. All rights reserved. 29
  • 30. © 2019 Cloudera, Inc. All rights reserved. 30
  • 31. © 2019 Cloudera, Inc. All rights reserved. 31
  • 32. © 2019 Cloudera, Inc. All rights reserved. 32
  • 33. © 2019 Cloudera, Inc. All rights reserved. 33
  • 34. © 2019 Cloudera, Inc. All rights reserved. 34
  • 35. © 2019 Cloudera, Inc. All rights reserved. 35
  • 36. © 2019 Cloudera, Inc. All rights reserved. 36
  • 37. © 2019 Cloudera, Inc. All rights reserved. 37
  • 38. © 2019 Cloudera, Inc. All rights reserved. 38
  • 39. © 2019 Cloudera, Inc. All rights reserved. 39
  • 40. © 2019 Cloudera, Inc. All rights reserved. 40
  • 41. © 2019 Cloudera, Inc. All rights reserved. 41
  • 42. © 2019 Cloudera, Inc. All rights reserved. 42
  • 43. © 2019 Cloudera, Inc. All rights reserved. 43
  • 44. © 2019 Cloudera, Inc. All rights reserved. 44
  • 45. © 2019 Cloudera, Inc. All rights reserved. 45
  • 46. © 2019 Cloudera, Inc. All rights reserved. 46 https://github.com/tspannhw/HBase2
  • 47. © 2019 Cloudera, Inc. All rights reserved. 47
  • 48. © 2019 Cloudera, Inc. All rights reserved. 48 SPRING BOOT APPLICATION TO PHOENIX https://community.cloudera.com/t5/Community-Articles/Creating-a-Spring-Boot-Java-8-Microservice-To-Read-Apache/ta-p/247379 https://github.com/tspannhw/phillycrime-springboot-phoenix
  • 49. © 2019 Cloudera, Inc. All rights reserved. 49
  • 50. © 2019 Cloudera, Inc. All rights reserved. 50 TH N Y U