SlideShare a Scribd company logo
HBASE – THE SCALABLE
DATA STORE
An Introduction to HBase
JAX UK, October 2012

Lars George
Director EMEA Services
About Me

•  Director EMEA Services @ Cloudera
    •  Consulting on Hadoop projects (everywhere)
•  Apache Committer
    •  HBase and Whirr
•  O’Reilly Author
    •  HBase – The Definitive Guide
      •  Now in Japanese!

•  Contact
    •  lars@cloudera.com                      日本語版も出ました!	
  
    •  @larsgeorge
Agenda

•  Introduction to HBase
•  HBase Architecture
•  MapReduce with HBase
•  Advanced Techniques
•  Current Project Status
INTRODUCTION TO HBASE
Why Hadoop/HBase?

•  Datasets are constantly growing and intake soars
    •  Yahoo! has 140PB+ and 42k+ machines
    •  Facebook adds 500TB+ per day, 100PB+ raw data, on
       tens of thousands of machines
    •  Are you “throwing” data away today?
•  Traditional databases are expensive to scale and
   inherently difficult to distribute
•  Commodity hardware is cheap and powerful
   •  $1000 buys you 4-8 cores/4GB/1TB
   •  600GB 15k RPM SAS nearly $500
•  Need for random access and batch processing
    •  Hadoop only supports batch/streaming
History of Hadoop/HBase

•  Google solved its scalability problems
    •  “The Google File System” published October 2003
      •  Hadoop DFS
   •  “MapReduce: Simplified Data Processing on Large
     Clusters” published December 2004
      •  Hadoop MapReduce
   •  “BigTable: A Distributed Storage System for
     Structured Data” published November 2006
      •  HBase
Hadoop Introduction

•  Two main components
    •  Hadoop Distributed File System (HDFS)
       •  A scalable, fault-tolerant, high performance distributed file
         system capable of running on commodity hardware
   •  Hadoop MapReduce
       •  Software framework for distributed computation

•  Significant adoption
    •  Used in production in hundreds of organizations
    •  Primary contributors: Yahoo!, Facebook, Cloudera
HDFS: Hadoop Distributed File System

•  Reliably store petabytes of replicated data across
 thousands of nodes
   •  Data divided into 64MB blocks, each block replicated
     three times
•  Master/Slave architecture
    •  Master NameNode contains block locations
    •  Slave DataNode manages block on local file system
•  Built on commodity hardware
    •  No 15k RPM disks or RAID required (nor wanted!)
MapReduce

•  Distributed programming model to reliably
 process petabytes of data using its locality
   •  Built-in bindings for Java and C
   •  Can be used with any language via Hadoop
     Streaming
•  Inspired by map and reduce functions in
 functional programming

 Input	
  è	
  Map()	
  è	
  Copy/Sort	
  è	
  Reduce()	
  è	
  Output	
  
 	
  
Hadoop…

•  … is designed to store and stream extremely large
   datasets in batch
•  … is not intended for realtime querying
•  … does not support random access
•  … does not handle billions of small files well
   •  Less than default block size of 64MB and smaller
   •  Keeps “inodes” in memory on master
•  … is not supporting structured data more than
 unstructured or complex data

              That is why we have HBase!
Why HBase and not …?

•  Question: Why HBase and not <put-your-favorite-
   nosql-solution-here>?
•  What else is there?
   •    Key/value stores
   •    Document-oriented stores
   •    Column-oriented stores
   •    Graph-oriented stores
•  Features to ask for
    •  In memory or persistent?
    •  Strict or eventual consistency?
    •  Distributed or single machine (or afterthought)?
    •  Designed for read and/or write speeds?
    •  How does it scale? (if that is what you need)
What is HBase?

•  Distributed
•  Column-Oriented
•  Multi-Dimensional
•  High-Availability (CAP anyone?)
•  High-Performance
•  Storage System

                       Project Goals
   Billions of Rows * Millions of Columns * Thousands of
                            Versions
    Petabytes across thousands of commodity servers
HBase is not…

•  An SQL Database
    •  No joins, no query engine, no types, no SQL
    •  Transactions and secondary indexes only as add-ons but
       immature
•  A drop-in replacement for your RDBMS
•  You must be OK with RDBMS anti-schema
    •  Denormalized data
    •  Wide and sparsely populated tables
    •  Just say “no” to your inner DBA


               Keyword: Impedance Match
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables
HBase Tables

•  Tables are sorted by the Row Key in
   lexicographical order
•  Table schema only defines its Column Families
  •  Each family consists of any number of Columns
  •  Each column consists of any number of Versions
  •  Columns only exist when inserted, NULLs are free
  •  Columns within a family are sorted and stored
     together
  •  Everything except table names are byte[]


(Table, Row, Family:Column, Timestamp) è Value
Column Family vs. Column

•  Use only a few column families
    •  Causes many files that need to stay open per region
       plus class overhead per family
•  Best used when logical separation between data
   and meta columns
•  Sorting per family can be used to convey
   application logic or access pattern
HBase Architecture

•  Table is made up of any number if regions
•  Region is specified by its startKey and endKey
    •  Empty table: (Table, NULL, NULL)
    •  Two-region table: (Table, NULL, “com.cloudera.www”)
       and (Table, “com.cloudera.www”, NULL)
•  Each region may live on a different node and is
 made up of several HDFS files and blocks, each
 of which is replicated by Hadoop
HBase Architecture (cont.)

•  Two types of HBase nodes:
        Master and RegionServer
•  Special tables -ROOT- and.META. store schema
   information and region locations
•  Master server responsible for RegionServer
   monitoring as well as assignment and load
   balancing of regions
•  Uses ZooKeeper as its distributed coordination
   service
  •  Manages Master election and server availability
Web Crawl Example

•  Canonical use-case for BigTable
•  Store web crawl data
    •  Table webtable with family content and meta
    •  Row is reversed URL with Columns
      •  content:data stores the raw crawled data
      •  meta:language stores http language header
      •  meta:type stores http content-type header
   •  While processing raw data for hyperlinks and images,
     add families links and images
      •  links:<rurl> column for each hyperlink
      •  images:<rurl> column for each image
HBase Clients

•  Native Java Client/API
•  Non-Java Clients
    •  REST server
    •  Avro server
    •  Thrift server
    •  Jython, Scala, Groovy DSL
•  TableInputFormat/TableOutputFormat for
 MapReduce
   •  HBase as MapReduce source and/or target
•  HBase Shell
    •  JRuby shell adding get, put, scan and admin calls
Java API

•  CRUD
    •  get: retrieve an entire, or partial row (R)
    •  put: create and update a row (CU)
    •  delete: delete a cell, column, columns, or row (D)


      Result get(Get get) throws IOException;

      void put(Put put) throws IOException;

      void delete(Delete delete) throws IOException;
Java API (cont.)

•  CRUD+SI
    •  scan:      Scan any number of rows (S)
    •  increment: Increment a column value (I)




ResultScanner getScanner(Scan scan) throws IOException;

Result increment(Increment increment) throws IOException ;
Java API (cont.)

•  CRUD+SI+CAS
    •  Atomic compare-and-swap (CAS)


•  Combined get, check, and put operation
•  Helps to overcome lack of full transactions
Batch Operations

•  Support Get, Put, and Delete
•  Reduce network round-trips
•  If possible, batch operation to the server to gain
 better overall throughput

    void batch(List<Row> actions, Object[] results)
      throws IOException, InterruptedException;

    Object[] batch(List<Row> actions)
      throws IOException, InterruptedException;
Filters

•  Can be used with Get and Scan operations
•  Server side hinting
•  Reduce data transferred to client
•  Filters are no guarantee for fast scans
    •  Still full table scan in worst-case scenario
    •  Might have to implement your own
•  Filters can hint next row key
HBase Extensions

•  Hive, Pig, Cascading
    •  Hadoop-targeted MapReduce tools with HBase
       integration
•  Sqoop
    •  Read and write to HBase for further processing in
       Hadoop
•  HBase Explorer, Nutch, Heretrix
•  SpringData
•  Toad
History of HBase
•  November 2006
     •  Google releases paper on BigTable
•  February 2007
     •  Initial HBase prototype created as Hadoop contrib
•  October 2007
     •  First “useable” HBase (Hadoop 0.15.0)
•  January 2008
     •  Hadoop becomes TLP, HBase becomes subproject
•  October 2008
     •  HBase 0.18.1 released
•  January 2009
     •  HBase 0.19.0
•  September 2009
     •  HBase 0.20.0 released (Performance Release)
•  May 2010
     •  HBase becomes TLP
•  June 2010
     •  HBase 0.89.20100621, first developer release
•  May 2011
     •  HBase 0.90.3 release
HBase Users

•  Adobe
•  eBay
•  Facebook
•  Mozilla (Socorro)
•  Trend Micro (Advanced Threat Research)
•  Twitter
•  Yahoo!
•  …
HBASE ARCHITECTURE
HBase Architecture
HBase Architecture (cont.)
HBase Architecture (cont.)

•  Based on Log-Structured Merge-Trees (LSM-Trees)
•  Inserts are done in write-ahead log first
•  Data is stored in memory and flushed to disk on
   regular intervals or based on size
•  Small flushes are merged in the background to keep
   number of files small
•  Reads read memory stores first and then disk based
   files second
•  Deletes are handled with “tombstone” markers
•  Atomicity on row level no matter how many columns
   •  keeps locking model easy
Write Ahead Log
MAPREDUCE WITH HBASE
MapReduce with HBase

•  Framework to use HBase as source and/or sink for
   MapReduce jobs
•  Thin layer over native Java API
•  Provides helper class to set up jobs easier

   TableMapReduceUtil.initTableMapperJob(
      “test”, scan, MyMapper.class,
      ImmutableBytesWritable.class,
      RowResult.class, job);


   TableMapReduceUtil.initTableReducerJob(
      “table”, MyReducer.class, job);
MapReduce with HBase (cont.)

•  Special use-case in regards to Hadoop
•  Tables are sorted and have unique keys
    •  Often we do not need a Reducer phase
    •  Combiner not needed
•  Need to make sure load is distributed properly by
   randomizing keys (or use bulk import)
•  Partial or full table scans possible
•  Scans are very efficient as they make use of block
   caches
   •  But then make sure you do not create to much churn, or
     better switch caching off when doing full table scans.
•  Can use filters to limit rows being processed
TableInputFormat

•  Transforms a HBase table into a source for
   MapReduce jobs
•  Internally uses a TableRecordReader which
   wraps a Scan instance
   •  Supports restarts to handle temporary issues
•  Splits table by region boundaries and stores
 current region locality
TableOutputFormat

•  Allows to use HBase table as output target
•  Put and Delete support from mapper or reducer
   class
•  Uses TableOutputCommitter to write data
•  Disables auto-commit on table to make use of
   client side write buffer
•  Handles final flush in close()
HFileOutputFormat

•  Used to bulk load data into HBase
•  Bypasses normal API and generates low-level
   store files
•  Prepares files for final bulk insert
•  Needs special handling of sort order and
   partitioning
•  Only supports one column family (for now)
•  Can load bulk updates into existing tables
MapReduce Helper

•  TableMapReduceUtil
•  IdentityTableMapper
     •  Passes on key and value, where value is a Result
        instance and key is set to value.getRow()
•  IdentityTableReducer
     •  Stores values into HBase, must be Put or Delete
        instances
•  HRegionPartitioner
    •  Not set by default, use it to control partioning on
       Hadoop level
Custom MapReduce over Tables

•  No requirement to use provided framework
•  Can read from or write to one or many tables in
   mapper and reducer
•  Can split not on regions but arbitrary boundaries
•  Make sure to use write buffer in OutputFormat to
   get best performance (do not forget to call
   flushCommits() at the end!)
ADVANCED TECHNIQUES
Advanced Techniques

•  Key/Table Design
•  DDI
•  Salting
•  Hashing vs. Sequential Keys
•  ColumnFamily vs. Column
•  Using BloomFilter
•  Data Locality
•  checkAndPut() and checkAndDelete()
•  Coprocessors
Coprocessors

•  New addition to feature set
•  Based on talk by Jeff Dean at LADIS 2009
    •  Run arbitrary code on each region in RegionServer
    •  High level call interface for clients
       •  Calls are addressed to rows or ranges of rows while
          Coprocessors client library resolves locations
       •  Calls to multiple rows are atomically split
   •  Provides model for distributed services
       •  Automatic scaling, load balancing, request routing
Coprocessors in HBase

•  Use for efficient computational parallelism
•  Secondary indexing (HBASE-2038)
•  Column Aggregates (HBASE-1512)
    •  SQL-like sum(), avg(), max(), min(), etc.
•  Access control (HBASE-3025, HBASE-3045)
    •  Provide basic access control
•  Table Metacolumns
•  New filtering
    •  predicate pushdown
•  Table/Region access statistics
•  HLog extensions (HBASE-3257)
Coprocessor and RegionObserver

•  The Coprocessor interface defines these hooks
    •  preOpen, postOpen: Called before and after the
       region is reported as online to the master
    •  preFlush, postFlush: Called before and after the
       memstore is flushed into a new store file
    •  preCompact, postCompact: Called before and after
       compaction
    •  preSplit, postSplit: Called after the region is split
    •  preClose, postClose: Called before and after the
       region is reported as closed to the master
Coprocessor and RegionObserver

•  The RegionObserver interface is defines these hooks:
    •  preGet, postGet: Called before and after a client makes a Get
       request
    •  preExists, postExists: Called before and after the client tests for
       existence using a Get
    •  prePut, postPut: Called before and after the client stores a value
    •  preDelete, postDelete: Called before and after the client deletes a
       value
    •  preScannerOpen, postScannerOpen: Called before and after the
       client opens a new scanner
    •  preScannerNext, postScannerNext: Called before and after the
       client asks for the next row on a scanner
    •  preScannerClose, postScannerClose: Called before and after the
       client closes a scanner
    •  preCheckAndPut, postCheckAndPut: Called before and after the
       client calls checkAndPut()
    •  preCheckAndDelete, postCheckAndDelete: Called before and after
       the client calls checkAndDelete()
PROJECT STATUS
Current Project Status

•  HBase 0.90.x “Advanced Concepts”
    •  Master Rewrite – More Zookeeper
    •  Intra Row Scanning
    •  Further optimizations on algorithms and data
       structures
           CDH3
•  HBase 0.92.x “Coprocessors”
    •  Multi-DC Replication
    •  Discretionary Access Control
    •  Coprocessors
           CDH4
Current Project Status (cont.)

•  HBase 0.94.x “Performance Release”
    •  Read CRC Improvements
    •  Seek Optimizations
    •  WAL Compression
    •  Prefix Compression (aka Block Encoding)
    •  Atomic Append
    •  Atomic put+delete
    •  Multi Increment and Multi Append
    •  Per-region (i.e. local) Multi-Row Transactions
    •  WALPlayer

         CDH4.x    (soon)
Current Project Status (cont.)

•  HBase 0.96.x “The Singularity”
    •  Protobuf RPC
      •  Rolling Upgrades
      •  Multiversion Access
  •  Metrics V2
  •  Preview Technologies
      •  Snapshots
      •  PrefixTrie Block Encoding



        CDH5 ?
Ques%ons?	
  

More Related Content

What's hot (20)

PPTX
Apache HBase™
Prashant Gupta
 
PPT
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
PPT
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
PDF
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 
PDF
Intro to HBase
alexbaranau
 
PDF
Hbase: an introduction
Jean-Baptiste Poullet
 
PDF
Apache HBase 1.0 Release
Nick Dimiduk
 
PPTX
HBase in Practice
DataWorks Summit/Hadoop Summit
 
PPTX
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
PDF
HBase Status Report - Hadoop Summit Europe 2014
larsgeorge
 
PDF
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
PPTX
HBase: Just the Basics
HBaseCon
 
PPTX
HBase in Practice
larsgeorge
 
PDF
Apache HBase - Just the Basics
HBaseCon
 
PPTX
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.
 
PDF
HBase for Architects
Nick Dimiduk
 
PPTX
Introduction To HBase
Anil Gupta
 
PDF
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
PPTX
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
 
Apache HBase™
Prashant Gupta
 
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
Hw09 Practical HBase Getting The Most From Your H Base Install
Cloudera, Inc.
 
Intro to HBase Internals & Schema Design (for HBase users)
alexbaranau
 
Intro to HBase
alexbaranau
 
Hbase: an introduction
Jean-Baptiste Poullet
 
Apache HBase 1.0 Release
Nick Dimiduk
 
HBase in Practice
DataWorks Summit/Hadoop Summit
 
HBaseCon 2013: Compaction Improvements in Apache HBase
Cloudera, Inc.
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon
 
HBase Status Report - Hadoop Summit Europe 2014
larsgeorge
 
HBaseCon 2015- HBase @ Flipboard
Matthew Blair
 
HBase: Just the Basics
HBaseCon
 
HBase in Practice
larsgeorge
 
Apache HBase - Just the Basics
HBaseCon
 
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...
Cloudera, Inc.
 
HBase for Architects
Nick Dimiduk
 
Introduction To HBase
Anil Gupta
 
HBase Read High Availability Using Timeline-Consistent Region Replicas
HBaseCon
 
HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase
Cloudera, Inc.
 

Viewers also liked (20)

PDF
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks
 
PDF
Introduction to HBase
Avkash Chauhan
 
PPTX
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
Cloudera, Inc.
 
PPTX
Introduction to HBase - Phoenix HUG 5/14
Jeremy Walsh
 
PDF
HBase from the Trenches - Phoenix Data Conference 2015
Avinash Ramineni
 
PPTX
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
PPTX
HBase Operations and Best Practices
Venu Anuganti
 
PDF
HBase Client APIs (for webapps?)
Nick Dimiduk
 
PPTX
HBaseConEast2016: Practical Kerberos with Apache HBase
Michael Stack
 
PPTX
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
 
PPTX
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
PPTX
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
 
PPT
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Cloudera, Inc.
 
PDF
Apache HBase Low Latency
Nick Dimiduk
 
PPTX
Spark + HBase
DataWorks Summit/Hadoop Summit
 
PDF
HBase Storage Internals
DataWorks Summit
 
PDF
Hbase: Introduction to column oriented databases
Luis Cipriani
 
PDF
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
PDF
Meet HBase 1.0
enissoz
 
PPTX
Apache HBase Performance Tuning
Lars Hofhansl
 
Hortonworks Technical Workshop: HBase and Apache Phoenix
Hortonworks
 
Introduction to HBase
Avkash Chauhan
 
Strata + Hadoop World 2012: Apache HBase Features for the Enterprise
Cloudera, Inc.
 
Introduction to HBase - Phoenix HUG 5/14
Jeremy Walsh
 
HBase from the Trenches - Phoenix Data Conference 2015
Avinash Ramineni
 
HBaseConEast2016: HBase and Spark, State of the Art
Michael Stack
 
HBase Operations and Best Practices
Venu Anuganti
 
HBase Client APIs (for webapps?)
Nick Dimiduk
 
HBaseConEast2016: Practical Kerberos with Apache HBase
Michael Stack
 
Apache HBase Internals you hoped you Never Needed to Understand
Josh Elser
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Cloudera, Inc.
 
Apache Phoenix + Apache HBase
DataWorks Summit/Hadoop Summit
 
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Cloudera, Inc.
 
Apache HBase Low Latency
Nick Dimiduk
 
HBase Storage Internals
DataWorks Summit
 
Hbase: Introduction to column oriented databases
Luis Cipriani
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
Meet HBase 1.0
enissoz
 
Apache HBase Performance Tuning
Lars Hofhansl
 
Ad

Similar to Intro to HBase - Lars George (20)

PDF
Nyc hadoop meetup introduction to h base
智杰 付
 
PPTX
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
PDF
HBase
Pooja Sunkapur
 
PPTX
Apache Hive
Amit Khandelwal
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
PPTX
NoSql - mayank singh
Mayank Singh
 
PPTX
Introduction to Apache HBase
Gokuldas Pillai
 
PPTX
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
PDF
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
PPT
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
PDF
Conhecendo o Apache HBase
Felipe Ferreira
 
KEY
HBase and Hadoop at Urban Airship
dave_revell
 
PPTX
SQL on Hadoop
Bigdatapump
 
PPTX
Microsoft's Big Play for Big Data
Andrew Brust
 
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
PPTX
Hadoop Training in Hyderabad
Rajitha D
 
PPTX
Hadoop Training in Hyderabad
CHENNAKESHAVAKATAGAR
 
PPTX
Hive ppt on the basis of importance of big data
computer87914
 
PPTX
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
PDF
Techincal Talk Hbase-Ditributed,no-sql database
Rishabh Dugar
 
Nyc hadoop meetup introduction to h base
智杰 付
 
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
Apache Hive
Amit Khandelwal
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
NoSql - mayank singh
Mayank Singh
 
Introduction to Apache HBase
Gokuldas Pillai
 
Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends
Esther Kundin
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
bddmoscow
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Conhecendo o Apache HBase
Felipe Ferreira
 
HBase and Hadoop at Urban Airship
dave_revell
 
SQL on Hadoop
Bigdatapump
 
Microsoft's Big Play for Big Data
Andrew Brust
 
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Hadoop Training in Hyderabad
Rajitha D
 
Hadoop Training in Hyderabad
CHENNAKESHAVAKATAGAR
 
Hive ppt on the basis of importance of big data
computer87914
 
hive_slides_Webinar_Session_1.pptx
vishwasgarade1
 
Techincal Talk Hbase-Ditributed,no-sql database
Rishabh Dugar
 
Ad

More from JAX London (20)

PDF
Everything I know about software in spaghetti bolognese: managing complexity
JAX London
 
PDF
Devops with the S for Sharing - Patrick Debois
JAX London
 
PPT
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
JAX London
 
PDF
It's code but not as we know: Infrastructure as Code - Patrick Debois
JAX London
 
KEY
Locks? We Don't Need No Stinkin' Locks - Michael Barker
JAX London
 
PDF
Worse is better, for better or for worse - Kevlin Henney
JAX London
 
PDF
Java performance: What's the big deal? - Trisha Gee
JAX London
 
PDF
Clojure made-simple - John Stevenson
JAX London
 
PDF
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
JAX London
 
PDF
Play framework 2 : Peter Hilton
JAX London
 
PDF
Complexity theory and software development : Tim Berglund
JAX London
 
PDF
Why FLOSS is a Java developer's best friend: Dave Gruber
JAX London
 
PDF
Akka in Action: Heiko Seeburger
JAX London
 
PDF
NoSQL Smackdown 2012 : Tim Berglund
JAX London
 
PDF
Closures, the next "Big Thing" in Java: Russel Winder
JAX London
 
KEY
Java and the machine - Martijn Verburg and Kirk Pepperdine
JAX London
 
PDF
Mongo DB on the JVM - Brendan McAdams
JAX London
 
PDF
New opportunities for connected data - Ian Robinson
JAX London
 
PDF
HTML5 Websockets and Java - Arun Gupta
JAX London
 
PDF
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
JAX London
 
Everything I know about software in spaghetti bolognese: managing complexity
JAX London
 
Devops with the S for Sharing - Patrick Debois
JAX London
 
Busy Developer's Guide to Windows 8 HTML/JavaScript Apps
JAX London
 
It's code but not as we know: Infrastructure as Code - Patrick Debois
JAX London
 
Locks? We Don't Need No Stinkin' Locks - Michael Barker
JAX London
 
Worse is better, for better or for worse - Kevlin Henney
JAX London
 
Java performance: What's the big deal? - Trisha Gee
JAX London
 
Clojure made-simple - John Stevenson
JAX London
 
HTML alchemy: the secrets of mixing JavaScript and Java EE - Matthias Wessendorf
JAX London
 
Play framework 2 : Peter Hilton
JAX London
 
Complexity theory and software development : Tim Berglund
JAX London
 
Why FLOSS is a Java developer's best friend: Dave Gruber
JAX London
 
Akka in Action: Heiko Seeburger
JAX London
 
NoSQL Smackdown 2012 : Tim Berglund
JAX London
 
Closures, the next "Big Thing" in Java: Russel Winder
JAX London
 
Java and the machine - Martijn Verburg and Kirk Pepperdine
JAX London
 
Mongo DB on the JVM - Brendan McAdams
JAX London
 
New opportunities for connected data - Ian Robinson
JAX London
 
HTML5 Websockets and Java - Arun Gupta
JAX London
 
The Big Data Con: Why Big Data is a Problem, not a Solution - Ian Plosker
JAX London
 

Recently uploaded (20)

PDF
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
PDF
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
PDF
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PDF
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
PDF
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PDF
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
PPTX
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
PDF
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
July Patch Tuesday
Ivanti
 
Presentation - Vibe Coding The Future of Tech
yanuarsinggih1
 
LLMs.txt: Easily Control How AI Crawls Your Site
Keploy
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
HubSpot Main Hub: A Unified Growth Platform
Jaswinder Singh
 
How Startups Are Growing Faster with App Developers in Australia.pdf
India App Developer
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
CIFDAQ Weekly Market Wrap for 11th July 2025
CIFDAQ
 
Fl Studio 24.2.2 Build 4597 Crack for Windows Free Download 2025
faizk77g
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
HCIP-Data Center Facility Deployment V2.0 Training Material (Without Remarks ...
mcastillo49
 
UiPath Academic Alliance Educator Panels: Session 2 - Business Analyst Content
DianaGray10
 
Complete JavaScript Notes: From Basics to Advanced Concepts.pdf
haydendavispro
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
The Builder’s Playbook - 2025 State of AI Report.pdf
jeroen339954
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
July Patch Tuesday
Ivanti
 

Intro to HBase - Lars George

  • 1. HBASE – THE SCALABLE DATA STORE An Introduction to HBase JAX UK, October 2012 Lars George Director EMEA Services
  • 2. About Me •  Director EMEA Services @ Cloudera •  Consulting on Hadoop projects (everywhere) •  Apache Committer •  HBase and Whirr •  O’Reilly Author •  HBase – The Definitive Guide •  Now in Japanese! •  Contact •  [email protected] 日本語版も出ました!   •  @larsgeorge
  • 3. Agenda •  Introduction to HBase •  HBase Architecture •  MapReduce with HBase •  Advanced Techniques •  Current Project Status
  • 5. Why Hadoop/HBase? •  Datasets are constantly growing and intake soars •  Yahoo! has 140PB+ and 42k+ machines •  Facebook adds 500TB+ per day, 100PB+ raw data, on tens of thousands of machines •  Are you “throwing” data away today? •  Traditional databases are expensive to scale and inherently difficult to distribute •  Commodity hardware is cheap and powerful •  $1000 buys you 4-8 cores/4GB/1TB •  600GB 15k RPM SAS nearly $500 •  Need for random access and batch processing •  Hadoop only supports batch/streaming
  • 6. History of Hadoop/HBase •  Google solved its scalability problems •  “The Google File System” published October 2003 •  Hadoop DFS •  “MapReduce: Simplified Data Processing on Large Clusters” published December 2004 •  Hadoop MapReduce •  “BigTable: A Distributed Storage System for Structured Data” published November 2006 •  HBase
  • 7. Hadoop Introduction •  Two main components •  Hadoop Distributed File System (HDFS) •  A scalable, fault-tolerant, high performance distributed file system capable of running on commodity hardware •  Hadoop MapReduce •  Software framework for distributed computation •  Significant adoption •  Used in production in hundreds of organizations •  Primary contributors: Yahoo!, Facebook, Cloudera
  • 8. HDFS: Hadoop Distributed File System •  Reliably store petabytes of replicated data across thousands of nodes •  Data divided into 64MB blocks, each block replicated three times •  Master/Slave architecture •  Master NameNode contains block locations •  Slave DataNode manages block on local file system •  Built on commodity hardware •  No 15k RPM disks or RAID required (nor wanted!)
  • 9. MapReduce •  Distributed programming model to reliably process petabytes of data using its locality •  Built-in bindings for Java and C •  Can be used with any language via Hadoop Streaming •  Inspired by map and reduce functions in functional programming Input  è  Map()  è  Copy/Sort  è  Reduce()  è  Output    
  • 10. Hadoop… •  … is designed to store and stream extremely large datasets in batch •  … is not intended for realtime querying •  … does not support random access •  … does not handle billions of small files well •  Less than default block size of 64MB and smaller •  Keeps “inodes” in memory on master •  … is not supporting structured data more than unstructured or complex data That is why we have HBase!
  • 11. Why HBase and not …? •  Question: Why HBase and not <put-your-favorite- nosql-solution-here>? •  What else is there? •  Key/value stores •  Document-oriented stores •  Column-oriented stores •  Graph-oriented stores •  Features to ask for •  In memory or persistent? •  Strict or eventual consistency? •  Distributed or single machine (or afterthought)? •  Designed for read and/or write speeds? •  How does it scale? (if that is what you need)
  • 12. What is HBase? •  Distributed •  Column-Oriented •  Multi-Dimensional •  High-Availability (CAP anyone?) •  High-Performance •  Storage System Project Goals Billions of Rows * Millions of Columns * Thousands of Versions Petabytes across thousands of commodity servers
  • 13. HBase is not… •  An SQL Database •  No joins, no query engine, no types, no SQL •  Transactions and secondary indexes only as add-ons but immature •  A drop-in replacement for your RDBMS •  You must be OK with RDBMS anti-schema •  Denormalized data •  Wide and sparsely populated tables •  Just say “no” to your inner DBA Keyword: Impedance Match
  • 24. HBase Tables •  Tables are sorted by the Row Key in lexicographical order •  Table schema only defines its Column Families •  Each family consists of any number of Columns •  Each column consists of any number of Versions •  Columns only exist when inserted, NULLs are free •  Columns within a family are sorted and stored together •  Everything except table names are byte[] (Table, Row, Family:Column, Timestamp) è Value
  • 25. Column Family vs. Column •  Use only a few column families •  Causes many files that need to stay open per region plus class overhead per family •  Best used when logical separation between data and meta columns •  Sorting per family can be used to convey application logic or access pattern
  • 26. HBase Architecture •  Table is made up of any number if regions •  Region is specified by its startKey and endKey •  Empty table: (Table, NULL, NULL) •  Two-region table: (Table, NULL, “com.cloudera.www”) and (Table, “com.cloudera.www”, NULL) •  Each region may live on a different node and is made up of several HDFS files and blocks, each of which is replicated by Hadoop
  • 27. HBase Architecture (cont.) •  Two types of HBase nodes: Master and RegionServer •  Special tables -ROOT- and.META. store schema information and region locations •  Master server responsible for RegionServer monitoring as well as assignment and load balancing of regions •  Uses ZooKeeper as its distributed coordination service •  Manages Master election and server availability
  • 28. Web Crawl Example •  Canonical use-case for BigTable •  Store web crawl data •  Table webtable with family content and meta •  Row is reversed URL with Columns •  content:data stores the raw crawled data •  meta:language stores http language header •  meta:type stores http content-type header •  While processing raw data for hyperlinks and images, add families links and images •  links:<rurl> column for each hyperlink •  images:<rurl> column for each image
  • 29. HBase Clients •  Native Java Client/API •  Non-Java Clients •  REST server •  Avro server •  Thrift server •  Jython, Scala, Groovy DSL •  TableInputFormat/TableOutputFormat for MapReduce •  HBase as MapReduce source and/or target •  HBase Shell •  JRuby shell adding get, put, scan and admin calls
  • 30. Java API •  CRUD •  get: retrieve an entire, or partial row (R) •  put: create and update a row (CU) •  delete: delete a cell, column, columns, or row (D) Result get(Get get) throws IOException; void put(Put put) throws IOException; void delete(Delete delete) throws IOException;
  • 31. Java API (cont.) •  CRUD+SI •  scan: Scan any number of rows (S) •  increment: Increment a column value (I) ResultScanner getScanner(Scan scan) throws IOException; Result increment(Increment increment) throws IOException ;
  • 32. Java API (cont.) •  CRUD+SI+CAS •  Atomic compare-and-swap (CAS) •  Combined get, check, and put operation •  Helps to overcome lack of full transactions
  • 33. Batch Operations •  Support Get, Put, and Delete •  Reduce network round-trips •  If possible, batch operation to the server to gain better overall throughput void batch(List<Row> actions, Object[] results) throws IOException, InterruptedException; Object[] batch(List<Row> actions) throws IOException, InterruptedException;
  • 34. Filters •  Can be used with Get and Scan operations •  Server side hinting •  Reduce data transferred to client •  Filters are no guarantee for fast scans •  Still full table scan in worst-case scenario •  Might have to implement your own •  Filters can hint next row key
  • 35. HBase Extensions •  Hive, Pig, Cascading •  Hadoop-targeted MapReduce tools with HBase integration •  Sqoop •  Read and write to HBase for further processing in Hadoop •  HBase Explorer, Nutch, Heretrix •  SpringData •  Toad
  • 36. History of HBase •  November 2006 •  Google releases paper on BigTable •  February 2007 •  Initial HBase prototype created as Hadoop contrib •  October 2007 •  First “useable” HBase (Hadoop 0.15.0) •  January 2008 •  Hadoop becomes TLP, HBase becomes subproject •  October 2008 •  HBase 0.18.1 released •  January 2009 •  HBase 0.19.0 •  September 2009 •  HBase 0.20.0 released (Performance Release) •  May 2010 •  HBase becomes TLP •  June 2010 •  HBase 0.89.20100621, first developer release •  May 2011 •  HBase 0.90.3 release
  • 37. HBase Users •  Adobe •  eBay •  Facebook •  Mozilla (Socorro) •  Trend Micro (Advanced Threat Research) •  Twitter •  Yahoo! •  …
  • 41. HBase Architecture (cont.) •  Based on Log-Structured Merge-Trees (LSM-Trees) •  Inserts are done in write-ahead log first •  Data is stored in memory and flushed to disk on regular intervals or based on size •  Small flushes are merged in the background to keep number of files small •  Reads read memory stores first and then disk based files second •  Deletes are handled with “tombstone” markers •  Atomicity on row level no matter how many columns •  keeps locking model easy
  • 44. MapReduce with HBase •  Framework to use HBase as source and/or sink for MapReduce jobs •  Thin layer over native Java API •  Provides helper class to set up jobs easier TableMapReduceUtil.initTableMapperJob( “test”, scan, MyMapper.class, ImmutableBytesWritable.class, RowResult.class, job); TableMapReduceUtil.initTableReducerJob( “table”, MyReducer.class, job);
  • 45. MapReduce with HBase (cont.) •  Special use-case in regards to Hadoop •  Tables are sorted and have unique keys •  Often we do not need a Reducer phase •  Combiner not needed •  Need to make sure load is distributed properly by randomizing keys (or use bulk import) •  Partial or full table scans possible •  Scans are very efficient as they make use of block caches •  But then make sure you do not create to much churn, or better switch caching off when doing full table scans. •  Can use filters to limit rows being processed
  • 46. TableInputFormat •  Transforms a HBase table into a source for MapReduce jobs •  Internally uses a TableRecordReader which wraps a Scan instance •  Supports restarts to handle temporary issues •  Splits table by region boundaries and stores current region locality
  • 47. TableOutputFormat •  Allows to use HBase table as output target •  Put and Delete support from mapper or reducer class •  Uses TableOutputCommitter to write data •  Disables auto-commit on table to make use of client side write buffer •  Handles final flush in close()
  • 48. HFileOutputFormat •  Used to bulk load data into HBase •  Bypasses normal API and generates low-level store files •  Prepares files for final bulk insert •  Needs special handling of sort order and partitioning •  Only supports one column family (for now) •  Can load bulk updates into existing tables
  • 49. MapReduce Helper •  TableMapReduceUtil •  IdentityTableMapper •  Passes on key and value, where value is a Result instance and key is set to value.getRow() •  IdentityTableReducer •  Stores values into HBase, must be Put or Delete instances •  HRegionPartitioner •  Not set by default, use it to control partioning on Hadoop level
  • 50. Custom MapReduce over Tables •  No requirement to use provided framework •  Can read from or write to one or many tables in mapper and reducer •  Can split not on regions but arbitrary boundaries •  Make sure to use write buffer in OutputFormat to get best performance (do not forget to call flushCommits() at the end!)
  • 52. Advanced Techniques •  Key/Table Design •  DDI •  Salting •  Hashing vs. Sequential Keys •  ColumnFamily vs. Column •  Using BloomFilter •  Data Locality •  checkAndPut() and checkAndDelete() •  Coprocessors
  • 53. Coprocessors •  New addition to feature set •  Based on talk by Jeff Dean at LADIS 2009 •  Run arbitrary code on each region in RegionServer •  High level call interface for clients •  Calls are addressed to rows or ranges of rows while Coprocessors client library resolves locations •  Calls to multiple rows are atomically split •  Provides model for distributed services •  Automatic scaling, load balancing, request routing
  • 54. Coprocessors in HBase •  Use for efficient computational parallelism •  Secondary indexing (HBASE-2038) •  Column Aggregates (HBASE-1512) •  SQL-like sum(), avg(), max(), min(), etc. •  Access control (HBASE-3025, HBASE-3045) •  Provide basic access control •  Table Metacolumns •  New filtering •  predicate pushdown •  Table/Region access statistics •  HLog extensions (HBASE-3257)
  • 55. Coprocessor and RegionObserver •  The Coprocessor interface defines these hooks •  preOpen, postOpen: Called before and after the region is reported as online to the master •  preFlush, postFlush: Called before and after the memstore is flushed into a new store file •  preCompact, postCompact: Called before and after compaction •  preSplit, postSplit: Called after the region is split •  preClose, postClose: Called before and after the region is reported as closed to the master
  • 56. Coprocessor and RegionObserver •  The RegionObserver interface is defines these hooks: •  preGet, postGet: Called before and after a client makes a Get request •  preExists, postExists: Called before and after the client tests for existence using a Get •  prePut, postPut: Called before and after the client stores a value •  preDelete, postDelete: Called before and after the client deletes a value •  preScannerOpen, postScannerOpen: Called before and after the client opens a new scanner •  preScannerNext, postScannerNext: Called before and after the client asks for the next row on a scanner •  preScannerClose, postScannerClose: Called before and after the client closes a scanner •  preCheckAndPut, postCheckAndPut: Called before and after the client calls checkAndPut() •  preCheckAndDelete, postCheckAndDelete: Called before and after the client calls checkAndDelete()
  • 58. Current Project Status •  HBase 0.90.x “Advanced Concepts” •  Master Rewrite – More Zookeeper •  Intra Row Scanning •  Further optimizations on algorithms and data structures CDH3 •  HBase 0.92.x “Coprocessors” •  Multi-DC Replication •  Discretionary Access Control •  Coprocessors CDH4
  • 59. Current Project Status (cont.) •  HBase 0.94.x “Performance Release” •  Read CRC Improvements •  Seek Optimizations •  WAL Compression •  Prefix Compression (aka Block Encoding) •  Atomic Append •  Atomic put+delete •  Multi Increment and Multi Append •  Per-region (i.e. local) Multi-Row Transactions •  WALPlayer CDH4.x (soon)
  • 60. Current Project Status (cont.) •  HBase 0.96.x “The Singularity” •  Protobuf RPC •  Rolling Upgrades •  Multiversion Access •  Metrics V2 •  Preview Technologies •  Snapshots •  PrefixTrie Block Encoding CDH5 ?