HBase Coprocessors
  Deploy shared functionality
    directly on the cluster

        O’Reilly Webcast
       November 4th, 2011
About Me
• Solutions Architect @ Cloudera
• Apache HBase & Whirr Committer
• Author of
      HBase – The Definitive Guide
• Working with HBase since end
  of 2007
• Organizer of the Munich OpenHUG
• Speaker at Conferences (Fosdem, Hadoop World)
Overview
• Coprocessors were added to Bigtable
  – Mentioned during LADIS 2009 talk
• Runs user code within each region of a table
  – Code split and moves with region
• Defines high level call interface for clients
• Calls addressed to rows or ranges of rows
• Implicit automatic scaling, load balancing, and
  request routing
Examples Use-Cases
• Bigtable uses Coprocessors
  –   Scalable metadata management
  –   Distributed language model for machine translation
  –   Distributed query processing for full-text index
  –   Regular expression search in code repository
• MapReduce jobs over HBase are often map-only
  jobs
  – Row keys are already sorted and distinct
  ➜ Could be replaced by Coprocessors
HBase Coprocessors
• Inspired by Google’s Coprocessors
   – Not much information available, but general idea is
     understood
• Define various types of server-side code extensions
   –   Associated with table using a table property
   –   Attribute is a path to JAR file
   –   JAR is loaded when region is opened
   –   Blends new functionality with existing
• Can be chained with Priorities and Load Order

➜ Allows for dynamic RPC extensions
Coprocessor Classes and Interfaces
• The Coprocessor Interface
  – All user code must inherit from this class
• The CoprocessorEnvironment Interface
  – Retains state across invocations
  – Predefined classes
• The CoprocessorHost Interface
  – Ties state and user code together
  – Predefined classes
Coprocessor Priority
• System or User

/** Highest installation priority */
static final int PRIORITY_HIGHEST = 0;
/** High (system) installation priority */
static final int PRIORITY_SYSTEM = Integer.MAX_VALUE / 4;
/** Default installation prio for user coprocessors */
static final int PRIORITY_USER = Integer.MAX_VALUE / 2;
/** Lowest installation priority */
static final int PRIORITY_LOWEST = Integer.MAX_VALUE;
Coprocessor Environment
• Available Methods
Coprocessor Host
• Maintains all Coprocessor instances and their
  environments (state)
• Concrete Classes
  – MasterCoprocessorHost
  – RegionCoprocessorHost
  – WALCoprocessorHost
• Subclasses provide access to specialized
  Environment implementations
Control Flow
Coprocessor Interface
• Base for all other types of Coprocessors
• start() and stop() methods for lifecycle
  management
• State as defined in the interface:
Observer Classes
• Comparable to database triggers
  – Callback functions/hooks for every explicit API
    method, but also all important internal calls
• Concrete Implementations
  – MasterObserver
     • Hooks into HMaster API
  – RegionObserver
     • Hooks into Region related operations
  – WALObserver
     • Hooks into write-ahead log operations
Region Observers
• Can mediate (veto) actions
  – Used by the security policy extensions
  – Priority allows mediators to run first
• Hooks into all CRUD+S API calls and more
  – get(), put(), delete(), scan(), increment(),…
  – checkAndPut(), checkAndDelete(),…
  – flush(), compact(), split(),…
• Pre/Post Hooks for every call
• Can be used to build secondary indexes, filters
Endpoint Classes
• Define a dynamic RPC protocol, used between
  client and region server
• Executes arbitrary code, loaded in region server
  – Future development will add code weaving/inspection
    to deny any malicious code
• Steps to add your own methods
  – Define and implement your own protocol
  – Implement endpoint coprocessor
  – Call HTable’s coprocessorExec() or coprocessorProxy()
Coprocessor Loading
• There are two ways: dynamic or static
   – Static: use configuration files and table schema
   – Dynamic: not available (yet)
• For static loading from configuration:
   – Order is important (defines the execution order)
   – Special property key for each host type
   – Region related classes are loaded for all regions and
     tables
   – Priority is always System
   – JAR must be on class path
Loading from Configuration
• Example:
  <property>
   <name>hbase.coprocessor.region.classes</name>
   <value>coprocessor.RegionObserverExample, 
    coprocessor.AnotherCoprocessor</value>
  </property>
  <property>
   <name>hbase.coprocessor.master.classes</name>
   <value>coprocessor.MasterObserverExample</value>
  </property>
  <property>
   <name>hbase.coprocessor.wal.classes</name>
   <value>coprocessor.WALObserverExample, 
    bar.foo.MyWALObserver</value>
  </property>
Coprocessor Loading (cont.)
• For static loading from table schema:
  – Definition per table
  – For all regions of the table
  – Only region related classes, not WAL or Master
  – Added to HTableDescriptor, when table is created
    or altered
  – Allows to set the priority and JAR path
  COPROCESSOR$<num>
   <path-to-jar>|<classname>|<priority>
Loading from Table Schema
• Example:
'COPROCESSOR$1' => 
 'hdfs://localhost:8020/users/leon/test.jar| 
  coprocessor.Test|10'

'COPROCESSOR$2' => 
 '/Users/laura/test2.jar| 
  coprocessor.AnotherTest|1000'
Example: Add Coprocessor
public static void main(String[] args) throws IOException {
  Configuration conf = HBaseConfiguration.create();
  FileSystem fs = FileSystem.get(conf);
  Path path = new Path(fs.getUri() + Path.SEPARATOR +
   "test.jar");
  HTableDescriptor htd = new HTableDescriptor("testtable");
  htd.addFamily(new HColumnDescriptor("colfam1"));
  htd.setValue("COPROCESSOR$1", path.toString() +
   "|" + RegionObserverExample.class.getCanonicalName() +
   "|" + Coprocessor.PRIORITY_USER);
  HBaseAdmin admin = new HBaseAdmin(conf);
  admin.createTable(htd);
  System.out.println(admin.getTableDescriptor(
   Bytes.toBytes("testtable")));
}
Example Output
{NAME => 'testtable', COPROCESSOR$1 =>
'file:/test.jar|coprocessor.RegionObserverExample|1073741823',
FAMILIES => [{NAME => 'colfam1', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE',
VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536',
IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
Region Observers
• Handles all region related events
• Hooks for two classes of operations:
  – Lifecycle changes
  – Client API Calls
• All client API calls have a pre/post hook
  – Can be used to grant access on preGet()
  – Can be used to update secondary indexes on
    postPut()
Handling Region Lifecycle Events



• Hook into pending open, open, and pending close
  state changes
• Called implicitly by the framework
   – preOpen(), postOpen(),…
• Used to piggyback or fail the process, e.g.
   – Cache warm up after a region opens
   – Suppress region splitting, compactions, flushes
Region Environment
Special Hook Parameter
public interface RegionObserver extends Coprocessor {

 /**
  * Called before the region is reported as open to the master.
  * @param c the environment provided by the region server
  */
 void preOpen(final
   ObserverContext<RegionCoprocessorEnvironment> c);

 /**
  * Called after the region is reported as open to the master.
  * @param c the environment provided by the region server
  */
 void postOpen(final
   ObserverContext<RegionCoprocessorEnvironment> c);
ObserverContext
Chain of Command
• Especially the complete() and bypass()
  methods allow to change the processing chain
  – complete() ends the chain at the current
    coprocessor
  – bypass() completes the pre/post chain but uses
    the last value returned by the coprocessors,
    possibly not calling the actual API method (for
    pre-hooks)
Example: Pre-Hook Complete


@Override
public void preSplit(ObserverContext
     <RegionCoprocessorEnvironment> e) {
  e.complete();
}
Master Observer
• Handles all HMaster related events
  – DDL type calls, e.g. create table, add column
  – Region management calls, e.g. move, assign
• Pre/post hooks with Context
• Specialized environment provided
Master Environment
Master Services (cont.)
• Very powerful features
  – Access the AssignmentManager to modify plans
  – Access the MasterFileSystem to create or access
    resources on HDFS
  – Access the ServerManager to get the list of known
    servers
  – Use the ExecutorService to run system-wide
    background processes
• Be careful (for now)!
Example: Master Post Hook
public class MasterObserverExample
  extends BaseMasterObserver {
  @Override public void postCreateTable(
    ObserverContext<MasterCoprocessorEnvironment> env,
    HRegionInfo[] regions, boolean sync)
    throws IOException {
    String tableName =
      regions[0].getTableDesc().getNameAsString();
    MasterServices services =
      env.getEnvironment().getMasterServices();
    MasterFileSystem masterFileSystem =
     services.getMasterFileSystem();
    FileSystem fileSystem = masterFileSystem.getFileSystem();
    Path blobPath = new Path(tableName + "-blobs");
    fileSystem.mkdirs(blobPath);
  }
}
Example Output
hbase(main):001:0> create 'testtable',
  'colfam1‘
0 row(s) in 0.4300 seconds

$ bin/hadoop dfs -ls
  Found 1 items
  drwxr-xr-x - larsgeorge supergroup 0 ...
  /user/larsgeorge/testtable-blobs
Endpoints
• Dynamic RPC extends server-side functionality
  – Useful for MapReduce like implementations
  – Handles the Map part server-side, Reduce needs
    to be done client side
• Based on CoprocessorProtocol interface
• Routing to regions is based on either single
  row keys, or row key ranges
  – Call is sent, no matter if row exists or not since
    region start and end keys are coarse grained
Custom Endpoint Implementation
• Involves two steps:
  – Extend the CoprocessorProtocol interface
     • Defines the actual protocol
  – Extend the BaseEndpointCoprocessor
     • Provides the server-side code and the dynamic RPC
       method
Example: Row Count Protocol
public interface RowCountProtocol
  extends CoprocessorProtocol {
  long getRowCount()
    throws IOException;
  long getRowCount(Filter filter)
    throws IOException;
  long getKeyValueCount()
    throws IOException;
}
Example: Endpoint for Row Count
public class RowCountEndpoint
extends BaseEndpointCoprocessor
implements RowCountProtocol {

 private long getCount(Filter filter,
  boolean countKeyValues) throws IOException {
   Scan scan = new Scan();
  scan.setMaxVersions(1);
  if (filter != null) {
    scan.setFilter(filter);
  }
Example: Endpoint for Row Count
RegionCoprocessorEnvironment environment =
  (RegionCoprocessorEnvironment)
  getEnvironment();
// use an internal scanner to perform
// scanning.
InternalScanner scanner =
  environment.getRegion().getScanner(scan);
int result = 0;
Example: Endpoint for Row Count
    try {
      List<KeyValue> curVals =
        new ArrayList<KeyValue>();
      boolean done = false;
      do {
        curVals.clear();
        done = scanner.next(curVals);
        result += countKeyValues ? curVals.size() : 1;
      } while (done);
    } finally {
      scanner.close();
    }
    return result;
}
Example: Endpoint for Row Count
    @Override
    public long getRowCount() throws IOException {
      return getRowCount(new FirstKeyOnlyFilter());
    }

    @Override
    public long getRowCount(Filter filter) throws IOException {
     return getCount(filter, false);
    }

    @Override
    public long getKeyValueCount() throws IOException {
      return getCount(null, true);
    }
}
Endpoint Invocation
• There are two ways to invoke the call
  – By Proxy, using HTable.coprocessorProxy()
     • Uses a delayed model, i.e. the call is send when the proxied
       method is invoked
  – By Exec, using HTable.coprocessorExec()
     • The call is send in parallel to all regions and the results are
       collected immediately
• The Batch.Call class is used be coprocessorExec()
  to wrap the calls per region
• The optional Batch.Callback can be used to react
  upon completion of the remote call
Exec vs. Proxy
Example: Invocation by Exec
public static void main(String[] args) throws IOException {
 Configuration conf = HBaseConfiguration.create();
 HTable table = new HTable(conf, "testtable");
 try {
   Map<byte[], Long> results =
    table.coprocessorExec(RowCountProtocol.class, null, null,
    new Batch.Call<RowCountProtocol, Long>() {
      @Override
      public Long call(RowCountProtocol counter)
      throws IOException {
        return counter.getRowCount();
      }
    });
Example: Invocation by Exec
      long total = 0;
      for (Map.Entry<byte[], Long> entry :
           results.entrySet()) {
        total += entry.getValue().longValue();
        System.out.println("Region: " +
          Bytes.toString(entry.getKey()) +
          ", Count: " + entry.getValue());
      }
      System.out.println("Total Count: " + total);
    } catch (Throwable throwable) {
       throwable.printStackTrace();
    }
}
Example Output
Region:
  testtable,,1303417572005.51f9e2251c...cb
  cb0c66858f., Count: 2
Region: testtable,row3,
  1303417572005.7f3df4dcba...dbc99fce5d
  87., Count: 3
Total Count: 5
Batch Convenience
• The Batch.forMethod() helps to quickly map a
  protocol function into a Batch.Call
• Useful for single method calls to the servers
• Uses the Java reflection API to retrieve the
  named method
• Saves you from implementing the anonymous
  inline class
Batch Convenience
Batch.Call call =
 Batch.forMethod(
   RowCountProtocol.class,
   "getKeyValueCount");
Map<byte[], Long> results =
 table.coprocessorExec(
   RowCountProtocol.class,
   null, null, call);
Call Multiple Endpoints
• Sometimes you need to call more than one
  endpoint in a single roundtrip call to the
  servers
• This requires an anonymous inline class, since
  Batch.forMethod cannot handle this
Call Multiple Endpoints
Map<byte[], Pair<Long, Long>>
results = table.coprocessorExec(
 RowCountProtocol.class, null, null,
 new Batch.Call<RowCountProtocol,
   Pair<Long, Long>>() {
   public Pair<Long, Long> call(
     RowCountProtocol counter)
   throws IOException {
      return new Pair(
     counter.getRowCount(),
     counter.getKeyValueCount());
   }
 });
Example: Invocation by Proxy
RowCountProtocol protocol =
  table.coprocessorProxy(
    RowCountProtocol.class,
    Bytes.toBytes("row4"));
long rowsInRegion =
  protocol.getRowCount();
  System.out.println(
    "Region Row Count: " +
    rowsInRegion);
Questions?

• Contact:
  Email: lars@cloudera.com
  Twitter: @larsgeorge

• Talk at Hadoop World, November 8th & 9th
Special Offer for
                                        Webcast Attendees
                                    Visit http://oreilly.com to
                                    purchase your copy of
                                    Hbase: The Definitive
                                    Guide and enter code
                                    4CAST to save 40% off
                                    print book & 50% off
                                    ebook with special code
                                    4CAST


Visit http://oreilly.com/webcasts to view upcoming webcasts and online events.

More Related Content

PDF
MySQL shell and It's utilities - Praveen GR (Mydbops Team)
PDF
Major features postgres 11
 
PDF
Setup oracle golden gate 11g replication
PDF
Sap basis administrator user guide
PPTX
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
PPTX
The Stream Processor as a Database Apache Flink
PDF
Oracle10g New Features I
PPTX
Oracle ebs db platform migration
MySQL shell and It's utilities - Praveen GR (Mydbops Team)
Major features postgres 11
 
Setup oracle golden gate 11g replication
Sap basis administrator user guide
HBaseConEast2016: Coprocessors – Uses, Abuses and Solutions
The Stream Processor as a Database Apache Flink
Oracle10g New Features I
Oracle ebs db platform migration

What's hot (20)

PPTX
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
ODP
PostgreSQL 8.4 TriLUG 2009-11-12
PPTX
Oracle ACFS High Availability NFS Services (HANFS)
PDF
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
PDF
Apache Con NA 2013 - Cassandra Internals
PDF
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
PDF
Perl Programming - 04 Programming Database
PDF
Extending Apache Spark – Beyond Spark Session Extensions
PDF
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
PDF
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
PPTX
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
PPTX
ACID Transactions in Hive
PPTX
Hadoop on osx
PDF
CRX2Oak - all the secrets of repository migration
PDF
Introduction to Mongodb execution plan and optimizer
PDF
Streaming Processing with a Distributed Commit Log
PDF
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
PPTX
Sql Server 2008 New Programmability Features
PPTX
Presto overview
PDF
Apache Flink internals
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
PostgreSQL 8.4 TriLUG 2009-11-12
Oracle ACFS High Availability NFS Services (HANFS)
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Apache Con NA 2013 - Cassandra Internals
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Perl Programming - 04 Programming Database
Extending Apache Spark – Beyond Spark Session Extensions
Mixing Metrics and Logs with Grafana + Influx by David Kaltschmidt, Director ...
InfluxDB IOx Tech Talks: Query Processing in InfluxDB IOx
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
ACID Transactions in Hive
Hadoop on osx
CRX2Oak - all the secrets of repository migration
Introduction to Mongodb execution plan and optimizer
Streaming Processing with a Distributed Commit Log
PostgreSQL 9.4, 9.5 and Beyond @ COSCUP 2015 Taipei
Sql Server 2008 New Programmability Features
Presto overview
Apache Flink internals
Ad

Viewers also liked (20)

PPT
U.S. Senate Social Graph, 1991 - Present
ZIP
InsideRIA Outlook for 2009
PPTX
Oct. 14, 2011 webcast ch7 subnets bruce hartpence
PPT
2 3-2012 Take Control of iCloud
PPT
12 13 what is desktop virtualization
PDF
Sxsw speaker submission_effectiveui_07252014
PDF
Hoppala at ARE2011
PPT
Test Driven Development
PDF
WattzOn Personal Energy Audit
PDF
Hoppala at XMediaLab
PDF
Citizen Science on the Move conference 25, 26 & 27 june 2012
PPTX
Souders WPO Web 2.0 Expo
PPTX
Search Different Understanding Apple's New Search Engine State of Search 2016
PDF
Sharing Apache's Goodness: How We Should be Telling Apache's Story
PDF
Open Source at the Apache Software Foundation
PDF
But we're already open source! Why would I want to bring my code to Apache?
PDF
Allister Frost Speaker Biography
PDF
Voice+IP Conference Frankfurt, Germany
PDF
2009 Research Where
PDF
Apple earnings q4-2010
U.S. Senate Social Graph, 1991 - Present
InsideRIA Outlook for 2009
Oct. 14, 2011 webcast ch7 subnets bruce hartpence
2 3-2012 Take Control of iCloud
12 13 what is desktop virtualization
Sxsw speaker submission_effectiveui_07252014
Hoppala at ARE2011
Test Driven Development
WattzOn Personal Energy Audit
Hoppala at XMediaLab
Citizen Science on the Move conference 25, 26 & 27 june 2012
Souders WPO Web 2.0 Expo
Search Different Understanding Apple's New Search Engine State of Search 2016
Sharing Apache's Goodness: How We Should be Telling Apache's Story
Open Source at the Apache Software Foundation
But we're already open source! Why would I want to bring my code to Apache?
Allister Frost Speaker Biography
Voice+IP Conference Frankfurt, Germany
2009 Research Where
Apple earnings q4-2010
Ad

Similar to Nov. 4, 2011 o reilly webcast-hbase- lars george (20)

PDF
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
PPTX
H base introduction & development
PPTX
Hadoop 20111117
PDF
Postgres Vienna DB Meetup 2014
PPT
nodejs_at_a_glance.ppt
PDF
(ATS4-PLAT01) Core Architecture Changes in AEP 9.0 and their Impact on Admini...
PPT
nodejs_at_a_glance, understanding java script
PDF
Airflow tutorials hands_on
PDF
HBase Coprocessors @ HUG NYC
PPTX
Copper: A high performance workflow engine
ODP
Introduction to LAVA Workload Scheduler
PPTX
Infrastructure modeling with chef
PPTX
Introducing Node.js in an Oracle technology environment (including hands-on)
PDF
Java colombo-deep-dive-into-jax-rs
PDF
PDF
Get to know PostgreSQL!
PPTX
Hadoop
PDF
Getting to know Laravel 5
PDF
Fortify aws aurora_proxy_2019_pleu
PPTX
Introduction to Apache Mesos
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
H base introduction & development
Hadoop 20111117
Postgres Vienna DB Meetup 2014
nodejs_at_a_glance.ppt
(ATS4-PLAT01) Core Architecture Changes in AEP 9.0 and their Impact on Admini...
nodejs_at_a_glance, understanding java script
Airflow tutorials hands_on
HBase Coprocessors @ HUG NYC
Copper: A high performance workflow engine
Introduction to LAVA Workload Scheduler
Infrastructure modeling with chef
Introducing Node.js in an Oracle technology environment (including hands-on)
Java colombo-deep-dive-into-jax-rs
Get to know PostgreSQL!
Hadoop
Getting to know Laravel 5
Fortify aws aurora_proxy_2019_pleu
Introduction to Apache Mesos

More from O'Reilly Media (20)

PPTX
2 7-2012 Google how links boost rankings
PDF
February 8, 2012 Webcast: 10 Things You Didn't Know About Google+
PPTX
Sept. 28, 2011 webcast become an expert google searcher in an hour stephan ...
PPTX
Oct. 4, 2011 webcast top 5 tips for building viral social web applications an...
PPTX
Oct. 27, 2011 webcast practical and pragmatic application of pmi standards
PPTX
Nov. 8, 2011 webcast desiging mobile interfaces by steven hoober
PPTX
Oct. 25. 2011 webcast conduct aninterview
PPTX
Nov. 15, 2011 dani nordin talking to clients about drupal projects
PPT
What's New & Cool in Drupal 7
PPT
Dealing with Legacy Perl Code - Peter Scott
PDF
The Science of Social Media
PDF
Web 2.0 Expo Ny--How to Submit a Winning Proposal
PDF
O'Reilly Webcast: Architecting Applications For The Cloud
PDF
Active Facebook Users By Country & Region: August 2009
PPT
Web Squared
PDF
Twitter Webcast Power Tips, Pt 2
PDF
Twitter Webcast Power Tips, Pt 1
PDF
Facebook and Myspace App Platforms: A Brief Update
PDF
U.S. iTunes App Store: Sellers
PDF
The What Why And Who Of Xbrl
2 7-2012 Google how links boost rankings
February 8, 2012 Webcast: 10 Things You Didn't Know About Google+
Sept. 28, 2011 webcast become an expert google searcher in an hour stephan ...
Oct. 4, 2011 webcast top 5 tips for building viral social web applications an...
Oct. 27, 2011 webcast practical and pragmatic application of pmi standards
Nov. 8, 2011 webcast desiging mobile interfaces by steven hoober
Oct. 25. 2011 webcast conduct aninterview
Nov. 15, 2011 dani nordin talking to clients about drupal projects
What's New & Cool in Drupal 7
Dealing with Legacy Perl Code - Peter Scott
The Science of Social Media
Web 2.0 Expo Ny--How to Submit a Winning Proposal
O'Reilly Webcast: Architecting Applications For The Cloud
Active Facebook Users By Country & Region: August 2009
Web Squared
Twitter Webcast Power Tips, Pt 2
Twitter Webcast Power Tips, Pt 1
Facebook and Myspace App Platforms: A Brief Update
U.S. iTunes App Store: Sellers
The What Why And Who Of Xbrl

Recently uploaded (20)

PDF
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
PPTX
Module 1 Introduction to Web Programming .pptx
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
Data Virtualization in Action: Scaling APIs and Apps with FME
PPTX
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
PDF
Convolutional neural network based encoder-decoder for efficient real-time ob...
PDF
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
PDF
Enhancing plagiarism detection using data pre-processing and machine learning...
PDF
Auditboard EB SOX Playbook 2023 edition.
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PPTX
Microsoft User Copilot Training Slide Deck
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PDF
Improvisation in detection of pomegranate leaf disease using transfer learni...
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
LMS bot: enhanced learning management systems for improved student learning e...
PDF
Connector Corner: Transform Unstructured Documents with Agentic Automation
PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
PDF
4 layer Arch & Reference Arch of IoT.pdf
The-2025-Engineering-Revolution-AI-Quality-and-DevOps-Convergence.pdf
Module 1 Introduction to Web Programming .pptx
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
Data Virtualization in Action: Scaling APIs and Apps with FME
AI-driven Assurance Across Your End-to-end Network With ThousandEyes
Convolutional neural network based encoder-decoder for efficient real-time ob...
A hybrid framework for wild animal classification using fine-tuned DenseNet12...
Enhancing plagiarism detection using data pre-processing and machine learning...
Auditboard EB SOX Playbook 2023 edition.
Introduction to MCP and A2A Protocols: Enabling Agent Communication
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
Microsoft User Copilot Training Slide Deck
NewMind AI Weekly Chronicles – August ’25 Week IV
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
Improvisation in detection of pomegranate leaf disease using transfer learni...
MuleSoft-Compete-Deck for midddleware integrations
LMS bot: enhanced learning management systems for improved student learning e...
Connector Corner: Transform Unstructured Documents with Agentic Automation
Electrocardiogram sequences data analytics and classification using unsupervi...
4 layer Arch & Reference Arch of IoT.pdf

Nov. 4, 2011 o reilly webcast-hbase- lars george

  • 1. HBase Coprocessors Deploy shared functionality directly on the cluster O’Reilly Webcast November 4th, 2011
  • 2. About Me • Solutions Architect @ Cloudera • Apache HBase & Whirr Committer • Author of HBase – The Definitive Guide • Working with HBase since end of 2007 • Organizer of the Munich OpenHUG • Speaker at Conferences (Fosdem, Hadoop World)
  • 3. Overview • Coprocessors were added to Bigtable – Mentioned during LADIS 2009 talk • Runs user code within each region of a table – Code split and moves with region • Defines high level call interface for clients • Calls addressed to rows or ranges of rows • Implicit automatic scaling, load balancing, and request routing
  • 4. Examples Use-Cases • Bigtable uses Coprocessors – Scalable metadata management – Distributed language model for machine translation – Distributed query processing for full-text index – Regular expression search in code repository • MapReduce jobs over HBase are often map-only jobs – Row keys are already sorted and distinct ➜ Could be replaced by Coprocessors
  • 5. HBase Coprocessors • Inspired by Google’s Coprocessors – Not much information available, but general idea is understood • Define various types of server-side code extensions – Associated with table using a table property – Attribute is a path to JAR file – JAR is loaded when region is opened – Blends new functionality with existing • Can be chained with Priorities and Load Order ➜ Allows for dynamic RPC extensions
  • 6. Coprocessor Classes and Interfaces • The Coprocessor Interface – All user code must inherit from this class • The CoprocessorEnvironment Interface – Retains state across invocations – Predefined classes • The CoprocessorHost Interface – Ties state and user code together – Predefined classes
  • 7. Coprocessor Priority • System or User /** Highest installation priority */ static final int PRIORITY_HIGHEST = 0; /** High (system) installation priority */ static final int PRIORITY_SYSTEM = Integer.MAX_VALUE / 4; /** Default installation prio for user coprocessors */ static final int PRIORITY_USER = Integer.MAX_VALUE / 2; /** Lowest installation priority */ static final int PRIORITY_LOWEST = Integer.MAX_VALUE;
  • 9. Coprocessor Host • Maintains all Coprocessor instances and their environments (state) • Concrete Classes – MasterCoprocessorHost – RegionCoprocessorHost – WALCoprocessorHost • Subclasses provide access to specialized Environment implementations
  • 11. Coprocessor Interface • Base for all other types of Coprocessors • start() and stop() methods for lifecycle management • State as defined in the interface:
  • 12. Observer Classes • Comparable to database triggers – Callback functions/hooks for every explicit API method, but also all important internal calls • Concrete Implementations – MasterObserver • Hooks into HMaster API – RegionObserver • Hooks into Region related operations – WALObserver • Hooks into write-ahead log operations
  • 13. Region Observers • Can mediate (veto) actions – Used by the security policy extensions – Priority allows mediators to run first • Hooks into all CRUD+S API calls and more – get(), put(), delete(), scan(), increment(),… – checkAndPut(), checkAndDelete(),… – flush(), compact(), split(),… • Pre/Post Hooks for every call • Can be used to build secondary indexes, filters
  • 14. Endpoint Classes • Define a dynamic RPC protocol, used between client and region server • Executes arbitrary code, loaded in region server – Future development will add code weaving/inspection to deny any malicious code • Steps to add your own methods – Define and implement your own protocol – Implement endpoint coprocessor – Call HTable’s coprocessorExec() or coprocessorProxy()
  • 15. Coprocessor Loading • There are two ways: dynamic or static – Static: use configuration files and table schema – Dynamic: not available (yet) • For static loading from configuration: – Order is important (defines the execution order) – Special property key for each host type – Region related classes are loaded for all regions and tables – Priority is always System – JAR must be on class path
  • 16. Loading from Configuration • Example: <property> <name>hbase.coprocessor.region.classes</name> <value>coprocessor.RegionObserverExample, coprocessor.AnotherCoprocessor</value> </property> <property> <name>hbase.coprocessor.master.classes</name> <value>coprocessor.MasterObserverExample</value> </property> <property> <name>hbase.coprocessor.wal.classes</name> <value>coprocessor.WALObserverExample, bar.foo.MyWALObserver</value> </property>
  • 17. Coprocessor Loading (cont.) • For static loading from table schema: – Definition per table – For all regions of the table – Only region related classes, not WAL or Master – Added to HTableDescriptor, when table is created or altered – Allows to set the priority and JAR path COPROCESSOR$<num> <path-to-jar>|<classname>|<priority>
  • 18. Loading from Table Schema • Example: 'COPROCESSOR$1' => 'hdfs://localhost:8020/users/leon/test.jar| coprocessor.Test|10' 'COPROCESSOR$2' => '/Users/laura/test2.jar| coprocessor.AnotherTest|1000'
  • 19. Example: Add Coprocessor public static void main(String[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); FileSystem fs = FileSystem.get(conf); Path path = new Path(fs.getUri() + Path.SEPARATOR + "test.jar"); HTableDescriptor htd = new HTableDescriptor("testtable"); htd.addFamily(new HColumnDescriptor("colfam1")); htd.setValue("COPROCESSOR$1", path.toString() + "|" + RegionObserverExample.class.getCanonicalName() + "|" + Coprocessor.PRIORITY_USER); HBaseAdmin admin = new HBaseAdmin(conf); admin.createTable(htd); System.out.println(admin.getTableDescriptor( Bytes.toBytes("testtable"))); }
  • 20. Example Output {NAME => 'testtable', COPROCESSOR$1 => 'file:/test.jar|coprocessor.RegionObserverExample|1073741823', FAMILIES => [{NAME => 'colfam1', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
  • 21. Region Observers • Handles all region related events • Hooks for two classes of operations: – Lifecycle changes – Client API Calls • All client API calls have a pre/post hook – Can be used to grant access on preGet() – Can be used to update secondary indexes on postPut()
  • 22. Handling Region Lifecycle Events • Hook into pending open, open, and pending close state changes • Called implicitly by the framework – preOpen(), postOpen(),… • Used to piggyback or fail the process, e.g. – Cache warm up after a region opens – Suppress region splitting, compactions, flushes
  • 24. Special Hook Parameter public interface RegionObserver extends Coprocessor { /** * Called before the region is reported as open to the master. * @param c the environment provided by the region server */ void preOpen(final ObserverContext<RegionCoprocessorEnvironment> c); /** * Called after the region is reported as open to the master. * @param c the environment provided by the region server */ void postOpen(final ObserverContext<RegionCoprocessorEnvironment> c);
  • 26. Chain of Command • Especially the complete() and bypass() methods allow to change the processing chain – complete() ends the chain at the current coprocessor – bypass() completes the pre/post chain but uses the last value returned by the coprocessors, possibly not calling the actual API method (for pre-hooks)
  • 27. Example: Pre-Hook Complete @Override public void preSplit(ObserverContext <RegionCoprocessorEnvironment> e) { e.complete(); }
  • 28. Master Observer • Handles all HMaster related events – DDL type calls, e.g. create table, add column – Region management calls, e.g. move, assign • Pre/post hooks with Context • Specialized environment provided
  • 30. Master Services (cont.) • Very powerful features – Access the AssignmentManager to modify plans – Access the MasterFileSystem to create or access resources on HDFS – Access the ServerManager to get the list of known servers – Use the ExecutorService to run system-wide background processes • Be careful (for now)!
  • 31. Example: Master Post Hook public class MasterObserverExample extends BaseMasterObserver { @Override public void postCreateTable( ObserverContext<MasterCoprocessorEnvironment> env, HRegionInfo[] regions, boolean sync) throws IOException { String tableName = regions[0].getTableDesc().getNameAsString(); MasterServices services = env.getEnvironment().getMasterServices(); MasterFileSystem masterFileSystem = services.getMasterFileSystem(); FileSystem fileSystem = masterFileSystem.getFileSystem(); Path blobPath = new Path(tableName + "-blobs"); fileSystem.mkdirs(blobPath); } }
  • 32. Example Output hbase(main):001:0> create 'testtable', 'colfam1‘ 0 row(s) in 0.4300 seconds $ bin/hadoop dfs -ls Found 1 items drwxr-xr-x - larsgeorge supergroup 0 ... /user/larsgeorge/testtable-blobs
  • 33. Endpoints • Dynamic RPC extends server-side functionality – Useful for MapReduce like implementations – Handles the Map part server-side, Reduce needs to be done client side • Based on CoprocessorProtocol interface • Routing to regions is based on either single row keys, or row key ranges – Call is sent, no matter if row exists or not since region start and end keys are coarse grained
  • 34. Custom Endpoint Implementation • Involves two steps: – Extend the CoprocessorProtocol interface • Defines the actual protocol – Extend the BaseEndpointCoprocessor • Provides the server-side code and the dynamic RPC method
  • 35. Example: Row Count Protocol public interface RowCountProtocol extends CoprocessorProtocol { long getRowCount() throws IOException; long getRowCount(Filter filter) throws IOException; long getKeyValueCount() throws IOException; }
  • 36. Example: Endpoint for Row Count public class RowCountEndpoint extends BaseEndpointCoprocessor implements RowCountProtocol { private long getCount(Filter filter, boolean countKeyValues) throws IOException { Scan scan = new Scan(); scan.setMaxVersions(1); if (filter != null) { scan.setFilter(filter); }
  • 37. Example: Endpoint for Row Count RegionCoprocessorEnvironment environment = (RegionCoprocessorEnvironment) getEnvironment(); // use an internal scanner to perform // scanning. InternalScanner scanner = environment.getRegion().getScanner(scan); int result = 0;
  • 38. Example: Endpoint for Row Count try { List<KeyValue> curVals = new ArrayList<KeyValue>(); boolean done = false; do { curVals.clear(); done = scanner.next(curVals); result += countKeyValues ? curVals.size() : 1; } while (done); } finally { scanner.close(); } return result; }
  • 39. Example: Endpoint for Row Count @Override public long getRowCount() throws IOException { return getRowCount(new FirstKeyOnlyFilter()); } @Override public long getRowCount(Filter filter) throws IOException { return getCount(filter, false); } @Override public long getKeyValueCount() throws IOException { return getCount(null, true); } }
  • 40. Endpoint Invocation • There are two ways to invoke the call – By Proxy, using HTable.coprocessorProxy() • Uses a delayed model, i.e. the call is send when the proxied method is invoked – By Exec, using HTable.coprocessorExec() • The call is send in parallel to all regions and the results are collected immediately • The Batch.Call class is used be coprocessorExec() to wrap the calls per region • The optional Batch.Callback can be used to react upon completion of the remote call
  • 42. Example: Invocation by Exec public static void main(String[] args) throws IOException { Configuration conf = HBaseConfiguration.create(); HTable table = new HTable(conf, "testtable"); try { Map<byte[], Long> results = table.coprocessorExec(RowCountProtocol.class, null, null, new Batch.Call<RowCountProtocol, Long>() { @Override public Long call(RowCountProtocol counter) throws IOException { return counter.getRowCount(); } });
  • 43. Example: Invocation by Exec long total = 0; for (Map.Entry<byte[], Long> entry : results.entrySet()) { total += entry.getValue().longValue(); System.out.println("Region: " + Bytes.toString(entry.getKey()) + ", Count: " + entry.getValue()); } System.out.println("Total Count: " + total); } catch (Throwable throwable) { throwable.printStackTrace(); } }
  • 44. Example Output Region: testtable,,1303417572005.51f9e2251c...cb cb0c66858f., Count: 2 Region: testtable,row3, 1303417572005.7f3df4dcba...dbc99fce5d 87., Count: 3 Total Count: 5
  • 45. Batch Convenience • The Batch.forMethod() helps to quickly map a protocol function into a Batch.Call • Useful for single method calls to the servers • Uses the Java reflection API to retrieve the named method • Saves you from implementing the anonymous inline class
  • 46. Batch Convenience Batch.Call call = Batch.forMethod( RowCountProtocol.class, "getKeyValueCount"); Map<byte[], Long> results = table.coprocessorExec( RowCountProtocol.class, null, null, call);
  • 47. Call Multiple Endpoints • Sometimes you need to call more than one endpoint in a single roundtrip call to the servers • This requires an anonymous inline class, since Batch.forMethod cannot handle this
  • 48. Call Multiple Endpoints Map<byte[], Pair<Long, Long>> results = table.coprocessorExec( RowCountProtocol.class, null, null, new Batch.Call<RowCountProtocol, Pair<Long, Long>>() { public Pair<Long, Long> call( RowCountProtocol counter) throws IOException { return new Pair( counter.getRowCount(), counter.getKeyValueCount()); } });
  • 49. Example: Invocation by Proxy RowCountProtocol protocol = table.coprocessorProxy( RowCountProtocol.class, Bytes.toBytes("row4")); long rowsInRegion = protocol.getRowCount(); System.out.println( "Region Row Count: " + rowsInRegion);
  • 50. Questions? • Contact: Email: [email protected] Twitter: @larsgeorge • Talk at Hadoop World, November 8th & 9th
  • 51. Special Offer for Webcast Attendees Visit http://oreilly.com to purchase your copy of Hbase: The Definitive Guide and enter code 4CAST to save 40% off print book & 50% off ebook with special code 4CAST Visit http://oreilly.com/webcasts to view upcoming webcasts and online events.