Flink：异步IO关联HBase维表数据

最新推荐文章于 2024-07-29 18:55:54 发布

原创

最新推荐文章于 2024-07-29 18:55:54 发布 · 2.9k 阅读

20 ·

CC 4.0 BY-SA版权

文章标签：

#flink

本文介绍了Flink使用异步IO关联HBase维表数据的优点，包括提高系统吞吐和降低延迟，以及如何利用缓存策略优化。同时，详细解析了实现原理，包括异步IO和Guava CacheBuilder缓存机制，并提供了源码分析和相关案例代码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

一、使用异步IO关联HBase维表数据优点

为避免在流计算环境中频繁的以同步方式查询外部维表，Flink官方提供使用异步IO与外部系统并发的交互方式，这样可以减轻因为网络交互引起的系统吞吐和延迟问题。当然，为了避免频繁与外部系统进行交互，建议使用内部缓存的方式存储近期容易使用到的维度数据，也就是LRU(最近最少使用)思想，业界经常使用的一个缓存机制是Guava 库提供的 CacheBuilder。
整体的设计思想就是：先用异步IO将HBase维表数据加载到缓存中，这样在关联维表时候先去缓存中查找，如果找不到再去HBase表中查询，然后加载到缓存中。

1、优点

这样一方面可以避免大量的维表数据将内存撑爆，另一方面可以进行多维度数据的关联

2、缺点

1、需要异步客户端，比如HBase原生的客户端是不能使用的，因为原生的是同步交互客户端，必须使用异步客户端asynchbase。如果应用的热存储没有异步客户端，那么可以使用自己创建线程池模拟异步请求的方式。
2、由于用到了缓存机制，维度数据更新就会有一定的延迟

3、应用场景

比较适合的场景就是维度数据量特别大，并且可以接受维度数据更新有一定的延迟，或者说维表数据自身更新就很不频繁的情况。

二、实现原理

1、异步IO

在这里插入图片描述
如上图，这是人家Flink官方提供的一个流计算引擎在同步和异步方式与外界存储介质交互的差异对比，左边是同步方式，右边是异步方式。
可以很清楚看到，同步交互方式必须是发送一条请求，然后整个计算任务是卡住状态，等待存储介质返回查询结果，这么干肯定影响计算速度，我自己在刚接触Flink前期就比较喜欢在RichFunction的open()中创建外部存储介质的链接，然后在map()或者filter()中直接使用这个链接去获取想要的数据，这就是典型的同步交互方式。
而异步交互方式则是同时发送多个查询，然后哪个查询结果先到就可以直接使用，也可以认为流计算和查询这两个动作是分开执行的，当然异步IO组件支持返回结果的顺序。

2、缓存机制

这里使用缓存机制是Guava 库提供的 CacheBuilder。

三、源码解析

一、CacheBuilder缓存

二、HBase异步客户端

HBase异步客户端官网
一定要详细看一看java Docs，用法讲的很详细
HBase异步客户端源码Git地址
下面源码分析使用的是v1.8.2

使用异步客户端必须引入依赖：

        <dependency>
            <groupId>org.hbase</groupId>
            <artifactId>asynchbase</artifactId>
            <version>1.8.2</version>
        </dependency>

一个完整的与HBase异步交互的代码需要以下知识。

1、HBaseClient

HBaseClient源码位置

由于目前只使用到get方法，只列出两个get方法的源码，这两个方法是从HBase获取数据的方法，
两者的区别就是，前者只能获取一个维表数据，后者可以获取多个维表的数据，
不过在生产过程中我把好几个维表放在一个HBase表中，不同维表对应不同列蔟

  /**
   * Retrieves data from HBase.从 HBase 检索数据。
   * @param request The {@code get} request.
   * @return A deferred list of key-values that matched the get request.
   *         与 get 请求匹配的延迟键值列表。
   */
  public Deferred<ArrayList<KeyValue>> get(final GetRequest request) {
   
   
    num_gets.increment();
    return sendRpcToRegion(request).addCallbacks(got, Callback.PASSTHROUGH);
  }

/**
   * Method to issue multiple get requests to HBase in a batch. This can avoid
   * bottlenecks in region clients and improve response time.
   * 批量向 HBase 发出多个 get 请求的方法。
   * 这可以避免区域客户端的瓶颈并提高响应时间。
   * @param requests A list of one or more GetRequests.
   *         requests 一个或多个 GetRequests 的列表。
   * @return A deferred grouping of result or exceptions. Note that this API may
   * return a DeferredGroupException if one or more calls failed.
   * 结果或异常的延迟分组。
   * 请注意，如果一个或多个调用失败，此 API 可能会返回 DeferredGroupException。
   * @since 1.8
   */
  public Deferred<List<GetResultOrException>> get(final List<GetRequest> requests) {
   
   
    return Deferred.groupInOrder(multiGet(requests))
        .addCallback(
            new Callback<List<GetResultOrException>, ArrayList<GetResultOrException>>() {
   
   
              public List<GetResultOrException> call(ArrayList<GetResultOrException> results) {
   
   
                return results;
              }
            }
        );
  }

构造函数：

  /**
   * Constructor.
   * @param quorum_spec The specification of the quorum, e.g.
   * {@code "host1,host2,host3"}.
   *                    第一个参数指定Zookeeper地址
   * @param base_path The base path under which is the znode for the
   * -ROOT- region.
   *                   第二个参数执行port
   */
  public HBaseClient(final String quorum_spec, final String base_path) {
   
   
    this(quorum_spec, base_path, defaultChannelFactory(new Config()));
  }

2、GetRequest

GetRequest源码位置

这个是对于从HBase怎么获取数据的一种描述，无非就是指定 key 列蔟列。
在这里插入图片描述
此处主要关注构造函数：通过如下几个构造函数，就能明白可以按照业务需求指定 key 列蔟或者列来获取数据

  /**
   * Constructor.
   * <strong>These byte arrays will NOT be copied.</strong>
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   */
  public GetRequest(final byte[] table, final byte[] key) {
   
   
    super(table, key);
    this.bufferable = false; //don't buffer get request
  }

  /**
   * Constructor.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   * <strong>This byte array will NOT be copied.</strong>
   */
  public GetRequest(final String table, final byte[] key) {
   
   
    this(table.getBytes(), key);
  }

  /**
   * Constructor.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   */
  public GetRequest(final String table, final String key) {
   
   
    this(table.getBytes(), key.getBytes());
  }

  /**
   * Constructor.
   * <strong>These byte arrays will NOT be copied.</strong>
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   * @param family The column family.
   * @since 1.5
   */
  public GetRequest(final byte[] table,
                    final byte[] key,
                    final byte[] family) {
   
   
    super(table, key);
    this.family(family);
    this.bufferable = false; //don't buffer get request
  }

  /**
   * Constructor.
   * @param table The non-empty name of the table to use.
   * @param key The row key to get in that table.
   * @param family The column family.
   * @since 1.5
   */
  public GetRequest(final String table,
                    final String key,
                    final String family