DataStream API:Event Time(Generating Watermarks)

曹木青芸

已于 2023-01-27 22:54:16 修改

阅读量524

点赞数

CC 4.0 BY-SA版权

分类专栏： flink官方文档翻译-DataStream API 文章标签： java 开发语言

于 2022-09-02 17:27:04 首次发布

本文链接：https://blog.csdn.net/weixin_48813624/article/details/126622978

flink官方文档翻译-DataStream API 专栏收录该内容

3 篇文章

订阅专栏

本文详细介绍了Apache Flink中的事件时间处理，包括Watermark的生成、策略、使用以及如何处理空闲Sources。重点讨论了TimestampAssigner和WatermarkGenerator的角色，以及如何编写周期性和标点水印生成器。还提到了水印对齐功能，以解决不同数据源速度不一致的问题，并探讨了旧的AssignerWithPeriodicWatermarks和AssignerWithPunctuatedWatermarks接口。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Generating Watermarks 生成水印

In this section you will learn about the APIs that Flink provides for working with event time timestamps and watermarks. For an introduction to event time, processing time, and ingestion time, please refer to the introduction to event time.
在本节中，您将了解Flink提供的用于处理事件时间时间戳和水印的API。有关事件时间、处理时间和摄入时间的介绍，请参阅introduction to event time。

Introduction to Watermark Strategies 水印策略简介

In order to work with event time, Flink needs to know the events timestamps, meaning each element in the stream needs to have its event timestamp assigned. This is usually done by accessing/extracting the timestamp from some field in the element by using a TimestampAssigner.
为了处理事件时间，Flink需要知道事件时间时间戳，这意味着流中的每个元素都需要分配其事件时间戳。这通常是通过使用时间戳分配器(TimestampAssigner)从元素中的某个字段访问/提取时间戳来完成的。

Timestamp assignment goes hand-in-hand with generating watermarks, which tell the system about progress in event time. You can configure this by specifying a WatermarkGenerator.
时间戳分配与生成水印同时进行，水印告诉系统事件时间的进度。您可以通过指定水印生成器(WatermarkGenerator)进行配置。

The Flink API expects a WatermarkStrategy that contains both a TimestampAssigner and WatermarkGenerator. A number of common strategies are available out of the box as static methods on WatermarkStrategy, but users can also build their own strategies when required.
Flink API需要一个同时包含时间戳分配器(TimestampAssigner)和水印生成器(WatermarkGenerator)的水印策略。作为WatermarkStrategy的静态方法，许多常见的策略都是现成的，但用户也可以在需要时构建自己的策略。

Here is the interface for completeness’ sake:
如下接口：

public interface WatermarkStrategy<T> 
    extends TimestampAssignerSupplier<T>,
            WatermarkGeneratorSupplier<T>{

    /**
     * Instantiates a {@link TimestampAssigner} for assigning timestamps according to this
     * strategy.
     */
    @Override
    TimestampAssigner<T> createTimestampAssigner(TimestampAssignerSupplier.Context context);

    /**
     * Instantiates a WatermarkGenerator that generates watermarks according to this strategy.
     */
    @Override
    WatermarkGenerator<T> createWatermarkGenerator(WatermarkGeneratorSupplier.Context context);
}

As mentioned, you usually don’t implement this interface yourself but use the static helper methods on WatermarkStrategy for common watermark strategies or to bundle together a custom TimestampAssigner with a WatermarkGenerator. For example, to use bounded-out-of-orderness watermarks and a lambda function as a timestamp assigner you use this:
如前所述，您通常不自己实现此接口，而是使用WatermarkStrategy上的静态方法来实现常见的水印策略，或者将自定义的TimestampAssigner与WatermarkGenerator捆绑在一起。例如，要使用有界无序水印和lambda函数作为时间戳分配器，可以使用以下方法：

WatermarkStrategy
        .<Tuple2<Long, String>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
        .withTimestampAssigner((event, timestamp) -> event.f0);

Specifying a TimestampAssigner is optional and in most cases you don’t actually want to specify one. For example, when using Kafka or Kinesis you would get timestamps directly from the Kafka/Kinesis records.
指定时间戳分配器是可选的，在大多数情况下，您实际上不想指定一个。例如，当使用Kafka或Kinesis时，您将直接从Kafka/Kinesis记录中获取时间戳。

We will look at the WatermarkGenerator interface later in Writing WatermarkGenerators.
我们将在稍后编写WatermarkGenerator时查看WatermarkGenerator接口。

Attention: Both timestamps and watermarks are specified as milliseconds since the Java epoch of 1970-01-01T00:00:00Z.
注意：时间戳和水印都指定为自Java纪元1970-01-01T00:00:00Z以来的毫秒值。

Using Watermark Strategies 使用水印策略

There are two places in Flink applications where a WatermarkStrategy can be used: 1) directly on sources and 2) after non-source operation.
在Flink应用程序中有两个地方可以使用水印策略：1.直接在sources上，2.在non-source操作之后。

The first option is preferable, because it allows sources to exploit knowledge about shards/partitions/splits in the watermarking logic. Sources can usually then track watermarks at a finer level and the overall watermark produced by a source will be more accurate. Specifying a WatermarkStrategy directly on the source usually means you have to use a source specific interface/ Refer to Watermark Strategies and the Kafka Connector for how this works on a Kafka Connector and for more details about how per-partition watermarking works there.
第一个选项更可取，因为它允许sources利用水印逻辑中关于切片的信息。另外，Sources通常可以更精细地跟踪水印，并且source生成的整体水印将更精确。直接在source上指定水印策略通常意味着您必须使用特定的source的接口/请参阅Watermark Strategies and the Kafka Connector，了解其在Kafka连接器上的工作方式，以及关于每个分区生成水印的更多详细信息。

The second option (setting a WatermarkStrategy after arbitrary operations) should only be used if you cannot set a strategy directly on the source:
第二个选项(在任意操作后设置水印策略)只能当无法直接在source上设置策略时才建议使用：

final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

DataStream<MyEvent> stream = env.readFile(
        myFormat, myFilePath, FileProcessingMode.PROCESS_CONTINUOUSLY, 100,
        FilePathFilter.createDefaultFilter(), typeInfo);

DataStream<MyEvent> withTimestampsAndWatermarks = stream
        .filter( event -> event.severity() == WARNING )
        .assignTimestampsAndWatermarks(<watermark strategy>);

withTimestampsAndWatermarks
        .keyBy( (event) -> event.getGroup() )
        .window(TumblingEventTimeWindows.of(Time.seconds(10)))
        .reduce( (a, b) -> a.add(b) )
        .addSink(...);

Using a WatermarkStrategy this way takes a stream and produce a new stream with timestamped elements and watermarks. If the original stream had timestamps and/or watermarks already, the timestamp assigner overwrites them.
以这种方式使用水印策略获取一个流，并生成带有时间戳元素和水印的新流。如果原始流已经具有时间戳和/或水印，则时间戳分配器将覆盖它们。

Dealing With Idle Sources 处理空闲Sources

If one of the input splits/partitions/shards does not carry events for a while this means that the WatermarkGenerator also does not get any new information on which to base a watermark. We call this an idle input or an idle source. This is a problem because it can happen that some of your partitions do still carry events. In that case, the watermark will be held back, because it is computed as the minimum over all the different parallel watermarks.
如果其中一个输入切片暂时不携带事件，这意味着WatermarkGenerator也不会获得任何新的水印信息。我们称之为空闲输入或空闲source。这是一个问题，因为可能会发生某些分区仍然承载事件的情况。在这种情况下，水印将被保留，此时它被计算为所有不同并行水印的最小值。

To deal with this, you can use a WatermarkStrategy that will detect idleness and mark an input as idle. WatermarkStrategy provides a convenience helper for this:
为了解决这个问题，您可以使用WatermarkStrategy来检测空闲并将输入标记为空闲。WatermarkStrategy为此提供了一个方便的助手：

WatermarkStrategy
        .<Tuple2<Long, String>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
        .withIdleness(Duration.ofMinutes(1));

Watermark alignment Beta 水印对齐测试版

In the previous paragraph we discussed a situation when splits/partitions/shards or sources are idle and can stall increasing watermarks. On the other side of the spectrum, a split/partition/shard or source may process records very fast and in turn increase its watermark relatively faster than the others. This on its own is not a problem per se. However, for downstream operators that are using watermarks to emit some data it can actually become a problem.
在上一段中，我们讨论了切片或sources处于空闲状态并暂缓水印生成的情况。另一方面，切片或source可以非常快地处理记录，从而比其他操作更快地增加其水印。这本身并不是一个问题。然而，对于使用水印来发送某些数据的下游operators来说，这实际上可能会成为一个问题。

In this case, contrary to idle sources, the watermark of such downstream operator (like windowed joins on aggregations) can progress. However, such operator might need to buffer excessive amount of data coming from the fast inputs, as the minimal watermark from all of its inputs is held back by the lagging input. All records emitted by the fast input will hence have to be buffered in the said downstream operator state, which can lead into uncontrollable growth of the operator’s state.
在这种情况下，与空闲sources相反，此时下游operator(如聚合时的窗口join)的水印可以进行。然而，这样的operator可能需要缓冲来自快速输入的过量数据，因为来自其所有输入的最小水印会被滞后输入抑制。因此，由快速输入发出的所有记录将必须在所述下游operator状态中被缓冲，这可能导致operator状态的不可控增长。

In order to address the issue, you can enable watermark alignment, which will make sure no sources/splits/shards/partitions increase their watermarks too far ahead of the rest. You can enable alignment for every source separately:
为了解决这个问题，您可以启用水印对齐，这将确保没有任何sources切片比其他增加过多的水印。您可以分别为每个source启用对齐：

WatermarkStrategy
        .<Tuple2<Long, String>>forBoundedOutOfOrderness(Duration.ofSeconds(20))
        .withWatermarkAlignment("alignment-group-1", Duration.ofSeconds(20), Duration.ofSeconds(1));

Note: You can enable watermark alignment only for FLIP-27 sources. It does not work for legacy or if applied after the source via DataStream#assignTimestampsAndWatermarks.
注意：只能为FLIP-27sources启用水印对齐。它不适用于legacy或在source之后通过DataStream#assignTimestampsAndWatermarks被应用的。

When enabling the alignment, you need to tell Flink, which group should the source belong. You do that by providing a label (e.g. alignment-group-1) which bind together all sources that share it. Moreover, you have to tell the maximal drift from the current minimal watermarks across all sources belonging to that group. The third parameter describes how often the current maximal watermark should be updated. The downside of frequent updates is that there will be more RPC messages travelling between TMs and the JM.
启用对齐时，需要告诉Flink source应属于哪个组。您可以通过提供一个标签（例如alignment-group-1）来实现这一点，该标签将共享它的所有sources绑定在一起。此外，您必须从属于该组的所有sources的当前最小水印中分辨出最大漂移。第三个参数描述当前最大水印的更新频率。频繁更新的缺点是会有更多的RPC消息在TMs和JM之间传输。

In order to achieve the alignment Flink will pause consuming from the source/task, which generated watermark that is too far into the future. In the meantime it will continue reading records from other sources/tasks which can move the combined watermark forward and that way unblock the faster one.
为了实现对齐，Flink将暂停source/task的消费，这会生成超前的水印。同时，它将继续从其他source/task读取记录，这些source/task可以将联合水印向前移动，并以这种方式解锁更快的水印。

Note: As of 1.15, Flink supports aligning across tasks of the same source and/or different sources. It does not support aligning splits/partitions/shards in the same task.
注意：从1.15开始，Flink支持跨同一source/或不同source的任务对齐。它不支持在同一任务中对齐切片。

In a case where there are e.g. two Kafka partitions that produce watermarks at different pace, that get assigned to the same task watermark might not behave as expected. Fortunately, worst case it should not perform worse than without alignment.
如果有两个Kafka分区以不同速度产生水印时，这些分区被分配给相同的任务，水印可能不会如预期那样表现。幸运的是，在最坏的情况下，它的性能不应该比没有对齐时差。

Given the limitation above, we suggest applying watermark alignment in two situations:
鉴于上述限制，我们建议在两种情况下应用水印对齐：
1.You have two different sources (e.g. Kafka and File) that produce watermarks at different speeds
1.您有两个不同的sources （例如Kafka和文件），它们以不同的速度生成水印。
2.You run your source with parallelism equal to the number of splits/shards/partitions, which results in every subtask being assigned a single unit of work.
2.运行source时，并行度等于切片的数量，这导致每个子任务都被分配一个工作单元。

Writing WatermarkGenerators 编写WatermarkGenerators

A TimestampAssigner is a simple function that extracts a field from an event, we therefore don’t need to look at them in detail. A WatermarkGenerator, on the other hand, is a bit more complicated to write and we will look at how you can do that in the next two sections. This is the WatermarkGenerator interface:
TimestampAssigner是一个从事件中提取字段的简单函数，因此我们不需要详细研究它们。另一方面，WatermarkGenerator的编写有点复杂，我们将在接下来的两节中研究如何实现这一点。这是WatermarkGenerator接口：

/**
 * The {@code WatermarkGenerator} generates watermarks either based on events or
 * periodically (in a fixed interval).
 * WatermarkGenerator基于事件或周期性地(以固定间隔)生成水印
 * 
 * <p><b>Note:</b> This WatermarkGenerator subsumes the previous distinction between the
 * {@code AssignerWithPunctuatedWatermarks} and the {@code AssignerWithPeriodicWatermarks}.
 */ 
@Public
public interface WatermarkGenerator<T> {

    /**
     * Called for every event, allows the watermark generator to examine 
     * and remember the event timestamps, or to emit a watermark based on
     * the event itself.
     * 为每个事件调用，允许水印生成器检查并记住事件时间戳，或基于事件本身发出水印
     */
    void onEvent(T event, long eventTimestamp, WatermarkOutput output);

    /**
     * Called periodically, and might emit a new watermark, or not.
     * 周期调用，并且可能会发出新的水印，或者不发出。
     * 
     * <p>The interval in which this method is called and Watermarks 
     * are generated depends on {@link ExecutionConfig#getAutoWatermarkInterval()}.
     * 调用此方法和生成水印的间隔取决于ExecutionConfig#getAutoWatermarkInterval()。
     */
    void onPeriodicEmit(WatermarkOutput output);
}

There are two different styles of watermark generation: periodic and punctuated.
有两种不同的水印生成方式：周期性的和加标点的。

A periodic generator usually observes the incoming events via onEvent() and then emits a watermark when the framework calls onPeriodicEmit().
周期性的生成器通常通过onEvent()观察传入的事件，然后在框架调用onPeriodicEmit()时发出水印。

A puncutated generator will look at events in onEvent() and wait for special marker events or punctuations that carry watermark information in the stream. When it sees one of these events it emits a watermark immediately. Usually, punctuated generators don’t emit a watermark from onPeriodicEmit().
加标点的生成器将查看onEvent()中的事件，并等待在流中携带水印信息的特殊标记事件或标点。当它看到这些事件之一时，它会立即发出水印。通常，加标点的生成器不会从onPeriodicEmit()发出水印。

We will look at how to implement generators for each style next.
接下来，我们将研究如何为每个样式实现生成器。

Writing a Periodic WatermarkGenerator 编写周期性水印生成器

A periodic generator observes stream events and generates watermarks periodically (possibly depending on the stream elements, or purely based on processing time).
周期性水印生成器周期地观察流事件并生成水印(可能取决于流元素，或者纯粹基于处理时间)。

The interval (every n milliseconds) in which the watermark will be generated is defined via ExecutionConfig.setAutoWatermarkInterval(…). The generators’s onPeriodicEmit() method will be called each time, and a new watermark will be emitted if the returned watermark is non-null and larger than the previous watermark.
生成水印的间隔(每n毫秒)通过ExecutionConfig.setAutoWatermarkInterval(…)定义。每次都会调用生成器的onPeriodicEmit()方法，如果返回的水印不为空且大于前一个水印，则会发出新的水印。

Here we show two simple examples of watermark generators that use periodic watermark generation. Note that Flink ships with BoundedOutOfOrdernessWatermarks, which is a WatermarkGenerator that works similarly to the BoundedOutOfOrdernessGenerator shown below. You can read about using that here.
这里我们展示了两个简单示例。请注意，Flink带有BoundedOutOfOrdernessWatermarks，这是一个水印生成器，其工作方式与下面显示的BoundedOutOfOrdernessGenerator类似。你可以在这里阅读关于使用它的信息。

/**
 * This generator generates watermarks assuming that elements arrive out of order,
 * but only to a certain degree. The latest elements for a certain timestamp t will arrive
 * at most n milliseconds after the earliest elements for timestamp t.
 * 该生成器生成水印，假设元素无序到达，但仅在一定程度上。特定时间戳t的最迟元素将在时间戳t最早元素之后最多n毫秒到达。
 */
public class BoundedOutOfOrdernessGenerator implements WatermarkGenerator<MyEvent> {

    private final long maxOutOfOrderness = 3500; // 3.5 seconds

    private long currentMaxTimestamp;

    @Override
    public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
        currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
        // emit the watermark as current highest timestamp minus the out-of-orderness bound
        output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));
    }

}

/**
 * This generator generates watermarks that are lagging behind processing time 
 * by a fixed amount. It assumes that elements arrive in Flink after a bounded delay.
 * 此生成器生成滞后于处理时间固定数量的水印。它假设元素在有界延迟后到达Flink
 */
public class TimeLagWatermarkGenerator implements WatermarkGenerator<MyEvent> {

    private final long maxTimeLag = 5000; // 5 seconds

    @Override
    public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
        // don't need to do anything because we work on processing time
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
        output.emitWatermark(new Watermark(System.currentTimeMillis() - maxTimeLag));
    }
}

Writing a Punctuated WatermarkGenerator 编写标点水印生成器

A punctuated watermark generator will observe the stream of events and emit a watermark whenever it sees a special element that carries watermark information.
标点水印生成器将观察事件流，并在看到携带水印信息的特殊元素时发出水印。

This is how you can implement a punctuated generator that emits a watermark whenever an event indicates that it carries a certain marker:
下面就是您如何实现标点生成器的方法，每当事件指示它携带某个标记时，该生成器就会发出水印：

public class PunctuatedAssigner implements WatermarkGenerator<MyEvent> {

    @Override
    public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
        if (event.hasWatermarkMarker()) {
            output.emitWatermark(new Watermark(event.getWatermarkTimestamp()));
        }
    }

    @Override
    public void onPeriodicEmit(WatermarkOutput output) {
        // don't need to do anything because we emit in reaction to events above
    }
}

Note: It is possible to generate a watermark on every single event. However, because each watermark causes some computation downstream, an excessive number of watermarks degrades performance.
注意：可以在每个事件上生成水印。然而，由于每个水印都会在下游引起一些计算，因此过多的水印会降低性能。

Watermark Strategies and the Kafka Connector 水印策略与Kafka连接器

When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
当使用Apache Kafka作为数据源时，每个Kafka分区可能有一个简单的事件时间模式（递增时间戳或无序边界）。然而，当使用来自Kafka的流时，多个分区通常会被并行使用，将来自分区的事件交错，并破坏每个分区的模式（这是Kafka消费者客户端工作的固有方式）。

In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
在这种情况下，可以使用Flink的Kafka-partition-aware水印生成。使用该功能，每个Kafka分区在Kafka消费者内部生成水印，每个分区的水印合并的方式与在流shuffle中合并水印的方式相同。

For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks. Note, that we don’t provide a TimestampAssigner in the example, the timestamps of the Kafka records themselves will be used instead.
例如，如果事件时间戳严格按照每个Kafka分区递增，则使用递增时间戳水印生成器生成每个分区的水印将产生完美的整体水印。注意，我们在示例中没有提供时间戳分配器，而是使用Kafka记录本身的时间戳。

The illustrations below show how to use the per-Kafka-partition watermark generation, and how watermarks propagate through the streaming dataflow in that case.
下图显示了如何使用每个Kafka分区生成水印，以及在这种情况下水印如何通过流式dataflow传播。

FlinkKafkaConsumer<MyType> kafkaSource = new FlinkKafkaConsumer<>("myTopic", schema, props);
kafkaSource.assignTimestampsAndWatermarks(
        WatermarkStrategy
                .forBoundedOutOfOrderness(Duration.ofSeconds(20)));

DataStream<MyType> stream = env.addSource(kafkaSource);

Generating Watermarks with awareness for Kafka-partitions

How Operators Process Watermarks Operators如何处理水印

As a general rule, operators are required to completely process a given watermark before forwarding it downstream. For example, WindowOperator will first evaluate all windows that should be fired, and only after producing all of the output triggered by the watermark will the watermark itself be sent downstream. In other words, all elements produced due to occurrence of a watermark will be emitted before the watermark.
一般来说，operators需要在把给定水印转发到下游之前完全处理该水印。例如，WindowOperator将首先评估所有应该触发的窗口，只有在生成由水印触发的所有输出后，水印本身才会被发送到下游。换句话说，由于水印的出现而产生的所有元素将在水印之前发射。

The same rule applies to TwoInputStreamOperator. However, in this case the current watermark of the operator is defined as the minimum of both of its inputs.
同样的规则适用于TwoInputStreamOperator。然而，在这种情况下，operator的当前水印被定义为其两个输入的最小值。

The details of this behavior are defined by the implementations of the OneInputStreamOperator#processWatermark, TwoInputStreamOperator#processWatermark1 and TwoInputStreamOperator#processWatermark2 methods.
此行为的详细信息由OneInputStreamOperator#processWatermark, TwoInputStreamOperator#processWatermark1 and TwoInputStreamOperator#processWatermark2方法的实现定义。

The Deprecated AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks 不推荐使用的AssignerWithPeriodicWatermarks 和AssignerWithPunctuatedWatermarks

Prior to introducing the current abstraction of WatermarkStrategy, TimestampAssigner, and WatermarkGenerator, Flink used AssignerWithPeriodicWatermarks and AssignerWithPunctuatedWatermarks. You will still see them in the API but it is recommended to use the new interfaces because they offer a clearer separation of concerns and also unify periodic and punctuated styles of watermark generation.
在介绍WatermarkStrategy、TimestampAssigner和WatermarkGenerator的当前抽象之前，Flink使用了AssignerWithPeriodicWatermarks和AssignerWithPunctuatedWatermarks。您仍然可以在API中看到它们，但建议使用新接口，因为它们提供了更清晰的关注点分离，并统一了水印生成的周期性和标点样式。