Flink广播状态中使用非基本类型

文章介绍了在Flink状态编程中如何使用广播状态,特别是当基本类型无法满足需求时,如何定义和使用HashMap作为广播变量。通过一个案例展示了如何从Kafka的两个topic消费数据,将其中一个作为广播数据,与另一个数据流进行关联操作,从而实现特定的功能。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

背景

在Flink状态编程中,经常会用到状态编程,其中也包括广播状态。广播变量作为K-V类型状态数据,平时使用的基本类型比较多(比如String,Boolean,Byte,Short,Int,Long,Float,Double,Char,Date,Void,BigInteger,BigDecimal,Instant等),以K和V都是String举例,定义如下:

MapStateDescriptor<String, String> mapStateDescriptor = new MapStateDescriptor<String, String>("testMapState", BasicTypeInfo.STRING_TYPE_INFO, BasicTypeInfo.STRING_TYPE_INFO);

在这次的项目中,基本类型已无法满足业务场景,经过研究,可以在广播状态中使用其他的类型,比如HashMap,定义广播变量的时候,只需要在类型声明出做出调整

MapStateDescriptor<String, HashMap> mapMapStateDescriptor = new MapStateDescriptor<String, HashMap>("testMapMapState", BasicTypeInfo.STRING_TYPE_INFO, TypeInformation.of(new TypeHint<HashMap>() {
    @Override
    public TypeInformation<HashMap> getTypeInfo() {
        return super.getTypeInfo();
    }
}));

当然,这里直接用的是父类的方法,可以不用重写,改造如下:

MapStateDescriptor<String, HashMap> mapMapStateDescriptor = new MapStateDescriptor<String, HashMap>("testMapMapState", BasicTypeInfo.STRING_TYPE_INFO, TypeInformation.of(new TypeHint<HashMap>() {}));

参考官网资料:Apache Flink 1.12 Documentation: Broadcast State 模式

案例说明

下面以案例来说明HashMap在广播变量中的使用

Flink DataStream消费kafka的两个topic,形成两个流,数据格式如下:

topic1:{"name":"zhangsan","province":"anhui","city":"hefei"}

topic2:{"province":"anhui","city":"hefei","address":"rongchuang"}

topic1 -> stream1,topic2 -> stream2;

topic2的数据作为广播数据;topic1的数据关联topic2的数据,获取address(逻辑可能不严谨,能满足功能测试即可)。

整体代码实现如下:

package flinkbroadcasttest;

import flinkbroadcasttest.processfunction.FlinkBroadcastTestProcess;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.streaming.api.datastream.BroadcastConnectedStream;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;

import java.util.HashMap;
import java.util.Properties;

public class FlinkBroadcastTest {
    public static void main(String[] args) throws Exception {
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 禁用全局任务链
        env.disableOperatorChaining();

        String brokers = "kafka-log1.test.xl.com:9092,kafka-log2.test.xl.com:9092,kafka-log3.test.xl.com:9092";
        String topic1 = "0000-topic1";
        String topic2 = "0000-topic2";
        String groupId = "demo";

        Properties props = new Properties();
        props.setProperty("bootstrap.servers", brokers);
        props.setProperty("group.id", groupId);
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("auto.offset.reset", "earliest");
        props.put("max.poll.records", 1000);
        props.put("session.timeout.ms", 90000);
        props.put("request.timeout.ms", 120000);
        props.put("enable.auto.commit", true);
        props.put("auto.commit.interval.ms", 100);

        FlinkKafkaConsumer<String> consumer1 = new FlinkKafkaConsumer<String>(topic1, new SimpleStringSchema(), props);
        consumer1.setCommitOffsetsOnCheckpoints(true);
        DataStream<String> data1KafkaDataDS = env.addSource(consumer1);

        FlinkKafkaConsumer<String> consumer2 = new FlinkKafkaConsumer<String>(topic2, new SimpleStringSchema(), props);
        consumer2.setCommitOffsetsOnCheckpoints(true);
        DataStream<String> data2KafkaDataDS = env.addSource(consumer2);
        
        MapStateDescriptor<String, HashMap> mapMapStateDescriptor = new MapStateDescriptor<String, HashMap>("testMapMapState", BasicTypeInfo.STRING_TYPE_INFO, TypeInformation.of(new TypeHint<HashMap>() {}));
        BroadcastStream<String> broadcast = data2KafkaDataDS.broadcast(mapMapStateDescriptor);
        BroadcastConnectedStream<String, String> connect = data1KafkaDataDS.connect(broadcast);
        DataStream<String> result = connect.process(new FlinkBroadcastTestProcess());
        result.print();

        env.execute("FlinkBroadcastTest");
    }
}
package flinkbroadcasttest.processfunction;

import com.alibaba.fastjson2.JSON;
import com.alibaba.fastjson2.JSONObject;
import org.apache.flink.api.common.state.BroadcastState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ReadOnlyBroadcastState;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;

import java.util.HashMap;

public class FlinkBroadcastTestProcess extends BroadcastProcessFunction<String, String, String> {

    MapStateDescriptor<String, HashMap> mapMapStateDescriptor = new MapStateDescriptor<String, HashMap>("testMapMapState", BasicTypeInfo.STRING_TYPE_INFO, TypeInformation.of(new TypeHint<HashMap>() {}));

    @Override
    public void processElement(String value, ReadOnlyContext ctx, Collector<String> out) throws Exception {
        try {
            ReadOnlyBroadcastState<String, HashMap> broadcastState = ctx.getBroadcastState(mapMapStateDescriptor);
            JSONObject obj = JSON.parseObject(value);
            String name = obj.getString("name");
            String province = obj.getString("province");
            String city = obj.getString("city");
            HashMap hashMap = broadcastState.get(province);
            if (hashMap != null && hashMap.containsKey(city)) {
                String address = hashMap.get(city).toString();
                System.out.println(address);
                JSONObject object = new JSONObject();
                obj.put("name", name);
                object.put("province", province);
                object.put("city", city);
                object.put("address", address);
                out.collect(object.toString());
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    @Override
    public void processBroadcastElement(String value, Context ctx, Collector<String> out) throws Exception {
        try {
            BroadcastState<String, HashMap> broadcastState = ctx.getBroadcastState(mapMapStateDescriptor);
            JSONObject obj = JSON.parseObject(value);
            String province = obj.getString("province");
            String city = obj.getString("city");
            String address = obj.getString("address");
            String kind = obj.getString("kind");
            HashMap hashMap = broadcastState.get(province);
            if (kind.equals("delete")) {
                if (hashMap != null && hashMap.containsKey(city)) {
                    hashMap.remove(city);
                    broadcastState.put(province, hashMap);
                }
            } else if (kind.equals("add")) {
                if (hashMap == null) {
                    hashMap = new HashMap();
                }
                hashMap.put(city, address);
                broadcastState.put(province, hashMap);
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Flink中创建和使用广播流主要是为了在流处理中将某些变化不频繁的数据(如配置信息、静态字典等)广播到所有并行的任务实例。这样做可以使得每个任务实例都可以访问到最新的广播状态数据。以下是创建和使用广播流的基本步骤: 1. 准备广播数据:首先,你需要准备需要广播的数据。这些数据通常存储在一个集合或者文件中,比如使用List或者自定义的类来存储。 2. 创建广播状态:在Flink中,可以通过`BroadcastStateDescriptor`来描述广播状态,它需要指定状态的名字和数据类型。 ```java BroadcastStateDescriptor<String, String> broadcastStateDescriptor = new BroadcastStateDescriptor<>("broadcast-state", Types.STRING); ``` 3. 构建广播流:使用`BroadcastStream`来构建广播流,需要将上述创建的广播状态描述符与原始的数据流结合。 ```java DataStream<String> sourceStream = ...; // 原始数据流 BroadcastStream<String> broadcastStream = sourceStream .broadcast(broadcastStateDescriptor); ``` 4. 使用广播流:通过`connect`方法将广播流与主流连接起来,并定义如何处理主流和广播流的数据。`CoProcessFunction`用于处理连接后的数据,其中可以分别处理来自主流和广播流的事件。 ```java DataStream<String> nonBroadcastStream = ...; // 广播数据流 ConnectedStreams<String, String> connectedStreams = nonBroadcastStream.connect(broadcastStream); DataStream<String> resultStream = connectedStreams .process(new MyBroadcastProcessFunction()); ``` 5. 发布广播流:最后,通过执行环境的`execute`方法启动流处理作业,广播流就会开始工作。 ```java executionEnv.execute("broadcast-stream-example"); ``` 在使用时,广播流会不断地将状态数据发送给所有的并行任务实例,每个实例都会维护一份状态的本地副本。Flink的`BroadcastState`可以在运行时动态更新,而无需重新部署整个作业。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值