DataStream和DataSet的用法一样
// DataStream
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// DataSet
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
Flink针对DataStream提供了大量的已经实现的算子
-
Map:输入一个元素,然后返回一个元素,中间可以进行清洗转换等操作
-
FlatMap:输入一个元素,可以返回0个、1个或者多个元素
-
Filter:过滤函数,对传入的数据进行判断,符合条件的数据会被留下
-
KeyBy:根据指定的Key进行分组,Key相同的数据会进入同一个分区
KeyBy有两种典型用法
(1)DataStream.keyBy("someKey")指定对象中的someKey字段作为分组Key
(2)DataStream.keyBy(0)指定Tuple中的第一个元素作为分组Key
-
Reduce:对数据进行聚合操作,结合当前元素和上一次Reduce返回的值进行聚合操作,然后返回一个新的值
-
Aggregations:sum()、min()、max()等
依赖
<!--flink核心包-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.7.2</version>
</dependency>
<!--flink流处理包-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.7.2</version>
<scope>provided</scope>
</dependency>
案例一:
思路:
创建自定义数据源(流) 1秒钟产生一个新数据
-----实现SourceFunction<>接口,tip:SourceFuntion和SourceContext需要指明泛型,否则报InvalidTypesException异常
创建自定义分区,奇偶分区
-----实现Partitioner<>接口
自定义数据源
import org.apache.flink.streaming.api.functions.source.SourceFunction;
/**
* SourceFunction<Long> 泛型为Long类型
* 即为输出产生的类型
*/
public class MySource implements SourceFunction<Long> {
private long count = 1;
private boolean isRunning = true;
/**
* run方法负责产生数据
* @param ctx 数据源上下文对象
* @throws Exception
*/
public void run(SourceContext<Long> ctx) throws Exception {
while(isRunning) {
// collect方法就是将产生的数据发送出去
ctx.collect(count);
count ++;
Thread.sleep(1000);
}
}
/**
* 取消数据产生的处理
*/
public void cancel() {
isRunning = false;
}
}
自定义分区
import org.apache.flink.api.common.functions.Partitioner;
/**
* Partitioner<Long> 泛型为Long类型
* 即为输入处理的数据的类型
*/
public class MyPartitioner implements Partitioner<Long> {
public int partition(Long key, int numPartitions) {
System.out.println("一共有" + numPartitions + "个分区");
if(key % 2 == 0) {
return 0;
} else {
return 1;
}
}
}
代码
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple1;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class SelfPartitionDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 并行度为2,就是创建2个线程执行任务
env.setParallelism(2);
// 自定义数据MySource
DataStreamSource<Long> data = env.addSource(new MySource());
// MapFunction<Long, Tuple1<Long>> 输入的数据类型为Long,输出的数据类型为Tuple1<Long>
// 将Long转变为Tuple1<Long>
SingleOutputStreamOperator<Tuple1<Long>> mapded = data.map(new MapFunction<Long, Tuple1<Long>>() {
public Tuple1<Long> map(Long value) throws Exception {
System.out.println("map1当前多线程ID" + Thread.currentThread().getId() + ".value:" + value);
return new Tuple1<Long>(value);
}
});
// Tuple1<Long>为一元组只能写0,如果是二元组根据需要从0或者1进行选择
DataStream<Tuple1<Long>> partitioned = mapded.partitionCustom(new MyPartitioner(), 0);
// 将Tuple1<Long>转变为Long
SingleOutputStreamOperator<Long> result = partitioned.map(new MapFunction<Tuple1<Long>, Long>() {
public Long map(Tuple1<Long> value) throws Exception {
System.out.println("map2当前多线程ID" + Thread.currentThread().getId() + ".value:" + value);
return value.getField(0);
}
});
result.print().setParallelism(1);
env.execute("selfpartitionDemo");
}
}
案例二:
import org.apache.flink.api.common.functions.ReduceFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.ReduceOperator;
import org.apache.flink.api.java.operators.UnsortedGrouping;
import org.apache.flink.api.java.tuple.Tuple2;
import java.util.ArrayList;
public class DataSetTransformation {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ArrayList<Tuple2<String, Integer>> listData = new ArrayList<Tuple2<String, Integer>>();
listData.add(new Tuple2<String,Integer>("java",1));
listData.add(new Tuple2<String,Integer>("java",1));
listData.add(new Tuple2<String,Integer>("scala",1));
DataSource<Tuple2<String, Integer>> data = env.fromCollection(listData);
UnsortedGrouping<Tuple2<String, Integer>> grouped = data.groupBy(0);
AggregateOperator<Tuple2<String, Integer>> sum = grouped.sum(1);
// reduce聚合操作
// value1 为初始数据或者是聚合之后的数据
// value2 与value1进行聚合操作
ReduceOperator<Tuple2<String, Integer>> reduced = grouped.reduce(new ReduceFunction<Tuple2<String, Integer>>() {
public Tuple2<String, Integer> reduce(Tuple2<String, Integer> value1, Tuple2<String, Integer> value2) throws Exception {
return new Tuple2<String, Integer>(value1.f0, value1.f1 + value2.f1);
}
});
reduced.print();
// sum.print();
}
}