本篇所使用环境:
flink:1.19(我尽量用的2.0版本未废弃的api)
java:17
neo4j:5.12.0(java依赖用的5.19.0,5.12.0有bug部分类报红)
python:3.10
网上能找到的pyflink的内容相当的少,有pyflink的开发者可以私信我加个联系方式,探讨探讨共同进步。
阅读过pyflink源码的同学肯定会发现,pyflink其实几乎没有实现任何逻辑,99%的代码全是通过py4j的java_gateway来生成一个java网关,来代理jvm反射调用java代码。
所以读pyflink源码会发现通篇都长这样:
全是java代码的封装。甚至在debug也会发现,jvm的进程会被截断(猜测),导致根本走不到你要调试的代码处就报错,和直接运行的报错完全没关系。
想完全用python实现自定义sink,几乎不可能,除非你能用python重构相当一部分flink框架代码。
所以我真的不推荐大家用pyflink!!!即使你特别精通python,不熟悉java你也搞不明白pyflink的,只能简单调一下api,遇到问题都不知道从何入手,毕竟很多时候都不能debug,即使debug走过创建env这关了,调用栈没走两步就到java封装了,你得自己去找java的源码看,如果熟悉java,那干嘛还用pyflink呢?除非跟我一样是领导要求的,或者想挑战一下自己的软肋。
那么正片开始:
没错,首先你得实现一个java版的自定义sink(笑死)。
下面是我写的neo4jsink,暂时只实现了新增和修改逻辑,没写删除逻辑。
Milesian111/flink-neo4j-sink: 自用neo4j-sink,不断完善中。
legacy分支:用旧的sink架构实现RichSinkFunction,支持二阶段提交、精确一次等。
v-0.1分支:用新的sink架构实现Sink(没有实现SupportCommit和SupportState),只支持至少一次语义。
v-0.2分支:在v-0.1基础上实现SupportCommit和SupportState,遇到了一些问题,暂时没空解决,勿用。
for-pyflink分支:实现方式和v-0.1一样,但是为了让pyflink能适配,有冗余代码。
flink sink的新老架构:
老架构:
- 继承SinkFunction/RichSinkFunction
- 调用时的算子是addSink(YourSink(param1,param2...yourInterfaceImplement)),这里的yourInterfaceImplement一般传一个lambda函数作为接口的实现
- 老架构在2.0版本被迁移到legacy模块
新架构:
- v1:实现接口Sink/StafulSink,在2.0版本已经被删除
- v2:实现接口Sink(+SupportsWriterState+SupportsCommitter)
- 调用时的算子是sinkTo(YourSink()),老架构里的SinkFunction()逻辑需要提前用MapFunction()实现
1.11版本就有新架构了,但是现在还是有很多connector是老架构实现。
一开始我是用legacy的方式,想试试的可以参考RMQsink,虽然JDBCsink在pyflink看也是老架构,但是在github上最新的jdbc-connecter里已经看不到老架构实现了,我也懒得找。
踩坑、解决、踩坑,最终卡在这个报错,拼尽全力无法战胜:
Unable to make field private static final java.lang.reflect.Method jdk.proxy3.$Proxy22.m0 accessible: module jdk.proxy3 does not "opens jdk.proxy3" to unnamed module
这个问题是jdk9新加了个保护机制,禁止反射调用时访问未知名模块的私有成员变量。
网上都说加一个jvm option就能解决--add-opens=jdk.proxy3/jdk.proxy3=ALL-UNNAMED
我找了半天才找到pyflink该在哪加jvm option,并没有到对外暴露的任何方式,我只能修改源码在pyflink_gateway_server.py里加:
但是tmd没用!搞了一整天,还是没解决。理论上我用新架构应该也会遇到这个问题,不知道为什么没遇到了,两者的差别只在于继承SinkFunction得创建一个接口来抽离SinkFunction逻辑,只能归结为py4j对java interface的封装触发了什么奇怪的bug。有大佬知道如何解决请务必联系我!!!
后来我才尝试用新架构sink,也就是上面的v-0.1分支。
java实现步骤:
CypherStatement:pojo类,element数据的封装,用map算子将数据处理成CypherStatement再调用sink算子
import java.util.Map;
public class CypherStatement {
private final String query;
private final Map<String, Object> parameters;
public CypherStatement(String query, Map<String, Object> parameters) {
this.query = query;
this.parameters = parameters;
}
// Getters
public String getQuery() {
return query;
}
public Map<String, Object> getParameters() {
return parameters;
}
}
Neo4jSink 实现sink接口,入口
import org.apache.flink.api.connector.sink2.Sink;
import org.apache.flink.api.connector.sink2.SinkWriter;
import java.io.IOException;
public class Neo4jSink implements Sink<CypherStatement> {
private final String uri;
private final String user;
private final String password;
private final int batchSize;
public Neo4jSink(String uri, String user, String password, int batchSize) {
this.uri = uri;
this.user = user;
this.password = password;
this.batchSize = batchSize;
}
@Override
public SinkWriter<CypherStatement> createWriter(InitContext initContext) throws IOException {
return new Neo4jSinkWriter(uri, user, password, batchSize);
}
}
Neo4jSinkWriter实现SinkWriter接口:
分别实现以下方法:
write:事前准备,比如设计批处理逻辑
flush:连接数据库,写入数据
close:关闭连接
import org.apache.flink.api.connector.sink2.SinkWriter;
import org.neo4j.driver.*;
import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.List;
public class Neo4jSinkWriter implements SinkWriter<CypherStatement>, Serializable {
private transient Driver driver;
private transient Session session;
private transient Transaction transaction;
private final List<CypherStatement> batch = new ArrayList<>();
private final String uri;
private final String user;
private final String password;
private final int batchSize;
public Neo4jSinkWriter(String uri, String user, String password, int batchSize) {
this.uri = uri;
this.user = user;
this.password = password;
this.batchSize = batchSize;
}
@Override
public void write(CypherStatement element, Context context) throws IOException, InterruptedException {
batch.add(element);
if (batch.size() >= batchSize) { // 批量写入,提升性能
flush(true);
}
}
@Override
public void flush(boolean b) throws IOException, InterruptedException {
if (driver == null) {
driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password));
session = driver.session();
transaction = session.beginTransaction();
}
try {
for (CypherStatement stmt : batch) {
transaction.run(stmt.getQuery(), stmt.getParameters());
}
transaction.commit(); // 提交当前事务
} catch (Exception e) {
transaction.rollback(); // 失败回滚
throw new RuntimeException("Neo4j write failed", e);
} finally {
batch.clear();
transaction.close();
session.close();
driver.close();
}
}
@Override
public void close() throws Exception {
if (driver != null) {
driver.close();
}
}
}
需要注意我用的neo4j的driver是5.19.0版本,4.x及以下,api会不同。
但是问题来了,pyflink端可以将Pojo类CypherStatement进行封装(代码我删掉了,懒得再写了,自己查一下py4j的官网能找到),然后在mapFunction中将数据转换成CypherStatement类型再sink下游。
然而现实很骨感:
pyflink在创建env时会启动jvm,并调用py4j的java_gateway,用来代理调用java代码,但是pyflink里的transform算子,比如map,在调用mapFunction时,是单独起了一个python进程来执行的,java_gateway有全局线程锁,无法被序列化,所以python进程无法接收jvm的任何对象,于是导致在map函数中,无法调用java的Pojo类(报错找不到CypherStatement类)。
p.s.笔者后来想了一下,可能可以将Python执行模式(python.execution-mode)切换成线程模式来处理,不过并未尝试,读者有兴趣可以尝试一下。
于是我尝试不要这个pojo封装,而是用java 的Map<String,Object>的形式来封装要处理的数据。pyflink开发者设计了一些自动转换逻辑比如python List/Dict到java Array/Map,于是我在python端只需要用mapFunction将数据转换成字典,然后在datastream 调用sink时,会自动将字典转换成java Map。
但是自动转换的Map没办法精细化value的类型,在java里可以定义Map<String,Object>,而pyflink的 TypeInformation没有设计类似的自由类型,也就是说所有的字段都只能是同一种类型,我年前遇到使用ES sink时遇到了这个问题,我当时也发了一篇文章,闭环了兄弟们!
PyFlink/Flink datastream api写ES无法写复合数据类型的一种解决方案
当时是换了SQL API绕过了这个问题,现在可以通过二次开发解决啦!
接着想办法咯,难道没有一种类型,是pyflink和java API都认识的类型,又能满足多层级的数据封装?当然有: Row类型!
改代码咯,java侧:
Neo4jSink:改下input类型就行
package hirson.sink.neo4j;
import org.apache.flink.api.connector.sink2.*;
import org.apache.flink.types.Row;
import java.io.IOException;
public class Neo4jSink implements Sink<Row> {
private final String uri;
private final String user;
private final String password;
private final int batchSize;
public Neo4jSink(String uri, String user, String password, int batchSize) {
this.uri = uri;
this.user = user;
this.password = password;
this.batchSize = batchSize;
}
@Override
public SinkWriter<Row> createWriter(InitContext initContext) throws IOException {
return new Neo4jSinkWriter(uri, user, password, batchSize);
}
}
Neo4jSinkWriter :input类型,对应的还要调整一下数据处理逻辑
package hirson.sink.neo4j;
import org.apache.flink.api.connector.sink2.SinkWriter;
import org.apache.flink.types.Row;
import org.neo4j.driver.*;
import java.io.IOException;
import java.io.Serializable;
import java.util.*;
public class Neo4jSinkWriter implements SinkWriter<Row>, Serializable {
private transient Driver driver;
private transient Session session;
private transient Transaction transaction;
private final List<Row> batch = new ArrayList<>();
private final String uri;
private final String user;
private final String password;
private final int batchSize;
public Neo4jSinkWriter(String uri, String user, String password, int batchSize) {
this.uri = uri;
this.user = user;
this.password = password;
this.batchSize = batchSize;
}
@Override
public void write(Row element, Context context) throws IOException, InterruptedException {
// 数据格式校验
batch.add(element);
if (batch.size() >= batchSize) {
flush(true);
}
}
@Override
public void flush(boolean b) throws IOException, InterruptedException {
if (driver == null) {
driver = GraphDatabase.driver(uri, AuthTokens.basic(user, password));
session = driver.session();
transaction = session.beginTransaction();
}
try {
for (Row stmt : batch) {
String query = (String) stmt.getField(0);
Map<String, Object> parameters = convertRowToMap(stmt);
transaction.run(query, parameters);
}
transaction.commit();
} catch (Exception e) {
transaction.rollback();
throw new RuntimeException("Neo4j write failed", e);
} finally {
batch.clear();
transaction.close();
session.close();
driver.close();
}
}
private Map<String, Object> convertRowToMap(Row stmt) {
Map<String, Object> params = new HashMap<>();
Row paramsKeyRow = (Row)stmt.getField(1);
Row paramsValueRow = (Row)stmt.getField(2);
for (int pos = 0; pos < paramsKeyRow.getArity(); pos++) {
String key =(String) paramsKeyRow.getField(pos);
Object value = paramsValueRow.getField(pos);
params.put(key, value);
}
return params;
}
@Override
public void close() throws Exception {
if (driver != null) {
driver.close();
}
}
}
然后我还新加了个Neo4jSinkBuilder来适配python的风格:
package hirson.sink.neo4j;
import java.io.Serializable;
public class Neo4jSinkBuilder implements Serializable {
private String uri;
private String user ;
private String password ;
private int batchSize;
public Neo4jSinkBuilder setUri(String uri) {
this.uri = uri;
return this;
}
public Neo4jSinkBuilder setUser(String user) {
this.user = user;
return this;
}
public Neo4jSinkBuilder setPassword(String password) {
this.password = password;
return this;
}
public Neo4jSinkBuilder setBatchSize(int batchSize) {
this.batchSize = batchSize;
return this;
}
public Neo4jSink build() {
return new Neo4jSink(uri, user, password, batchSize);
}
}
java侧搞定,python侧的代码也贴上来:
# -*- coding: utf-8 -*-
'''
@File :Neo4jSink.py
@Author :Hirson(Zhang.Hechuan)
@Date :2025/3/10 16:06
'''
from pyflink.datastream.connectors import Sink
from pyflink.java_gateway import get_gateway
__all__ = [
'Neo4jSink',
'Neo4jSinkBuilder'
]
# Neo4jSink class wrapping the Java sink
class Neo4jSink(Sink):
def __init__(self, j_neo4j_sink):
super(Neo4jSink, self).__init__(j_neo4j_sink)
@staticmethod
def builder() -> 'Neo4jSinkBuilder':
"""Returns a builder to construct a Neo4jSink."""
return Neo4jSinkBuilder()
# Neo4jSinkBuilder for configuring the sink
class Neo4jSinkBuilder:
def __init__(self):
self._j_builder = get_gateway().jvm.hirson.sink.neo4j.Neo4jSinkBuilder()
def set_uri(self, uri: str) -> 'Neo4jSinkBuilder':
"""Sets the Neo4j connection URI."""
self._j_builder.setUri(uri)
return self
def set_user(self, user: str) -> 'Neo4jSinkBuilder':
"""Sets the Neo4j username."""
self._j_builder.setUser(user)
return self
def set_password(self, password: str) -> 'Neo4jSinkBuilder':
"""Sets the Neo4j password."""
self._j_builder.setPassword(password)
return self
def set_batch_size(self, batch_size: int) -> 'Neo4jSinkBuilder':
"""Sets the batch size for writing to Neo4j."""
self._j_builder.setBatchSize(batch_size)
return self
def build(self) -> 'Neo4jSink':
"""Builds the Neo4jSink."""
j_neo4j_sink = self._j_builder.build()
return Neo4jSink(j_neo4j_sink)
然后就是使用案例:
# -*- coding: utf-8 -*-
'''
@File :PyflinkNeo4jTest.py
@Author :Hirson(Zhang.Hechuan)
@Date :2025/3/7 15:43
'''
from pyflink.common import Row
from pyflink.common.typeinfo import Types, _from_java_type
from pyflink.datastream import StreamExecutionEnvironment
from PyflinkNeo4jSinkDemo3 import Neo4jSink
#定义Cypher语句构建函数
def build_cypher_statement(row):
return Row(
"MERGE (u:User {id: $id}) SET u.name = $name, u.age = $age",
Row("id", "name", "age"),
Row(row.id, row.name, row.age)
)
if __name__ == '__main__':
# 创建执行环境
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(1)
# 把java的jar包add进来(根据实际路径修改)
env.add_jars('file:///D:/Hirson/PythonProjects/demo/jars/neo4jsink-1.0-SNAPSHOT.jar')
# 创建测试数据流
sample_data = [
Row(id='1001', name="Kevin", age=30),
Row(id='1002', name="Bob", age=25),
Row(id='1003', name="Eric", age=35)
]
type_info=Types.ROW_NAMED(["id","name","age"],[Types.STRING(),Types.STRING(),Types.INT()])
output_type = Types.ROW([Types.STRING(),
Types.ROW([Types.STRING(), Types.STRING(), Types.STRING()]),
Types.ROW([Types.STRING(), Types.STRING(), Types.INT()])
])
ds = env.from_collection(sample_data,type_info).map(lambda x: build_cypher_statement(x),output_type)
#创建并配置Neo4j Sink
neo4j_sink = Neo4jSink.builder()\
.set_uri("bolt://localhost:7687")\
.set_user("neo4j")\
.set_password("12345678")\
.set_batch_size(100)\
.build()
# 添加Sink到数据流
ds.sink_to(neo4j_sink)
# 执行作业
env.execute("Neo4j Sink Demo")
终于成功了,虽然还有点粗糙,但是好歹跑通了,目前还没有实现精确一次的逻辑,以后再说吧。
遇到大大小小的坑不下一百个吧,比如你可能好奇我为啥造的数据type_info用Types.ROW_NAMED,map一次之后又用Types.ROW,这是因为java侧,sink处理row类型数据会调用row.getfield(pos),用字段位置来取值,而不是用名称,但是Row类型在设计时要么基于字段位置,要么基于字段名称,你指定了名称就无法基于位置取值,所以我设计的row里套两个row,用一个内层的row来记录字段名,坑爹。
类似的问题一大堆,有任何疑问可以评论或者私信我,这个玩意耗费了我很多精力,得好好摸会鱼才能缓过来。