Spark Structured Streaming 写入 elastic 动态索引支持测试用例
ES 官方文档 https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html#spark-sql-streaming
输入源 csv 文件
示例数据如下
liwei,20,中国,2019-05-14
liwei,10,中国,2019-06-15
zhangsan,20,中国,2019-05-16
zhangsan,10,中国,2019-06-17
输出端 es
输出索引动态设置为 "structured.es.example.{name}.{date|yyyy-MM}/_doc"
官网文档 https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html#cfg-multi-writes-format
最终写入 elastic 索引为
structured.es.example.zhangsan.2019-06
structured.es.example.zhangsan.2019-05
structured.es.example.liwei.2019-06
structured.es.example.liwei.2019-05
伪代码
/**
* Spark Structured Streaming 写入 elastic 动态索引支持测试用例
* @author wei.Li by 2019-05-24
*/
object StructuredEs {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.config(
new SparkConf()
.setAppName("StructuredEs")
.setMaster("local")
.set("es.nodes", "ES-IP:9201")
.set("es.nodes.wan.only", "true")
)
.getOrCreate()
// 测试用例数据对应schema , 重点说明: date 日期数据格式为 `yyyy-MM-dd`
val userSchema = new StructType()
.add("name", "string").add("age", "integer").add("address", "string").add("date", "string")
spark
.readStream
.option("sep", ",")
.schema(userSchema)
.csv("/data/csv/") // csv 文件所在目录
.writeStream
.format("es")
.outputMode(OutputMode.Append())
.option("checkpointLocation", "file:/data/job/spark/checkpointLocation/example/StructuredEs")
.start("structured.es.example.{name}.{date|yyyy-MM}") // 写出索引配置,ES 7+ 无需配置 type
.awaitTermination()
}
}