文章目录
8 Hudi配置
在Hadoop的master主节点上配置。
8.1 配置环境变量
这里使用下载好编译好的软件进行操作:
把编译好的文件放在置顶目录:
cp hudi-0.8.0_yl /usr/app/
添加环境变量:
# vim /etc/profile
添加内容如下:
export HUDI_UTILITIES_BUNDLE=/usr/app/hudi-0.8.0/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar
export HUDI_SPARK_BUNDLE=/usr/app/hudi-0.8.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar
对之前的环境变量添加如下配置:
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export SPARK_CONF_DIR=$SPARK_HOME/conf
export HIVE_CONF_DIR=$HIVE_HOME/conf
环境变量生效:
source /etc/profile
8.2 脚本配置
在member文件中,进行如下配置:
.1 kafka-source.properties
作用:传递数据变动从kafka到hadoop.
修改地方【路径:/software/member/config】:
- 连接Kafka集群的入口参数:bootstrap.servers = 10.20.3.75:9092
- deltastreamer流式数据的源头【kafaka】:/jk/cmj/member-app/member/config/schema.avsc
- deltastreamer流式数据的目的【hadoop】:/jk/cmj/member-app/member/config/schema.avsc
- hoodie.deltastreamer.source.kafka.topic: kafka到deltastreamer的topic
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
include=base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=create_date
# Schema provider props (change to absolute path based on your installation)
hoodie.deltastreamer.schemaprovider.source.schema.file=/jk/cmj/member-app/member/config/schema.avsc
hoodie.deltastreamer.schemaprovider.target.schema.file=/jk/cmj/member-app/member/config/schema.avsc
# Kafka Source
hoodie.deltastreamer.source.kafka.topic=debe.cdc_zy.test_cdc.output
#Kafka props
#client.id=client-id-cmj
bootstrap.servers=10.20.3.75:9092
auto.offset.reset=earliest
# hive sync
#hoodie.datasource.hive_sync.table=tb_member_mor
#hoodie.datasource.hive_sync.username=root
#hoodie.datasource.hive_sync.password=hive
#hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://czm-hdp-1:10000
#hoodie.datasource.hive_sync.partition_fields=create_date
#hoodie.datasource.write.table.type=MERGE_ON_READ
hoodie.compact.inline=true
#hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.keygen.SimpleKeyGenerator
.2 schema.avsc
作用:获取变动表的数据结构
该文件【文件路径:/software/member】的作用:记录数据表中的字段数据结构,根据自己的表【debezium配置的表】进行更改【也可以使用java的spqrk demo进行编写】,数据要多加一行create_date
一次作为hoodie.datasource.write.partitionpath.field
:
{
"type": "record",
"name": "TestCdc",
"namespace": "some.namespace",
"db-schema-name": "cdc_zy",
"db-table-name": "test_cdc",
"fields": [
{ "name": "id", "type": [ "null", "string" ], "default": null },
{ "name": "name", "type": [ "null", "string" ], "default": null },
{ "name": "create_time", "type": [ "null", "string" ], "default": null, "doc": "记录生成时间" },
{ "name": "create_date", "type": ["null", "string"] },
{ "name": "update_time", "type": [ "null", "string" ], "default": null, "doc": "最后更新时间" }
]
}
.3 sync-hive.sh
作用:把hudi的表结构同步到hivi
文件【路径:/software/member】内容如下,更改地方:
- hudi路径:/usr/app/hudi-0.8.0
- 记录的数据库名:database default
- 记录的表名:table test_cdc_mor
/usr/app/hudi-0.8.0/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--jdbc-url jdbc:hive2://127.0.0.1:10000 \
--user root \
--pass hive \
--partitioned-by create_date \
--base-path /user/hive/warehouse/test_cdc_cow \
--database default \
--table tb_cdc_cow
/usr/app/hudi-0.8.0/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
--jdbc-url jdbc:hive2://127.0.0.1:10000 \
--user root \
--pass hive \
--partitioned-by create_date \
--base-path /user/hive/warehouse/test_cdc_mor \
--database default \
--table test_cdc_mor
.4 sync-config.sh
作用:创建deltastreamer流式数据的文件夹,同步数据表结构
文件【路径:/software/member】内容如下,更改地方:
- 所有的文件路径
hadoop fs -mkdir -p /jk/cmj/member-app/member
hadoop fs -rm -R /jk/cmj/member-app/member/config
hadoop fs -put -l config /jk/cmj/member-app/member/config
.5 ingest-mor.sh
作用:提交deltastreamer的任务,把数据库中的变动从hadoop入到hudi。
文件【路径:/software/member】内容如下,更改地方:
- target-base-path
- kafaka配置【与kafka-source.properties中相同】:props /jk/cmj/member-app/member/config/kafka-source.properties
- 目的表【与sync-hive.sh的表名相同】:target-table test_cdc_mor
#--checkpoint hadoop.debe_test.tb_member.output,0:0 \
#--disable-compaction
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field update_time \
--target-base-path /user/hive/warehouse/test_cdc_mor \
--target-table test_cdc_mor \
--props /jk/cmj/member-app/member/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
# --continuous > /dev/null 2>&1 &
8.3 数据入Hudi测试
8.3.1 一次性入hudi
执行的流程如下所示:
- 1,启动java的程序【eureka(服务注册中心)和hudi kafka demo(转换kafka的topic让deltastreamer接受)】,实时把数据库的变动同步到kafaka中;
- 修改hudi kafaka demo中的application.yml的kafaka服务器地址以及kafka topic源和目的
- 2,首先前面在confluent中debezium配置的数据库中的数据表进行数据变动;
- 3,运行member文件夹中的脚本;
- 先运行
sync-config.sh
创建hadoop文件夹; - 运行
ingest-mor.sh
把kafka中的数据导入到hadoop中; - 最后运行
sync-hive.sh
同步数据库的表结构;
- 先运行
- 4,启动kafaka Tools,在工具中
Clusters--->Topics
中,在debe
【debezium配置的服务器】中看表中的数据结构变化,在debe_cdc_zy_test_cdc.output
【在kafak的java文件中的application.yml中】中查看表中的数据变动;
- 5,在装有hive的服务器上【10.20.3.72】上,启动hive,使用sql语句查询hadoop中的表。
8.3.2 持续入Hudi
- 1,把member文件夹中的脚本中config/kafka-source.properties的
hoodie.compact.inline=true
注释。 - 2,更改ingest-mor.sh文件的内容为【如果需要打印日志,则把
/dev/null
换成打印日志路径】:
spark-submit \
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
--table-type MERGE_ON_READ \
--source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
--source-ordering-field update_time \
--target-base-path /user/hive/warehouse/test_hudi_mor \
--target-table test_hudi_mor \
--props /jk/cmj/member-app/member/config/kafka-source.properties \
--schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
--continuous >/dev/null 2>&1 &
- 在hdfs-site.xml 文件中添加下面几行代码【3个节点】
<property>
<name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
<value>NEVER</value>
</property>
- 删除hadoop下原有文件
hadoop fs -rm -r /user/hive
- 如果重启hadoop,则需要删除hadoop下配置的hdfs-site.xml中的配置的路径下的文件夹,例如
/home/hadoop/datanode/data/current
.