Hudi安装、配置和使用————附带详细步骤

最新推荐文章于 2025-06-15 17:51:19 发布

繁星蓝雨

最新推荐文章于 2025-06-15 17:51:19 发布

阅读量1.1w

点赞数 3

CC 4.0 BY-SA版权

分类专栏：大数据文章标签：大数据 Hudi

本文链接：https://blog.csdn.net/qq_33375598/article/details/118532659

大数据专栏收录该内容

41 篇文章

订阅专栏

本文档详细介绍了如何配置Hudi环境，包括设置环境变量、配置脚本，以及数据从Kafka流入Hudi的过程。通过执行一系列脚本，实现了Hudi表结构与Hive的同步，并进行了数据的一次性和持续性导入测试。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

文章目录

8 Hudi配置

8 Hudi配置

在Hadoop的master主节点上配置。

8.1 配置环境变量

这里使用下载好编译好的软件进行操作：
把编译好的文件放在置顶目录：

cp hudi-0.8.0_yl /usr/app/

添加环境变量：

# vim /etc/profile

添加内容如下：

export HUDI_UTILITIES_BUNDLE=/usr/app/hudi-0.8.0/docker/hoodie/hadoop/hive_base/target/hoodie-utilities.jar

export HUDI_SPARK_BUNDLE=/usr/app/hudi-0.8.0/docker/hoodie/hadoop/hive_base/target/hoodie-spark-bundle.jar

对之前的环境变量添加如下配置：

export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

export SPARK_CONF_DIR=$SPARK_HOME/conf

export HIVE_CONF_DIR=$HIVE_HOME/conf

环境变量生效：

source /etc/profile

8.2 脚本配置

在member文件中，进行如下配置：

.1 kafka-source.properties

作用：传递数据变动从kafka到hadoop.

修改地方【路径：/software/member/config】：

连接Kafka集群的入口参数：bootstrap.servers = 10.20.3.75:9092
deltastreamer流式数据的源头【kafaka】：/jk/cmj/member-app/member/config/schema.avsc
deltastreamer流式数据的目的【hadoop】：/jk/cmj/member-app/member/config/schema.avsc
hoodie.deltastreamer.source.kafka.topic: kafka到deltastreamer的topic

#  Licensed to the Apache Software Foundation (ASF) under one
#  or more contributor license agreements.  See the NOTICE file
#  distributed with this work for additional information
#  regarding copyright ownership.  The ASF licenses this file
#  to you under the Apache License, Version 2.0 (the
#  "License"); you may not use this file except in compliance
#  with the License.  You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  Unless required by applicable law or agreed to in writing, software
#  distributed under the License is distributed on an "AS IS" BASIS,
#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
#  See the License for the specific language governing permissions and
# limitations under the License.

include=base.properties
# Key fields, for kafka example
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=create_date

# Schema provider props (change to absolute path based on your installation)
hoodie.deltastreamer.schemaprovider.source.schema.file=/jk/cmj/member-app/member/config/schema.avsc
hoodie.deltastreamer.schemaprovider.target.schema.file=/jk/cmj/member-app/member/config/schema.avsc

# Kafka Source
hoodie.deltastreamer.source.kafka.topic=debe.cdc_zy.test_cdc.output

#Kafka props
#client.id=client-id-cmj
bootstrap.servers=10.20.3.75:9092
auto.offset.reset=earliest

# hive sync
#hoodie.datasource.hive_sync.table=tb_member_mor
#hoodie.datasource.hive_sync.username=root
#hoodie.datasource.hive_sync.password=hive
#hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://czm-hdp-1:10000
#hoodie.datasource.hive_sync.partition_fields=create_date
#hoodie.datasource.write.table.type=MERGE_ON_READ
hoodie.compact.inline=true
#hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.keygen.SimpleKeyGenerator

.2 schema.avsc

作用：获取变动表的数据结构

该文件【文件路径：/software/member】的作用：记录数据表中的字段数据结构，根据自己的表【debezium配置的表】进行更改【也可以使用java的spqrk demo进行编写】，数据要多加一行create_date一次作为hoodie.datasource.write.partitionpath.field：

{
  "type": "record",
  "name": "TestCdc",
  "namespace": "some.namespace",
  "db-schema-name": "cdc_zy",
  "db-table-name": "test_cdc",

  "fields": [
    { "name": "id", "type": [ "null", "string" ], "default": null },
    { "name": "name", "type": [ "null", "string" ], "default": null },
    { "name": "create_time", "type": [ "null", "string" ], "default": null, "doc": "记录生成时间" },
    { "name": "create_date", "type": ["null", "string"] },
    { "name": "update_time", "type": [ "null", "string" ], "default": null, "doc": "最后更新时间" }
  ]
}

.3 sync-hive.sh

作用：把hudi的表结构同步到hivi

文件【路径：/software/member】内容如下,更改地方：

hudi路径：/usr/app/hudi-0.8.0
记录的数据库名:database default
记录的表名：table test_cdc_mor

/usr/app/hudi-0.8.0/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
  --jdbc-url jdbc:hive2://127.0.0.1:10000 \
  --user root \
  --pass hive \
  --partitioned-by create_date \
  --base-path /user/hive/warehouse/test_cdc_cow \
  --database default \
  --table tb_cdc_cow

/usr/app/hudi-0.8.0/hudi-sync/hudi-hive-sync/run_sync_tool.sh \
  --jdbc-url jdbc:hive2://127.0.0.1:10000 \
  --user root \
  --pass hive \
  --partitioned-by create_date \
  --base-path /user/hive/warehouse/test_cdc_mor \
  --database default \
  --table test_cdc_mor

.4 sync-config.sh

作用：创建deltastreamer流式数据的文件夹，同步数据表结构

文件【路径：/software/member】内容如下,更改地方：

所有的文件路径

hadoop fs -mkdir -p /jk/cmj/member-app/member
hadoop fs -rm -R /jk/cmj/member-app/member/config
hadoop fs -put -l config /jk/cmj/member-app/member/config

.5 ingest-mor.sh

作用：提交deltastreamer的任务，把数据库中的变动从hadoop入到hudi。

文件【路径：/software/member】内容如下,更改地方：

target-base-path
kafaka配置【与kafka-source.properties中相同】：props /jk/cmj/member-app/member/config/kafka-source.properties
目的表【与sync-hive.sh的表名相同】:target-table test_cdc_mor

#--checkpoint hadoop.debe_test.tb_member.output,0:0 \
#--disable-compaction
spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type MERGE_ON_READ \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field update_time \
  --target-base-path /user/hive/warehouse/test_cdc_mor \
  --target-table test_cdc_mor \
  --props /jk/cmj/member-app/member/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider
# --continuous  > /dev/null 2>&1 &

8.3 数据入Hudi测试

8.3.1 一次性入hudi

执行的流程如下所示：

1，启动java的程序【eureka（服务注册中心）和hudi kafka demo(转换kafka的topic让deltastreamer接受)】，实时把数据库的变动同步到kafaka中；

修改hudi kafaka demo中的application.yml的kafaka服务器地址以及kafka topic源和目的

2，首先前面在confluent中debezium配置的数据库中的数据表进行数据变动；
3，运行member文件夹中的脚本；
- 先运行sync-config.sh创建hadoop文件夹;
- 运行ingest-mor.sh把kafka中的数据导入到hadoop中；
- 最后运行sync-hive.sh同步数据库的表结构；
4，启动kafaka Tools，在工具中Clusters--->Topics中，在debe【debezium配置的服务器】中看表中的数据结构变化，在debe_cdc_zy_test_cdc.output【在kafak的java文件中的application.yml中】中查看表中的数据变动；

在这里插入图片描述

5，在装有hive的服务器上【10.20.3.72】上，启动hive，使用sql语句查询hadoop中的表。

8.3.2 持续入Hudi

1，把member文件夹中的脚本中config/kafka-source.properties的hoodie.compact.inline=true注释。
2，更改ingest-mor.sh文件的内容为【如果需要打印日志，则把/dev/null换成打印日志路径】:

spark-submit \
  --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer $HUDI_UTILITIES_BUNDLE \
  --table-type MERGE_ON_READ \
  --source-class org.apache.hudi.utilities.sources.JsonKafkaSource \
  --source-ordering-field update_time \
  --target-base-path /user/hive/warehouse/test_hudi_mor \
  --target-table test_hudi_mor \
  --props /jk/cmj/member-app/member/config/kafka-source.properties \
  --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
  --continuous >/dev/null 2>&1 &

在hdfs-site.xml 文件中添加下面几行代码【3个节点】

	   <property>
             <name>dfs.client.block.write.replace-datanode-on-failure.policy</name>
             <value>NEVER</value>
        </property>