Apache Iecberg 从入门到放弃(2) —— Iceberg文件解析

最新推荐文章于 2025-07-06 15:40:44 发布

置顶

狄杰丶

最新推荐文章于 2025-07-06 15:40:44 发布

阅读量4.4k

点赞数 13

CC 4.0 BY-SA版权

分类专栏： Apache Iceberg 文章标签：大数据数据湖仓湖一体 Apache Iceberg 数据仓库

本文链接：https://blog.csdn.net/weixin_47482194/article/details/115676338

背景介绍

上一章我们讲过了如何将Flink和Iceberg结合，演示了一些常用的操作，并且在文章的最后演示了一个比较全的DEMO。
主要是讲了一些使用上的内容，对于原理没有太过深入，而既然我们的标题是从入门到放弃，那么必然是要对Iceberg进行深入了解的，不然怎么会放弃呢😂
所以，今天我们就来对Flink 结合 Iceberg后，写在HDFS上的元数据文件进行解析

不过在开始之前先准备一下工作

先下载avro-tools点我下载用来分析我们的元数据文件

再将我们上一次表中的所有元数据文件下载下来

hdfs dfs -get /user/hive/warehouse/iceberg_db.db/iceberg_kafka_test/metadata

简单介绍

贴一下官网对元数据文件的解释，同时加上我的翻译&理解

Snapshot
A snapshot is the state of a table at some time.
代表一张表在某个时刻的状态，对应着${TABLE_PATH}/metadata/XXX.metadata.json

Each snapshot lists all of the data files that make up the table’s contents at the time of the snapshot. Data files are stored across multiple manifest files, and the manifests for a snapshot are listed in a single manifest list file.
每个快照文件列出了在某个时刻所有构成这一次快照的数据文件。数据文件存储在多个manifest files中，而某一次快照的manifests会被展示在一个manifest list中

Manifest list
对应着${TABLE_PATH}/metadata/snap-XXX.avro文件
A manifest list is a metadata file that lists the manifests that make up a table snapshot.
一个manifest list 是一个元数据文件，它列出了构成快照的list

Each manifest file in the manifest list is stored with information about its contents, like partition value ranges, used to speed up metadata operations.
manifest list中的每个manifest file 都存储着有关其内容的信息，比如分区值范围，用来加速元数据的操作（更容易找到数据文件

Manifest file
A manifest file is a metadata file that lists a subset of data files that make up a snapshot.
一个manifest file是一个元数据文件，它列出了组成一个快照的数据文件的子集。

Each data file in a manifest is stored with a partition tuple, column-level stats, and summary information used to prune splits during scan planning.
每个manifest中的数据文件都存储有分区元祖、列级统计信息和摘要信息，这些信息用于在Scan时进行优化（过滤无用文件

表情复杂

有些同学可能看不明白，没事，我们进入下一小节

深度理解

以select * from iceberg_catalog.iceberg_db.iceberg_kafka_test /*+ OPTIONS('streaming'='true', 'monitor-interval'='1s', 'start-snapshot-id'='3821550127947089987') */ ;
这样一条SQL为案例，讲一下三种类型文件，在查询的时候起到什么作用

首先会通过iceberg_catalog这个Hive Catalog，获取到iceberg_db.iceberg_kafka_test的信息，类似于我们执行desc formatted iceberg_db.iceberg_kafka_test得到的信息

主要是为了获取其中的metadata_location对应的信息
metadata_location对应的值代表着最新的快照路径，我们将下载到本地的对应路径文件打开

文件中每个字段的解释可以参考[1]，在这里展开说太多了
可以看到我们当前的snapshotId是7125985432047681528，然后在这个文件中，搜索这个id，可以找到这个

    {
   
   
      "snapshot-id": 7125985432047681528,
      "parent-snapshot-id": 5099196027648958107,
      "timestamp-ms": 1617953528158,
      "summary": {
   
   
        "operation": "append",
        "flink.job-id": "f4e22ca9fe284270e6eed311bbdb2acb",
        "flink.max-committed-checkpoint-id": "114",