GlusteFS:自我修复（Selfheal）源码分析

最新推荐文章于 2021-11-19 20:05:28 发布

liuhong1123

最新推荐文章于 2021-11-19 20:05:28 发布

阅读量4.6k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： GlusterFS文件系统研究

本文链接：https://blog.csdn.net/liuhong1123/article/details/8118305

本文详细分析了GlusterFS的自我修复机制，涵盖修复过程、源码解读、算法分析及触发修复的场景。修复分为全量(full)和差异(diff)两种类型，涉及文件内容、元数据和目录项的修复。修复流程包括元数据对比、源选择、修复操作，触发事件主要在opendir和lookup操作中。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

1.概述

Glusterfs自我修复是基于事件触发模式，修复的项主要包括文件内容（data），元数据（metadata），项（entry）等。修复分为2种类型，文件内容的整个修复（full），和差异化修复（diff），文件修复时候，并不会整个文件copy，而是以块单位进行copy。具体修复流程大致为从一个数据完整的节点，将文件读取到客户端，然后再将文件写到数据不完整的节点。

数据修复示意图

文件内容，元数据，目录项的修复主要经历3个步骤：

1、检查查询到的元数据信息，判断是否需要触发修复操作；

2、通过一定规则，从逻辑卷的所有brick中选择出一个brick作为sources；

3、以source为模版，对其他节点进行修复操作；

在上面已经提及到，glusterfs的修复是基于事件模型的，有2中事件：第一，当进入某个目录的时候，会触发进行数据完整性检查，进行数据的自我修复；第二，当要进行文件的读写前会进行lookup的操作，该lookup操作会触发自我修复，两者最后都是通过调用afr_self_heal触发修复操作。

文件lookup时修复

2.功能测试

场景：建立只有2个brick- test5,test6的冗余卷。

Test5文件夹的可扩展属性：

>>> xattr.listxattr("test5")

(u'security.selinux', u'trusted.gfid', u'trusted.glusterfs.test', u'trusted.afr.v1-client-0', u'trusted.afr.v1-client-1')

第一次进入某个文件夹这时候在opendir的时候触发修复，即某个项时，该项下所有的项被修复，包括项和项的元数据；

日志：

0-v1-replicate-0: background meta-data entry self-heal completed on /liu/test

>> xattr.listxattr("liu")//liu为一个文件夹

元数据信息：

(u'security.selinux', u'trusted.gfid', u'trusted.afr.v1-client-0', u'trusted.afr.v1-client-1')

该项下文件也会创建，并且写了文件的一个基本元数据，但是文件为0字节，其他很多元数据没有修复：

元数

未修复前元数据信息：

>>> xattr.listxattr("missing")

(u'security.selinux', u'trusted.gfid')//在mknod时候添加扩展属性trusted.gfid

修复后元数据：

>>> xattr.listxattr("missing")

(u'security.selinux', u'trusted.gfid', u'trusted.afr.v1-client-0', u'trusted.afr.v1-client-1')

修复后日志：

0- v1-replicate-0: background meta-data data self-heal completed on /missing

注：由上红色字体可知，修复文件的时候修复了数据和元数据两者。

Ø Lookup在这样几种情况下修复：

l 如果是目录，则仅仅修复该目录下的所有项；

l 如果是文件，则修复该文件；

当某一个项或者文件被修复，其元数据包括attr和xattr都会被修复

如，test6下面的文件THANKS，py-compile本来被删除，1-9目录项本来被删除，当修复后，元数据会与test5一致：

-rwxr-xr-x. 1 root root 0 12?.31 16:42 py-compile /test6目录，0字节，只是进行了文件创建，还没有文件的完全修复

-rw-r--r--. 1 root root 0 12?.30 14:37 THANKS /test6目录

drwxr-xr-x. 2 root root 4096 1?. 4 15:45 1 /test6目录

-rwxr-xr-x. 1 root root 4142 12?.31 16:42 py-compile /test5目录

-rw-r--r--. 1 root root 90 12?.30 14:37 THANKS /test5目录

drwxr-xr-x. 3 root root 4096 1?. 4 15:45 1 /test5目录

在数据修复过程中，我们经常会通过lookup，或者fstat操作获得iatt结构体内参数值，这些参数一个对象buf进行组织，buf内的数据为：

$9 = {ia_ino = 1, ia_gfid = '\000' <repeats 15 times>, "\001", ia_dev = 2049, ia_type = IA_IFDIR, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 1 '\001'}, group = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}, other = {read = 1 '\001', write = 0 '\000', exec = 1 '\001'}}, ia_nlink = 2, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 4096, ia_blksize = 4096, ia_blocks = 16, ia_atime = 1326072998, ia_atime_nsec = 76224397, ia_mtime = 1326072014, ia_mtime_nsec = 464975995, ia_ctime = 1326072259, ia_ctime_nsec = 344098899}

一般扩展属性中的有dict结构体组织：

$12 = {is_static = 0 '\000', hash_size = 1, count = 2, refcount = 0, members = 0x7fd6fc000950, members_list = 0x7fd6fc000a60, extra_free = 0x0, extra_stdfree = 0x0, lock = 1}

每对数据对的值：

(gdb) p *xattr->members[0]

$25 = {hash_next = 0x7fd6fc0009c0, prev = 0x0, next = 0x7fd6fc0009c0, value = 0x7fd6fc000a30,

key = 0x7fd6fc000c20 "trusted.afr.v1-client-1"}

结构体struct _data内的数据：

(gdb) p *xattr->members[0]->value

$26 = {is_static = 0 '\000', is_const = 0 '\000', is_stdalloc = 0 '\000', len = 12, vec = 0x0, data = 0x7fd6fc000a10 "", refcount = 1, lock = 1}

在读取目录的时候，返回的每个项entry内的数据：

$14 = { {list = {next = 0x1d3b3f0, prev = 0x1d3d240}, {next = 0x1d3b3f0, prev = 0x1d3d240}}, d_ino = 140273059024932,

d_off = 140273139813794, d_len = 333631280, d_type = 32767, d_stat = {ia_ino = 1,

ia_gfid = "0\317\342\023\377\177\000\000\030\274\206\342\223\177\000", ia_dev = 11, ia_type = IA_IFREG, ia_prot = {

suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {

read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}},

ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_rdev = 0, ia_size = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 3791377312,

ia_atime_nsec = 32659, ia_mtime = 0, ia_mtime_nsec = 0, ia_ctime = 0, ia_ctime_nsec = 0},

d_name = 0x7fff13e2ce50 "\360\263\323\001"}

项列表参数：(gdb) p entries->list

$9 = {next = 0x1d3b3f0, prev = 0x1d3d240}

3.源码分析

3.1.修复过程分析

修复过程，大致可以分为4个过程：

1、判断参数是否需要修复：

该过程主要是将从各个brick查询到的元数据休息进行对比，主要包括文件类型，文件大小，文件权限，扩展属性等参数进行对比，看这些对应信息是否一致，如果不一致则要标识需要修复，包括如下几个标识：

metadata_self_heal；//修复元数据，包括文件属性与扩展属性

data_self_heal；//修复文件内容，当对比副本大小不一致时触发

entry_self_heal；//修复目录项

gfid_self_heal；

missing_entry_self_heal；

2. 通过事务类型标记source节点，及其源的计算过程：

1）如果是元数据修复，且所有节点均为innocent，选择ia_uid（为什么选择最小ia_uid）最小的副本作为source；

2）如果存在wise副本，且该副本没有被其他副本指控，则标识该节点为wise；如果每个wise副本冲突，这种情况下没有wise节点副本存在，则发生脑裂，没有source节点可以选择返回EIO异常；如果存在没有冲突的wise副本，则这些wise副本均可作为source节点；

3）如果没有wise节点存在，从fool中选择指证其他节点数最大的节点作为source；

4）如果上面3条均没有找到相应的source，则从所有brick中找出运行的brick副本，这些均可以作为sources；

5）选择第一个sources，作为模版；

3.以source为模版，开始进行自动修复

3.2.阅读卷选择

1、首先通过source选择方式，选择一批source放在source数组内；

2、获得prev_read_child= local->read_child_index；

3、获得config_read_child= priv->read_child；

4、如果prev_read_child在source数组中存在，则将prev_read_child作为read_child;

5、否则如果config_read_child在source数组中存在，则将config_read_child作为read_child;

6、如果在source数组中，prev_read_child与config_read_child均不存在，则从success_child中选择第一个在source数组中的节点作为read_child

7、将read_child记录到ctx中，供读等操作使用；

3.3.算法分析

官方描述：

a) "Full" algorithm – this algorithm copies the entire file data in order to heal the out-ofsync copy. This algorithm is used when a file has to be created from scratch on a

server.

b) "Diff" algorithm – this algorithm compares blocks present on both servers and copies

only those blocks that are different from the correct copy to the out-of-sync copy. This

algorithm is used when files have to be re-built partially.

The “Diff” algorithm is especially beneficial for situations such as running VM images,

where self-heal of a recovering replicated copy of the image will occur much faster because

only the changed blocks need to be synchronized.

在文件内容的修复中，glusterfs分为2种算法full，diff。系统默认为full算法，可以通过在卷配置文件改变其算法类型。

在算法部分，程序会首先选择一种算法：

struct afr_sh_algorithm afr_self_heal_algorithms[] = {

{.name = "full", .fn = afr_sh_algo_full},

{.name = "diff", .fn = afr_sh_algo_diff},