客户端出现connect timeout 和command timeout,分析思路如下,本文注重讲原因4:
1、网络原因
2、慢查询
3、value值过大
4、aof rewrite
redis服务排查
redis配置信息:
auto-aof-rewrite-percentage:100%
no-appendfsync-on-rewrite:no
appendfsync:everysec
1.查看redis日志如下:
20949:M 03 Jul 12:28:02.956 * Starting automatic rewriting of AOF on 100% growth
20949:M 03 Jul 12:28:03.080 * Background append only file rewriting started by pid 6394
20949:M 03 Jul 12:30:06.777 * Background AOF buffer size: 80 MB
20949:M 03 Jul 12:30:14.046 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
20949:M 03 Jul 12:30:15.408 * Background AOF buffer size: 180 MB
20949:M 03 Jul 12:30:24.953 * Background AOF buffer size: 280 MB
20949:M 03 Jul 12:30:41.336 * AOF rewrite child asks to stop sending diffs.
6394:C 03 Jul 12:30:41.336 * Parent agreed to stop sending diffs. Finalizing AOF...
6394:C 03 Jul 12:30:41.336 * Concatenating 97.17 MB of AOF diff received from parent.
6394:C 03 Jul 12:30:46.735 * SYNC append only file rewrite performed
6394:C 03 Jul 12:30:46.819 * AOF rewrite: 542 MB of memory used by copy-on-write
20949:M 03 Jul 12:30:46.958 * Background AOF rewrite terminated with success
20949:M 03 Jul 12:30:59.909 * Residual parent diff successfully flushed to the rewritten AOF (243.96 MB)
20949:M 03 Jul 12:30:59.911 * Background AOF rewrite finished successfully
20949:M 03 Jul 12:31:00.047 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.
通过日志可以发现:
1、redis出现aof rewrite
2、redis aof fsync由于磁盘bus导致长时间未完成
3、不等待fsync完成的情况下编写aof buffer,促使reids堵塞其他进程命令
首先我们知道redis是单线程处理,当我们打开AOF持久化功能时,reids在每个事件处理完成后都会调用write(2),将变化写入aof buffer,如果此时write(2)被堵塞,redis将不能处理其他命令,Linux 规定执行 write(2) 时,如果对同一个文件正在执行fdatasync(2)将 buffer写入物理磁盘,write(2)会被Block住,整个Redis被Block住,不能处理其他命令。
如果磁盘IO比较繁忙(aof rewrite或者rdb刷盘 ),导致aof buffer fdatasync(2)时间较长,堵塞write(2)写入aof buffer,进而导致redis block其他命令。
Redis的刷盘策略略有调整,当进程发现文件有在执行 fdatasync(2) 时,就先不调用 write(2),只存在 cache 里,免得被 Block。但如果已经超过两秒都还是这个样子,则会硬着头皮执行 write(2),即使 redis 会被 Block 住。此时那句要命的 log 会打印:“Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.”。此时aof_delayed_fsync(由于fsync(2)堵塞write(2)导致redis block 的次数,)值被加1。因此,对于 fsync 设为 everysec 时丢失数据的可能性的最严谨说法是:如果有 fdatasync 在长时间的执行,此时 redis 意外关闭会造成文件里不多于两秒的数据丢失。如果 fdatasync 运行正常,redis 意外关闭没有影响,只有当操作系统 crash 时才会造成少于1秒的数据丢失。
2、解决办法:
1、调整系统参数dirty_bytes(当数据达到值时系统自动sync数据到磁盘,避免一秒刷盘大量脏数据)
echo "vm.dirty_bytes=4194304" >> /etc/sysctl.conf
sysctl -p
2、关闭RDB或者AOF
3、关闭RDB并调整redis参数:
appendfsync:always(每次事件aof buffer 写入并fsync(2)到文件)
everysec(每次事件aof buffer 写入到文件 ,每秒fsync(2)到文件)
no(每次事件aof buffer 写入到文件,操作系统控制fsync(2)到文件)
no-appendfsync-on-rewrite:yes(aof rewrite时不进行aof buffer的fsync(2)操作,最多可能丢失30S的数据)
4、主库关闭RDB和AOF,持久化在从库执行
注意:主库故障后不能自动重启,否则会导致所有数据清空
故障案例好文:https://blog.csdn.net/crisschan/article/details/51514087