一、现象描述
生产各数据中心网络,目前采用calico ipip全网互联网络(各个节点直接都要通过bgp来建立路由链接),当k8s集群中的ingress节点down机之后,从主机down机到完全恢复,整个过程耗时4分钟,这远远超过了正常down机的恢复时间
二、组件详解
Felix:
以守护进程的方式运行在集群的每个节点上,主要提供四个关键能力:接口管理(Interface management)、编程式路由(Route programming),编程式权限(ACL programming),状态报告(State reporting)。
BIRD:
从Felix获取路由并分发给网络上的BGP对端,用于主机间路由。与Felix一样都是运行在集群的每个节点。主要提供两个关键能力:路由分发(Route distribution)、路由映射配置(BGP route reflector configuration)
Confd:
开源、轻量级的配置管理工具,存储BGP配置和全局默认值,监听数据变化动态生成BIRD配置文件,会触发BIRD重新加载配置信息。
三、排查过程
排查过程主要包括(排查主机日志、排查ingress日志、排查calico日志)
最终定位耗时4分钟,主要是calico服务启动后,bird组件网络路由创建慢导致
calico耗时启动日志:
Threshold time for bird readiness check: 30s
2024-11-08 02:31:21.625 [INFO][9] startup.go 256: Early log level set to info
2024-11-08 02:31:21.626 [INFO][9] startup.go 274: Using stored node name from /var/lib/calico/nodename
2024-11-08 02:31:21.626 [INFO][9] startup.go 284: Determined node name: yonsuite-test-172-20-30-75
2024-11-08 02:31:21.655 [INFO][9] startup.go 97: Skipping datastore connection test
2024-11-08 02:31:21.657 [INFO][9] startup.go 509: IPv4 address 172.20.30.75 discovered on interface eth0
2024-11-08 02:31:21.657 [INFO][9] startup.go 647: No AS number configured on node resource, using global value
2024-11-08 02:31:21.657 [INFO][9] startup.go 149: Setting NetworkUnavailable to False
2024-11-08 02:31:21.664 [ERROR][9] startup.go 152: Unable to set NetworkUnavailable to False error=nodes "yonsuite-test-172-20-30-75" not found
2024-11-08 02:31:21.670 [INFO][9] startup.go 530: FELIX_IPV6SUPPORT is false through environment variable
2024-11-08 02:31:21.675 [INFO][9] startup.go 181: Using node name: yonsuite-test-172-20-30-75
Threshold time for bird readiness check: 30s
2024-11-08 02:31:21.761 [INFO][17] allocate_ipip_addr.go 91: IPIP tunnel address is still valid IP="192.168.86.128"
Calico node started successfully
bird: Unable to open configuration file /etc/calico/confd/config/bird.cfg: No such file or directory
bird: Unable to open configuration file /etc/calico/confd/config/bird6.cfg: No such file or directory
Threshold time for bird readiness check: 30s
2024-11-08 02:31:22.861 [INFO][40] config.go 105: Skipping confd config file.
2024-11-08 02:31:22.861 [INFO][40] run.go 17: Starting calico-confd
2024-11-08 02:31:22.866 [INFO][40] watchersyncer.go 89: Start called
2024-11-08 02:31:22.866 [INFO][40] client.go 183: CALICO_ADVERTISE_CLUSTER_IPS not specified, no cluster ips will be advertised
2024-11-08 02:31:22.950 [INFO][40] client.go 319: RouteGenerator has indicated it is in sync
2024-11-08 02:31:22.950 [INFO][40