Calico BGP通信分析
BGP网络模型
-
BGP网络相比较IPIP网络,最大的不同之处就是
没有
隧道设备
tunl0
,pod之间的流量直接从·
宿主机
通过arp下一跳到目的地
宿主机
,减少了tunl0环节
BGP两种模式:
-
全互联模式(node-to-node mesh)——
全互联模式
,每一个BGP Speaker都需要和其他BGP Speaker建立BGP连接,这样BGP连接总数就是N^2,如果数量过大会消耗大量连接。如果集群数量
超过100台官方不建议使用此种模式
-
路由反射模式Router Reflection(RR)——
RR模式
中会指定一个或多个BGP Speaker为RouterReflection,它与网络中其他Speaker建立连接,
每个Speaker只要与Router Reflection建立BGP就可以获得全网的路由信息。在calico中可以通过Global Peer实现RR模式
BGP开启方式
# 开启IP In IP 模式方式:设置环境变量CALICO_IPV4POOL_IPIP来标识是否开启IPinIP Mode. 如果该变量的值为Always那么就是开启IPIP,如果关闭需要设置为Never
- name: CALICO_IPV4POOL_IPIP
value: "Never"
测试容器YAML
主机 | IP |
---|---|
k8s-master-1 | 192.168.0.11/24 |
K8s-node-1 | 192.168.0.12/24 |
apiVersion: v1
kind: Service
metadata:
name: busybox
namespace: devops
spec:
selector:
app: busybox
type: NodePort
ports:
- name: http
port: 8888
protocol: TCP
targetPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: busybox
namespace: devops
spec:
replicas: 2
revisionHistoryLimit: 10
selector:
matchLabels:
app: busybox
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
name: busybox
labels:
app: busybox
spec:
affinity: # 防止二个busybox 在同一个节点
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- busybox
restartPolicy: Always
containers:
- command: ["/bin/sh","-c","mkdir -p /var/lib/www && httpd -f -v -p 80 -h /var/lib/www"]
name: busybox
image: docker.io/library/busybox:latest
imagePullPolicy: IfNotPresent
ports:
- name: http
containerPort: 80
BGP跨主机分析
- 查看Pod信息
╰─ kubectl get pods -n devops -o custom-columns=NAME:.metadata.name,IP:.status.podIP,HOST:.spec.nodeName
NAME IP HOST
busybox-77649b9c55-fv298 172.16.196.1 k8s-master-1
busybox-77649b9c55-s7zfv 172.16.109.65 k8s-node-1
jenkins-56b6774bb6-d8v8b 172.16.109.66 k8s-node-1
- 进入k8s-master-1的容器busybox-77649b9c55-fv298查看路由信息
/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if5: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 0e:1c:2c:f6:2a:f9 brd ff:ff:ff:ff:ff:ff
inet 172.16.196.1/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::c1c:2cff:fef6:2af9/64 scope link
valid_lft forever preferred_lft forever
/ # route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 169.254.1.1 0.0.0.0 UG 0 0 0 eth0
169.254.1.1 0.0.0.0 255.255.255.255 UH 0 0 0 eth0
从上述中我们可以看出,k8s-master-1的容器busybox-77649b9c55-fv298默认有一个网关:
169.254.1.1
。但是整个网络中没有一张网卡是这个地址
-
从路由表可以知道
169.254.1.1
是容器的默认网关,但却找不到任何一张网卡对应这个 IP 地址。当一个数据包的目的地址不是本机时,就会查询路由表,
从路由表中查到网关后,它首先会通过 ARP广播获得网关的 MAC 地址,然后在发出的网络数据包中将目标 MAC 改为网关的 MAC
,而网关的 IP 地址不会出现在任何网络包头中。也就是说,没有人在乎这个 IP 地址究竟是什么,只要能找到对应的 MAC 地址,能响应 ARP 就行了 -
在Kubernetes Calico网络中,
当一个数据包的目的地址不是本网络时,会先发起ARP广播,网关即169.254.1.1收到会将自己的mac地址返回给发送端
,
后续的请求由这个veth对进行完成,使用代理arp做了arp欺骗。这样做抑制了arp广播攻击,并且通过代理arp也可以进行跨网络的访问 -
查看MAC地址信息,这个 MAC 地址应该是 Calico 硬塞进去的,而且还能响应 ARP。正常情况下,内核会对外发送 ARP 请求,询问整个二层网络中谁拥有
169.254.1.1
这个 IP 地址,拥有这个 IP 地址的设备会将自己的 MAC地址返回给对方。但现在的情况比较尴尬,容器和主机都没有这个 IP 地址,甚至连主机上的网卡:
calixxxxx
,。MAC 地址也是一个无用的
ee:ee:ee:ee:ee:ee
- k8s-master-1宿主机节点网卡信息
/ # ip neigh
169.254.1.1 dev eth0 lladdr ee:ee:ee:ee:ee:ee used 0/0/0 probes 1 STALE
# k8s-master-1 自身网卡信息查看
[root@k8s-master-1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:b1:02:f0 brd ff:ff:ff:ff:ff:ff
altname enp2s0
inet 192.168.0.11/24 brd 192.168.0.255 scope global noprefixroute ens160
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:feb1:2f0/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether c2:25:66:ef:00:10 brd ff:ff:ff:ff:ff:ff
inet 10.96.0.1/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.6.244/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.0.10/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.35.201/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
5: cali1dadcdd5b31@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default # 容器busybox的对端设备
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-989b68fd-d10b-b11b-781e-18feb8b85b12
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
8: cali42cd276b2be@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default # 其他容器的(coredns)
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-1328bbde-1a3c-a60c-9c3d-f4e1c2fbb3cd
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
# 查看calied926cbf5c7@if4网卡的ARP代理参数
[root@k8s-master-1 ~]# cat /proc/sys/net/ipv4/conf/cali1dadcdd5b31/proxy_arp
1
-
通过veth-pair会传递到对端calixxx上,因为calixxx网卡开启了arp proxy,
所以它会代答所有的ARP请求,让容器的报文都发到calixxx上,也就是发送到主机网络栈,再使用主机网络栈的路由来送到下一站
. 可以通过cat /proc/sys/net/ipv4/conf/calixxx/proxy_arp/来查看,输出都是1 -
Calico 通过一个巧妙的方法将 workload 的所有流量引导到一个特殊的网关 169.254.1.1,从而引流到主机的 calixxx 网络设备上,
最终将二三层流量全部转换成三层流量来转发
-
在主机上通过开启代理 ARP 功能来实现 ARP 应答,使得
ARP 广播被抑制在主机上,抑制了广播风暴
,也不会有 ARP 表膨胀的问题
k8s-master-1的busybox-77649b9c55-fv298
尝试
ping k8s-node-1的busybox-77649b9c55-s7zfv
# 查看k8s-master-1的busybox容器mac信息(为空)
/ # ip neigh show
# k8s-master-1的busybox 尝试ping k8s-node-1的busybox
/ # ping -c 1 172.16.109.65
PING 172.16.109.65 (172.16.109.65): 56 data bytes
64 bytes from 172.16.109.65: seq=0 ttl=62 time=1.603 ms
--- 172.16.109.65 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.603/1.603/1.603 ms
# 查看ARP信息
/ # arp -n
? (169.254.1.1) at ee:ee:ee:ee:ee:ee [ether] on eth0
# 查看k8s-master-1 busybox 当前网卡IP
/ # ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: eth0@if7: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1480 qdisc noqueue
link/ether 86:9c:03:9e:db:9f brd ff:ff:ff:ff:ff:ff
inet 172.16.196.1/32 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::849c:3ff:fe9e:db9f/64 scope link
valid_lft forever preferred_lft forever
# 查看k8s-master-1路由信息
[root@k8s-master-1 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.0.2 0.0.0.0 UG 100 0 0 ens160
172.16.109.64 192.168.0.12 255.255.255.192 UG 0 0 0 ens160 # 其他节点的pod网段路由规则
172.16.196.0 0.0.0.0 255.255.255.192 U 0 0 0 * # 路由屏蔽,这里是把网段路由那些借助路由黑洞给屏蔽了
172.16.196.1 0.0.0.0 255.255.255.255 UH 0 0 0 cali1dadcdd5b31 # 容器busybox的对端设备
172.16.196.2 0.0.0.0 255.255.255.255 UH 0 0 0 cali42cd276b2be # 其他容器的(coredns),每一个本机pod一个路由规则
192.168.0.0 0.0.0.0 255.255.255.0 U 100 0 0 ens160
# 查看路由详细信息
[root@k8s-master-1 ~]# ip route show
default via 192.168.0.2 dev ens160 proto static metric 100
172.16.109.64/26 via 192.168.0.12 dev ens160 proto bird
blackhole 172.16.196.0/26 proto bird
172.16.196.1 dev cali1dadcdd5b31 scope link
172.16.196.2 dev cali42cd276b2be scope link
192.168.0.0/24 dev ens160 proto kernel scope link src 192.168.0.11 metric 100
k8s-master-1的busybox-77649b9c55-fv298
尝试
ping k8s-node-1的busybox-77649b9c55-s7zfv
整体流程数据报文流程如下:
-
由于
172.16.109.65
与当前
172.16.196.1
属于不同的网段,由于跨网段目的MAC地址为网关
169.254.1.1
的MAC地址,在获取网关的MAC地址时,由于veth-pair特效,
eth0(容器)->cali1dadcdd5b31(宿主机)
,
宿主机的cali1dadcdd5b31的网卡开启了ARP代理(ARP欺骗)
会将MAC地址:
ee:ee:ee:ee:ee:ee
返回给容器 -
当获取到MAC地址后,构建数据报文:src: 172.16.196.1,dst: 172.16.109.65 src_mac: 86:9c:03:9e:db:9f dst_mac: ee:ee:ee:ee:ee:ee,此时容器查询本机路由规则发现命中
默认网关路由
,将数据报文丢给eth0,然后基于veth-pair设备对特性,数据报文到达宿主机的
cali1dadcdd5b31
网卡
[root@k8s-master-1 ~]# tcpdump -i cali1dadcdd5b31 icmp -e -Nnnvl
dropped privs to tcpdump
tcpdump: listening on cali1dadcdd5b31, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:55:45.042609 0e:1c:2c:f6:2a:f9 > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 5428, offset 0, flags [DF], proto ICMP (1), length 84)
172.16.196.1 > 172.16.109.65: ICMP echo request, id 17, seq 10, length 64
16:55:45.043076 ee:ee:ee:ee:ee:ee > 0e:1c:2c:f6:2a:f9, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 60905, offset 0, flags [none], proto ICMP (1), length 84)
172.16.109.65 > 172.16.196.1: ICMP echo reply, id 17, seq 10, length 64
-
数据报文到达k8s-master-1的
cali1dadcdd5b31
后进行路由匹配,此时会匹配到
172.16.109.64 192.168.0.12 255.255.255.192 UG 0 0 0 ens160
网段路由规则(k8s-node-1上所有pod都会命中这个路由规则),在BGP模式下tunl0是不会使用的,所以此时会直接封装数据报文:src: 172.16.196.1(
POD IP
),dst: 172.16.109.65(
POD IP
) src_mac: 00:0c:29:b1:02:f0(
ens160物理网卡
),dst_mac: 00:0c:29:90:fa:e2(
ens160物理网卡
)。由于源目IP为POD IP,而源目MAC分别为k8s-master-1/k8s-node-1的ens160的MAC地址,这表明:k8s-master-1节点的路由接收到数据,重新构建数据包时,使用arp请求,将k8s-node-1节点的mac拿到,然后封装到数据链路层。这就要求
k8s-master-1和k8s-node-1处于同一个二层网络。否则会导致k8s-master-1无法拿到k8s-node-1的MAC地址。从而导致数据报文无法构建
。数据报文构建后k8s-master-1将报文从ens160网卡丢出去
[root@k8s-master-1 ~]# tcpdump -i ens160 icmp -Nnnvle
dropped privs to tcpdump
tcpdump: listening on ens160, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:11:07.511766 00:0c:29:b1:02:f0 > 00:0c:29:90:fa:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 51277, offset 0, flags [DF], proto ICMP (1), length 84)
172.16.196.1 > 172.16.109.65: ICMP echo request, id 17, seq 932, length 64
17:11:07.512366 00:0c:29:90:fa:e2 > 00:0c:29:b1:02:f0, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 42499, offset 0, flags [none], proto ICMP (1), length 84)
172.16.109.65 > 172.16.196.1: ICMP echo reply, id 17, seq 932, length 64
17:11:08.512052 00:0c:29:b1:02:f0 > 00:0c:29:90:fa:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 51369, offset 0, flags [DF], proto ICMP (1), length 84)
-
k8s-master-1将报文从ens160网卡丢出去后,数据报文到达二层交换机设备(
虚拟机环境下,为虚拟交换机
),
由于数据报文的MAC地址中目的MAC地址为k8s-node-1的MAC地址,交换机会把数据报文丢给k8s-node-1节点
-
k8s-node-1 ens160物理抓包查看报文,可以发现和k8s-master-1 ens160发出来的数据报文一致
[root@k8s-node-1 ~]# tcpdump -i ens160 icmp -Nnnvle
dropped privs to tcpdump
tcpdump: listening on ens160, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:40:05.566173 00:0c:29:b1:02:f0 > 00:0c:29:90:fa:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 3808, offset 0, flags [DF], proto ICMP (1), length 84)
172.16.196.1 > 172.16.109.65: ICMP echo request, id 19, seq 1314, length 64
17:40:05.566306 00:0c:29:90:fa:e2 > 00:0c:29:b1:02:f0, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 62570, offset 0, flags [none], proto ICMP (1), length 84)
172.16.109.65 > 172.16.196.1: ICMP echo reply, id 19, seq 1314, length 64
-
k8s-node-1接收到报文后,匹配路由规则:
172.16.109.65 0.0.0.0 255.255.255.255 UH 0 0 0 cali4e329df4a89
,构建报文:src: 172.16.196.1(
POD_IP
),dst: 172.16.109.65(
POD_IP
),src_mac: ee:ee:ee:ee:ee:ee(
k8s-node-1的busybox eth0对端的cali4e329df4a89网卡mac地址
),dst_mac: 72:19:6b:9b:bf:e2(
k8s-node-1的busybox eth0网卡mac地址
)
# 查看k8s-node-1物理网卡信息
[root@k8s-node-1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:90:fa:e2 brd ff:ff:ff:ff:ff:ff
altname enp2s0
inet 192.168.0.12/24 brd 192.168.0.255 scope global noprefixroute ens160
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:fe90:fae2/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 42:2d:65:bc:7a:b1 brd ff:ff:ff:ff:ff:ff
inet 10.96.0.1/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.6.244/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.0.10/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.35.201/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
5: cali4e329df4a89@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default # busybox容器的网卡
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-d7eaf0a1-1638-338d-5309-dbd5ec632608
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
8: cali523840de229@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default # 其他容器的网卡
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-bd969f87-9e94-599c-119a-f8b8f3efc9c6
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
# 查看路由信息
[root@k8s-node-1 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.0.2 0.0.0.0 UG 100 0 0 ens160
172.16.109.64 0.0.0.0 255.255.255.192 U 0 0 0 * # 路由屏蔽,这里是把网段路由那些借助路由黑洞给屏蔽了
172.16.109.65 0.0.0.0 255.255.255.255 UH 0 0 0 cali4e329df4a89 # 容器busybox的对端设备
172.16.109.66 0.0.0.0 255.255.255.255 UH 0 0 0 cali523840de229 # 其他容器的,每一个本机pod一个路由规则
172.16.196.0 192.168.0.11 255.255.255.192 UG 0 0 0 ens160 # 其他节点的pod网段路由规则
192.168.0.0 0.0.0.0 255.255.255.0 U 100 0 0 ens160
# 查看MAC地址
[root@k8s-node-1 ~]# tcpdump -i cali4e329df4a89 -Nnnvle icmp
dropped privs to tcpdump
tcpdump: listening on cali4e329df4a89, link-type EN10MB (Ethernet), snapshot length 262144 bytes
17:47:31.196897 ee:ee:ee:ee:ee:ee > 72:19:6b:9b:bf:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 62, id 37592, offset 0, flags [DF], proto ICMP (1), length 84)
172.16.196.1 > 172.16.109.65: ICMP echo request, id 29, seq 0, length 64
17:47:31.196950 72:19:6b:9b:bf:e2 > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 29435, offset 0, flags [none], proto ICMP (1), length 84)
172.16.109.65 > 172.16.196.1: ICMP echo reply, id 29, seq 0, length 64
# 查看ARP表
[root@k8s-node-1 ~]# arp -an
? (192.168.0.1) at 5e:e9:1e:fb:06:64 [ether] on ens160
? (172.16.109.66) at aa:2d:89:c6:49:dd [ether] PERM on cali523840de229
? (192.168.0.11) at 00:0c:29:b1:02:f0 [ether] on ens160
? (172.16.109.65) at 72:19:6b:9b:bf:e2 [ether] PERM on cali4e329df4a89
? (192.168.0.2) at 00:50:56:e0:1b:11 [ether] on ens160
按照上述分析,网络通信流程如下:
- busybox(k8s-master-1)-> calixxx -> ens160(k8s-master-1) <—-> ens160(k8s-node-1) -> calixxx -> busybox(k8s-node-1)
- 根据k8s-master-1宿主机中的路由规则中的下一跳,使用路由规则发送到k8s-node-1的宿主机
- BGP模式要求节点必须属于同一个2层网络,由于跨节点POD间的通信报文在节点的物理网卡格式为:src_ip: pod_ip1 dst_ip: pod src_mac: node1_ensxx_mac,dst_mac: node2_ensxx_mac,如果二个节点不属于同一个二层网络,会导致节点之间无法互相获取MAC地址,从而导致数据报文构建失败
BGP同主机分析
- 查看Pod信息与宿主机路由信息
# 查看pod分布情况
╰─ kubectl get pods -n devops -o custom-columns=NAME:.metadata.name,IP:.status.podIP,HOST:.spec.nodeName
NAME IP HOST
busybox-77649b9c55-fv298 172.16.196.1 k8s-master-1
busybox-77649b9c55-s7zfv 172.16.109.65 k8s-node-1
jenkins-56b6774bb6-d8v8b 172.16.109.66 k8s-node-1
# k8s-node-1查看路由信息
[root@k8s-node-1 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.0.2 0.0.0.0 UG 100 0 0 ens160
172.16.109.64 0.0.0.0 255.255.255.192 U 0 0 0 *
172.16.109.65 0.0.0.0 255.255.255.255 UH 0 0 0 cali4e329df4a89 # busybox容器
172.16.109.66 0.0.0.0 255.255.255.255 UH 0 0 0 cali523840de229 # 其他容器
172.16.196.0 192.168.0.11 255.255.255.192 UG 0 0 0 ens160
192.168.0.0 0.0.0.0 255.255.255.0 U 100 0 0 ens160
# 查看mac信息
[root@k8s-node-1 ~]# arp -an
? (192.168.0.1) at 5e:e9:1e:fb:06:64 [ether] on ens160
? (172.16.109.66) at aa:2d:89:c6:49:dd [ether] PERM on cali523840de229 # 其他容器
? (192.168.0.11) at 00:0c:29:b1:02:f0 [ether] on ens160
? (172.16.109.65) at 72:19:6b:9b:bf:e2 [ether] PERM on cali4e329df4a89 # busybox容器
? (192.168.0.2) at 00:50:56:e0:1b:11 [ether] on ens160
# 查看网卡信息
[root@k8s-node-1 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens160: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:90:fa:e2 brd ff:ff:ff:ff:ff:ff
altname enp2s0
inet 192.168.0.12/24 brd 192.168.0.255 scope global noprefixroute ens160
valid_lft forever preferred_lft forever
inet6 fe80::20c:29ff:fe90:fae2/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0
4: kube-ipvs0: <BROADCAST,NOARP> mtu 1500 qdisc noop state DOWN group default
link/ether 42:2d:65:bc:7a:b1 brd ff:ff:ff:ff:ff:ff
inet 10.96.0.1/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.6.244/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.0.10/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
inet 10.96.35.201/32 scope global kube-ipvs0
valid_lft forever preferred_lft forever
5: cali4e329df4a89@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default # busybox容器
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-d7eaf0a1-1638-338d-5309-dbd5ec632608
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
8: cali523840de229@if4: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default # 其他容器
link/ether ee:ee:ee:ee:ee:ee brd ff:ff:ff:ff:ff:ff link-netns cni-bd969f87-9e94-599c-119a-f8b8f3efc9c6
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
valid_lft forever preferred_lft forever
当k8s-node-1的busybox容器ping其他容器时,数据包转发流程大致如下:
-
172.16.109.65和172.16.109.66属于同一个网段,当是我们查看busybox的路由信息发现并没有
172.16.109.64
的网段路由信息,所以会去默认网关169.254.1.1获取MAC地址信息,由于busybox的对端网卡:cali4e329df4a89开启了ARP代理,所以会返回
ee:ee:ee:ee:ee:ee
。然后封装数据报文:src_addr: 172.16.109.65 dst_addr: 172.16.109.66 src_mac: 72:19:6b:9b:bf:e2 dst_mac: ee:ee:ee:ee:ee:ee , 该数据报文会被送到宿主机的cali4e329df4a89
# k8s-node-1的busybox ping 其他容器
/ # ping -c 1 172.16.109.66
PING 172.16.109.66 (172.16.109.66): 56 data bytes
64 bytes from 172.16.109.66: seq=0 ttl=63 time=0.206 ms
--- 172.16.109.66 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.206/0.206/0.206 ms
[root@k8s-node-1 ~]# tcpdump -i cali4e329df4a89 -Nnnvle
dropped privs to tcpdump
tcpdump: listening on cali4e329df4a89, link-type EN10MB (Ethernet), snapshot length 262144 bytes
18:17:06.642985 72:19:6b:9b:bf:e2 > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 17087, offset 0, flags [DF], proto ICMP (1), length 84)
172.16.109.65 > 172.16.109.66: ICMP echo request, id 37, seq 0, length 64
18:17:06.643095 ee:ee:ee:ee:ee:ee > 72:19:6b:9b:bf:e2, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 8840, offset 0, flags [none], proto ICMP (1), length 84)
172.16.109.66 > 172.16.109.65: ICMP echo reply, id 37, seq 0, length 64
-
当数据到达宿主机的cali4e329df4a89网卡,此时会进行路由匹配,匹配到规则
172.16.109.66 0.0.0.0 255.255.255.255 UH 0 0 0 cali523840de229
规则,会将数据报文转发给本机的cali523840de229网卡(
因veth-pair特性,这个数据报文会直接被丢给容器
),此时会构建数据报文:src_addr: 172.16.109.65 dst_addr: 172.16.109.66 src_mac: ee:ee:ee:ee:ee:ee dst_mac: 2d:89:c6:49:dd
[root@k8s-node-1 ~]# tcpdump -i cali523840de229 -Nnnvle icmp
dropped privs to tcpdump
tcpdump: listening on cali523840de229, link-type EN10MB (Ethernet), snapshot length 262144 bytes
18:19:07.901457 ee:ee:ee:ee:ee:ee > aa:2d:89:c6:49:dd, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 63, id 24224, offset 0, flags [DF], proto ICMP (1), length 84)
172.16.109.65 > 172.16.109.66: ICMP echo request, id 38, seq 0, length 64
18:19:07.901483 aa:2d:89:c6:49:dd > ee:ee:ee:ee:ee:ee, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 14344, offset 0, flags [none], proto ICMP (1), length 84)
172.16.109.66 > 172.16.109.65: ICMP echo reply, id 38, seq 0, length 64
- 数据报文回来和出去一致,这里就不分析了