-
Bug
-
Resolution: Can't Do
-
Normal
-
None
-
4.12
-
No
-
SDN Sprint 247, SDN Sprint 248
-
2
-
False
-
Description of problem:
Nodes are placed in Ready status even if ovs-configuration service fails to start br-ex after a reboot
Version-Release number of selected component (if applicable):
4.12 (4.12.44 and nightlies)
How reproducible:
Always when ovs-configuration fails to start br-ex
Steps to Reproduce:
1. Prepare MachineConfig manifest to use dual-stack through DHCP for LACP bond0 (br-ex), and bond0.vlanY (secondary bridge br-ex1) 2. Deploy OCP 4.12 via IPI with latest nightly GA on a baremetal cluster with OVN-K and include the manifest as day1 3. After the cluster is ready, perform an operation to reboot the worker nodes, for example apply a Performance Profile 4. After the reboot, check the ovnkube-node pods, the ones from the worker nodes with ovn-configuration issues will remain in Crashing status and you can see more details of the failure in their logs. 5. Check the ovs-configuration service logs of the workers with ovnkube-node issues, when it fails to bring up br-ex the node is still placed in Ready status
Actual results:
Nodes are placed in Ready status even if ovs-configuration service fails to start br-ex after a reboot
Expected results:
Nodes should not be placed in Ready status if ovs-configuration service fails to start br-ex
Additional info:
- We do not understand why br-ex is failing to start yet, but we have observed this since OCP 4.12.41, - A collateral damage of this bug is that workloads are scheduled in the nodes because they are marked a Ready and pods fail to start. - We have mainly seen this failure in deployments with br-ex and secondary bridge br-ex1
More details below:
1. All nodes are Ready after a reboot from rolling out a Performance Profile
$ omc get nodes NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 2h v1.25.14+bcb9a60 master-1 Ready control-plane,master 2h v1.25.14+bcb9a60 master-2 Ready control-plane,master 2h v1.25.14+bcb9a60 worker-0 Ready worker 1h v1.25.14+bcb9a60 worker-1 Ready worker 1h v1.25.14+bcb9a60 worker-2 Ready worker 1h v1.25.14+bcb9a60 worker-3 Ready worker 2h v1.25.14+bcb9a60
2. But some ovnkube-node pods are getting restarted in a loop, when checking the ovn logs it shows there is no br-ex
$ omc -n openshift-ovn-kubernetes get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ovnkube-master-9nzjw 6/6 Running 0 2h 192.168.62.23 master-2 <none> <none> ovnkube-master-fdkcv 6/6 Running 0 2h 192.168.62.21 master-0 <none> <none> ovnkube-master-thvfd 6/6 Running 0 2h 192.168.62.22 master-1 <none> <none> ovnkube-node-6p6gd 4/5 CrashLoopBackOff 16 1h 192.168.62.25 worker-1 <none> <none> ovnkube-node-hf28x 5/5 Running 5 1h 192.168.62.26 worker-2 <none> <none> ovnkube-node-hw4s8 5/5 Running 5 2h 192.168.62.27 worker-3 <none> <none> ovnkube-node-m7m5f 4/5 CrashLoopBackOff 12 1h 192.168.62.24 worker-0 <none> <none> ovnkube-node-q6skg 5/5 Running 0 2h 192.168.62.22 master-1 <none> <none> ovnkube-node-wxrqp 5/5 Running 0 2h 192.168.62.21 master-0 <none> <none> ovnkube-node-zhtxn 5/5 Running 0 2h 192.168.62.23 master-2 <none> <none> $ omc -n openshift-ovn-kubernetes logs ovnkube-node-6p6gd -c ovnkube-node | tail -1 2023-11-19T16:21:48.788745330Z F1119 16:21:48.788732 54036 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found $ omc -n openshift-ovn-kubernetes logs ovnkube-node-m7m5f -c ovnkube-node | tail -1 2023-11-19T16:22:18.973329972Z F1119 16:22:18.973312 31951 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found
3. When looking at the journal logs, we see ovs-configuration.service failed to start in the last attempt. (the successful start was when the cluster was installed), and the failure points to exceeded retries to bring br-ex up
$ grep ovs-configuration.service: ../journal-worker-0.log Nov 19 14:32:30 worker-0 systemd[1]: ovs-configuration.service: Succeeded. Nov 19 14:32:30 worker-0 systemd[1]: ovs-configuration.service: Consumed 4.498s CPU time Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'. <-- This is the start attempt after the reboot from PAO Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Consumed 5.019s CPU time $ grep ovs-configuration.service: ../journal-worker-1.log Nov 19 14:32:03 worker-1 systemd[1]: ovs-configuration.service: Succeeded. Nov 19 14:32:03 worker-1 systemd[1]: ovs-configuration.service: Consumed 4.151s CPU time Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'. <-- This is the start attempt after the reboot from PAO Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Consumed 5.059s CPU time $ grep configure-ovs journal-worker-0.log | less ... Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: ERROR: Cannot bring up connection br-ex after 10 attempts Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + return 3 Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + handle_exit Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + e=3 Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + '[' 3 -eq 0 ']' Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + echo 'ERROR: configure-ovs exited with error: 3' Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: ERROR: configure-ovs exited with error: 3 $ grep configure-ovs journal-worker-1.log | less ... Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: ERROR: Cannot bring up connection br-ex after 10 attempts Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + return 3 Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + handle_exit Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + e=3 Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + '[' 3 -eq 0 ']' Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + echo 'ERROR: configure-ovs exited with error: 3' Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: ERROR: configure-ovs exited with error: 3