Loading...

XML

Word

Printable

Type: Bug
Resolution: Can't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Networking / ovn-kubernetes
Labels:
- SDN-Bug-Backlog-Reduction-Lack-Of-Team-Cycles

Regression:
No
Sprint:
SDN Sprint 247, SDN Sprint 248
sprint_count:
2
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Nodes are placed in Ready status even if ovs-configuration service fails to start br-ex after a reboot

Version-Release number of selected component (if applicable):

4.12 (4.12.44 and nightlies)

How reproducible:

Always when ovs-configuration fails to start br-ex

Steps to Reproduce:

1. Prepare MachineConfig manifest to use dual-stack through DHCP for LACP bond0 (br-ex), and bond0.vlanY (secondary bridge br-ex1)
2. Deploy OCP 4.12 via IPI with latest nightly GA on a baremetal cluster with OVN-K and include the manifest as day1
3. After the cluster is ready, perform an operation to reboot the worker nodes, for example apply a Performance Profile
4. After the reboot, check the ovnkube-node pods, the ones from the worker nodes with ovn-configuration issues will remain in Crashing status and you can see more details of the failure in their logs.
5. Check the ovs-configuration service logs of the workers with ovnkube-node issues, when it fails to bring up br-ex the node is still placed in Ready status

Actual results:

Nodes are placed in Ready status even if ovs-configuration service fails to start br-ex after a reboot

Expected results:

Nodes should not be placed in Ready status if ovs-configuration service fails to start br-ex

Additional info:

 - We do not understand why br-ex is failing to start yet, but we have observed this since OCP 4.12.41,
 - A collateral damage of this bug is that workloads are scheduled in the nodes because they are marked a Ready and pods fail to start.
 - We have mainly seen this failure in deployments with br-ex and secondary bridge br-ex1

More details below:
1. All nodes are Ready after a reboot from rolling out a Performance Profile

$ omc get nodes
NAME       STATUS   ROLES                  AGE   VERSION
master-0   Ready    control-plane,master   2h    v1.25.14+bcb9a60
master-1   Ready    control-plane,master   2h    v1.25.14+bcb9a60
master-2   Ready    control-plane,master   2h    v1.25.14+bcb9a60
worker-0   Ready    worker                 1h    v1.25.14+bcb9a60
worker-1   Ready    worker                 1h    v1.25.14+bcb9a60
worker-2   Ready    worker                 1h    v1.25.14+bcb9a60
worker-3   Ready    worker                 2h    v1.25.14+bcb9a60

2. But some ovnkube-node pods are getting restarted in a loop, when checking the ovn logs it shows there is no br-ex

$ omc -n openshift-ovn-kubernetes get pods -o wide
NAME                   READY   STATUS             RESTARTS   AGE   IP              NODE       NOMINATED NODE   READINESS GATES
ovnkube-master-9nzjw   6/6     Running            0          2h    192.168.62.23   master-2   <none>           <none>
ovnkube-master-fdkcv   6/6     Running            0          2h    192.168.62.21   master-0   <none>           <none>
ovnkube-master-thvfd   6/6     Running            0          2h    192.168.62.22   master-1   <none>           <none>
ovnkube-node-6p6gd     4/5     CrashLoopBackOff   16         1h    192.168.62.25   worker-1   <none>           <none>
ovnkube-node-hf28x     5/5     Running            5          1h    192.168.62.26   worker-2   <none>           <none>
ovnkube-node-hw4s8     5/5     Running            5          2h    192.168.62.27   worker-3   <none>           <none>
ovnkube-node-m7m5f     4/5     CrashLoopBackOff   12         1h    192.168.62.24   worker-0   <none>           <none>
ovnkube-node-q6skg     5/5     Running            0          2h    192.168.62.22   master-1   <none>           <none>
ovnkube-node-wxrqp     5/5     Running            0          2h    192.168.62.21   master-0   <none>           <none>
ovnkube-node-zhtxn     5/5     Running            0          2h    192.168.62.23   master-2   <none>           <none>

$ omc -n openshift-ovn-kubernetes logs ovnkube-node-6p6gd -c ovnkube-node | tail -1
2023-11-19T16:21:48.788745330Z F1119 16:21:48.788732   54036 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found

$ omc -n openshift-ovn-kubernetes logs ovnkube-node-m7m5f -c ovnkube-node | tail -1
2023-11-19T16:22:18.973329972Z F1119 16:22:18.973312   31951 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found

3. When looking at the journal logs, we see ovs-configuration.service failed to start in the last attempt. (the successful start was when the cluster was installed), and the failure points to exceeded retries to bring br-ex up

$ grep ovs-configuration.service: ../journal-worker-0.log
Nov 19 14:32:30 worker-0 systemd[1]: ovs-configuration.service: Succeeded.
Nov 19 14:32:30 worker-0 systemd[1]: ovs-configuration.service: Consumed 4.498s CPU time
Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.  <-- This is the start attempt after the reboot from PAO
Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Consumed 5.019s CPU time

$ grep ovs-configuration.service: ../journal-worker-1.log
Nov 19 14:32:03 worker-1 systemd[1]: ovs-configuration.service: Succeeded.
Nov 19 14:32:03 worker-1 systemd[1]: ovs-configuration.service: Consumed 4.151s CPU time
Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'. <-- This is the start attempt after the reboot from PAO
Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Consumed 5.059s CPU time

$ grep configure-ovs journal-worker-0.log | less
...
Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: ERROR: Cannot bring up connection br-ex after 10 attempts
Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + return 3
Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + handle_exit
Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + e=3
Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + '[' 3 -eq 0 ']'
Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + echo 'ERROR: configure-ovs exited with error: 3'
Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: ERROR: configure-ovs exited with error: 3

$ grep configure-ovs journal-worker-1.log | less
...
Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: ERROR: Cannot bring up connection br-ex after 10 attempts
Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + return 3
Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + handle_exit
Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + e=3
Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + '[' 3 -eq 0 ']'
Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + echo 'ERROR: configure-ovs exited with error: 3'
Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: ERROR: configure-ovs exited with error: 3

Assignee:: Jaime Caamaño Ruiz

Reporter:: Manuel Rodriguez

QA Contact:: Zhanqi Zhao

Need Info From:: Manuel Rodriguez

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/11/20 9:13 PM

Updated:: 2024/01/26 11:11 AM

Resolved:: 2024/01/25 6:34 PM

Details

Description

Attachments

Activity

People

Dates