Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-23469

Nodes are placed in Ready status even if ovs-configuration service fails to start br-ex after a reboot

XMLWordPrintable

    • No
    • SDN Sprint 247, SDN Sprint 248
    • 2
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Nodes are placed in Ready status even if ovs-configuration service fails to start br-ex after a reboot
      

      Version-Release number of selected component (if applicable):

      4.12 (4.12.44 and nightlies)
      

      How reproducible:

      Always when ovs-configuration fails to start br-ex
      

      Steps to Reproduce:

      1. Prepare MachineConfig manifest to use dual-stack through DHCP for LACP bond0 (br-ex), and bond0.vlanY (secondary bridge br-ex1)
      2. Deploy OCP 4.12 via IPI with latest nightly GA on a baremetal cluster with OVN-K and include the manifest as day1
      3. After the cluster is ready, perform an operation to reboot the worker nodes, for example apply a Performance Profile
      4. After the reboot, check the ovnkube-node pods, the ones from the worker nodes with ovn-configuration issues will remain in Crashing status and you can see more details of the failure in their logs.
      5. Check the ovs-configuration service logs of the workers with ovnkube-node issues, when it fails to bring up br-ex the node is still placed in Ready status
      

      Actual results:

      Nodes are placed in Ready status even if ovs-configuration service fails to start br-ex after a reboot
      

      Expected results:

      Nodes should not be placed in Ready status if ovs-configuration service fails to start br-ex
      

      Additional info:

       - We do not understand why br-ex is failing to start yet, but we have observed this since OCP 4.12.41,
       - A collateral damage of this bug is that workloads are scheduled in the nodes because they are marked a Ready and pods fail to start.
       - We have mainly seen this failure in deployments with br-ex and secondary bridge br-ex1
      

      More details below:
      1. All nodes are Ready after a reboot from rolling out a Performance Profile

      $ omc get nodes
      NAME       STATUS   ROLES                  AGE   VERSION
      master-0   Ready    control-plane,master   2h    v1.25.14+bcb9a60
      master-1   Ready    control-plane,master   2h    v1.25.14+bcb9a60
      master-2   Ready    control-plane,master   2h    v1.25.14+bcb9a60
      worker-0   Ready    worker                 1h    v1.25.14+bcb9a60
      worker-1   Ready    worker                 1h    v1.25.14+bcb9a60
      worker-2   Ready    worker                 1h    v1.25.14+bcb9a60
      worker-3   Ready    worker                 2h    v1.25.14+bcb9a60
      

      2. But some ovnkube-node pods are getting restarted in a loop, when checking the ovn logs it shows there is no br-ex

      $ omc -n openshift-ovn-kubernetes get pods -o wide
      NAME                   READY   STATUS             RESTARTS   AGE   IP              NODE       NOMINATED NODE   READINESS GATES
      ovnkube-master-9nzjw   6/6     Running            0          2h    192.168.62.23   master-2   <none>           <none>
      ovnkube-master-fdkcv   6/6     Running            0          2h    192.168.62.21   master-0   <none>           <none>
      ovnkube-master-thvfd   6/6     Running            0          2h    192.168.62.22   master-1   <none>           <none>
      ovnkube-node-6p6gd     4/5     CrashLoopBackOff   16         1h    192.168.62.25   worker-1   <none>           <none>
      ovnkube-node-hf28x     5/5     Running            5          1h    192.168.62.26   worker-2   <none>           <none>
      ovnkube-node-hw4s8     5/5     Running            5          2h    192.168.62.27   worker-3   <none>           <none>
      ovnkube-node-m7m5f     4/5     CrashLoopBackOff   12         1h    192.168.62.24   worker-0   <none>           <none>
      ovnkube-node-q6skg     5/5     Running            0          2h    192.168.62.22   master-1   <none>           <none>
      ovnkube-node-wxrqp     5/5     Running            0          2h    192.168.62.21   master-0   <none>           <none>
      ovnkube-node-zhtxn     5/5     Running            0          2h    192.168.62.23   master-2   <none>           <none>
      
      $ omc -n openshift-ovn-kubernetes logs ovnkube-node-6p6gd -c ovnkube-node | tail -1
      2023-11-19T16:21:48.788745330Z F1119 16:21:48.788732   54036 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found
      
      $ omc -n openshift-ovn-kubernetes logs ovnkube-node-m7m5f -c ovnkube-node | tail -1
      2023-11-19T16:22:18.973329972Z F1119 16:22:18.973312   31951 ovnkube.go:133] error looking up gw interface: "br-ex", error: Link not found
      

      3. When looking at the journal logs, we see ovs-configuration.service failed to start in the last attempt. (the successful start was when the cluster was installed), and the failure points to exceeded retries to bring br-ex up

      $ grep ovs-configuration.service: ../journal-worker-0.log
      Nov 19 14:32:30 worker-0 systemd[1]: ovs-configuration.service: Succeeded.
      Nov 19 14:32:30 worker-0 systemd[1]: ovs-configuration.service: Consumed 4.498s CPU time
      Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
      Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.  <-- This is the start attempt after the reboot from PAO
      Nov 19 16:11:38 worker-0 systemd[1]: ovs-configuration.service: Consumed 5.019s CPU time
      
      $ grep ovs-configuration.service: ../journal-worker-1.log
      Nov 19 14:32:03 worker-1 systemd[1]: ovs-configuration.service: Succeeded.
      Nov 19 14:32:03 worker-1 systemd[1]: ovs-configuration.service: Consumed 4.151s CPU time
      Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Main process exited, code=exited, status=3/NOTIMPLEMENTED
      Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Failed with result 'exit-code'. <-- This is the start attempt after the reboot from PAO
      Nov 19 15:50:06 worker-1 systemd[1]: ovs-configuration.service: Consumed 5.059s CPU time
      
      $ grep configure-ovs journal-worker-0.log | less
      ...
      Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: ERROR: Cannot bring up connection br-ex after 10 attempts
      Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + return 3
      Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + handle_exit
      Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + e=3
      Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + '[' 3 -eq 0 ']'
      Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: + echo 'ERROR: configure-ovs exited with error: 3'
      Nov 19 16:10:15 worker-0 configure-ovs.sh[4413]: ERROR: configure-ovs exited with error: 3
      
      $ grep configure-ovs journal-worker-1.log | less
      ...
      Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: ERROR: Cannot bring up connection br-ex after 10 attempts
      Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + return 3
      Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + handle_exit
      Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + e=3
      Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + '[' 3 -eq 0 ']'
      Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: + echo 'ERROR: configure-ovs exited with error: 3'
      Nov 19 15:48:36 worker-1 configure-ovs.sh[4602]: ERROR: configure-ovs exited with error: 3
      

            jcaamano@redhat.com Jaime CaamaƱo Ruiz
            rhn-gps-manrodri Manuel Rodriguez
            Zhanqi Zhao Zhanqi Zhao
            Manuel Rodriguez
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: