Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-4806

[SNO OVN-K] Connection to the node lost after loading OOT drivers.

    XMLWordPrintable

Details

    • -
    • Important
    • Proposed
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem:

      Connections to the cluster are lost almost every time when installing out of tree driver. The state is permanent until the host server is manually rebooted.

      I will share the link to sosreport collected when the issue was present. The following are the observation:

      1. Kubelet service is inactive.

      * kubelet.service - Kubernetes Kubelet
         Loaded: loaded (/etc/systemd/system/kubelet.service; enabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/kubelet.service.d
                 `-10-mco-default-madv.conf, 20-logging.conf, 20-nodenet.conf
         Active: inactive (dead)

       

      2. NetworkManager-wait-online.service in the failed state.

      * NetworkManager-wait-online.service - Network Manager Wait Online
         Loaded: loaded (/usr/lib/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: disabled)
         Active: failed (Result: exit-code) since Mon 2022-12-12 14:47:17 UTC; 33min ago
           Docs: man:nm-online(1)
        Process: 3111 ExecStart=/usr/bin/nm-online -s -q (code=exited, status=1/FAILURE)
       Main PID: 3111 (code=exited, status=1/FAILURE)
            CPU: 347msDec 12 14:46:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: Starting Network Manager Wait Online...
      Dec 12 14:47:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
      Dec 12 14:47:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
      Dec 12 14:47:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: Failed to start Network Manager Wait Online.
      Dec 12 14:47:17 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: NetworkManager-wait-online.service: Consumed 347ms CPU time
       

       

       

      3. ovs-configuration.service in the failed state.

       

      * ovs-configuration.service - Configures OVS with proper host networking configuration
         Loaded: loaded (/etc/systemd/system/ovs-configuration.service; enabled; vendor preset: disabled)
         Active: failed (Result: exit-code) since Mon 2022-12-12 14:49:41 UTC; 30min ago
        Process: 3632 ExecStart=/usr/local/bin/configure-ovs.sh OVNKubernetes (code=exited, status=1/FAILURE)
       Main PID: 3632 (code=exited, status=1/FAILURE)
            CPU: 5.391sDec 12 14:49:41 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]:     link/ether b4:96:91:c0:85:53 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 9702 numtxqueues 64 numrxqueues 64 gso_max_size 65536 gso_max_segs 65535
      Dec 12 14:49:41 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: ovs-configuration.service: Failed with result 'exit-code'.
      Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: + ip route show
      Dec 12 14:49:41 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: Failed to start Configures OVS with proper host networking configuration.
      Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: 10.88.0.0/16 dev cni-podman0 proto kernel scope link src 10.88.0.1 linkdown
      Dec 12 14:49:41 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net systemd[1]: ovs-configuration.service: Consumed 5.391s CPU time
      Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: + ip -6 route show
      Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: ::1 dev lo proto kernel metric 256 pref medium
      Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: fe80::/64 dev cni-podman0 proto kernel metric 256 linkdown pref medium
      Dec 12 14:49:42 master0.a1202-7-11-u10-s3-oe20rannic-sno.lab.neat.nsn-rdnet.net configure-ovs.sh[3632]: + exit 1
       

       

       

      Version-Release number of selected component (if applicable):

      4.10.32

      How reproducible:

      Always

      Steps to Reproduce:

      1. Apply the MC that will load the drivers. ( MC available in must-gather )
      2. During the mcp reboot the connection to the node will be lost.
      
      

      Actual results:

      Connection lost to the node after mcp reboot.

      Expected results:

      Connection should not lost with the node after mcp reboot.

      Additional info:

      Workaround: 
      
      - Manually rebooting the node will restore the connections. 

      Attachments

        Activity

          People

            bbennett@redhat.com Ben Bennett
            rhn-support-akumawat Akshit Kumawat
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: