Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33211

(OCP 4.10) - OVNkube master SBDB Memory expansion + subsequent loss of leader elect

XMLWordPrintable

    • Low
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • RCA Only; workaround to rebuild db's. SE for 4.10 until 06/14

      Description of problem:
       * On or about March 22nd 15:00 --> ovnkube-master pod resource consumption increased in the sbdb container unexpectedly to around 25GB Memory from 14GB very rapidly. Node lost sbdb leader processes, and the cluster remained unstable until node restarts were taken.
       * Seeking assistance with understanding the origin of the memory expansion and how to avoid it.

       

      Version-Release number of selected component (if applicable):
       - OCP 4.10.55

      How reproducible:
       - One time

      Steps to Reproduce:

      Unknown - Issue occurred with little/no change on the cluster; memory ballooned on SBDB container and leader election was lost, forcing db rebuild process to restore/stabilize.

      Actual results:
       - cluster instability/unexpected downtime

      Expected results:
       - cluster stability/memory expansion should be tied to some process - looking for assistance in identifying the cause.

      Additional info:

      Top MEM-using processes:
          USER PID %CPU %MEM VSZ-MiB RSS-MiB TTY STAT START TIME COMMAND
          root 1946340 4.7 44.2 28687 28502 ? - Feb12 3003:06 ovsdb-server -vconsole:info -vfile:off
          root 1820836 29.3 9.5 7808 6148 ? - Mar24 1417:21 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/c>
          root 1948344 15.2 3.9 3598 2535 ? - Feb12 9549:34 kube-controller-manager --openshift-config=/etc/kubernetes/static-pod-re>
          root 5479 27.9 3.4 13061 2221 ? - 2023 42769:02 etcd --logger=zap --log-level=info
          contain+ 1940935 4.4 1.3 5039 876 ? - Feb12 2789:29 /bin/olm --namespace openshift-operator-lifecycle-manager
          nfsnobo+ 1941677 1.6 1.1 4381 751 ? - Feb12 1037:45 /usr/bin/cluster-network-operator start --listen=0.0.0.0:9104
          root 1188367 61.8 1.0 3902 690 ? - Mar25 1816:12 openshift-apiserver start --config=/var/run/configmaps/config/config.yam>
          root 3090 24.0 1.0 858 674 ? - 2023 36834:46 ovn-northd --no-chdir -vconsole:info
          1000510+ 1940876 0.9 1.0 4492 646 ? - Feb12 603:30 service-ca-operator controller -v=2
          contain+ 23622 2.2 0.9 11923 587 ? - 2023 3450:56 /bin/opm registry serve 
        
      {code:java}
      CONTAINER CPU % MEM DISK INODES NAME
      0fb0a93a97298 0.30 70.98MB 1.45GB 72 ## download-server
      1fb02e4408c99 19.59 2.655GB 249.9kB 35 ## kube-controller-manager
      4b2d057964cb2 32.40 2.279GB 8.192kB 18 ## etcd
      522bbc47bf0dd 1.27 1.027GB 28.67kB 32 ## network-operator
      77f72218121db 49.11 29.94GB 12.29kB 32 ## sbdb <--- expanded unexpectedly
      c7743184a5bc5 0.80 1.144GB 8.192kB 26 ## olm-operator
      da52e3918bed2 0.79 1.092GB 28.67kB 33 ## service-ca-controller
      e3dd0c1f5af2e 27.61 6.387GB 254kB 35 ## kube-apiserver 
      
      

      High container cpu/mem processes: from crictl stats | grep GB

       * template details in first update below

       

            bbennett@redhat.com Ben Bennett
            rhn-support-wrussell Will Russell
            Anurag Saxena Anurag Saxena
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: