Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17034

clusterresourceoverrides.admission.autoscaling.openshift.io can trigger pod creation and update failures if respective clusterresourceoverride pods restart

XMLWordPrintable

    • Moderate
    • No
    • 3
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      OpenShift Container Platform 4 with clusterresourceoverride-operator experiencing nodeCondition=[DiskPressure] will see frequent errors when pod are being created or updated as the respective pod can not be reached (as it's being evicted).
      
      Below is an easy way to trigger the condition without the need to create diskPressure condition (but simply simulating constant restart of the resourceoverride pods).
      
      $ cat /tmp/pod.yaml 
      apiVersion: v1
      kind: Pod
      metadata:
        name: hostname
      spec:
        containers:
        - name: hostname
          image: quay.io/rhn_support_sreber/hostname:latest
          resources:
            requests:
              memory: "128Mi"
              cpu: "250m"
            limits:
              memory: "256Mi"
              cpu: "500m"
      
      $ oc get ns project-1 -o yaml
      apiVersion: v1
      kind: Namespace
      metadata:
        annotations:
          openshift.io/description: ""
          openshift.io/display-name: ""
          openshift.io/requester: kube:admin
          openshift.io/sa.scc.mcs: s0:c26,c20
          openshift.io/sa.scc.supplemental-groups: 1000690000/10000
          openshift.io/sa.scc.uid-range: 1000690000/10000
        creationTimestamp: "2023-07-28T11:06:57Z"
        labels:
          clusterresourceoverrides.admission.autoscaling.openshift.io/enabled: "true"
          kubernetes.io/metadata.name: project-1
          pod-security.kubernetes.io/audit: restricted
          pod-security.kubernetes.io/audit-version: v1.24
          pod-security.kubernetes.io/warn: restricted
          pod-security.kubernetes.io/warn-version: v1.24
        name: project-1
        resourceVersion: "558511"
        uid: 75cc7d03-3f6d-4c9e-a5f5-a7a2435a89f1
      spec:
        finalizers:
        - kubernetes
      status:
        phase: Active
      
      $ oc get pod -n resourceoverride
      NAME                                                READY   STATUS    RESTARTS   AGE
      clusterresourceoverride-2rb6p                       1/1     Running   0          8m23s
      clusterresourceoverride-4kd2k                       1/1     Running   0          4m58s
      clusterresourceoverride-bqbjh                       1/1     Running   0          5s
      clusterresourceoverride-operator-595cc699cf-bp6t5   1/1     Running   0          142m
      
      $ while true; do POD=`oc get pod -n resourceoverride | grep -v clusterresourceoverride-operator | grep -v "^NAME" | tail -1 | awk '{print $1}'`; oc delete pod $POD -n resourceoverride; sleep 3; done
      pod "clusterresourceoverride-4xwz7" deleted
      pod "clusterresourceoverride-vx7g5" deleted
      pod "clusterresourceoverride-7f7f6" deleted
      pod "clusterresourceoverride-59fwc" deleted
      pod "clusterresourceoverride-mfk6z" deleted
      pod "clusterresourceoverride-zs77p" deleted
      pod "clusterresourceoverride-wdzwv" deleted
      pod "clusterresourceoverride-btf79" deleted
      pod "clusterresourceoverride-jvrjx" deleted
      pod "clusterresourceoverride-wmlpn" deleted
      pod "clusterresourceoverride-q4zjt" deleted
      pod "clusterresourceoverride-wr285" deleted
      pod "clusterresourceoverride-z9hsn" deleted
      pod "clusterresourceoverride-hfwcg" deleted
      pod "clusterresourceoverride-5dnzk" deleted
      pod "clusterresourceoverride-9cdtn" deleted
      pod "clusterresourceoverride-k2cdv" deleted
      pod "clusterresourceoverride-9qtpq" deleted
      pod "clusterresourceoverride-tb2qk" deleted
      pod "clusterresourceoverride-bqbjh" deleted
      
      $ while true; do oc delete pod -n project-1 hostname; oc apply -f /tmp/pod.yaml -n project-1; sleep 1; done
      [...]
      pod "hostname" deleted
      Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
      Error from server (NotFound): pods "hostname" not found
      Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
      Error from server (NotFound): pods "hostname" not found
      pod/hostname created
      pod "hostname" deleted
      pod/hostname created
      pod "hostname" deleted
      
      So clusterresourceoverride pods should either have a priorityClass assigned that prevents the pods from being evicted or potentially even better, call a service instead of localhost to make sure an endpoint is called that is considered healthy and running.
      
      

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.11 and OpenShift Container Platform 4.13
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Install OpenShift Container Platform 4
      2. Install ClusterResourceOverride according to https://docs.openshift.com/container-platform/4.13/nodes/clusters/nodes-cluster-overcommit.html
      3. Create either diskPressure on one of the OpenShift Container Platform 4 - Control-Plane Node(s) or simply delete one of the pods constantly to simulate a diskPressure situation (as shown in the description) 
      

      Actual results:

      $ while true; do oc delete pod -n project-1 hostname; oc apply -f /tmp/pod.yaml -n project-1; sleep 1; done
      [...]
      pod "hostname" deleted
      Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
      Error from server (NotFound): pods "hostname" not found
      Error from server (InternalError): error when creating "/tmp/pod.yaml": Internal error occurred: failed calling webhook "clusterresourceoverrides.admission.autoscaling.openshift.io": failed to call webhook: Post "https://localhost:9400/apis/admission.autoscaling.openshift.io/v1/clusterresourceoverrides?timeout=5s": dial tcp [::1]:9400: connect: connection refused
      Error from server (NotFound): pods "hostname" not found
      

      Expected results:

      Pods being created and update as long as one of the 3 clusterresourceoverride pods is running no matter on what OpenShift Container Platform 4 - Control-Plane Node 
      

      Additional info:

      
      

            joelsmith.redhat Joel Smith
            rhn-support-sreber Simon Reber
            Sunil Choudhary Sunil Choudhary
            Joel Smith
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: