Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-33185

assisted installer leaving broken debug pod after deployment complete

XMLWordPrintable

    • Moderate
    • No
    • AI-44
    • 1
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      
      Environment is hub cluster deployment of disconnected SNO spoke, with Telco RAN DU profile applied to spoke.
      
      Assisted installer leaves a broken pod on the spoke after deployment. This affects test automation that expects all pods to be healthy before/during tests.
      
      This seems to only happen with ACM 2.10.2+, I don't see this when ACM 2.9 is installed on the hub cluster.
      
      The pod pulls from registry.redhat.io/rhel8/support-tools , which is not available in a disconnected environment (we mirror to a local registry before deployment)
      
      assisted-installer                                 helix55labengrdu2redhatcom-debug-8cqw8                            0/1     ImagePullBackOff   0              5h39m
      
      $ oc logs -n assisted-installer helix55labengrdu2redhatcom-debug-8cqw8 --all-containers 
      Error from server (BadRequest): container "container-00" in pod "helix55labengrdu2redhatcom-debug-8cqw8" is waiting to start: trying and failing to pull image
      
      $ oc get pod -n assisted-installer helix55labengrdu2redhatcom-debug-8cqw8 -o yaml
      apiVersion: v1
      kind: Pod
      metadata:
        annotations:
          debug.openshift.io/source-container: container-00
          debug.openshift.io/source-resource: /v1, Resource=nodes/helix55.lab.eng.rdu2.redhat.com
          openshift.io/scc: privileged
        creationTimestamp: "2024-05-01T19:44:23Z"
        name: helix55labengrdu2redhatcom-debug-8cqw8
        namespace: assisted-installer
        resourceVersion: "61063"
        uid: 658c2671-8b31-49ba-a25c-1149380d25cc
      spec:
        containers:
        - command:
          - chroot
          - /host
          - last
          - reboot
          env:
          - name: TMOUT
            value: "900"
          image: registry.redhat.io/rhel8/support-tools
          imagePullPolicy: Always
          name: container-00
          resources: {}
          securityContext:
            privileged: true
            runAsUser: 0
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /host
            name: host
          - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
            name: kube-api-access-nglvx
            readOnly: true
        dnsPolicy: ClusterFirst
        enableServiceLinks: true
        hostIPC: true
        hostNetwork: true
        hostPID: true
        nodeName: helix55.lab.eng.rdu2.redhat.com
        preemptionPolicy: PreemptLowerPriority
        priority: 1000000000
        priorityClassName: openshift-user-critical
        restartPolicy: Never
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: default
        serviceAccountName: default
        terminationGracePeriodSeconds: 30
        tolerations:
        - effect: NoExecute
          key: node.kubernetes.io/not-ready
          operator: Exists
          tolerationSeconds: 300
        - effect: NoExecute
          key: node.kubernetes.io/unreachable
          operator: Exists
          tolerationSeconds: 300
        volumes:
        - hostPath:
            path: /
            type: Directory
          name: host
        - name: kube-api-access-nglvx
          projected:
            defaultMode: 420
            sources:
            - serviceAccountToken:
                expirationSeconds: 3607
                path: token
            - configMap:
                items:
                - key: ca.crt
                  path: ca.crt
                name: kube-root-ca.crt
            - downwardAPI:
                items:
                - fieldRef:
                    apiVersion: v1
                    fieldPath: metadata.namespace
                  path: namespace
            - configMap:
                items:
                - key: service-ca.crt
                  path: service-ca.crt
                name: openshift-service-ca.crt
      status:
        conditions:
        - lastProbeTime: null
          lastTransitionTime: "2024-05-01T20:14:38Z"
          status: "True"
          type: PodReadyToStartContainers
        - lastProbeTime: null
          lastTransitionTime: "2024-05-01T19:44:23Z"
          status: "True"
          type: Initialized
        - lastProbeTime: null
          lastTransitionTime: "2024-05-01T19:44:23Z"
          message: 'containers with unready status: [container-00]'
          reason: ContainersNotReady
          status: "False"
          type: Ready
        - lastProbeTime: null
          lastTransitionTime: "2024-05-01T19:44:23Z"
          message: 'containers with unready status: [container-00]'
          reason: ContainersNotReady
          status: "False"
          type: ContainersReady
        - lastProbeTime: null
          lastTransitionTime: "2024-05-01T19:44:23Z"
          status: "True"
          type: PodScheduled
        containerStatuses:
        - image: registry.redhat.io/rhel8/support-tools
          imageID: ""
          lastState: {}
          name: container-00
          ready: false
          restartCount: 0
          started: false
          state:
            waiting:
              message: Back-off pulling image "registry.redhat.io/rhel8/support-tools"
              reason: ImagePullBackOff
        hostIP: 2620:52:0:800::1ff3
        hostIPs:
        - ip: 2620:52:0:800::1ff3
        phase: Pending
        podIP: 2620:52:0:800::1ff3
        podIPs:
        - ip: 2620:52:0:800::1ff3
        qosClass: BestEffort
        startTime: "2024-05-01T19:44:23Z"
      
      
          

      Version-Release number of selected component (if applicable):

      OCP version doesn't matter, tested with:
      
      HUB:
      $ oc get clusterversions.config.openshift.io 
      NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.15.11   True        False         33h     Cluster version is 4.15.11
      
      advanced-cluster-management.v2.10.3
      multicluster-engine.v2.5.3
      openshift-gitops-operator.v1.12.0
      packageserver
      topology-aware-lifecycle-manager.v4.16.0
      ztp-site-generator:v4.16.0-23
      
      SNO SPOKE:
      $ oc get clusterversions.config.openshift.io 
      NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
      version   4.16.0-0.nightly-2024-04-30-053518   True        False         5h29m   Cluster version is 4.16.0-0.nightly-2024-04-30-053518
      
      cluster-logging.v5.9.1
      local-storage-operator.v4.16.0-202404302047
      packageserver
      ptp-operator.v4.16.0-202404292211
      sriov-fec.v2.8.0
      sriov-network-operator.v4.16.0-202404292211
      
      
          

      How reproducible:

      Always
          

      Steps to Reproduce:

          1. Deploy SNO spoke with Telco RAN DU profile
          2. Observe pod state and logs after deployment
          3.
          

      Actual results:

      debug pod is present in assisted-installer namespace and in an error state
          

      Expected results:

      debug pod pulls image correctly, or is not created in the first place during deployment. Either way it should be cleaned up from the node after deployment is complete..
          

      Additional info:

      
      Workaround is to delete the pod manually and rerun tests, which is not practical in CI.
      
      sosreports and mustgathers will be attached in a comment.
          

            oamizur Ori Amizur
            rhn-support-dgonyier Dwaine Gonyier
            Michael Burman Michael Burman
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated: