-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.13.z, 4.12.z, 4.14.z, 4.15.z, 4.16.0
-
Moderate
-
No
-
AI-44
-
1
-
False
-
-
Description of problem:
Environment is hub cluster deployment of disconnected SNO spoke, with Telco RAN DU profile applied to spoke. Assisted installer leaves a broken pod on the spoke after deployment. This affects test automation that expects all pods to be healthy before/during tests. This seems to only happen with ACM 2.10.2+, I don't see this when ACM 2.9 is installed on the hub cluster. The pod pulls from registry.redhat.io/rhel8/support-tools , which is not available in a disconnected environment (we mirror to a local registry before deployment) assisted-installer helix55labengrdu2redhatcom-debug-8cqw8 0/1 ImagePullBackOff 0 5h39m $ oc logs -n assisted-installer helix55labengrdu2redhatcom-debug-8cqw8 --all-containers Error from server (BadRequest): container "container-00" in pod "helix55labengrdu2redhatcom-debug-8cqw8" is waiting to start: trying and failing to pull image $ oc get pod -n assisted-installer helix55labengrdu2redhatcom-debug-8cqw8 -o yaml apiVersion: v1 kind: Pod metadata: annotations: debug.openshift.io/source-container: container-00 debug.openshift.io/source-resource: /v1, Resource=nodes/helix55.lab.eng.rdu2.redhat.com openshift.io/scc: privileged creationTimestamp: "2024-05-01T19:44:23Z" name: helix55labengrdu2redhatcom-debug-8cqw8 namespace: assisted-installer resourceVersion: "61063" uid: 658c2671-8b31-49ba-a25c-1149380d25cc spec: containers: - command: - chroot - /host - last - reboot env: - name: TMOUT value: "900" image: registry.redhat.io/rhel8/support-tools imagePullPolicy: Always name: container-00 resources: {} securityContext: privileged: true runAsUser: 0 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /host name: host - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-nglvx readOnly: true dnsPolicy: ClusterFirst enableServiceLinks: true hostIPC: true hostNetwork: true hostPID: true nodeName: helix55.lab.eng.rdu2.redhat.com preemptionPolicy: PreemptLowerPriority priority: 1000000000 priorityClassName: openshift-user-critical restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 tolerations: - effect: NoExecute key: node.kubernetes.io/not-ready operator: Exists tolerationSeconds: 300 - effect: NoExecute key: node.kubernetes.io/unreachable operator: Exists tolerationSeconds: 300 volumes: - hostPath: path: / type: Directory name: host - name: kube-api-access-nglvx projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace - configMap: items: - key: service-ca.crt path: service-ca.crt name: openshift-service-ca.crt status: conditions: - lastProbeTime: null lastTransitionTime: "2024-05-01T20:14:38Z" status: "True" type: PodReadyToStartContainers - lastProbeTime: null lastTransitionTime: "2024-05-01T19:44:23Z" status: "True" type: Initialized - lastProbeTime: null lastTransitionTime: "2024-05-01T19:44:23Z" message: 'containers with unready status: [container-00]' reason: ContainersNotReady status: "False" type: Ready - lastProbeTime: null lastTransitionTime: "2024-05-01T19:44:23Z" message: 'containers with unready status: [container-00]' reason: ContainersNotReady status: "False" type: ContainersReady - lastProbeTime: null lastTransitionTime: "2024-05-01T19:44:23Z" status: "True" type: PodScheduled containerStatuses: - image: registry.redhat.io/rhel8/support-tools imageID: "" lastState: {} name: container-00 ready: false restartCount: 0 started: false state: waiting: message: Back-off pulling image "registry.redhat.io/rhel8/support-tools" reason: ImagePullBackOff hostIP: 2620:52:0:800::1ff3 hostIPs: - ip: 2620:52:0:800::1ff3 phase: Pending podIP: 2620:52:0:800::1ff3 podIPs: - ip: 2620:52:0:800::1ff3 qosClass: BestEffort startTime: "2024-05-01T19:44:23Z"
Version-Release number of selected component (if applicable):
OCP version doesn't matter, tested with: HUB: $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.15.11 True False 33h Cluster version is 4.15.11 advanced-cluster-management.v2.10.3 multicluster-engine.v2.5.3 openshift-gitops-operator.v1.12.0 packageserver topology-aware-lifecycle-manager.v4.16.0 ztp-site-generator:v4.16.0-23 SNO SPOKE: $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.16.0-0.nightly-2024-04-30-053518 True False 5h29m Cluster version is 4.16.0-0.nightly-2024-04-30-053518 cluster-logging.v5.9.1 local-storage-operator.v4.16.0-202404302047 packageserver ptp-operator.v4.16.0-202404292211 sriov-fec.v2.8.0 sriov-network-operator.v4.16.0-202404292211
How reproducible:
Always
Steps to Reproduce:
1. Deploy SNO spoke with Telco RAN DU profile 2. Observe pod state and logs after deployment 3.
Actual results:
debug pod is present in assisted-installer namespace and in an error state
Expected results:
debug pod pulls image correctly, or is not created in the first place during deployment. Either way it should be cleaned up from the node after deployment is complete..
Additional info:
Workaround is to delete the pod manually and rerun tests, which is not practical in CI. sosreports and mustgathers will be attached in a comment.