Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13.z, 4.12.z, 4.14.z, 4.15.z, 4.16.0
Component/s: Installer / Assisted installer
Labels:
- Triaged
- telco-5g-ran

Severity:
Moderate
Regression:
No
Sprint:
AI-44
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:


Environment is hub cluster deployment of disconnected SNO spoke, with Telco RAN DU profile applied to spoke.

Assisted installer leaves a broken pod on the spoke after deployment. This affects test automation that expects all pods to be healthy before/during tests.

This seems to only happen with ACM 2.10.2+, I don't see this when ACM 2.9 is installed on the hub cluster.

The pod pulls from registry.redhat.io/rhel8/support-tools , which is not available in a disconnected environment (we mirror to a local registry before deployment)

assisted-installer                                 helix55labengrdu2redhatcom-debug-8cqw8                            0/1     ImagePullBackOff   0              5h39m

$ oc logs -n assisted-installer helix55labengrdu2redhatcom-debug-8cqw8 --all-containers 
Error from server (BadRequest): container "container-00" in pod "helix55labengrdu2redhatcom-debug-8cqw8" is waiting to start: trying and failing to pull image

$ oc get pod -n assisted-installer helix55labengrdu2redhatcom-debug-8cqw8 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    debug.openshift.io/source-container: container-00
    debug.openshift.io/source-resource: /v1, Resource=nodes/helix55.lab.eng.rdu2.redhat.com
    openshift.io/scc: privileged
  creationTimestamp: "2024-05-01T19:44:23Z"
  name: helix55labengrdu2redhatcom-debug-8cqw8
  namespace: assisted-installer
  resourceVersion: "61063"
  uid: 658c2671-8b31-49ba-a25c-1149380d25cc
spec:
  containers:
  - command:
    - chroot
    - /host
    - last
    - reboot
    env:
    - name: TMOUT
      value: "900"
    image: registry.redhat.io/rhel8/support-tools
    imagePullPolicy: Always
    name: container-00
    resources: {}
    securityContext:
      privileged: true
      runAsUser: 0
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /host
      name: host
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-nglvx
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostIPC: true
  hostNetwork: true
  hostPID: true
  nodeName: helix55.lab.eng.rdu2.redhat.com
  preemptionPolicy: PreemptLowerPriority
  priority: 1000000000
  priorityClassName: openshift-user-critical
  restartPolicy: Never
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - hostPath:
      path: /
      type: Directory
    name: host
  - name: kube-api-access-nglvx
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
      - configMap:
          items:
          - key: service-ca.crt
            path: service-ca.crt
          name: openshift-service-ca.crt
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-05-01T20:14:38Z"
    status: "True"
    type: PodReadyToStartContainers
  - lastProbeTime: null
    lastTransitionTime: "2024-05-01T19:44:23Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-05-01T19:44:23Z"
    message: 'containers with unready status: [container-00]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-05-01T19:44:23Z"
    message: 'containers with unready status: [container-00]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-05-01T19:44:23Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: registry.redhat.io/rhel8/support-tools
    imageID: ""
    lastState: {}
    name: container-00
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "registry.redhat.io/rhel8/support-tools"
        reason: ImagePullBackOff
  hostIP: 2620:52:0:800::1ff3
  hostIPs:
  - ip: 2620:52:0:800::1ff3
  phase: Pending
  podIP: 2620:52:0:800::1ff3
  podIPs:
  - ip: 2620:52:0:800::1ff3
  qosClass: BestEffort
  startTime: "2024-05-01T19:44:23Z"

Version-Release number of selected component (if applicable):

OCP version doesn't matter, tested with:

HUB:
$ oc get clusterversions.config.openshift.io 
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.15.11   True        False         33h     Cluster version is 4.15.11

advanced-cluster-management.v2.10.3
multicluster-engine.v2.5.3
openshift-gitops-operator.v1.12.0
packageserver
topology-aware-lifecycle-manager.v4.16.0
ztp-site-generator:v4.16.0-23

SNO SPOKE:
$ oc get clusterversions.config.openshift.io 
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.16.0-0.nightly-2024-04-30-053518   True        False         5h29m   Cluster version is 4.16.0-0.nightly-2024-04-30-053518

cluster-logging.v5.9.1
local-storage-operator.v4.16.0-202404302047
packageserver
ptp-operator.v4.16.0-202404292211
sriov-fec.v2.8.0
sriov-network-operator.v4.16.0-202404292211

How reproducible:

Always

Steps to Reproduce:

    1. Deploy SNO spoke with Telco RAN DU profile
    2. Observe pod state and logs after deployment
    3.

Actual results:

debug pod is present in assisted-installer namespace and in an error state

Expected results:

debug pod pulls image correctly, or is not created in the first place during deployment. Either way it should be cleaned up from the node after deployment is complete..

Additional info:


Workaround is to delete the pod manually and rerun tests, which is not practical in CI.

sosreports and mustgathers will be attached in a comment.

Assignee:: Ori Amizur

Reporter:: Dwaine Gonyier

QA Contact:: Michael Burman

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/05/02 1:38 AM

Updated:: 2024/06/02 8:45 AM

Details

Description

Attachments

Activity

People

Dates