Uploaded image for project: 'Red Hat OpenShift AI Engineering'
  1. Red Hat OpenShift AI Engineering
  2. RHOAIENG-4002

OOM error in the kserve storage-initializer init container

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • RHOAI_2.7.0
    • kserve
    • False
    • Hide

      None

      Show
      None
    • False
    • No
    • No
    • Testable

      After an enhancement of our our detection mechanism in the model loading perf&scale testing, we're observing that the storage-initializer init container of InferenceServices fail because of it is running out of memory:

                message: |
                  INFO:root:Initializing, args: src_uri [s3://psap-hf-models/flan-t5-small/flan-t5-small] dest_path[ [/mnt/models]
                  INFO:root:Copying contents of s3://psap-hf-models/flan-t5-small/flan-t5-small to local
                  INFO:botocore.credentials:Found credentials in environment variables.
                  INFO:root:Downloaded object flan-t5-small/flan-t5-small/.gitattributes to /mnt/models/.gitattributes
                  INFO:root:Downloaded object flan-t5-small/flan-t5-small/README.md to /mnt/models/README.md
                  INFO:root:Downloaded object flan-t5-small/flan-t5-small/config.json to /mnt/models/config.json
                  INFO:root:Downloaded object flan-t5-small/flan-t5-small/flax_model.msgpack to /mnt/models/flax_model.msgpack
                  INFO:root:Downloaded object flan-t5-small/flan-t5-small/generation_config.json to /mnt/models/generation_config.json
                  INFO:root:Downloaded object flan-t5-small/flan-t5-small/model.safetensors to /mnt/models/model.safetensors
                reason: OOMKilled
                startedAt: "2024-03-01T16:54:02Z"
      

      This InferenceService was configured to download a small model, not even a large one.
      (link to the CI test where the failure happened)
      This error might easily go unnoticed, as the restart count of the init container isn't displayed in usual oc commands.

      Here is another example, where the automated failed because it detected that the initContainer was restarted (once), but at the second try the download passed successfully:

         - containerID: cri-o://4558b4dd710052ee24f7a1dc94ce5f5f6de049eefbe9d0ba349ec5a3e00dc79d
            image: quay.io/modh/kserve-storage-initializer@sha256:330af2d517b17dbf0cab31beba13cdbe7d6f4b9457114dea8f8485a011e3b138
            imageID: quay.io/modh/kserve-storage-initializer@sha256:330af2d517b17dbf0cab31beba13cdbe7d6f4b9457114dea8f8485a011e3b138
            lastState: {}
            name: storage-initializer
            ready: true
            restartCount: 1
            state:
              terminated:
                containerID: cri-o://4558b4dd710052ee24f7a1dc94ce5f5f6de049eefbe9d0ba349ec5a3e00dc79d
                exitCode: 0
                finishedAt: "2024-03-01T21:17:46Z"
                reason: Completed
                startedAt: "2024-03-01T21:17:34Z"
      

      (link to this CI run)
      The issue with this behavior is that the model would be downloaded from S3 multiple times, maybe potentially continuously if the model always triggers an OOM error.

            Unassigned Unassigned
            kpouget2 Kevin Pouget
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: