-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
RHOAI_2.7.0
-
False
-
-
False
-
No
-
No
-
-
-
Testable
After an enhancement of our our detection mechanism in the model loading perf&scale testing, we're observing that the storage-initializer init container of InferenceServices fail because of it is running out of memory:
message: | INFO:root:Initializing, args: src_uri [s3://psap-hf-models/flan-t5-small/flan-t5-small] dest_path[ [/mnt/models] INFO:root:Copying contents of s3://psap-hf-models/flan-t5-small/flan-t5-small to local INFO:botocore.credentials:Found credentials in environment variables. INFO:root:Downloaded object flan-t5-small/flan-t5-small/.gitattributes to /mnt/models/.gitattributes INFO:root:Downloaded object flan-t5-small/flan-t5-small/README.md to /mnt/models/README.md INFO:root:Downloaded object flan-t5-small/flan-t5-small/config.json to /mnt/models/config.json INFO:root:Downloaded object flan-t5-small/flan-t5-small/flax_model.msgpack to /mnt/models/flax_model.msgpack INFO:root:Downloaded object flan-t5-small/flan-t5-small/generation_config.json to /mnt/models/generation_config.json INFO:root:Downloaded object flan-t5-small/flan-t5-small/model.safetensors to /mnt/models/model.safetensors reason: OOMKilled startedAt: "2024-03-01T16:54:02Z"
This InferenceService was configured to download a small model, not even a large one.
(link to the CI test where the failure happened)
This error might easily go unnoticed, as the restart count of the init container isn't displayed in usual oc commands.
Here is another example, where the automated failed because it detected that the initContainer was restarted (once), but at the second try the download passed successfully:
- containerID: cri-o://4558b4dd710052ee24f7a1dc94ce5f5f6de049eefbe9d0ba349ec5a3e00dc79d image: quay.io/modh/kserve-storage-initializer@sha256:330af2d517b17dbf0cab31beba13cdbe7d6f4b9457114dea8f8485a011e3b138 imageID: quay.io/modh/kserve-storage-initializer@sha256:330af2d517b17dbf0cab31beba13cdbe7d6f4b9457114dea8f8485a011e3b138 lastState: {} name: storage-initializer ready: true restartCount: 1 state: terminated: containerID: cri-o://4558b4dd710052ee24f7a1dc94ce5f5f6de049eefbe9d0ba349ec5a3e00dc79d exitCode: 0 finishedAt: "2024-03-01T21:17:46Z" reason: Completed startedAt: "2024-03-01T21:17:34Z"
(link to this CI run)
The issue with this behavior is that the model would be downloaded from S3 multiple times, maybe potentially continuously if the model always triggers an OOM error.