Uploaded image for project: 'Red Hat OpenShift Data Science'
  1. Red Hat OpenShift Data Science
  2. RHODS-6171

odh-model-controller OOM killed on sandbox cluster for large numbers of users and namespaces

XMLWordPrintable

    • 1
    • False
    • None
    • False
    • Testable
    • Yes
    • 1.21.0-z
    • No
    • No
    • Yes
    • None
    • ML Serving Sprint 1.22, ML Serving Sprint 1.23, ML Serving Sprint 1.24
    • High

      Description of problem:

      On clusters with very high resources, we observed that odh-model-controller pod is hitting OOM issue.
      Performance test was the default toolchain-e2e test for the sandbox. This will create lots of users, namespaces, and users.
      https://github.com/codeready-toolchain/toolchain-e2e/tree/master/setup

      Prerequisites (if any, like setup, operators/versions):

      Steps to Reproduce

      1. Create cluster with master nodes m5.12xlarge
      2. Install RHODS
      3. run the tests
        go run setup/main.go --users 2000 --default 2000 --custom 0 --username "user${RANDOM_NAME}" --workloads redhat-ods-operator:rhods-operator --workloads redhat-ods-applications:rhods-dashboard --workloads redhat-ods-applications:notebook-controller-deployment --workloads redhat-ods-applications:odh-notebook-controller-manager --workloads redhat-ods-applications:modelmesh-controller --workloads redhat-ods-applications:etcd --workloads redhat-ods-applications:odh-model-controller --workloads redhat-ods-monitoring:blackbox-exporter --workloads redhat-ods-monitoring:rhods-prometheus-operator --workloads redhat-ods-monitoring:prometheus

      Actual results:

      Average odh-model-controller CPU Usage: 0.0012
      Max odh-model-controller CPU Usage: 0.0025
      Average odh-model-controller Memory Usage: 21.42 MB
      Max odh-model-controller Memory Usage: 46.12 MB

      Reproducibility (Always/Intermittent/Only Once):

      Always

      Build Details:

      RHODS 1.20.0-14 ('brew.registry.redhat.io/rh-osbs/iib:395124')

      Workaround:

      Additional info:

      Pods logs:

      1214 09:11:46.994085 1 request.go:601] Waited for 1.035737371s due to client-side throttling, not priority and fairness, request: GET:https://172.30.0.1:443/apis/route.openshift.io/v1?timeout=32s
      1.6710091098061335e+09 INFO controller-runtime.metrics Metrics server is starting to listen {"addr": ":8080"}
      1.6710091098064e+09 INFO setup starting manager
      1.6710091098065906e+09 INFO Starting server {"path": "/metrics", "kind": "metrics", "addr": "[::]:8080"}
      1.6710091098065984e+09 INFO Starting server {"kind": "health probe", "addr": "[::]:8081"}
      I1214 09:11:49.806644 1 leaderelection.go:248] attempting to acquire leader lease redhat-ods-applications/odh-model-controller...
      I1214 09:12:06.712832 1 leaderelection.go:258] successfully acquired lease redhat-ods-applications/odh-model-controller
      1.6710091267129855e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1beta1.InferenceService"}
      1.671009126713024e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1alpha1.ServingRuntime"}
      1.67100912671303e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.Namespace"}
      1.6710091267129595e+09 DEBUG events Normal {"object": {"kind":"Lease","namespace":"redhat-ods-applications","name":"odh-model-controller","uid":"ceaa6dab-6c60-43f7-8850-a6c417dd7c4a","apiVersion":"coordination.k8s.io/v1","resourceVersion":"609836"}, "reason": "LeaderElection", "message": "odh-model-controller-5cc9dbb6cb-n2slb_3b64dfaa-8c41-4e37-9aba-41b5392b0f62 became leader"}
      1.671009126713036e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.Route"}
      1.6710091267130482e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.ServiceAccount"}
      1.6710091267130663e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.Service"}
      1.6710091267130752e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.Secret"}
      1.671009126713082e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1.ClusterRoleBinding"}
      1.6710091267130685e+09 INFO Starting EventSource {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret", "source": "kind source: *v1.Secret"}
      1.671009126713098e+09 INFO Starting Controller {"controller": "secret", "controllerGroup": "", "controllerKind": "Secret"}
      1.6710091267130897e+09 INFO Starting EventSource {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService", "source": "kind source: *v1alpha1.ServingRuntime"}
      1.6710091267131064e+09 INFO Starting Controller {"controller": "inferenceservice", "controllerGroup": "serving.kserve.io", "controllerKind": "InferenceService"}
      

      overall report:

      :popcorn: provisioning users...
        user signups (800/2000) [==========================>-----------------------------------------]  40%
         idler setup (798/2000) [=========================>------------------------------------------]  40%
      setup users with default template (186/2000) [====>---------------------------------------------------------------]   9%
      metrics error: metrics value could not be retrieved for query odh-model-controller CPU Usage
      Average Cluster CPU Utilisation: 4.82 %
      Max Cluster CPU Utilisation: 6.14 %
      Average Cluster Memory Utilisation: 6.49 %
      Max Cluster Memory Utilisation: 6.79 %
      Average Node Memory Usage: 9.42 %
      Max Node Memory Usage: 9.95 %
      Average etcd Instance Memory Usage: 852.13 MB
      Max etcd Instance Memory Usage: 1206.66 MB
      Average olm-operator CPU Usage: 0.1373
      Max olm-operator CPU Usage: 0.2340
      Average olm-operator Memory Usage: 499.24 MB
      Max olm-operator Memory Usage: 552.52 MB
      Average openshift-kube-apiserver: 20088.71 MB
      Max openshift-kube-apiserver: 21178.97 MB
      Average apiserver CPU Usage: 0.8006
      Max apiserver CPU Usage: 1.2799
      Average apiserver Memory Usage: 677.55 MB
      Max apiserver Memory Usage: 760.98 MB
      Average host-operator-controller-manager CPU Usage: 0.0310
      Max host-operator-controller-manager CPU Usage: 0.0442
      Average host-operator-controller-manager Memory Usage: 122.26 MB
      Max host-operator-controller-manager Memory Usage: 140.29 MB
      Average member-operator-controller-manager CPU Usage: 0.0571
      Max member-operator-controller-manager CPU Usage: 0.1064
      Average member-operator-controller-manager Memory Usage: 322.09 MB
      Max member-operator-controller-manager Memory Usage: 396.97 MB
      Average rhods-operator CPU Usage: 0.0669
      Max rhods-operator CPU Usage: 0.0950
      Average rhods-operator Memory Usage: 278.03 MB
      Max rhods-operator Memory Usage: 294.56 MB
      Average rhods-dashboard CPU Usage: 0.0033
      Max rhods-dashboard CPU Usage: 0.0041
      Average rhods-dashboard Memory Usage: 140.12 MB
      Max rhods-dashboard Memory Usage: 144.39 MB
      Average notebook-controller-deployment CPU Usage: 0.0023
      Max notebook-controller-deployment CPU Usage: 0.0030
      Average notebook-controller-deployment Memory Usage: 70.65 MB
      Max notebook-controller-deployment Memory Usage: 75.94 MB
      Average odh-notebook-controller-manager CPU Usage: 0.0048
      Max odh-notebook-controller-manager CPU Usage: 0.0069
      Average odh-notebook-controller-manager Memory Usage: 160.38 MB
      Max odh-notebook-controller-manager Memory Usage: 239.06 MB
      Average modelmesh-controller CPU Usage: 0.0024
      Max modelmesh-controller CPU Usage: 0.0029
      Average modelmesh-controller Memory Usage: 100.38 MB
      Max modelmesh-controller Memory Usage: 128.35 MB
      Average etcd CPU Usage: 0.0034
      Max etcd CPU Usage: 0.0038
      Average etcd Memory Usage: 23.05 MB
      Max etcd Memory Usage: 23.54 MB
      Average odh-model-controller CPU Usage: 0.0012
      Max odh-model-controller CPU Usage: 0.0025
      Average odh-model-controller Memory Usage: 21.42 MB
      Max odh-model-controller Memory Usage: 46.12 MB
      Average blackbox-exporter CPU Usage: 0.0029
      Max blackbox-exporter CPU Usage: 0.0039
      Average blackbox-exporter Memory Usage: 48.92 MB
      Max blackbox-exporter Memory Usage: 57.42 MB
      Average rhods-prometheus-operator CPU Usage: 0.0021
      Max rhods-prometheus-operator CPU Usage: 0.0029
      Average rhods-prometheus-operator Memory Usage: 39.61 MB
      Max rhods-prometheus-operator Memory Usage: 45.39 MB
      Average prometheus CPU Usage: 0.0095
      Max prometheus CPU Usage: 0.0110
      Average prometheus Memory Usage: 193.33 MB
      Max prometheus Memory Usage: 217.10 MB
      

            aasthana@redhat.com Anish Asthana
            takumar@redhat.com Tarun Kumar
            Tarun Kumar Tarun Kumar
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: