Uploaded image for project: 'Red Hat OpenShift AI Engineering'
  1. Red Hat OpenShift AI Engineering
  2. RHOAIENG-4925

KServe may fail to reconcile sometimes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • RHOAI_2.8.0
    • kserve, Platform
    • False
    • Hide

      None

      Show
      None
    • False
    • No
    • No
    • Testable

      I observed Kserve to fail to reconcile after installation of the todays 2.8.1 test build (which is effectively same as 2.8.0 in this case) with following error:

      DataScienceCluster resource reconciled with component errors: 2 errors occurred:

      • context deadline exceeded
      • context deadline exceeded
      Conditions
      kserveReady   Unknown    Mar 27, 2024, 5:17 PM    ReconcileInit    Component is enabled
      
      rhods-operator log excerpt
      2024-03-27T16:16:14Z INFO features resource created {"feature": "serverless-serving-gateways", "namespace": "knative-serving", "resource": "operator.knative.dev/v1beta1, Kind=KnativeServing"}
      2024-03-27T16:16:15Z INFO features applying manifest {"feature": "serverless-serving-gateways", "feature": "serverless-serving-gateways", "name": "istio-ingress-gateway.tmpl", "path": "templates/serverless/serving-istio-gateways/istio-ingress-gateway.yaml"}
      2024-03-27T16:16:15Z INFO features Creating resource {"feature": "serverless-serving-gateways", "name": "knative-ingress-gateway"}
      2024-03-27T16:16:15Z INFO features Object already exists... {"feature": "serverless-serving-gateways"}
      2024-03-27T16:16:15Z INFO features applying manifest {"feature": "serverless-serving-gateways", "feature": "serverless-serving-gateways", "name": "istio-local-gateway.yaml", "path": "templates/serverless/serving-istio-gateways/istio-local-gateway.yaml"}
      2024-03-27T16:16:15Z INFO features Creating resource {"feature": "serverless-serving-gateways", "name": "knative-local-gateway"}
      2024-03-27T16:16:15Z INFO features Object already exists... {"feature": "serverless-serving-gateways"}
      2024-03-27T16:16:15Z INFO features applying manifest {"feature": "serverless-serving-gateways", "feature": "serverless-serving-gateways", "name": "local-gateway-svc.tmpl", "path": "templates/serverless/serving-istio-gateways/local-gateway-svc.yaml"}
      2024-03-27T16:16:15Z INFO features Creating resource {"feature": "serverless-serving-gateways", "name": "knative-local-gateway"}
      2024-03-27T16:16:15Z INFO features Object already exists... {"feature": "serverless-serving-gateways"}
      2024-03-27T16:16:15Z ERROR controllers.DataScienceCluster failed to reconcile kserve on DataScienceCluster {"instance.Name": "default-dsc", "error": "2 errors occurred:\n\t* context deadline exceeded\n\t* context deadline exceeded\n\n"}
      github.com/opendatahub-io/opendatahub-operator/v2/controllers/datasciencecluster.(*DataScienceClusterReconciler).reportError
      /workspace/controllers/datasciencecluster/datasciencecluster_controller.go:335
      github.com/opendatahub-io/opendatahub-operator/v2/controllers/datasciencecluster.(*DataScienceClusterReconciler).reconcileSubComponent
      /workspace/controllers/datasciencecluster/datasciencecluster_controller.go:299
      github.com/opendatahub-io/opendatahub-operator/v2/controllers/datasciencecluster.(*DataScienceClusterReconciler).Reconcile
      /workspace/controllers/datasciencecluster/datasciencecluster_controller.go:235
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
      /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
      /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
      /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
      /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
      2024-03-27T16:16:15Z DEBUG events failed to reconcile kserve on DataScienceCluster for instance default-dsc {"type": "Warning", "object": {"kind":"DataScienceCluster","name":"default-dsc","uid":"8c143040-3601-4c4b-b200-618fd09c908f","apiVersion":"datasciencecluster.opendatahub.io/v1","resourceVersion":"6781547"}, "reason": "DataScienceClusterReconcileError"}
      Updating manifests : /opt/manifests/kueue/rhoai
      # Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically.
      Updating manifests : /opt/manifests/codeflare/default
      # Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
      # Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically.
      2024/03/27 16:16:16 well-defined vars that were never replaced: namespace
      Updating manifests : /opt/manifests/ray/openshift
      # Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically.
      # Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
      Updating manifests : /opt/manifests/trustyai-service-operator/base
      # Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically.
      2024/03/27 16:16:21 well-defined vars that were never replaced: oauthProxyImage,trustyaiServiceImage
      2024-03-27T16:16:21Z INFO controllers.DataScienceCluster DataScienceCluster Deployment Incomplete.
      2024-03-27T16:16:21Z ERROR Reconciler error {"controller": "datasciencecluster", "controllerGroup": "datasciencecluster.opendatahub.io", "controllerKind": "DataScienceCluster", "DataScienceCluster": {"name":"default-dsc"}, "namespace": "", "name": "default-dsc", "reconcileID": "29ab2c5d-904e-494d-b815-bb855e28eb76", "error": "2 errors occurred:\n\t* context deadline exceeded\n\t* context deadline exceeded\n\n"}
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
      /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
      /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274
      sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
      /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235
      2024-03-27T16:16:21Z DEBUG events DataScienceCluster instance default-dsc created, but have some failures in component 2 errors occurred:
      * context deadline exceeded
      * context deadline exceeded
      {"type": "Normal", "object": {"kind":"DataScienceCluster","name":"default-dsc","uid":"8c143040-3601-4c4b-b200-618fd09c908f","apiVersion":"datasciencecluster.opendatahub.io/v1","resourceVersion":"6791960"}, "reason": "DataScienceClusterComponentFailures"}
      2024-03-27T16:16:55Z INFO controllers.DataScienceCluster Reconciling DataScienceCluster resources {"Request.Name": "default-dsc"}
      Updating manifests : /opt/manifests/dashboard/crd
      Updating manifests : /opt/manifests/dashboard/overlays/rhoai
      

      I have seen this error after a fresh install and once I set kserve to Removed (wait until operator reconcile fully) and then back to Managed, the operator reconciled just fine this time. I wasn't able to reproduce this yet.

      DSC spec
      spec:
        components:
          codeflare:
            managementState: Removed
          kserve:
            managementState: Managed
            serving:
              ingressGateway:
                certificate:
                  type: SelfSigned
              managementState: Managed
              name: knative-serving
          trustyai:
            managementState: Removed
          ray:
            managementState: Removed
          kueue:
            managementState: Removed
          workbenches:
            managementState: Managed
          dashboard:
            managementState: Managed
          modelmeshserving:
            managementState: Managed
          datasciencepipelines:
            managementState: Managed
      
      DSCI spec
      spec:
        applicationsNamespace: redhat-ods-applications
        monitoring:
          managementState: Managed
          namespace: redhat-ods-monitoring
        serviceMesh:
          controlPlane:
            metricsCollection: Istio
            name: data-science-smcp
            namespace: istio-system
          managementState: Managed
        trustedCABundle:
          customCABundle: ''
          managementState: Managed
      

      I'm not really sure under which precise circumstances this happens. Setting this as a normal priority. Let's see if I will hit it again or somebody else will...

            Unassigned Unassigned
            jstourac@redhat.com Jan Stourac
            RHOAI Model Server and Serving Metrics
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: