-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
RHOAI_2.8.0
-
False
-
-
False
-
No
-
No
-
-
-
Testable
I observed Kserve to fail to reconcile after installation of the todays 2.8.1 test build (which is effectively same as 2.8.0 in this case) with following error:
DataScienceCluster resource reconciled with component errors: 2 errors occurred:
- context deadline exceeded
- context deadline exceeded
Conditions
kserveReady Unknown Mar 27, 2024, 5:17 PM ReconcileInit Component is enabled
rhods-operator log excerpt
2024-03-27T16:16:14Z INFO features resource created {"feature": "serverless-serving-gateways", "namespace": "knative-serving", "resource": "operator.knative.dev/v1beta1, Kind=KnativeServing"} 2024-03-27T16:16:15Z INFO features applying manifest {"feature": "serverless-serving-gateways", "feature": "serverless-serving-gateways", "name": "istio-ingress-gateway.tmpl", "path": "templates/serverless/serving-istio-gateways/istio-ingress-gateway.yaml"} 2024-03-27T16:16:15Z INFO features Creating resource {"feature": "serverless-serving-gateways", "name": "knative-ingress-gateway"} 2024-03-27T16:16:15Z INFO features Object already exists... {"feature": "serverless-serving-gateways"} 2024-03-27T16:16:15Z INFO features applying manifest {"feature": "serverless-serving-gateways", "feature": "serverless-serving-gateways", "name": "istio-local-gateway.yaml", "path": "templates/serverless/serving-istio-gateways/istio-local-gateway.yaml"} 2024-03-27T16:16:15Z INFO features Creating resource {"feature": "serverless-serving-gateways", "name": "knative-local-gateway"} 2024-03-27T16:16:15Z INFO features Object already exists... {"feature": "serverless-serving-gateways"} 2024-03-27T16:16:15Z INFO features applying manifest {"feature": "serverless-serving-gateways", "feature": "serverless-serving-gateways", "name": "local-gateway-svc.tmpl", "path": "templates/serverless/serving-istio-gateways/local-gateway-svc.yaml"} 2024-03-27T16:16:15Z INFO features Creating resource {"feature": "serverless-serving-gateways", "name": "knative-local-gateway"} 2024-03-27T16:16:15Z INFO features Object already exists... {"feature": "serverless-serving-gateways"} 2024-03-27T16:16:15Z ERROR controllers.DataScienceCluster failed to reconcile kserve on DataScienceCluster {"instance.Name": "default-dsc", "error": "2 errors occurred:\n\t* context deadline exceeded\n\t* context deadline exceeded\n\n"} github.com/opendatahub-io/opendatahub-operator/v2/controllers/datasciencecluster.(*DataScienceClusterReconciler).reportError /workspace/controllers/datasciencecluster/datasciencecluster_controller.go:335 github.com/opendatahub-io/opendatahub-operator/v2/controllers/datasciencecluster.(*DataScienceClusterReconciler).reconcileSubComponent /workspace/controllers/datasciencecluster/datasciencecluster_controller.go:299 github.com/opendatahub-io/opendatahub-operator/v2/controllers/datasciencecluster.(*DataScienceClusterReconciler).Reconcile /workspace/controllers/datasciencecluster/datasciencecluster_controller.go:235 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235 2024-03-27T16:16:15Z DEBUG events failed to reconcile kserve on DataScienceCluster for instance default-dsc {"type": "Warning", "object": {"kind":"DataScienceCluster","name":"default-dsc","uid":"8c143040-3601-4c4b-b200-618fd09c908f","apiVersion":"datasciencecluster.opendatahub.io/v1","resourceVersion":"6781547"}, "reason": "DataScienceClusterReconcileError"} Updating manifests : /opt/manifests/kueue/rhoai # Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically. Updating manifests : /opt/manifests/codeflare/default # Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically. # Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically. 2024/03/27 16:16:16 well-defined vars that were never replaced: namespace Updating manifests : /opt/manifests/ray/openshift # Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically. # Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically. Updating manifests : /opt/manifests/trustyai-service-operator/base # Warning: 'vars' is deprecated. Please use 'replacements' instead. [EXPERIMENTAL] Run 'kustomize edit fix' to update your Kustomization automatically. 2024/03/27 16:16:21 well-defined vars that were never replaced: oauthProxyImage,trustyaiServiceImage 2024-03-27T16:16:21Z INFO controllers.DataScienceCluster DataScienceCluster Deployment Incomplete. 2024-03-27T16:16:21Z ERROR Reconciler error {"controller": "datasciencecluster", "controllerGroup": "datasciencecluster.opendatahub.io", "controllerKind": "DataScienceCluster", "DataScienceCluster": {"name":"default-dsc"}, "namespace": "", "name": "default-dsc", "reconcileID": "29ab2c5d-904e-494d-b815-bb855e28eb76", "error": "2 errors occurred:\n\t* context deadline exceeded\n\t* context deadline exceeded\n\n"} sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:329 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2 /remote-source/operator/deps/gomod/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235 2024-03-27T16:16:21Z DEBUG events DataScienceCluster instance default-dsc created, but have some failures in component 2 errors occurred: * context deadline exceeded * context deadline exceeded {"type": "Normal", "object": {"kind":"DataScienceCluster","name":"default-dsc","uid":"8c143040-3601-4c4b-b200-618fd09c908f","apiVersion":"datasciencecluster.opendatahub.io/v1","resourceVersion":"6791960"}, "reason": "DataScienceClusterComponentFailures"} 2024-03-27T16:16:55Z INFO controllers.DataScienceCluster Reconciling DataScienceCluster resources {"Request.Name": "default-dsc"} Updating manifests : /opt/manifests/dashboard/crd Updating manifests : /opt/manifests/dashboard/overlays/rhoai
I have seen this error after a fresh install and once I set kserve to Removed (wait until operator reconcile fully) and then back to Managed, the operator reconciled just fine this time. I wasn't able to reproduce this yet.
DSC spec
spec: components: codeflare: managementState: Removed kserve: managementState: Managed serving: ingressGateway: certificate: type: SelfSigned managementState: Managed name: knative-serving trustyai: managementState: Removed ray: managementState: Removed kueue: managementState: Removed workbenches: managementState: Managed dashboard: managementState: Managed modelmeshserving: managementState: Managed datasciencepipelines: managementState: Managed
DSCI spec
spec: applicationsNamespace: redhat-ods-applications monitoring: managementState: Managed namespace: redhat-ods-monitoring serviceMesh: controlPlane: metricsCollection: Istio name: data-science-smcp namespace: istio-system managementState: Managed trustedCABundle: customCABundle: '' managementState: Managed
I'm not really sure under which precise circumstances this happens. Setting this as a normal priority. Let's see if I will hit it again or somebody else will...