-
Bug
-
Resolution: Unresolved
-
Major
-
RHOAI_2.8.0, RHOAI_2.8.1, RHOAI_2.9.0
-
1
-
False
-
-
False
-
Release Notes
-
No
-
-
Known Issue
-
Done
-
No
-
-
-
RHOAI DSP Sprint 1, RHOAI DSP Sprint 2
-
Testable
Today, in working with rhn-gpte-dtorresf , in a customer environment, we ran across the exact same symptoms as what was described in this previous Jira: https://issues.redhat.com/browse/RHOAIENG-2099
In a disconnected RHOAI 2.8.0 environment, at a customer, the pipeline server would not get created.
We observed that the mariadb pod would get started ok, fairly fast too, but no other pipeline pod would get created.
When we looked at the events of the DSPA, we could see a successful Object Storage check, but a failed database check.
Now, we have seen this happen, intermittently, in the past (https://issues.redhat.com/browse/RHOAIENG-2099). but in those cases, 99% of the time, deleting and re-creating the pipeline server would fix eveything on the second attempt.
Not in this scenario. Deleting and re-creating got us the exact same result.
The only thing that helped here was to disable the Health Check for the database.
(set spec.database.disableHealthCheck = true)
Doing so immediately fixed the issue, got the other pods to start and the DS pipeline server to work.
We tried deleting and re-creating 3 times in a row and the behavior was always the same.
The customer will likely soon be opening a case about this, at which time we'll link to this Jira.
I'd like to understand a bit better the nature of the health check for the database, and what could be the cause of this.
So far, my only theory is that the health check is a network connection, initiated from another namespace, and that it can't reach the mariadb service inside the project, potentially because the customer has network policies that are more restrictive than what we usually see. (just a theory, by all means, poke holes in it).
If there is a manual equivalent to the health check, it would worth testing it in the customer's environment. The MariaDB pod was fully running. (although we never checked its logs).
- split to
-
RHOAIENG-7124 Move db/obj store health checks to readiness check
- New
- links to
-
RHBA-2024:130576 RHOAI 2.10.0 - Red Hat OpenShift AI
- mentioned on