Status: Resolved (View Workflow)
Affects Version/s: None
Fix Version/s: None
Per the k8s docs, retry of probes before treating the probe as failed can be configured in the probe config provided to k8s. In our case that's in set in the application template livenessProbe/readinessProbe config section that ultimately configures k8s to call the our livenessProbe.sh and readinessProbe.sh.
Further, those docs indicate that by default probes should not take longer than 1 sec to execute, otherwise the probe will be considered failed. That timeout can be a higher value, but again the templates would need to set that.
Per the bug report at  it seems k8s is not properly enforcing the timeout, but that could change at any time, so we should work to ensure our probes do not start failing if OpenShift moves to a k8s release with this fixed.
The retry and timeout issues are related because one reason our probes might take a long time to complete is that they currently attempt to do retries internally.
1) The scripts in the os-eap-probes module check for COUNT and SLEEP args to the script (which would be set in the application template livenessProbe/readinessProbe config section) and default to 30 and 5 respectively. That means in case of failure, the retry will take longer than 1 sec, so once the issue at  is fixed the retries will no longer be meaningful.
So, templates should use periodSeconds and failureThreshold to configure retries, and should set the "COUNT" arg to the scripts (first arg) to 1, disabling internal retry.
At some point the default value of COUNT in the scripts could be changed to 1. Care needs to be taken with this though as that would change the behavior of images that don't include the updated k8s settings.
2) Also, livenessProbe.sh sleeps for 5 secs before beginning the probe.
If this is still a concern we need to find a different solution.
This will probably need subtasks or something, so different product teams can adjust their own templates.