Loading...

XML

Word

Printable

Details

Type: Enhancement
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Common, EAP7, EAP_CD
Labels:
- CD17
- dev-approved

Target Release:

EAP CD 17.0.GA
Git Pull Request:
https://github.com/jboss-container-images/jboss-eap-modules/pull/116, https://github.com/jboss-container-images/jboss-eap-7-openshift-image/pull/246

SFDC Cases Counter:
SFDC Cases Links:

Description

Per the k8s docs[1], retry of probes before treating the probe as failed can be configured in the probe config provided to k8s. In our case that's in set in the application template livenessProbe/readinessProbe config section that ultimately configures k8s to call the our livenessProbe.sh and readinessProbe.sh.

Further, those docs indicate that by default probes should not take longer than 1 sec to execute, otherwise the probe will be considered failed. That timeout can be a higher value, but again the templates would need to set that.

Per the bug report at [2] it seems k8s is not properly enforcing the timeout, but that could change at any time, so we should work to ensure our probes do not start failing if OpenShift moves to a k8s release with this fixed.

The retry and timeout issues are related because one reason our probes might take a long time to complete is that they currently attempt to do retries internally.

1) The scripts in the os-eap-probes module check for COUNT and SLEEP args to the script (which would be set in the application template livenessProbe/readinessProbe config section) and default to 30 and 5 respectively. That means in case of failure, the retry will take longer than 1 sec, so once the issue at [2] is fixed the retries will no longer be meaningful.

So, templates should use periodSeconds and failureThreshold to configure retries, and should set the "COUNT" arg to the scripts (first arg) to 1, disabling internal retry.

At some point the default value of COUNT in the scripts could be changed to 1. Care needs to be taken with this though as that would change the behavior of images that don't include the updated k8s settings.

2) Also, livenessProbe.sh sleeps for 5 secs before beginning the probe.

# Sleep for 5 seconds to avoid launching readiness and liveness probes
# at the same time
sleep 5

If this is still a concern we need to find a different solution.

This will probably need subtasks or something, so different product teams can adjust their own templates.

[1] https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/#configure-probes
[2] https://github.com/kubernetes/kubernetes/issues/26895

Attachments

Issue Links

clones

CLOUD-3245 [7.2.x] Allow kubernetes to control probe retries; avoid probes taking longer than kubernetes timeout settings

is incorporated by

CLOUD-3308 EAP CD 17.0 Release

Closed

Activity

People

Assignee:: Ken Wills

Reporter:: Brian Stansberry

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2019/06/04 12:40 PM

Updated:: 2024/02/08 3:05 PM

Resolved:: 2019/06/04 10:54 PM