Loading...

Type: Task
Resolution: Unresolved
Priority: Undefined
Fix Version/s: MCE 2.6.0
Affects Version/s: ACM 2.11.0
Component/s: Documentation, HyperShift
Labels:
- doc-ack
- hypershift-docs

Blocked:
False
Blocked Reason:
None
Ready:
False
Regression:
No
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Create an informative issue (See each section, incomplete templates/issues won't be triaged)

Using the current documentation as a model, please complete the issue template.

Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.

Prerequisite: Start with what we have

Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:

- Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes

- Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs

Describe the changes in the doc and link to your dev story

Provide info for the following steps:

1. - [x] Mandatory Add the required version to the Fix version/s field.

2. - [x] Mandatory Choose the type of documentation change.

- [x] New topic in an existing section or new section: Perhaps in a new topic after this one https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.10/html/clusters/cluster_mce_overview#enable-node-auto-scaling-hosted-cluster

Or if there's already a section that describes machine health checks or managed cluster node recovery/replacement

- [ ] Update to an existing topic

3. - [x] Mandatory for GA content:

- [x] Add steps and/or other important conceptual information here: (See content below)

- [x] Add Required access level for the user to complete the task here: Same as creating hosted control planes

- [x] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?): (See content below)

- [x] Add link to dev story here: OCPSTRAT-1123 & MGMT-17492

—

Content

Introductory sub-section

Title: Auto-repair Bare Metal Managed Cluster Nodes

The hosted control planes with the Agent platform can use Machine Health Checks (link: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.15/html/machine_management/deploying-machine-health-checks) to replace unhealthy managed cluster nodes.

The Machine Health Check object allows managed cluster nodes to be automatically replaced when the node is considered unhealthy.

Enabling Machine Health Checks sub-section

Machine Health Checks can be created by editing the NodePool.

Steps to enable Machine Health Checks:

Ensure `spec.nodeDrainTimeout` on your NodePool is greater than 0s
1. To verify, run the following command:
  1. oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep nodeDrainTimeout
2. Expected output:
  1. nodeDrainTimeout: 30s
3. If it is not greater than 0s, run the following command, ensuring the time is set to a time greater than 0s
  1. oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec":{"nodeDrainTimeout": 30m}}' --type=merge
Enable Machine Health Check by setting spec.management.autoRepair in the NodePool to true using the following command
1. oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec": {"management": {"autoRepair":true}
  }}' --type=merge

1. Verify by running the following command:
  1. oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep autoRepair
2. Expected output:
  1. autoRepair: true

Additional notes:

Ideally, there are additional host machines (Agents) that are available and ready to be installed if the managed cluster nodes are unhealthy
The Machine Health Check object created through this process is not configurable and set with these specifications
- Does not replace nodes until there are at least 2 nodes that have been unhealthy for at least 8 minutes
- Unhealthy node definition is when the spoke cluster Node condition shows:
  - Ready is "False" or Unknown

Disabling Machine Health Checks sub-section

Steps to disable Machine Health Checks:

Disable Machine Health Check by setting spec.management.autoRepair in the NodePool to false using the following command:
1. oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec": {"management": {"autoRepair":false}
  }}' --type=merge

1. Verify by running the following command:
  1. oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep autoRepair
2. Expected output:
  1. autoRepair: true

Details

Description

Create an informative issue (See each section, incomplete templates/issues won't be triaged)

Prerequisite: Start with what we have

Describe the changes in the doc and link to your dev story

Content

Introductory sub-section

Enabling Machine Health Checks sub-section

Disabling Machine Health Checks sub-section

Attachments

Activity

People

Dates