Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-11595

[Doc] Hosted Control Planes (Agent) Auto-repair Bare Metal Managed Cluster Nodes

XMLWordPrintable

    • False
    • None
    • False
    • No

      Create an informative issue (See each section, incomplete templates/issues won't be triaged)

      Using the current documentation as a model, please complete the issue template. 

      Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.

      Prerequisite: Start with what we have

      Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:

       - Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes

       - Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs 

      Describe the changes in the doc and link to your dev story

      Provide info for the following steps:

      1. - [x] Mandatory Add the required version to the Fix version/s field.

      2. - [x] Mandatory Choose the type of documentation change.

            - [x] New topic in an existing section or new section: Perhaps in a new topic after this one https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes/2.10/html/clusters/cluster_mce_overview#enable-node-auto-scaling-hosted-cluster

      Or if there's already a section that describes machine health checks or managed cluster node recovery/replacement 

            - [ ] Update to an existing topic

      3. - [x] Mandatory for GA content:
                  
             - [x] Add steps and/or other important conceptual information here: (See content below)

                  
             - [x] Add Required access level for the user to complete the task here: Same as creating hosted control planes 
             

             - [x] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?): (See content below)
           
           
             - [x] Add link to dev story here: OCPSTRAT-1123 &  MGMT-17492 

      Content

      Introductory sub-section

      Title: Auto-repair Bare Metal Managed Cluster Nodes

      The hosted control planes with the Agent platform can use Machine Health Checks (link: https://access.redhat.com/documentation/en-us/openshift_container_platform/4.15/html/machine_management/deploying-machine-health-checks) to replace unhealthy managed cluster nodes. 

      The Machine Health Check object allows managed cluster nodes to be automatically replaced when the node is considered unhealthy.

      Enabling Machine Health Checks sub-section

      Machine Health Checks can be created by editing the NodePool.

      Steps to enable Machine Health Checks:

      1. Ensure `spec.nodeDrainTimeout` on your NodePool is greater than 0s
        1. To verify, run the following command:
          1. oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep nodeDrainTimeout
        2. Expected output:
          1. nodeDrainTimeout: 30s
        3. If it is not greater than 0s, run the following command, ensuring the time is set to a time greater than 0s
          1. oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec":{"nodeDrainTimeout": 30m}}' --type=merge
      2. Enable Machine Health Check by setting spec.management.autoRepair in the NodePool to true using the following command
        1.  oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec": {"management": {"autoRepair":true}

          }}' --type=merge

        1. Verify by running the following command:
          1. oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep autoRepair
        2. Expected output:
          1. autoRepair: true

      Additional notes:

      • Ideally, there are additional host machines (Agents) that are available and ready to be installed if the managed cluster nodes are unhealthy
      • The Machine Health Check object created through this process is not configurable and set with these specifications
        • Does not replace nodes until there are at least 2 nodes that have been unhealthy for at least 8 minutes
        • Unhealthy node definition is when the spoke cluster Node condition shows:
          • Ready is "False" or Unknown

      Disabling Machine Health Checks sub-section

      Steps to disable Machine Health Checks:

      1. Disable Machine Health Check by setting spec.management.autoRepair in the NodePool to false using the following command:
        1.  oc patch nodepool -n <hosted_cluster_namespace> <nodepool_name> -p '{"spec": {"management": {"autoRepair":false}

          }}' --type=merge

        1. Verify by running the following command:
          1. oc get nodepool -n <hosted_cluster_namespace> <nodepool_name> -o yaml | grep autoRepair
        2. Expected output:
          1. autoRepair: true

            sdudhgao@redhat.com Servesha Dudhgaonkar
            cchun@redhat.com Crystal Chun
            David Huynh David Huynh
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: