Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-692

Provide Manual Non-Disruptive Recovery Method for Etcd Minority Failure

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • 0% To Do, 0% In Progress, 100% Done
    • 0
    • 0
    • Program Call

      Background

      Etcd cluster failures generally fall into two categories, minority failure and majority failure. The latter, where etcd quorum is lost, results in an API server outage, and these scenarios can be resolved by restoring the etcd cluster from a backed-up snapshot. However, minority failures, where quorum is maintained, should not require disruption to API server requests to resolve because the etcd cluster can still process reads and writes.

      Feature Overview

      Provide non-disruptive recovery steps for etcd minority failure scenarios, enhancing the stability of our platform and preventing data loss and service disruptions.

      Goals

      1. Ensure that we have viable and safe manual recovery methods for etcd minority failure that does not require API server disruption (v4.14).
      2. Automate the etcd minority failure recovery methods (v4.15 onwards).

      Requirements

      1. The manual recovery must be possible when etcd is backed by local storage.
      2. The manual recovery must be non-disruptive, with API read/write operations continuing to work as long as the existing etcd cluster maintains quorum.

      Use Cases

      1. A single etcd instance, backed by local storage, loses access to its node and there's no available snapshot for recovery. In this case, recovery should be possible using the remaining etcd instances which still have quorum.
      2. A single etcd instance's PVC data might be lost, but a snapshot of that data exists. Recovery of the PVC from a snapshot and rescheduling of the etcd instance should be possible.
      3. A single etcd instance backed by local storage needs to be moved to another management node. This process should be possible by backing up the existing PVC to a snapshot on distributed storage and then restoring that data to local storage on another management node.

      Out of Scope

      1. Recovery steps for etcd majority failure scenarios.
      2. Recovery process for etcd instances not backed by local storage.

      Customer Considerations

      Given the central role that etcd plays in the operation of the clusters, disruptions can have significant impacts on customers. Ensuring a smooth recovery process will help minimize downtime and data loss.

      Documentation Considerations

      Documentation should be created detailing the (manual) process of non-disruptive recovery for etcd minority failure scenarios. It should include different use cases, potential challenges, and recovery steps

            asegurap1@redhat.com Antoni Segura Puimedon
            azaalouk Adel Zaalouk
            Cesar Wong, David Vossel
            Jie Zhao Jie Zhao
            Laura Hinson Laura Hinson
            Dave Mulford Dave Mulford
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: