Loading...

XML

Word

Printable

Type: Feature
Resolution: Done
Fix Version/s: None
Affects Version/s: None
Component/s: Hosted Control Panes
Labels:
- ga_readiness
- self-managed

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Target Version:

openshift-4.14

RICE Score:
0
Risk Score:
0

Discussion Needed:

Program Call

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

PX Priority Data:
PX Review Complete:

Intelligence Requested:
Market:

Background

Etcd cluster failures generally fall into two categories, minority failure and majority failure. The latter, where etcd quorum is lost, results in an API server outage, and these scenarios can be resolved by restoring the etcd cluster from a backed-up snapshot. However, minority failures, where quorum is maintained, should not require disruption to API server requests to resolve because the etcd cluster can still process reads and writes.

Feature Overview

Provide non-disruptive recovery steps for etcd minority failure scenarios, enhancing the stability of our platform and preventing data loss and service disruptions.

Goals

Ensure that we have viable and safe manual recovery methods for etcd minority failure that does not require API server disruption (v4.14).
Automate the etcd minority failure recovery methods (v4.15 onwards).

Requirements

The manual recovery must be possible when etcd is backed by local storage.
The manual recovery must be non-disruptive, with API read/write operations continuing to work as long as the existing etcd cluster maintains quorum.

Use Cases

A single etcd instance, backed by local storage, loses access to its node and there's no available snapshot for recovery. In this case, recovery should be possible using the remaining etcd instances which still have quorum.
A single etcd instance's PVC data might be lost, but a snapshot of that data exists. Recovery of the PVC from a snapshot and rescheduling of the etcd instance should be possible.
A single etcd instance backed by local storage needs to be moved to another management node. This process should be possible by backing up the existing PVC to a snapshot on distributed storage and then restoring that data to local storage on another management node.

Out of Scope

Recovery steps for etcd majority failure scenarios.
Recovery process for etcd instances not backed by local storage.

Customer Considerations

Given the central role that etcd plays in the operation of the clusters, disruptions can have significant impacts on customers. Ensuring a smooth recovery process will help minimize downtime and data loss.

Documentation Considerations

Documentation should be created detailing the (manual) process of non-disruptive recovery for etcd minority failure scenarios. It should include different use cases, potential challenges, and recovery steps

relates to

HOSTEDCP-1070 Control Plane Pod supported persistent storage backends

Closed

links to

Etcd minority failure recovery for self-managed HCP

mentioned in: Page Loading...

Assignee:: Antoni Segura Puimedon

Reporter:: Adel Zaalouk

Contributors:: Cesar Wong, David Vossel

QA Contact:: Jie Zhao

Doc Contact:: Laura Hinson

Product Experience Engineering Contact:: Dave Mulford

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/07/07 10:43 PM

Updated:: 2024/03/14 3:08 PM

Resolved:: 2023/10/03 1:29 PM

Details

Description

Background

Feature Overview

Goals

Requirements

Use Cases

Out of Scope

Customer Considerations

Documentation Considerations

Attachments

Issue Links

Activity

People

Dates