Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: rhel-9.5
Affects Version/s: rhel-9.2.0, rhel-9.3.0
Component/s: pacemaker
Labels:
None

Severity:
Major

Pool Team:

sst_high_availability
Sub-System Group:

ssg_filesystems_storage_and_HA

Dev Target Milestone:
17
Internal Target Milestone:
19
Story Points:
8
ACKs Check:

Dev ack
Target Version:

rhel-9.4
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:

Red Hat Enterprise Linux
Release Blocker:
Approved Blocker

Release Note Type:
Bug Fix
Release Note Text:

Hide
Cause (the user action or circumstances that trigger the bug): Previously, when a node left the cluster, the Pacemaker controller would clear its transient node attributes from the CIB, while the attribute manager would clear them from its database.
Consequence (what the user experience is when the bug occurs): Timing issues could cause a variety of problems when one component had finished clearing attributes but the other had not. Commonly, this would present as a node immediately leaving the cluster after rejoining, due to its shutdown attribute still being present from its last shutdown. Also, if the controller crashed on a node, that node's transient node attributes would be wrongly erased from the CIB.
Fix (what has changed to fix the bug; do not include overly technical details): The attribute manager now handles clearing transient node attribute from both its database and the CIB.
Result (what happens now that the patch is applied): Timing issues are no longer possible.

Show
Cause (the user action or circumstances that trigger the bug): Previously, when a node left the cluster, the Pacemaker controller would clear its transient node attributes from the CIB, while the attribute manager would clear them from its database. Consequence (what the user experience is when the bug occurs): Timing issues could cause a variety of problems when one component had finished clearing attributes but the other had not. Commonly, this would present as a node immediately leaving the cluster after rejoining, due to its shutdown attribute still being present from its last shutdown. Also, if the controller crashed on a node, that node's transient node attributes would be wrongly erased from the CIB. Fix (what has changed to fix the bug; do not include overly technical details): The attribute manager now handles clearing transient node attribute from both its database and the CIB. Result (what happens now that the patch is applied): Timing issues are no longer possible.

Experience:
Architecture:

All
OS:
All
Target Upstream Version:
2.1.7

PX Impact Score:
PX Impact Range:
PX Priority Data:
SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

What were you trying to do that didn't work?

I tried to remove the stonith devices and stop the cluster, so I could setup sbd.

Please provide the package NVR for which bug is seen:

since pacemaker-2.1.6-7.el9.x86_64

How reproducible:

Sometimes, 50% chance

Steps to reproduce

setup two node cluster
check out which node is a DC

on a DC node: remove the stonith devices and stop the cluster (

pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all

)

Expected results

Stonith devices are deleted, cluster stops.

Actual results

Cluster is stuck while stopping:

[root@virt-253 ~]# pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all
Attempting to stop: fence-virt-252... Stopped
Attempting to stop: fence-virt-253... Stopped
virt-252: Stopping Cluster (pacemaker)...

[root@virt-253 ~]# pcs status --full
Cluster name: STSRHTS14392

WARNINGS:
No stonith devices and stonith-enabled is not false

Cluster Summary:
  * Stack: corosync (Pacemaker daemons are shutting down)
  * Current DC: virt-253 (2) (version 2.1.6-9.el9-6fdc9deea29) - MIXED-VERSION partition with quorum
  * Last updated: Fri Oct 13 13:16:22 2023 on virt-253
  * Last change:  Fri Oct 13 13:15:18 2023 by root via cibadmin on virt-252
  * 2 nodes configured
  * 0 resource instances configured

Node List:
  * Node virt-252 (1): pending, feature set <3.15.1
  * Node virt-253 (2): online, feature set 3.17.4

Full List of Resources:
  * No resources

Migration Summary:

Tickets:

PCSD Status:
  virt-252: Online
  virt-253: Online

Daemon Status:
  corosync: active/enabled
  pacemaker: inactive/enabled
  pcsd: active/enabled

After 15 minutes when cluster is stuck (`cluster-recheck-interval` I assume) cluster finally stops.

I created a crm_report from the incident and attached it. The cluster got stuck on the stop action around Oct 13 13:15

[^cluster-froze-when-stop.tar.bz2]

clones

RHEL-13216 Revert broken attempt to fix "cluster got stuck while stopping" [rhel-9]

Closed

Assignee:: Kenneth Gaillot

Reporter:: Marketa Smazova

Developer:: Kenneth Gaillot

QA Contact:: Marketa Smazova

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2024/01/29 9:23 PM

Updated:: 2024/05/28 5:30 AM

Dev Target end:: 2024/06/24

Target end:: 2024/07/08

Details

Description

What were you trying to do that didn't work?

Please provide the package NVR for which bug is seen:

How reproducible:

Steps to reproduce

Expected results

Actual results

Attachments

Issue Links

Activity

People

Dates