Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-23082

Avoid "shutdown" node attribute persisting after shutdown

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • rhel-9.5
    • rhel-9.2.0, rhel-9.3.0
    • pacemaker
    • None
    • Major
    • sst_high_availability
    • ssg_filesystems_storage_and_HA
    • 17
    • 19
    • 8
    • Dev ack
    • False
    • Hide

      None

      Show
      None
    • Red Hat Enterprise Linux
    • Approved Blocker
    • Bug Fix
    • Hide
      Cause (the user action or circumstances that trigger the bug): Previously, when a node left the cluster, the Pacemaker controller would clear its transient node attributes from the CIB, while the attribute manager would clear them from its database.
      Consequence (what the user experience is when the bug occurs): Timing issues could cause a variety of problems when one component had finished clearing attributes but the other had not. Commonly, this would present as a node immediately leaving the cluster after rejoining, due to its shutdown attribute still being present from its last shutdown. Also, if the controller crashed on a node, that node's transient node attributes would be wrongly erased from the CIB.
      Fix (what has changed to fix the bug; do not include overly technical details): The attribute manager now handles clearing transient node attribute from both its database and the CIB.
      Result (what happens now that the patch is applied): Timing issues are no longer possible.
      Show
      Cause (the user action or circumstances that trigger the bug): Previously, when a node left the cluster, the Pacemaker controller would clear its transient node attributes from the CIB, while the attribute manager would clear them from its database. Consequence (what the user experience is when the bug occurs): Timing issues could cause a variety of problems when one component had finished clearing attributes but the other had not. Commonly, this would present as a node immediately leaving the cluster after rejoining, due to its shutdown attribute still being present from its last shutdown. Also, if the controller crashed on a node, that node's transient node attributes would be wrongly erased from the CIB. Fix (what has changed to fix the bug; do not include overly technical details): The attribute manager now handles clearing transient node attribute from both its database and the CIB. Result (what happens now that the patch is applied): Timing issues are no longer possible.
    • All
    • All
    • 2.1.7

      What were you trying to do that didn't work?

      I tried to remove the stonith devices and stop the cluster, so I could setup sbd.

      Please provide the package NVR for which bug is seen:

      since pacemaker-2.1.6-7.el9.x86_64

      How reproducible:

      Sometimes, 50% chance

      Steps to reproduce

      1.  setup two node cluster
      2.  check out which node is a DC
      3.  on a DC node: remove the stonith devices and stop the cluster (
        pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all

        )

      Expected results

      Stonith devices are deleted, cluster stops.

      Actual results

      Cluster is stuck while stopping:

      [root@virt-253 ~]# pcs stonith delete fence-virt-252; pcs stonith delete fence-virt-253; pcs cluster stop --all
      Attempting to stop: fence-virt-252... Stopped
      Attempting to stop: fence-virt-253... Stopped
      virt-252: Stopping Cluster (pacemaker)...
      
      [root@virt-253 ~]# pcs status --full
      Cluster name: STSRHTS14392
      
      WARNINGS:
      No stonith devices and stonith-enabled is not false
      
      Cluster Summary:
        * Stack: corosync (Pacemaker daemons are shutting down)
        * Current DC: virt-253 (2) (version 2.1.6-9.el9-6fdc9deea29) - MIXED-VERSION partition with quorum
        * Last updated: Fri Oct 13 13:16:22 2023 on virt-253
        * Last change:  Fri Oct 13 13:15:18 2023 by root via cibadmin on virt-252
        * 2 nodes configured
        * 0 resource instances configured
      
      Node List:
        * Node virt-252 (1): pending, feature set <3.15.1
        * Node virt-253 (2): online, feature set 3.17.4
      
      Full List of Resources:
        * No resources
      
      Migration Summary:
      
      Tickets:
      
      PCSD Status:
        virt-252: Online
        virt-253: Online
      
      Daemon Status:
        corosync: active/enabled
        pacemaker: inactive/enabled
        pcsd: active/enabled
      
      

      After 15 minutes when cluster is stuck (`cluster-recheck-interval` I assume) cluster finally stops.

      I created a crm_report from the incident and attached it. The cluster got stuck on the stop action around Oct 13 13:15

      [^cluster-froze-when-stop.tar.bz2]

            kgaillot@redhat.com Kenneth Gaillot
            rhn-support-msmazova Marketa Smazova
            Kenneth Gaillot Kenneth Gaillot
            Marketa Smazova Marketa Smazova
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: