Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-23399

Remote connection doesn't restart after migration failure

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • rhel-9.5
    • rhel-9.4
    • pacemaker
    • None
    • Normal
    • sst_high_availability
    • ssg_filesystems_storage_and_HA
    • 17
    • 19
    • 8
    • Dev ack
    • False
    • Hide

      None

      Show
      None
    • Red Hat Enterprise Linux
    • Bug Fix
    • Hide
      Cause (the user action or circumstances that trigger the bug): A successfully active Pacemaker Remote connection could move to a preferred node, but the connection fails when the cluster attempts live migration (for example, because the preferred node is misconfigured).

      Consequence (what the user experience is when the bug occurs): The Pacemaker Remote node is fenced and stays stopped.

      Fix (what has changed to fix the bug; do not include overly technical details):

      Result (what happens now that the patch is applied):
      Show
      Cause (the user action or circumstances that trigger the bug): A successfully active Pacemaker Remote connection could move to a preferred node, but the connection fails when the cluster attempts live migration (for example, because the preferred node is misconfigured). Consequence (what the user experience is when the bug occurs): The Pacemaker Remote node is fenced and stays stopped. Fix (what has changed to fix the bug; do not include overly technical details): Result (what happens now that the patch is applied):
    • All

      Description of problem: When a remote node connection resource with non-zero reconnect-interval attribute is configured to prefer running on a certain cluster node and that preferred node becomes unable to host the remote connection resource, then after the reconnect-interval expires, the cluster attempts to migrate the connection resource back to its preferred node. If the preferred node is still unable to start the connection resource, the migration operation fails and remote node gets fenced.

      If migration of a remote node fails, but the connection can be maintained at the old location, no fencing should be scheduled. 

      Moreover, the remote connection resource currently remains stopped after the migration failure until the next cluster-recheck-interval, which might be a scheduler bug. When the failure is recorded, the next scheduler run should schedule all appropriate actions needed. 

      Version-Release number of selected component (if applicable): pacemaker-2.0.3-4.el8

      Steps to Reproduce:

      1. Configure a 2-node cluster with a third Pacemaker Remote node
      2. Set "reconnect-interval=120s" and "meta migration-threshold=1" attributes on the remote connection resource
      3. Make sure cluster-recheck-interval is much longer than the above reconnect-interval (eg. 15min, the default)
      4. Create a location constraint for the remote connection resource to prefer running on a particular cluster node
      5. Verify the remote connection resource is started on its preferred node
      6. Block the network connection between preferred node and remote node (eg. block the outgoing connection in the preferred node's firewall) 

      Actual results: The remote connection resource first moves to the less-desirable cluster node and after reconnect-interval (2m) tries to migrate back to its preferred cluster node (unsuccessfully), resulting in remote node getting fenced and remote connection resource remaining stopped until the next cluster-recheck-interval fires.

      Expected results: Migration operation failure should not be fatal, remote node should never be fenced or marked unclean (ie. providing uninterrupted service)

            rhn-support-clumens Christopher Lumens
            kgaillot@redhat.com Kenneth Gaillot
            Kenneth Gaillot Kenneth Gaillot
            Cluster QE Cluster QE
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: