Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: rhel-9.5
Affects Version/s: rhel-9.4
Component/s: pacemaker
Labels:
None

Severity:
Normal

Pool Team:

sst_high_availability
Sub-System Group:

ssg_filesystems_storage_and_HA

Dev Target Milestone:
17
Internal Target Milestone:
19
Story Points:
8
ACKs Check:

Dev ack
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:

Red Hat Enterprise Linux

Release Note Type:
Bug Fix
Release Note Text:

Hide
Cause (the user action or circumstances that trigger the bug): A successfully active Pacemaker Remote connection could move to a preferred node, but the connection fails when the cluster attempts live migration (for example, because the preferred node is misconfigured).

Consequence (what the user experience is when the bug occurs): The Pacemaker Remote node is fenced and stays stopped.

Fix (what has changed to fix the bug; do not include overly technical details):

Result (what happens now that the patch is applied):

Show
Cause (the user action or circumstances that trigger the bug): A successfully active Pacemaker Remote connection could move to a preferred node, but the connection fails when the cluster attempts live migration (for example, because the preferred node is misconfigured). Consequence (what the user experience is when the bug occurs): The Pacemaker Remote node is fenced and stays stopped. Fix (what has changed to fix the bug; do not include overly technical details): Result (what happens now that the patch is applied):

Experience:
Architecture:

All

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem: When a remote node connection resource with non-zero reconnect-interval attribute is configured to prefer running on a certain cluster node and that preferred node becomes unable to host the remote connection resource, then after the reconnect-interval expires, the cluster attempts to migrate the connection resource back to its preferred node. If the preferred node is still unable to start the connection resource, the migration operation fails and remote node gets fenced.

If migration of a remote node fails, but the connection can be maintained at the old location, no fencing should be scheduled.

Moreover, the remote connection resource currently remains stopped after the migration failure until the next cluster-recheck-interval, which might be a scheduler bug. When the failure is recorded, the next scheduler run should schedule all appropriate actions needed.

Version-Release number of selected component (if applicable): pacemaker-2.0.3-4.el8

Steps to Reproduce:

Configure a 2-node cluster with a third Pacemaker Remote node
Set "reconnect-interval=120s" and "meta migration-threshold=1" attributes on the remote connection resource
Make sure cluster-recheck-interval is much longer than the above reconnect-interval (eg. 15min, the default)
Create a location constraint for the remote connection resource to prefer running on a particular cluster node
Verify the remote connection resource is started on its preferred node
Block the network connection between preferred node and remote node (eg. block the outgoing connection in the preferred node's firewall)

Actual results: The remote connection resource first moves to the less-desirable cluster node and after reconnect-interval (2m) tries to migrate back to its preferred cluster node (unsuccessfully), resulting in remote node getting fenced and remote connection resource remaining stopped until the next cluster-recheck-interval fires.

Expected results: Migration operation failure should not be fatal, remote node should never be fenced or marked unclean (ie. providing uninterrupted service)

links to

ClusterLabs T214

Assignee:: Christopher Lumens

Reporter:: Kenneth Gaillot

Developer:: Kenneth Gaillot

QA Contact:: Cluster QE

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/01/30 8:30 PM

Updated:: 2024/05/21 2:29 PM

Dev Target end:: 2024/06/24

Target end:: 2024/07/08

Details

Description

Attachments

Issue Links

Activity

People

Dates