Loading...

Details

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 9.1.5.Final, 9.2.0.CR1
Affects Version/s: 9.1.3.Final, 9.2.0.Beta1
Component/s: Core
Labels:
- testsuite_stability

Git Pull Request:
https://github.com/infinispan/infinispan/pull/5617, https://github.com/infinispan/infinispan/pull/5637

Description

After the coordinator changes, PreferAvailabilityStrategy first broadcasts a cache topology with the currentCH of the "maximum" topology. In the 2nd step it broadcasts a topology that removes all the topology members no longer in the cluster, and in the 3rd step it queues a rebalance with the remaining members.

If the cluster had only 2 nodes, A (the coordinator) and B, and B had not finished joining the cache, the maximum topology has A as the only member. That means step 2 tries to remove all members, and in the process removes the cache topology from ClusterCacheStatus. When step 3 tries to rebalance with B as the only member, it re-initializes ClusterCacheStatus with topology id 1, and because LocalTopologyManager already has a higher topology id it will never confirm the rebalance.

This sometimes happens in CacheManagerTest.testRestartReusingConfiguration. Like most other tests, it waits for the cache to finish joining before killing a node. But it only waits for the test cache, not for the CONFIG cache (which has awaitInitialTransfer(false)). Also, most of the time A either finishes the rebalance or re-initializes ClusterCacheStatus and sends a topology update with B as the only member before leaving. The test only fails if B doesn't receive or ignores one or more topology updates.

10:37:50,674 INFO  (remote-thread-Test-NodeA-p2265-t6:[]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}
10:37:51,037 DEBUG (remote-thread-Test-NodeA-p2265-t6:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,097 DEBUG (remote-thread-Test-NodeA-p2265-t5:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_NEW_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,203 DEBUG (testng-Test:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=5, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeB-59687], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
10:37:51,207 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [Test-NodeB-59687|2] (1) [Test-NodeB-59687]
*** Here topology updates are ignored
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring topology 4 for cache org.infinispan.CONFIG from old coordinator Test-NodeA-37820
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring topology 5 for cache org.infinispan.CONFIG from old coordinator Test-NodeA-37820
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Recovered 1 partition(s) for cache org.infinispan.CONFIG: [CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}]
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Updating topologies after merge for cache org.infinispan.CONFIG, current topology = CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, stable topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}, availability mode = null, resolveConflicts = false
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = null
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterTopologyManagerImpl] Updating cluster-wide stable topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}
10:37:51,340 FATAL (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [CLUSTER] [Context=org.infinispan.CONFIG]ISPN000313: Lost data because of abrupt leavers [Test-NodeA-37820]
10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687]
10:37:51,341 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Updating local topology for cache org.infinispan.CONFIG: CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}
*** The topology is re-initialized, without sending topology update
10:37:51,378 DEBUG (transport-thread-Test-NodeB-p2311-t1:[Merge-2]) [ClusterCacheStatus] Queueing rebalance for cache ___defaultcache with members [Test-NodeB-59687]
10:37:51,547 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [Test-NodeB-59687|3] (2) [Test-NodeB-59687, Test-NodeA-12100]
10:37:51,962 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Node Test-NodeA-12100 joining cache org.infinispan.CONFIG
10:37:51,964 DEBUG (remote-thread-Test-NodeB-p2309-t6:[]) [ClusterCacheStatus] Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687, Test-NodeA-12100]
*** Rebalance start is sent with wrong topology id
10:37:51,964 INFO  (remote-thread-Test-NodeB-p2309-t6:[]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb, 538b5324-cda9-49df-9786-7c6d6458332e]}
10:37:51,965 DEBUG (transport-thread-Test-NodeB-p2311-t4:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring old rebalance for cache org.infinispan.CONFIG, current topology is 4: CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb, 538b5324-cda9-49df-9786-7c6d6458332e]}

Attachments

Issue Links

is duplicated by

ISPN-8596 CacheManagerTest#testCacheManagerRestartReusingConfigurations failure

Closed

Coordinator crash in 2-node cluster can lead to invalid cache topology

Details

Description

Attachments

Issue Links

Activity

People

Dates