Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-8587

Coordinator crash in 2-node cluster can lead to invalid cache topology

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

      After the coordinator changes, PreferAvailabilityStrategy first broadcasts a cache topology with the currentCH of the "maximum" topology. In the 2nd step it broadcasts a topology that removes all the topology members no longer in the cluster, and in the 3rd step it queues a rebalance with the remaining members.

      If the cluster had only 2 nodes, A (the coordinator) and B, and B had not finished joining the cache, the maximum topology has A as the only member. That means step 2 tries to remove all members, and in the process removes the cache topology from ClusterCacheStatus. When step 3 tries to rebalance with B as the only member, it re-initializes ClusterCacheStatus with topology id 1, and because LocalTopologyManager already has a higher topology id it will never confirm the rebalance.

      This sometimes happens in CacheManagerTest.testRestartReusingConfiguration. Like most other tests, it waits for the cache to finish joining before killing a node. But it only waits for the test cache, not for the CONFIG cache (which has awaitInitialTransfer(false)). Also, most of the time A either finishes the rebalance or re-initializes ClusterCacheStatus and sends a topology update with B as the only member before leaving. The test only fails if B doesn't receive or ignores one or more topology updates.

      10:37:50,674 INFO  (remote-thread-Test-NodeA-p2265-t6:[]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}
      10:37:51,037 DEBUG (remote-thread-Test-NodeA-p2265-t6:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
      10:37:51,097 DEBUG (remote-thread-Test-NodeA-p2265-t5:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_NEW_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
      10:37:51,203 DEBUG (testng-Test:[]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=5, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeB-59687], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = AVAILABLE
      10:37:51,207 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [Test-NodeB-59687|2] (1) [Test-NodeB-59687]
      *** Here topology updates are ignored
      10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring topology 4 for cache org.infinispan.CONFIG from old coordinator Test-NodeA-37820
      10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t5:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring topology 5 for cache org.infinispan.CONFIG from old coordinator Test-NodeA-37820
      10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Recovered 1 partition(s) for cache org.infinispan.CONFIG: [CacheTopology{id=3, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeA-37820: 134, Test-NodeB-59687: 122]}, unionCH=null, phase=READ_ALL_WRITE_ALL, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}]
      10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Updating topologies after merge for cache org.infinispan.CONFIG, current topology = CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, stable topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}, availability mode = null, resolveConflicts = false
      10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterTopologyManagerImpl] Updating cluster-wide current topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}, availability mode = null
      10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterTopologyManagerImpl] Updating cluster-wide stable topology for cache org.infinispan.CONFIG, topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73]}
      10:37:51,340 FATAL (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [CLUSTER] [Context=org.infinispan.CONFIG]ISPN000313: Lost data because of abrupt leavers [Test-NodeA-37820]
      10:37:51,340 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Merge-2]) [ClusterCacheStatus] Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687]
      10:37:51,341 DEBUG (transport-thread-Test-NodeB-p2311-t6:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Updating local topology for cache org.infinispan.CONFIG: CacheTopology{id=4, rebalanceId=3, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeA-37820: 256]}, pendingCH=null, unionCH=null, phase=NO_REBALANCE, actualMembers=[Test-NodeA-37820, Test-NodeB-59687], persistentUUIDs=[d56ec014-ebb3-4be9-9ce2-91c2982ccb73, 96c95d15-440a-4dc7-915d-5d36ac4257bb]}
      *** The topology is re-initialized, without sending topology update
      10:37:51,378 DEBUG (transport-thread-Test-NodeB-p2311-t1:[Merge-2]) [ClusterCacheStatus] Queueing rebalance for cache ___defaultcache with members [Test-NodeB-59687]
      10:37:51,547 INFO  (jgroups-7,Test-NodeB-59687:[]) [CLUSTER] ISPN000094: Received new cluster view for channel ISPN: [Test-NodeB-59687|3] (2) [Test-NodeB-59687, Test-NodeA-12100]
      10:37:51,962 DEBUG (testng-Test:[]) [LocalTopologyManagerImpl] Node Test-NodeA-12100 joining cache org.infinispan.CONFIG
      10:37:51,964 DEBUG (remote-thread-Test-NodeB-p2309-t6:[]) [ClusterCacheStatus] Queueing rebalance for cache org.infinispan.CONFIG with members [Test-NodeB-59687, Test-NodeA-12100]
      *** Rebalance start is sent with wrong topology id
      10:37:51,964 INFO  (remote-thread-Test-NodeB-p2309-t6:[]) [CLUSTER] ISPN000310: Starting cluster-wide rebalance for cache org.infinispan.CONFIG, topology CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb, 538b5324-cda9-49df-9786-7c6d6458332e]}
      10:37:51,965 DEBUG (transport-thread-Test-NodeB-p2311-t4:[Topology-org.infinispan.CONFIG]) [LocalTopologyManagerImpl] Ignoring old rebalance for cache org.infinispan.CONFIG, current topology is 4: CacheTopology{id=2, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (1)[Test-NodeB-59687: 256]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (2)[Test-NodeB-59687: 129, Test-NodeA-12100: 127]}, unionCH=null, phase=READ_OLD_WRITE_ALL, actualMembers=[Test-NodeB-59687, Test-NodeA-12100], persistentUUIDs=[96c95d15-440a-4dc7-915d-5d36ac4257bb, 538b5324-cda9-49df-9786-7c6d6458332e]}
      

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: