Infinispan
  1. Infinispan
  2. ISPN-1814

CacheViewsManagerImpl enters an infinite loop if a joining node is killed before installing the initial view

    Details

    • Similar Issues:
      Show 10 results 

      Description

      When a node leaves the cluster gracefully, it is automatically removed from the set of joiners and from the next cache view. If, however, it leaves without sending a CacheViewControlCommand{REQUEST_LEAVE}, it is only handled properly if it's part of the last committed view.

      This is visible in the attached log (a simplified version of https://issues.jboss.org/secure/attachment/12350962/org.jboss.as.test.clustering.unmanaged.singleton.SingletonTestCase-output.txt, from ISPN-1806)

      The test repeatedly kills a node (node-udp-1) and starts it up again. Somehow JGroups didn't detect the killed node and when it was restarted we received a 3-node view:

      20:22:49,552 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-13,null) ISPN000094: Received new cluster view: [node-udp-0/cluster|4] [node-udp-0/cluster, node-udp-1/cluster, node-udp-1/cluster]
      

      CacheViewsManagerImpl tried to install the new view, but obviously the killed node wasn't responding. I believe we only got an exception when the test timed out and it stopped the new node-udp-1:

      20:23:20,170 ERROR [org.infinispan.cacheviews.CacheViewsManagerImpl] (CacheViewInstaller-1,node-udp-0/cluster) ISPN000172: Failed to prepare view CacheView{viewId=6, members=[node-udp-0/cluster, node-udp-1/cluster]} for cache  default, rolling back to view CacheView{viewId=5, members=[node-udp-0/cluster]}: java.util.concurrent.ExecutionException: org.infinispan.remoting.transport.jgroups.SuspectException: Suspected member: node-udp-1/cluster
      

      However, because of the bug in CacheViewsManagerImpl, we kept trying to install a cache view with 3 nodes:

      20:23:20,226 ERROR [org.infinispan.cacheviews.CacheViewsManagerImpl] (CacheViewInstaller-1,node-udp-0/cluster) ISPN000172: Failed to prepare view CacheView{viewId=8, members=[node-udp-0/cluster, node-udp-1/cluster, node-udp-1/cluster]} for cache  default, rolling back to view CacheView{viewId=7, members=[node-udp-0/cluster]}: java.util.concurrent.ExecutionException: org.infinispan.remoting.transport.jgroups.SuspectException: One or more nodes have left the cluster while replicating command CacheViewControlCommand{cache=default, type=PREPARE_VIEW, sender=node-udp-0/cluster, newViewId=8, newMembers=[node-udp-0/cluster, node-udp-1/cluster, node-udp-1/cluster], oldViewId=7, oldMembers=[node-udp-0/cluster]}
      

      The test couldn't really stop because it was blocked waiting for a transaction commit, and the commit command was waiting for the cache view installation to end.

        Gliffy Diagrams

          Activity

          Hide
          Dan Berindei added a comment -

          Fixed by removing the joiners that are no longer members of the cluster from the pending changes.

          Show
          Dan Berindei added a comment - Fixed by removing the joiners that are no longer members of the cluster from the pending changes.
          Hide
          Dan Berindei added a comment -

          Small correction: The JGroups protocol stack had FD_SOCK enabled, but for some reason it did not suspect the killed member. Only FD suspected it, 30 seconds later, and VERIFY_SUSPECT removed it from the cluster view. The second node-udp-1 was still alive at that point.

          However, the first node-udp-1 wasn't part of a committed view either: oldMmebers=[node-udp-0/cluster]. So the CacheViewsManagerImpl bug occurred because the first node-udp-1, not because of the second one.

          Show
          Dan Berindei added a comment - Small correction: The JGroups protocol stack had FD_SOCK enabled, but for some reason it did not suspect the killed member. Only FD suspected it, 30 seconds later, and VERIFY_SUSPECT removed it from the cluster view. The second node-udp-1 was still alive at that point. However, the first node-udp-1 wasn't part of a committed view either: oldMmebers= [node-udp-0/cluster] . So the CacheViewsManagerImpl bug occurred because the first node-udp-1 , not because of the second one.

            People

            • Assignee:
              Dan Berindei
              Reporter:
              Dan Berindei
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development