Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-6399

Timeout updating the JGroups view after killing one node

    XMLWordPrintable

Details

    Description

      GMS can sometimes delay the processing of a join/leave request because of JGRP-2028.

      Joiners retry automatically after GMS.join_timeout, so it's not that bad. Leavers, however, don't resend their leave requests, so the delay can be worse.

      Normally, the FD/FD_ALL/FD_SOCK protocols would wake up the ViewHandler thread. But we remove the FD* protocols from the stack in most of our tests, unless the test uses DISCARD. That means the leave request can be delayed until another node leaves:

      16:35:56,247 TRACE (testng-ClusterListenerDistAddListenerTest:) [GMS] NodeB-8309: sending LEAVE request to NodeA-45395
      16:35:56,268 TRACE (OOB-1,NodeA-45395:) [TCP_NIO2] NodeA-45395: received [dst: NodeA-45395, src: NodeB-8309 (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[LEAVE_REQ]: mbr=NodeB-8309, UNICAST3: DATA, seqno=22, TP: [cluster_name=ISPN]
      16:35:56,268 TRACE (OOB-1,NodeA-45395:) [UNICAST3] NodeA-45395: delivering NodeB-8309#22
      
      16:36:07,263 ERROR (testng-ClusterListenerDistAddListenerTest:) [UnitTestTestNGListener] Test testMemberJoinsAndRetrievesClusterListenersButMainListenerNodeDiesBeforeInstalled(org.infinispan.notifications.cachelistener.cluster.ClusterListenerDistAddListenerTest) failed.
      org.infinispan.util.concurrent.TimeoutException: Timed out before caches had complete views.  Expected 3 members in each view.  Views are as follows: [[NodeA-45395|3] (4) [NodeA-45395, NodeB-8309, NodeC-53222, NodeD-55165], [NodeA-45395|3] (4) [NodeA-45395, NodeB-8309, NodeC-53222, NodeD-55165], [NodeA-45395|3] (4) [NodeA-45395, NodeB-8309, NodeC-53222, NodeD-55165]]
      
      16:37:07,341 TRACE (testng-ClusterListenerDistAddListenerTest:) [GMS] NodeD-55165: sending LEAVE request to NodeA-45395
      16:37:07,361 TRACE (OOB-4,NodeA-45395:) [TCP_NIO2] NodeA-45395: received [dst: NodeA-45395, src: NodeD-55165 (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[LEAVE_REQ]: mbr=NodeD-55165, UNICAST3: DATA, seqno=21, TP: [cluster_name=ISPN]
      16:37:07,361 TRACE (OOB-4,NodeA-45395:) [UNICAST3] NodeA-45395: delivering NodeD-55165#21
      16:37:07,361 TRACE (ViewHandler,NodeA-45395:) [GMS] NodeA-45395: joiners=[], suspected=[], leaving=[NodeB-8309], new view: [NodeA-45395|4] (3) [NodeA-45395, NodeC-53222, NodeD-55165]
      

      FD_ALL is pretty cheap: it just sends a message every second, without opening any new sockets. So I think we should enable it by default, and only enable FD_SOCK with TransportFlags.withFD(true).

      Attachments

        Issue Links

          Activity

            People

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: