Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2030

GMS: view_ack_collection_timeout delay when last 2 members leave concurrently

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Minor
    • 3.6.12
    • 3.6.8
    • None

    Description

      When the coordinator (NodeE) leaves, it tries to install a new view on behalf of the new coordinator (NodeG, the last member).

      21:33:26,844 TRACE (ViewHandler,InitialClusterSizeTest-NodeE-42422:) [GMS] InitialClusterSizeTest-NodeE-42422: mcasting view [InitialClusterSizeTest-NodeG-30521|3] (1) [InitialClusterSizeTest-NodeG-30521] (1 mbrs)
      21:33:26,844 TRACE (ViewHandler,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] InitialClusterSizeTest-NodeE-42422: sending msg to null, src=InitialClusterSizeTest-NodeE-42422, headers are GMS: GmsHeader[VIEW], NAKACK2: [MSG, seqno=1], TP: [cluster_name=ISPN]
      

      The message is actually sent later by the bundler, but NodeG is also sending its LEAVE_REQ message at the same time. Both nodes try to create a connection to each other, and only NodeG succeeds:

      21:33:26,844 TRACE (ForkThread-2,InitialClusterSizeTest:) [TCP_NIO2] InitialClusterSizeTest-NodeG-30521: sending msg to InitialClusterSizeTest-NodeE-42422, src=InitialClusterSizeTest-NodeG-30521, headers are GMS: GmsHeader[LEAVE_REQ]: mbr=InitialClusterSizeTest-NodeG-30521, UNICAST3: DATA, seqno=1, conn_id=1, first, TP: [cluster_name=ISPN]
      
      21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2] InitialClusterSizeTest-NodeG-30521: sending 1 msgs (83 bytes (0.27% of max_bundle_size) to 1 dests(s): [ISPN:InitialClusterSizeTest-NodeE-42422]
      21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] InitialClusterSizeTest-NodeE-42422: sending 1 msgs (91 bytes (0.29% of max_bundle_size) to 1 dests(s): [ISPN]
      21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2] dest=127.0.0.1:7900 (86 bytes)
      21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] dest=127.0.0.1:7920 (94 bytes)
      21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] 127.0.0.1:7900: connecting to 127.0.0.1:7920
      21:33:26,865 TRACE (Timer-2,InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2] 127.0.0.1:7920: connecting to 127.0.0.1:7900
      21:33:26,866 TRACE (NioConnection.Reader [null],InitialClusterSizeTest-NodeG-30521:) [TCP_NIO2] 127.0.0.1:7920: rejected connection from 127.0.0.1:7900  (connection existed and my address won as it's higher)
      21:33:26,867 TRACE (OOB-1,InitialClusterSizeTest-NodeE-42422:) [TCP_NIO2] InitialClusterSizeTest-NodeE-42422: received [dst: InitialClusterSizeTest-NodeE-42422, src: InitialClusterSizeTest-NodeG-30521 (3 headers), size=0 bytes, flags=OOB], headers are GMS: GmsHeader[LEAVE_REQ]: mbr=InitialClusterSizeTest-NodeG-30521, UNICAST3: DATA, seqno=1, conn_id=1, first, TP: [cluster_name=ISPN]
      

      I'm guessing NodeE would need a STABLE round in order to retransmit the VIEW message, but I'm not sure if the stable round would work, since it already (partially?) installed the new view with NodeG as the only member. However, I think it should be possible for NodeE to remove NodeG from it's AckCollector once it receives its LEAVE_REQ, and stop blocking.

      This is a minor annoyance a few the Infinispan tests - most of them shut down the nodes serially, so they don't see this delay.

      The question is whether the concurrent connection setup can have an impact for other messages as well - e.g. during startup, when there aren't a lot of messages being sent around to trigger retransmission. Could the node that failed to open its connection retry immediately on the connection opened by the other node?

      Attachments

        Issue Links

          Activity

            People

              rhn-engineering-bban Bela Ban
              dberinde@redhat.com Dan Berindei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: