Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2380

Sometimes cluster members are not discovered when using TCPGOSSIP

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Minor
    • 4.1.6
    • 4.0.19
    • None

    Description

      Sometimes new member can't join existing cluster if TCPGOSSIP is used with use_nio property set to true. In such case new member creates its own cluster with only one member of itself. After some period of time MERGE3 protocol merges these two clusters into one, but if min_interval/max_interval values are large, it may take a while.

      For some reason, first try of initial discovery always finishes due to join_timeout. In this case only a few members are discovered with no coordinator.
      If we are lucky enough, GMS prints following log message: "I (WO-KIT-967-28892) am not the first of the nodes, waiting for another client to become coordinator" and makes second attempt to join cluster which now takes a few milliseconds and succeeds (see logs_success.txt). In case of failure, GMS prints "I (WO-KIT-967-14786) am the first of the nodes, will become coordinator" and creates new cluster with only one member (see logs_failure.txt).

      The expectations are that first try of the initial discovery should not fail due to the timeout and it should be as fast as the second one is.

      Workaround: set use_nio to false (or just remove it from the stack configuration)

      Attachments

        1. jgroups.xml
          1.0 kB
        2. logs_failure.txt
          26 kB
        3. logs_success.txt
          3.46 MB

        Activity

          People

            rhn-engineering-bban Bela Ban
            pavlo_fedyna Pavlo Fedyna (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: