Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-39

A TCP stack does not correctly detect failure (pulled cable) for certain TCPPING configurations

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 2.2.8
    • 2.2.9
    • None

    Description

      Physical hosts "A" (192.168.1.1, coordinator) and "B" (192.168.1.2) run JGroups processes configured with TCP/TCPPING stacks.

      "A" stack configuration:

      TCP(bind_addr=192.168.1.1;start_port=11800;loopback=true):
      TCPPING(initial_hosts=192.168.1.2[11800];port_range=3;timeout=3500;num_initial_members=3;up_thread=true;down_thread=true):
      MERGE2(min_interval=5000;max_interval=10000):
      FD(shun=true;timeout=1500;max_tries=3;up_thread=true;down_thread=true):
      VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false):
      pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
      pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
      pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=false;print_local_addr=false;down_thread=true;up_thread=true)

      "B" stack configuration:

      TCP(bind_addr=192.168.1.2;start_port=11800;loopback=true):
      TCPPING(initial_hosts=192.168.1.1[11800];port_range=3;timeout=3500;num_initial_members=3;up_thread=true;down_thread=true):
      MERGE2(min_interval=5000;max_interval=10000):
      FD(shun=true;timeout=1500;max_tries=3;up_thread=true;down_thread=true):
      VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false):
      pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
      pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
      pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=false;print_local_addr=false;down_thread=true;up_thread=true)

      If I pull the cable under B, the B stack immediately and correctly indentifies A as suspect and installs a new view containing itself only.

      However, A does not recognizes B as suspect and undeterministically spews out various info and warning messages. The view (A, B) stays incorrectly "valid" for a long time; sometimes gets replaced by (A), sometimes not.

      I tracked down the cause of the problem down to the A TCPPING configuration and TCP queue . If A's TCPPING is configured with a port_range=1, the problem goes away and the new view immediately installs into the A stack. It seems that if there are messages in the TCP queue except the SUSPECT message generated by FD, they mess up things and the SUSPECT message gets stuck in the queue, with undeterministic results.

      Attachments

        Activity

          People

            ovidiu.feodorov_jira Ovidiu Feodorov (Inactive)
            ovidiu.feodorov_jira Ovidiu Feodorov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: