Details
-
Bug
-
Resolution: Done
-
Major
-
2.2.9
-
None
Description
Physical hosts "A" (192.168.1.1, coordinator) and "B" (192.168.1.2) run JGroups processes configured with TCP/TCPPING stacks.
"A" stack configuration:
TCP(bind_addr=192.168.1.1;start_port=11800;loopback=true):
TCPPING(initial_hosts=192.168.1.2[11800];port_range=3;timeout=3500;num_initial_members=3;up_thread=true;down_thread=true):
MERGE2(min_interval=5000;max_interval=10000):
FD(shun=true;timeout=1500;max_tries=3;up_thread=true;down_thread=true):
VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false):
pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=false;print_local_addr=false;down_thread=true;up_thread=true)
"B" stack configuration:
TCP(bind_addr=192.168.1.2;start_port=11800;loopback=true):
TCPPING(initial_hosts=192.168.1.1[11800];port_range=3;timeout=3500;num_initial_members=3;up_thread=true;down_thread=true):
MERGE2(min_interval=5000;max_interval=10000):
FD(shun=true;timeout=1500;max_tries=3;up_thread=true;down_thread=true):
VERIFY_SUSPECT(timeout=1500;down_thread=false;up_thread=false):
pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):
pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):
pbcast.GMS(join_timeout=5000;join_retry_timeout=2000;shun=false;print_local_addr=false;down_thread=true;up_thread=true)
If I pull the cable under B, the B stack immediately and correctly indentifies A as suspect and installs a new view containing itself only.
However, A does not recognizes B as suspect and undeterministically spews out various info and warning messages. The view (A, B) stays incorrectly "valid" for a long time; sometimes gets replaced by (A), sometimes not.
I tracked down the cause of the problem down to the A TCPPING configuration and TCP queue . If A's TCPPING is configured with a port_range=1, the problem goes away and the new view immediately installs into the A stack. It seems that if there are messages in the TCP queue except the SUSPECT message generated by FD, they mess up things and the SUSPECT message gets stuck in the queue, with undeterministic results.