Details
-
Bug
-
Resolution: Won't Do
-
Major
-
None
-
3.4.2
-
None
Description
Hi,
We're seeing a problem which appears to be caused by the TransferQueueBundler thread being blocked while it fails to connect to an unavailable member.
The setup we have is a cluster split across two sites: say, members 0 through 4 in site A and members 5 through 9 in site B. Initially the cluster is complete: everyone has the same view. The case that we're testing is: what happens when connectivity is lost between the sites? NB we're using TCP transport.
Obviously the expected result is that we'd get two sub-clusters, one in each site. But this doesn't always happen. Instead we sometimes see some members become singletons (that is, with only themselves in view).
What seems to be happening is something like this:
- When the cross-site link is cut, members in site A suspect members in site B (and vice versa).
- So in each site there's a broadcast of SUSPECT messages
- Now each of the members in site A tries to VERIFY_SUSPECT each of the members in site B
- Each such attempt blocks the TransferQueueBundler for two seconds (TCP's default sock_conn_timeout), because we can't contact any member in the other site
- But that introduces a delay for all messages, not only for messages to the 'other' site
- If there are enough members in the 'other' site, we can easily get a large enough delay that HEARTBEAT (and then VERIFY_SUSPECT) messages start timing out between members in the same site
- At this point, members that ought to be able to see one another start to report that they cannot do so.
We've seen cases where a member becomes completely isolated - forming a singleton cluster - and does not recover. Unfortunately we don't have full trace from that run, so it's not clear why the cluster didn't eventually recover. I suspect that we're hitting something like JGRP-1493, in which delays sending messages (in that case, a delay when failing to get a physical address) caused the MergeKiller always to prevent merging.
It is highly undesirable that when a cluster contains several unavailable members, as in a partition between two sites, this should cause problems for members that can see one another.
Should all message sending really be blocked while failing to connect to an unavailable member?
This issue seems related also to JGRP-1815 which raises a similar question: should all message sending really be blocked while failing to find a physical address?
What do you think?
- do you agree that blocking message sending while attempting to connect to an unavailable member is undesirable?
- if so, what do you think the right fix is? If it's not too hard, we may be able to find time to take a look at implementing this ourselves.
- is there anything else we can do to help progress this issue?
We're using JGroups 3.4.2. I've attached the code fragment with which we configure the stack below.
Thanks for your help
David
stack.addProtocol((new TCP) .setValue("enable_diagnostics", false) .setValue("logical_addr_cache_max_size", 70) .setValue("logical_addr_cache_expiration", 10000) .setValue("physical_addr_max_fetch_attempts", 1) .setValue("bind_addr", localAddr) .setValue("bind_port", basePort) .setValue("port_range", 0)) val tcpping = new TCPPING val jhosts = initialHosts map { addr => new IpAddress(addr.getHostAddress, basePort) } tcpping.setInitialHosts(jhosts) tcpping.setPortRange(0) tcpping.setValue("return_entire_cache", true) stack.addProtocol(tcpping) .addProtocol(new MERGE3) .addProtocol((new FD_SOCK) .setValue("bind_addr", localAddr) .setValue("client_bind_port", basePort + 1) .setValue("start_port", basePort + 101) .setValue("suspect_msg_interval", 1000)) .addProtocol(new FD) .addProtocol((new VERIFY_SUSPECT) .setValue("timeout", 1000)) .addProtocol((new NAKACK2) .setValue("use_mcast_xmit", false)) .addProtocol(new UNICAST3) .addProtocol(new STABLE) .addProtocol(new MFC) .addProtocol(new SEQUENCER) .addProtocol((new GMS) .setValue("max_join_attempts", 3) .setValue("use_delta_views", false)) .addProtocol(new FRAG2)