Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 3.3.5, 3.4
Affects Version/s: 3.3.1
Labels:
None

Forum Reference:
https://community.jboss.org/thread/231100?tstart=0
Git Pull Request:
https://github.com/belaban/JGroups/pull/90
Steps to Reproduce:
Hide

Start one jgroups instance using TCPPING and let it cluster with itself.

Configure an iptables rule preventing communication to this instance on the jgroups port.

Configure a second jgroups instance with TCPPING, listing the first as an initial_host.

After the second instance has started, you'll see it it's logs that the discovery timed out an a new view was created containing just the new instance.

Remove the iptables rule on the old instance.

See that neither instance ever reports a complete view of the cluster.
Show
Start one jgroups instance using TCPPING and let it cluster with itself. Configure an iptables rule preventing communication to this instance on the jgroups port. Configure a second jgroups instance with TCPPING, listing the first as an initial_host. After the second instance has started, you'll see it it's logs that the discovery timed out an a new view was created containing just the new instance. Remove the iptables rule on the old instance. See that neither instance ever reports a complete view of the cluster.

SFDC Cases Counter:
SFDC Cases Links:

Description

When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.

Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254). This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node. Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address. The cluster never heals.

If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.

This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.

Attachments

Activity

People

Assignee:: Bela Ban

Reporter:: Andy Caldwell (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 2013/08/02 6:34 AM

Updated:: 2013/08/03 5:18 AM

Resolved:: 2013/08/03 5:18 AM