Description
This warning is caused by A attempting to send a unicast message to B, but the physical (IP) address of B is not in the cache (TP.logical_addr_cache).
The logical_addr_cache is populated at startup, during the discovery phase. However, when we have 20 members and (PING.)num_initial_members is 3, we'll return after 3 responses.
If we for example have
with A being the coordinator, a new joiner X returns after reception of (say) discovery responses of F, B and G. If X next tries to send a JOIN request to A, it will fail and drop the (unicast) request as the IP address for A is not yet present.
Event worse: if X attempts to invoke a unicast RPC to any member P for which it doesn't have the IP address, and does not send any more messages (e.g. in a separate thread to P), then the RPC will timeout, unless we use UNICAST, which - using a positive ack scheme - keeps retransmitting the request until P sends an ack.
The problem is that - when we don't have an IP address for P - we send a discovery request to fetch the IP address, but drop the current request.
There are 2 levels at which we can fix this problem:
#1 Make sure we receive at least the IP address of the coordinator at startup
This is done by making sure (in the above example) that A's IP address is part of the response set before we return from the discovery phase. Note that when we send a discovery request, everybody will reply, but if we return after reception of the replies from B, F and G, we won't have the coordinator's address to send a JOIN request to. So #1 makes sure that we only return after having the coordinator's IP address.
Note that this can still lead to problems when trying to send a unicast message to a different member, whose IP address we don't yet have ! This is solved in #2 below.
#2 When asking for the IP address of P, don't drop the current message to P, but loop for a short time until the address has been fetched
We don't block here, but simply loop for a limited time, in order to wait for the IP address. In most cases, this is not even necessary because #1 reduces the chances of an IP address not being available, but if it is, usually fetching an IP address takes a few millisconds.
Looping just reduces the chances that we have to run into a timeout with a blocking unicast RPC, or wait until stability flushes the pending unicast, causing it to be retransmitted.
Note that we increase the wait time on every loop iteration, to prevent discovery storms. Plus, we also stagger the discovery request: if 2 threads T1 and T2 trigger a discovery at time 30 and 120 respectively, then T2 will not send a discovery request, as it also has the IP address by means of the discovery request triggered by T1.