Description
We've been having more trouble with concurrent start up and now think
we've isolated a deadlock between FLUSH and GroupRequest during
concurrent startup.
We have four boxes that join a channel and use MessageDispatcher
immediately after connecting. This frequently blocks indefinitely.
GroupRequest.execute() obtains a lock, then a subsequent view change
comes in which does likewise. The upshot is that we can see all
Incoming threads are blocked for the lock and the only way it can be
released is for a stop_flush message to occur. With all incoming
threads blocked, that never happens.
In the attached unit test if you add this after the call to connect("A"), it passes, implying a deadlock;
if (j ==0) {
Thread.sleep(500);
}
Additionally, and this is more speculative, it seems the wait/notify code in pbcast does not account for the spurious wakeup case. I don't know under what circumstances they happen, and I don't believe we're seeing spurious wakes at this time, but it should be fixed at some stage.