To reproduce, run JBento in the ATL lab with 6 instances, then run Brian's stress test (3 instances with a total of 3000 threads).
This only occurs with buddy replication (which uses UNICAST), it doesn't occur with NAKACK. Also, it doesn't occur with TCP as transport (so that's a workaround).
Let's say we have buddies A and B.
After some time, B's UNICAST AckReceiverWindow for A shows next_to_remove=80265, msgs=[80266-82139]. This means that we expect 80265 as next seqno, however the lowest seqno we've received is 80266. The window gets new messages every 5 secs (credit requests from A), and adds them. But it cannot deliver them as it hasn't received 80265 yet !
A's UNICAST AckSenderWindow for B shows 1 message in the retransmission queue: 80265. The stack trace shows that the timer thread is still running (waiting for tasks to execute), but for some reason, 80265 is never retransmitted to B ! We don't see a retransmit() method in the TRACE logs (we do see the other UNICAST methods invoked, e.g. DATA and ACK traces).
- relates to
-
JGRP-507 CloserThread's attempt to interrupt TimeScheduler on closure could be end up being ignored
- Resolved