Details
-
Bug
-
Resolution: Done
-
Major
-
2.2.8
-
None
Description
While there was some load on my nodes I found a deadlock situation in org.jgroups.protocols.TOTAL
in the CVS HEAD Version. (Maybe it is also in 2.2.8. Unfortunately I don't have the time to test this, but according to the version history there was no change that could have introduced this bug after 2.2.8, so it should exist there too)
I identified the location of the deadlock:
It occurs in the Retransmitter.remove(long) method.
The first thread tries to retransmit a message. This is done by a thread that executes Retransmitter$Entry.run() where "this" (=the Entry) is synchronized.
Then the second thread starts to call the remove method and gets the lock for the msg object in Retransmitter.remove(long) line: 123 but it has to stop at line 126 because it tries to synchronize with the same Entry object that the first thread syncronized in Retransmitter$Entry.run().
Then the first thread reaches the Retransmitter.remove(long) method but can't acquire the lock on the msg object in line 123 because this is already locked by the second thread.
-> Deadlock
This occurs only under very high load, but it occurs and completely locks up JGroups. I am currently trying to find a solution to this, but as I don't know the code so well, could you have a look at it please?
The stack traces:
Thread 1:
Thread [TimeScheduler.Thread] (Suspended)
Retransmitter.remove(long) line: 123
AckSenderWindow.ack(long) line: 107
TOTAL._transmitBcastRequest(long) line: 495
TOTAL._retransmitBcastRequest(long) line: 632
TOTAL.access$000(TOTAL, long) line: 65
TOTAL$Command.retransmit(long, Message) line: 205
AckSenderWindow.retransmit(long, long, Address) line: 125
Retransmitter$Entry.run() line: 335
TimeScheduler._run() line: 396
TimeScheduler.access$000(TimeScheduler) line: 46
TimeScheduler$Loop.run() line: 135
Thread.run() line: 534
Thread 2:
Thread [UpHandler (TOTAL)] (Suspended)
Retransmitter.remove(long) line: 126
AckSenderWindow.ack(long) line: 107
TOTAL._recvBcastReply(TOTAL$Header) line: 604
TOTAL._upMsg(Event) line: 720
TOTAL._up(Event) line: 955
TOTAL.up(Event) line: 1022
UpHandler.run() line: 60