Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-123

Deadlock in Retransmitter

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 2.2.9
    • 2.2.8
    • None

    Description

      While there was some load on my nodes I found a deadlock situation in org.jgroups.protocols.TOTAL
      in the CVS HEAD Version. (Maybe it is also in 2.2.8. Unfortunately I don't have the time to test this, but according to the version history there was no change that could have introduced this bug after 2.2.8, so it should exist there too)

      I identified the location of the deadlock:

      It occurs in the Retransmitter.remove(long) method.
      The first thread tries to retransmit a message. This is done by a thread that executes Retransmitter$Entry.run() where "this" (=the Entry) is synchronized.

      Then the second thread starts to call the remove method and gets the lock for the msg object in Retransmitter.remove(long) line: 123 but it has to stop at line 126 because it tries to synchronize with the same Entry object that the first thread syncronized in Retransmitter$Entry.run().

      Then the first thread reaches the Retransmitter.remove(long) method but can't acquire the lock on the msg object in line 123 because this is already locked by the second thread.

      -> Deadlock

      This occurs only under very high load, but it occurs and completely locks up JGroups. I am currently trying to find a solution to this, but as I don't know the code so well, could you have a look at it please?

      The stack traces:
      Thread 1:
      Thread [TimeScheduler.Thread] (Suspended)
      Retransmitter.remove(long) line: 123
      AckSenderWindow.ack(long) line: 107
      TOTAL._transmitBcastRequest(long) line: 495
      TOTAL._retransmitBcastRequest(long) line: 632
      TOTAL.access$000(TOTAL, long) line: 65
      TOTAL$Command.retransmit(long, Message) line: 205
      AckSenderWindow.retransmit(long, long, Address) line: 125
      Retransmitter$Entry.run() line: 335
      TimeScheduler._run() line: 396
      TimeScheduler.access$000(TimeScheduler) line: 46
      TimeScheduler$Loop.run() line: 135
      Thread.run() line: 534

      Thread 2:
      Thread [UpHandler (TOTAL)] (Suspended)
      Retransmitter.remove(long) line: 126
      AckSenderWindow.ack(long) line: 107
      TOTAL._recvBcastReply(TOTAL$Header) line: 604
      TOTAL._upMsg(Event) line: 720
      TOTAL._up(Event) line: 955
      TOTAL.up(Event) line: 1022
      UpHandler.run() line: 60

      Attachments

        Activity

          People

            rhn-engineering-bban Bela Ban
            rschaffar Robert Schaffar-Taurok (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: