Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-750

Deadlock between GroupRequest and FLUSH during concurrent startup.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 2.6.3, 2.7
    • 2.6.3
    • None
    • Hide

      Add sleep after connect and avoid concurrent channel startup

      Show
      Add sleep after connect and avoid concurrent channel startup

    Description

      We've been having more trouble with concurrent start up and now think
      we've isolated a deadlock between FLUSH and GroupRequest during
      concurrent startup.

      We have four boxes that join a channel and use MessageDispatcher
      immediately after connecting. This frequently blocks indefinitely.

      GroupRequest.execute() obtains a lock, then a subsequent view change
      comes in which does likewise. The upshot is that we can see all
      Incoming threads are blocked for the lock and the only way it can be
      released is for a stop_flush message to occur. With all incoming
      threads blocked, that never happens.

      In the attached unit test if you add this after the call to connect("A"), it passes, implying a deadlock;

      if (j ==0) {
      Thread.sleep(500);
      }

      Additionally, and this is more speculative, it seems the wait/notify code in pbcast does not account for the spurious wakeup case. I don't know under what circumstances they happen, and I don't believe we're seeing spurious wakes at this time, but it should be fixed at some stage.

      Attachments

        Activity

          People

            rhn-engineering-bban Bela Ban
            rnewson_jira Robert Newson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: