Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-668

Deadlock condition in BARRIER

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 2.6.3, 2.7
    • 2.6.1
    • None

    Description

      Hey Bela et al:

      We've been fighting a lossy network (UDP receive errors) on a cluster of 50
      machines and managed to produce 2 coordinators who refused to MERGE3. A
      closer examination reviewed
      http://www.nabble.com/file/p14991972/barrier_deadlock.txt this stack trace
      which showed that there was one thread trying to satisfy a STATE_REQ msg
      blocked down in BARRIER.closeBarrier() waiting for the in_flight_threads to
      empty, another thread was trying to service another STATE_REQ and was
      blocked trying to lock the state_requesters table up in
      STATE_TRANSFER.handleStateReq(), and another 12 threads blocked waiting for
      the barrier to open in BARRIER.up().

      We quickly found that we had a deadlock condition in BARRIER that was
      problematic – http://www.nabble.com/file/p14991972/BARRIER.java.patch
      here's the patch to fix this . However, we cannot see an easy way to fix 2
      STATE_REQ messages coming right after the other. They will both enter the
      in_flight_threads set and only one will come back down to lock the barrier
      and will wait forever for the other one to leave in_flight_threads. If we
      let the 2nd come down too, it may come back up before the in_flight_threads
      is clear since all it does is see that the barrer is closed and returns.

      Although we may have fixed part of the deadlock we saw, we are looking into
      switching to the FLUSH protocol instead because of the 2 STATE_REQ issue.
      Just curious as to other's feedback about this issue and whether more folks
      are using FLUSH or BARRIER?

      Thanks much for an [otherwise] great code stack. We are excited to be using
      it in our distributed database system project.

      gray

      Attachments

        1. barrier_deadlock.txt
          5 kB
        2. BARRIER.java.patch
          0.9 kB
        3. jgroups_protocol.xml
          3 kB

        Activity

          People

            rhn-engineering-bban Bela Ban
            rhn-engineering-bban Bela Ban
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: