Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1282

Race condition in FLUSH when master leaves cluster

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: 2.6.16
    • Fix Version/s: 2.6.19, 2.11.2, 2.12
    • Labels:
      None

      Description

      There's a race condition in FLUSH when the master node is leaving the cluster,
      that can cause the master to not send a new view (with a new master) before leaving.

      The FLUSH is started when GMS sends down an Event.SUSPEND.
      FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
      onSuspend sends a START_FLUSH message down.

      In the working case, the local node gets the START_FLUSH first.
      FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
      onStartFlush sets the member variable "flushMembers".

      Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
      FLUSH.up calls FLUSH.onFlushCompleted.
      onFlushCompleted checks "flushMembers" against the list of replies.
      If they match (and flushMembers is not null), the flush completes.

      But in the non-working case, the FLUSH_COMPLETED from the other
      nodes is processed before the local START_FLUSH.
      In this case, flushMembers has not been set, and onFlushCompleted
      does nothing, expecting more replies (which never come).

      I believe this will only be triggered when the master is leaving,
      because it does not include itself in the FLUSH. If it was a flush
      member, there would be a FLUSH_COMPLETED reply from itself to
      trigger setting flushMembers at some point.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  vblagojevic Vladimir Blagojevic
                  Reporter:
                  dereed Dennis Reed
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: