Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-756

FLUSH still needs work

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.6.3, 2.7
    • None
    • None

      [Michael Newcomb]

      Still debugging concurrent starting issues... Now I'm running into a
      problem with FLUSH.

      So, there are 3 current members (A, B, C) and a new one joins (D)...

      1. coord starts a flush on A,B,C
      2. coord receives FLUSH_COMPLETED from A,B (misses C)
      3. coord times out and sleeps a few seconds
      4. coord starts a new flush on A,B,C

      Here is where the problems start. A,B (and possibly C) are already in a
      FLUSH situation. As far as they are concerned a flush is in progress
      because they sent FLUSH_COMPLETED to the coord.

      So, when they get a new flush, they determine who they are going to
      reject (either the currently flushing coordinator or the flush
      requestor).

      If the flush requestor is < than the current flush coordinator, then a
      reject flush is sent to the original flush coordinator and the flush is
      proceeded with the flush requestor.

      If the flush requestor is > than the current flush coordinator, then a
      reject flush is sent to the flush requestor and the flush is proceeded
      with the original flush coordinator.

      If the flush requestor is == the current flush coordinator, it behaves
      the same as if the flush requestor was > the flush coordinator. A reject
      flush is sent to the current coordinator and then a FLUSH_COMPLETED is
      sent to him...

      The problem is that the FLUSH_COMPLETED is basically ignored because the
      reject flush sets the promise to FALSE which immediately fails the
      flush. This causes another flush retry which results in the same thing
      again and again until all the retries are exhausted and the overall
      flush fails. Furthermore, the node that rejected the flush is left in
      the exact same state: he thinks he is in a flush and will reject any new
      flush requests by the current flush coordinator!

      Essentially, retrying flushes is a waste of time...

      I think that there are several ways to solve this problem.

      Since the flush is 'restarted' (onStartFlush is called after the reject
      is sent) even when the flush requestor == the current flush coordinator,
      there may be no need to reject the flush when the flush requestor == the
      current flush coordinator. Only send a reject flush if the
      abortFlushCoordinator != proceedFlushCoordinator...

      If that is not sufficient, then when the flush requestor == the current
      flush coordinator, the node that rejects a flush, should not 'restart'
      the flush by calling onStartFlush again (only call onStartFlush if
      abortFlushCoordinator != proceedFlushCoordinator). This basically sets
      the next flush attempt up for failure again and again because nothing
      has changed at the node: he still thinks a flush is on going and will
      reject any new flushes from the current flush coordinator.

      Again, these cases are for when the flush requestor is == the current
      flush coordinator. I have yet to attempt concurrent flush attempts by
      different nodes

            vblagoje Vladimir Blagojevic (Inactive)
            rhn-engineering-bban Bela Ban
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: