Uploaded image for project: 'A-MQ Broker'
  1. A-MQ Broker
  2. ENTMQBR-882

Standby slave does not announce replication to master when primary slave is down

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Done
    • Priority: Critical
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: high-availability
    • Environment:

      A-MQ 7.0.2, tested on RHEL 6 / 7 with Java 1.8

      • Cluster configuration: JGroups with TCP Unicast
      • Replicated HA
    • Target Release:
    • Sprint:
      AMQ Broker 1836, AMQ Broker 1839
    • Steps to Reproduce:
      Hide

      Configurations attached.

      1. Start master node
      2. Start slave node and wait for it to become replication partner for master
      3. Start standby node
      4. Produce messages to master node
      5. Kill slave node
      (standby does not announce as backup and remains in Backup Activation Loop
      6. Kill master node
      (standby still waiting to become backup)

      Show
      Configurations attached. 1. Start master node 2. Start slave node and wait for it to become replication partner for master 3. Start standby node 4. Produce messages to master node 5. Kill slave node (standby does not announce as backup and remains in Backup Activation Loop 6. Kill master node (standby still waiting to become backup)
    • Affects:
      Documentation (Ref Guide, User Guide, etc.), Release Notes
    • Release Notes Text:
      Hide
      This issue occurs when multiple backup brokers, also referred to as slaves, are serving a single live (master) broker. If a primary backup broker fails, the secondary backup tries to replicate. But that operation fails, the secondary backup cannot take over for the primary backup, and as a result, high-availability is lost.
      Show
      This issue occurs when multiple backup brokers, also referred to as slaves, are serving a single live (master) broker. If a primary backup broker fails, the secondary backup tries to replicate. But that operation fails, the secondary backup cannot take over for the primary backup, and as a result, high-availability is lost.
    • Release Notes Docs Status:
      Documented as Known Issue

      Description

      When testing failover in a scenario with 1 master and 2 slaves, the example scenario in which the master is killed first worrks correctly - the primary backup becomes the master and the secondary backup becomes the replication node.

      If, however, the primary backup is killed first, the secondary backup remains stopped and does not announce as the replication slave. Instead it continues to log:

      13:31:44,373 WARN  [org.apache.activemq.artemis.core.server] AMQ222040: Server is stopped
      

      When the master is brought down, the secondary slave remains stopped.

      Looking at the thread dumps of the secondary backup for this scenario, (taken when the primary is killed), it appears the secondary is stuck looping in NamedLiveNodeLocatorForReplication::locateNode(...).

      "AMQ119000: Activation for server ActiveMQServerImpl::serverUUID=null" #18 prio=5 os_prio=0 tid=0x00007f1920803800 nid=0x642b waiting on condition [0x00007f19028e8000]
         java.lang.Thread.State: WAITING (parking)
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for  <0x00000000c04b7170> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
              at org.apache.activemq.artemis.core.server.impl.NamedLiveNodeLocatorForReplication.locateNode(NamedLiveNodeLocatorForReplication.java:67)
              at org.apache.activemq.artemis.core.server.impl.NamedLiveNodeLocatorForReplication.locateNode(NamedLiveNodeLocatorForReplication.java:54)
              at org.apache.activemq.artemis.core.server.impl.SharedNothingBackupActivation.run(SharedNothingBackupActivation.java:195)
              at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:2793)
       
         Locked ownable synchronizers:
              - None
      

      If multiple slaves are configured for a master, nth slave should become the active slave if the current slave(s) are offline.

      This is https://issues.apache.org/jira/browse/ARTEMIS-2075 upstream

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  ataylor Andy Taylor
                  Reporter:
                  hawkinsds Duane Hawkins
                  Tester:
                  Roman Vais
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  7 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Time Tracking

                    Estimated:
                    Original Estimate - 2 days
                    2d
                    Remaining:
                    Remaining Estimate - 2 days
                    2d
                    Logged:
                    Time Spent - Not Specified
                    Not Specified