Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-1466

[3 HA pairs] Slave does not become live after master is killed and isolates itself

    XMLWordPrintable

Details

    • Release Notes
    • Hide
      The quorum voting protocol has been corrected. Previously, in high-availability configurations consisting of three high-availability pairs with replication, this issue occasionally prevented slave brokers from taking over during a failover event. Instead, the slave broker would become isolated from the broker cluster.
      Show
      The quorum voting protocol has been corrected. Previously, in high-availability configurations consisting of three high-availability pairs with replication, this issue occasionally prevented slave brokers from taking over during a failover event. Instead, the slave broker would become isolated from the broker cluster.
    • Documented as Resolved Issue

    Description

      I have a 3 pairs (6 VMs) of HA replication.

      Such slave can't be stopped via "artemis-service stop", it has to be killed.
      Also, the rest of HA loses track of this broker in topology.

      Testing/reproducing scenario:

      1. Make sure HA is formed properly (3 HA pairs)
      2. Loop over master brokers and do following
        1. Kill -9 master broker
        2. Make sure slave becomes live
        3. Resurrect master broker
        4. Make sure master becomes live again, slave is backup again

      Usually, this work OK for first or second execution of such test.
      But there is a problem, when I execute it (f.e.) third time.
      Master is killed, slave does not become live.
      After master resurrection, slave becomes isolated from whole HA.

      It might be connected to Quorum voting process (this seems like split-brain problem?).
      Also, I am not sending ANY messages.

      This time, this happened the very first try on the very first broker.

      Note: After master is restarted, the topology is completely off and some of the brokers are using different topology than others.

      Test log

      [10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-204, instance:/home/jamq/ha-replication, slave:amq-10-37-145-214
      [10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-202, instance:/home/jamq/ha-replication, slave:amq-10-37-145-205
      [10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-203, instance:/home/jamq/ha-replication, slave:amq-10-37-145-223
      [10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-214, instance:/home/jamq/ha-replication, master:amq-10-37-145-204
      [10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-205, instance:/home/jamq/ha-replication, master:amq-10-37-145-202
      [10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-223, instance:/home/jamq/ha-replication, master:amq-10-37-145-203
      [10:19:55] [INFO] tests.JAMQHA000Tests :: Running brokers: 
      amq-10-37-145-223 pid:22065
      amq-10-37-145-205 pid:16071
      amq-10-37-145-204 pid:11722
      amq-10-37-145-214 pid:12820
      amq-10-37-145-203 pid:3704
      amq-10-37-145-202 pid:18809
      [10:19:55] [PASS] dtestlib.Test :: Master is live # result:True (exp. True), dur.:-1.00 
      [10:19:55] [PASS] dtestlib.Test :: Slave is backup # result:True (exp. True), dur.:-1.00 
      [10:19:55] [WARNING] dtestlib.sut.ExternalBrokerOperable :: kill -s 9 11722
      [10:19:55] [PASS] dtestlib.Test :: [root@10.37.145.204]$ kill -s 9 11722 # ecode:0 (exp. True), dur.:0.07 
      [10:19:55] [INFO] tests.JAMQHA000Tests :: Slave should be master
      [10:19:55] [INFO] dtestlib.sut.ExternalBrokerClusterHA :: Wait for broker 'amq-10-37-145-214' to become live.
      [10:19:58] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1
      [10:19:58] [INFO] dtestlib :: retry: round=1/15, elapsed=3.05/60, result=False
      [10:20:03] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1
      [10:20:03] [INFO] dtestlib :: retry: round=2/15, elapsed=7.27/60, result=False
      [10:20:08] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1
      [10:20:08] [INFO] dtestlib :: retry: round=3/15, elapsed=12.30/60, result=False
      [10:20:11] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1
      [10:20:11] [INFO] dtestlib :: retry: round=4/15, elapsed=15.78/60, result=False
      [10:20:15] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt
      .. FAILED
      

      Slave log (214)

      2018-04-25 10:14:06,236 INFO  [io.hawt.web.JolokiaConfiguredAgentServlet] Jolokia overridden property: [key=policyLocation, value=file:/home/jamq/ha-replication/etc/jolokia-access.xml]
      2018-04-25 10:14:06,283 INFO  [io.hawt.web.RBACMBeanInvoker] Using MBean [hawtio:type=security,area=jmx,rank=0,name=HawtioDummyJMXSecurity] for role based access control
      2018-04-25 10:14:06,540 INFO  [io.hawt.system.ProxyWhitelist] Initial proxy whitelist: [localhost, 127.0.0.1, 10.37.145.214, dhcp-145-214.lab.eng.brq.redhat.com]
      2018-04-25 10:14:06,997 INFO  [org.apache.activemq.artemis] AMQ241001: HTTP Server started at http://0.0.0.0:8161
      2018-04-25 10:14:06,997 INFO  [org.apache.activemq.artemis] AMQ241002: Artemis Jolokia REST API available at http://0.0.0.0:8161/console/jolokia
      2018-04-25 10:14:06,997 INFO  [org.apache.activemq.artemis] AMQ241004: Artemis Console available at http://0.0.0.0:8161/console
      2018-04-25 10:14:10,598 INFO  [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::serverUUID=974aeba4-4860-11e8-986b-5254007f6e6f is synchronized with live-server.
      2018-04-25 10:14:14,544 INFO  [org.apache.activemq.artemis.core.server] AMQ221031: backup announced
      2018-04-25 10:15:09,741 INFO  [io.hawt.web.LoginServlet] hawtio login is using 1800 sec. HttpSession timeout
      
      
      >> killed master broker (204)
      [10:19:55] [PASS] dtestlib.Test :: [root@10.37.145.204]$ kill -s 9 11722 # ecode:0 (exp. True), dur.:0.07 
      
      
      2018-04-25 10:19:55,869 INFO  [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
      2018-04-25 10:19:55,872 INFO  [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 30 seconds for quorum vote results.
      2018-04-25 10:19:55,904 INFO  [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to 10.37.145.202/10.37.145.202:61616: ServerConnectVote [nodeId=974aeba4-4860-11e8-986b-5254007f6e6f, vote=false]
      2018-04-25 10:19:55,924 INFO  [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from 10.37.145.202/10.37.145.202:61616: ServerConnectVote [nodeId=974aeba4-4860-11e8-986b-5254007f6e6f, vote=true]
      2018-04-25 10:20:25,874 INFO  [org.apache.activemq.artemis.core.server] AMQ221069: Timeout waiting for quorum vote responses.
      2018-04-25 10:20:35,875 WARN  [org.apache.activemq.artemis.core.client] AMQ212041: Timed out waiting for netty channel to close
      2018-04-25 10:21:14,566 WARN  [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: AMQ119011: Did not receive data from server for org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection@6eebd7a8[ID=08040475, local= /10.37.145.214:59586, remote=10.37.145.202/10.37.145.202:61616] [code=CONNECTION_TIMEDOUT]
      

      Reproducer:
      Use ansible to deploy HA replication on 6 VMs.

      git clone https://github.com/msgqe/ansible-broker-clusters
      ansible-playbook roles/ha-replication/provision.yml -i roles/ha-replication/inventory/deployment.yml -u root
      
      (you will need compiled JMS client - test is using connector client to figure out, if broker is up or not)
      
      Execute tests from dtests
      --dschema DSNJAMQExternalClusterHA --test-node 10.37.145.202 --test-node 10.37.145.203 --test-node 10.37.145.204 --test-node 10.37.145.205 --test-node 10.37.145.214 --test-node 10.37.145.223 --set-option=/test-node-dtest-home-refresh-ena=False --load-json-settings /home/mtoth/work/repos/dtests/dtests/dtestlib/sut/example_6brokerHA_config.json --exec JAMQHA111Tests/test_kill_start_master_failback
      

      Attachments

        1. master_202.log
          1.36 MB
        2. master_202.png
          master_202.png
          22 kB
        3. master_203.log
          1.33 MB
        4. master_203.png
          master_203.png
          23 kB
        5. master_204_killed.log
          9 kB
        6. master_204.png
          master_204.png
          19 kB
        7. slave_205.png
          slave_205.png
          21 kB
        8. slave_214.log
          6 kB
        9. slave_214.png
          slave_214.png
          25 kB
        10. slave_223.png
          slave_223.png
          20 kB

        Issue Links

          Activity

            People

              rh-ee-ataylor Andy Taylor
              mtoth@redhat.com Michal Toth
              Michal Toth Michal Toth
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: