-
Type:
Bug
-
Status: Done
-
Priority:
Blocker
-
Resolution: Done
-
Affects Version/s: None
-
Fix Version/s: None
-
Component/s: high-availability
-
Labels:
-
Environment:
artemis 2.5.0
-
Target Release:
-
Affects:Release Notes
-
Release Notes Text:
-
Release Notes Docs Status:Documented as Resolved Issue
I have a 3 pairs (6 VMs) of HA replication.
Such slave can't be stopped via "artemis-service stop", it has to be killed.
Also, the rest of HA loses track of this broker in topology.
Testing/reproducing scenario:
- Make sure HA is formed properly (3 HA pairs)
- Loop over master brokers and do following
- Kill -9 master broker
- Make sure slave becomes live
- Resurrect master broker
- Make sure master becomes live again, slave is backup again
Usually, this work OK for first or second execution of such test.
But there is a problem, when I execute it (f.e.) third time.
Master is killed, slave does not become live.
After master resurrection, slave becomes isolated from whole HA.
It might be connected to Quorum voting process (this seems like split-brain problem?).
Also, I am not sending ANY messages.
This time, this happened the very first try on the very first broker.
Note: After master is restarted, the topology is completely off and some of the brokers are using different topology than others.
Test log
[10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-204, instance:/home/jamq/ha-replication, slave:amq-10-37-145-214
|
[10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-202, instance:/home/jamq/ha-replication, slave:amq-10-37-145-205
|
[10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-203, instance:/home/jamq/ha-replication, slave:amq-10-37-145-223
|
[10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-214, instance:/home/jamq/ha-replication, master:amq-10-37-145-204
|
[10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-205, instance:/home/jamq/ha-replication, master:amq-10-37-145-202
|
[10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-223, instance:/home/jamq/ha-replication, master:amq-10-37-145-203
|
[10:19:55] [INFO] tests.JAMQHA000Tests :: Running brokers:
|
amq-10-37-145-223 pid:22065
|
amq-10-37-145-205 pid:16071
|
amq-10-37-145-204 pid:11722
|
amq-10-37-145-214 pid:12820
|
amq-10-37-145-203 pid:3704
|
amq-10-37-145-202 pid:18809
|
[10:19:55] [PASS] dtestlib.Test :: Master is live # result:True (exp. True), dur.:-1.00
|
[10:19:55] [PASS] dtestlib.Test :: Slave is backup # result:True (exp. True), dur.:-1.00
|
[10:19:55] [WARNING] dtestlib.sut.ExternalBrokerOperable :: kill -s 9 11722
|
[10:19:55] [PASS] dtestlib.Test :: [root@10.37.145.204]$ kill -s 9 11722 # ecode:0 (exp. True), dur.:0.07
|
[10:19:55] [INFO] tests.JAMQHA000Tests :: Slave should be master
|
[10:19:55] [INFO] dtestlib.sut.ExternalBrokerClusterHA :: Wait for broker 'amq-10-37-145-214' to become live.
|
[10:19:58] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1
|
[10:19:58] [INFO] dtestlib :: retry: round=1/15, elapsed=3.05/60, result=False
|
[10:20:03] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1
|
[10:20:03] [INFO] dtestlib :: retry: round=2/15, elapsed=7.27/60, result=False
|
[10:20:08] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1
|
[10:20:08] [INFO] dtestlib :: retry: round=3/15, elapsed=12.30/60, result=False
|
[10:20:11] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1
|
[10:20:11] [INFO] dtestlib :: retry: round=4/15, elapsed=15.78/60, result=False
|
[10:20:15] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt
|
.. FAILED
|
Slave log (214)
2018-04-25 10:14:06,236 INFO [io.hawt.web.JolokiaConfiguredAgentServlet] Jolokia overridden property: [key=policyLocation, value=file:/home/jamq/ha-replication/etc/jolokia-access.xml]
|
2018-04-25 10:14:06,283 INFO [io.hawt.web.RBACMBeanInvoker] Using MBean [hawtio:type=security,area=jmx,rank=0,name=HawtioDummyJMXSecurity] for role based access control
|
2018-04-25 10:14:06,540 INFO [io.hawt.system.ProxyWhitelist] Initial proxy whitelist: [localhost, 127.0.0.1, 10.37.145.214, dhcp-145-214.lab.eng.brq.redhat.com]
|
2018-04-25 10:14:06,997 INFO [org.apache.activemq.artemis] AMQ241001: HTTP Server started at http://0.0.0.0:8161
|
2018-04-25 10:14:06,997 INFO [org.apache.activemq.artemis] AMQ241002: Artemis Jolokia REST API available at http://0.0.0.0:8161/console/jolokia
|
2018-04-25 10:14:06,997 INFO [org.apache.activemq.artemis] AMQ241004: Artemis Console available at http://0.0.0.0:8161/console
|
2018-04-25 10:14:10,598 INFO [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::serverUUID=974aeba4-4860-11e8-986b-5254007f6e6f is synchronized with live-server.
|
2018-04-25 10:14:14,544 INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced
|
2018-04-25 10:15:09,741 INFO [io.hawt.web.LoginServlet] hawtio login is using 1800 sec. HttpSession timeout
|
|
|
|
|
>> killed master broker (204)
|
[10:19:55] [PASS] dtestlib.Test :: [root@10.37.145.204]$ kill -s 9 11722 # ecode:0 (exp. True), dur.:0.07
|
|
|
|
|
2018-04-25 10:19:55,869 INFO [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
|
2018-04-25 10:19:55,872 INFO [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 30 seconds for quorum vote results.
|
2018-04-25 10:19:55,904 INFO [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to 10.37.145.202/10.37.145.202:61616: ServerConnectVote [nodeId=974aeba4-4860-11e8-986b-5254007f6e6f, vote=false]
|
2018-04-25 10:19:55,924 INFO [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from 10.37.145.202/10.37.145.202:61616: ServerConnectVote [nodeId=974aeba4-4860-11e8-986b-5254007f6e6f, vote=true]
|
2018-04-25 10:20:25,874 INFO [org.apache.activemq.artemis.core.server] AMQ221069: Timeout waiting for quorum vote responses.
|
2018-04-25 10:20:35,875 WARN [org.apache.activemq.artemis.core.client] AMQ212041: Timed out waiting for netty channel to close
|
2018-04-25 10:21:14,566 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: AMQ119011: Did not receive data from server for org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection@6eebd7a8[ID=08040475, local= /10.37.145.214:59586, remote=10.37.145.202/10.37.145.202:61616] [code=CONNECTION_TIMEDOUT]
|
Reproducer:
Use ansible to deploy HA replication on 6 VMs.
git clone https://github.com/msgqe/ansible-broker-clusters
|
ansible-playbook roles/ha-replication/provision.yml -i roles/ha-replication/inventory/deployment.yml -u root
|
|
|
(you will need compiled JMS client - test is using connector client to figure out, if broker is up or not)
|
|
|
Execute tests from dtests
|
--dschema DSNJAMQExternalClusterHA --test-node 10.37.145.202 --test-node 10.37.145.203 --test-node 10.37.145.204 --test-node 10.37.145.205 --test-node 10.37.145.214 --test-node 10.37.145.223 --set-option=/test-node-dtest-home-refresh-ena=False --load-json-settings /home/mtoth/work/repos/dtests/dtests/dtestlib/sut/example_6brokerHA_config.json --exec JAMQHA111Tests/test_kill_start_master_failback
|
- relates to
-
WFWIP-11 [Artemis upgrade] Regression in replicated HA tests
-
- Resolved
-
- is blocked by
-
ARTEMIS-1842 Loading...