Description
Test scenario:
- start two nodes in cluster in collocated HA topology with shared journal
- journal is located on local disk (not NFS/GFS2)
- start producer and send messages to inQueue to node-1, wait for producer to finish
- kill node-2 and start it again
- Start consumer and consume messages from inQueue on node-1
Expected result:
node-2 will start and consumer will receive all messages
Actual result:
Sometimes happens that node-2 does not start after kill
Attaching logs and thread dump from node-2 which hangs during start.
Investigation:
It seems that sometimes happens that Artemis (live) in node-2 is not able to acquire lock on journal:
"ServerService Thread Pool -- 85" #159 prio=5 os_prio=0 cpu=160.92ms elapsed=235.68s tid=0x00007fe3b82a4000 nid=0x291 waiting on condition [0x00007fe39d5ae000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(java.base@11.0.2/Native Method) at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:308) at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.startLiveNode(FileLockNodeManager.java:168) at org.apache.activemq.artemis.core.server.impl.SharedStoreLiveActivation.run(SharedStoreLiveActivation.java:68) at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.internalStart(ActiveMQServerImpl.java:544) at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.start(ActiveMQServerImpl.java:481) - locked <0x00000000d4ab7640> (a org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl) at org.apache.activemq.artemis.jms.server.impl.JMSServerManagerImpl.start(JMSServerManagerImpl.java:376) - locked <0x00000000d4ab7460> (a org.apache.activemq.artemis.jms.server.impl.JMSServerManagerImpl) at org.wildfly.extension.messaging.activemq.jms.JMSService.doStart(JMSService.java:206) - locked <0x00000000d1706148> (a org.wildfly.extension.messaging.activemq.jms.JMSService) at org.wildfly.extension.messaging.activemq.jms.JMSService.access$000(JMSService.java:65) at org.wildfly.extension.messaging.activemq.jms.JMSService$1.run(JMSService.java:100) at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.2/Executors.java:515) at java.util.concurrent.FutureTask.run(java.base@11.0.2/FutureTask.java:264) at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35) at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1985) at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1487) at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1378) at java.lang.Thread.run(java.base@11.0.2/Thread.java:834) at org.jboss.threads.JBossThread.run(JBossThread.java:485)
Customer impact:
Server does not fully boot and it's not possible to get to original state after server crash. Manual intervention required.
Tested on RHEL 7 (JDK 8/11).
Attachments
Issue Links
- is cloned by
-
ENTMQBR-2389 Sometimes server in collocated HA topology with shared store does not boot after kill and restart
- Closed