Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-16630

Sometimes server in collocated HA topology with shared store does not boot after kill and restart

    XMLWordPrintable

Details

    • Bug
    • Resolution: Obsolete
    • Critical
    • None
    • 7.2.1.CR1
    • ActiveMQ
    • None

    Description

      Test scenario:

      • start two nodes in cluster in collocated HA topology with shared journal
        • journal is located on local disk (not NFS/GFS2)
      • start producer and send messages to inQueue to node-1, wait for producer to finish
      • kill node-2 and start it again
      • Start consumer and consume messages from inQueue on node-1

      Expected result:
      node-2 will start and consumer will receive all messages

      Actual result:
      Sometimes happens that node-2 does not start after kill

      Attaching logs and thread dump from node-2 which hangs during start.

      Investigation:
      It seems that sometimes happens that Artemis (live) in node-2 is not able to acquire lock on journal:

      "ServerService Thread Pool -- 85" #159 prio=5 os_prio=0 cpu=160.92ms elapsed=235.68s tid=0x00007fe3b82a4000 nid=0x291 waiting on condition  [0x00007fe39d5ae000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
      	at java.lang.Thread.sleep(java.base@11.0.2/Native Method)
      	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:308)
      	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.startLiveNode(FileLockNodeManager.java:168)
      	at org.apache.activemq.artemis.core.server.impl.SharedStoreLiveActivation.run(SharedStoreLiveActivation.java:68)
      	at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.internalStart(ActiveMQServerImpl.java:544)
      	at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.start(ActiveMQServerImpl.java:481)
      	- locked <0x00000000d4ab7640> (a org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl)
      	at org.apache.activemq.artemis.jms.server.impl.JMSServerManagerImpl.start(JMSServerManagerImpl.java:376)
      	- locked <0x00000000d4ab7460> (a org.apache.activemq.artemis.jms.server.impl.JMSServerManagerImpl)
      	at org.wildfly.extension.messaging.activemq.jms.JMSService.doStart(JMSService.java:206)
      	- locked <0x00000000d1706148> (a org.wildfly.extension.messaging.activemq.jms.JMSService)
      	at org.wildfly.extension.messaging.activemq.jms.JMSService.access$000(JMSService.java:65)
      	at org.wildfly.extension.messaging.activemq.jms.JMSService$1.run(JMSService.java:100)
      	at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.2/Executors.java:515)
      	at java.util.concurrent.FutureTask.run(java.base@11.0.2/FutureTask.java:264)
      	at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
      	at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1985)
      	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1487)
      	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1378)
      	at java.lang.Thread.run(java.base@11.0.2/Thread.java:834)
      	at org.jboss.threads.JBossThread.run(JBossThread.java:485)
      

      Customer impact:
      Server does not fully boot and it's not possible to get to original state after server crash. Manual intervention required.

      Tested on RHEL 7 (JDK 8/11).

      Attachments

        Issue Links

          Activity

            People

              rhn-support-jbertram Justin Bertram
              mnovak1@redhat.com Miroslav Novak
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: