Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-25777

Rollback of JMS session after failover/failback can leave JMS session stopped

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 8.0.0.Beta-CR1, 8.0.0.GA-CR1
    • ActiveMQ
    • Known Issue
    • Hide

      Updated:

      Steps to reproduce the issue - NOT 100% reproducer, issue is hard to hit:

      EAP 8 Beta CR1

      git clone git@gitlab.mw.lab.eng.bos.redhat.com:jbossqe-eap/messaging-testsuite.git
      cd messaging-testsuite/scripts/
      git checkout master
      
      groovy -DEAP_ZIP_URL=file:///<path-to-zip>/jboss-eap-8.0.0.Beta-CR1.zip PrepareServers7.groovy
      export WORKSPACE=$PWD
      export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
      export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
      export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
      export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
      
      cd ../jboss-hornetq-testsuite/
      mvn clean install -B -Dartemis.version=2.21.0 -Deap7.org.jboss.qa.hornetq.apps.clients.version=8.1669154014-SNAPSHOT  -Dmaven.test.failure.ignore=true  -Dsurefire.failIfNoSpecifiedTests=false  -Dmaven.wagon.http.ssl.insecure=true -Dmaven.wagon.http.ssl.allowall=true  -Dtest=ReplicatedDedicatedFailoverTestCase#testMultipleFailoverWithQueueShutdown
      

      EAP 7.1.0:

      git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
      cd eap-tests-hornetq/scripts/
      groovy -DEAP_VERSION=7.1.0.CR1 PrepareServers7.groovy
      export WORKSPACE=$PWD
      export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
      export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
      export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
      export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
      cd ../jboss-hornetq-testsuite/
      mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testMultipleFailoverWithQueueKill  -DfailIfNoTests=false -Deap=7x   | tee log
      
      Show
      Updated: Steps to reproduce the issue - NOT 100% reproducer, issue is hard to hit : EAP 8 Beta CR1 git clone git@gitlab.mw.lab.eng.bos.redhat.com:jbossqe-eap/messaging-testsuite.git cd messaging-testsuite/scripts/ git checkout master groovy -DEAP_ZIP_URL=file: ///<path-to-zip>/jboss-eap-8.0.0.Beta-CR1.zip PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean install -B -Dartemis.version=2.21.0 -Deap7.org.jboss.qa.hornetq.apps.clients.version=8.1669154014-SNAPSHOT -Dmaven.test.failure.ignore= true -Dsurefire.failIfNoSpecifiedTests= false -Dmaven.wagon.http.ssl.insecure= true -Dmaven.wagon.http.ssl.allowall= true -Dtest=ReplicatedDedicatedFailoverTestCase#testMultipleFailoverWithQueueShutdown EAP 7.1.0: git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ groovy -DEAP_VERSION=7.1.0.CR1 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testMultipleFailoverWithQueueKill -DfailIfNoTests= false -Deap=7x | tee log

      There is intermittent issue in rollback of JMS session after failover/failback which can leave JMS session stopped.

      Customer impact: Consumer will have stopped session after failover/failback which prevents receiving of new messages. Customer application will starve for messages even though there are messages in queue.

      Tests scenario:
      Start two EAP 7.1.0.CR1 servers in dedicated HA topology with queue testQueue (live and backup)
      Start 3 producers and 1 consumer on live server sending/receiving messages from testQueue in transacted session.
      50 times kill and start live server which causes clients to 50 times failover and failback
      Stop producers and wait for consumer to receive all messages
      Check that number of send and received is the same. No losses or duplicates

      Result:
      In failed test, after 13 failovers, there was issue during failback where consumer ends with stopped session and thus does not consume any new messages.

      Investigation:
      There is problem that consumer which received large message from backup sends ack to live server which results in javax.jms.IllegalStateException:

      09:46:16,123 Thread-1956 ERROR [org.jboss.qa.hornetq.apps.clients.Receiver11:161] RETRY receive for host: 127.0.0.1, Trying to receive message with count: 23804
      javax.jms.IllegalStateException: AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,683 queue = jms.queue.testQueue0
              at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:404)
              at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:315)
              at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sendACK(ActiveMQSessionContext.java:467)
              at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.acknowledge(ClientSessionImpl.java:746)
              at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.doAck(ClientConsumerImpl.java:1105)
              at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.acknowledge(ClientConsumerImpl.java:764)
              at org.apache.activemq.artemis.core.client.impl.ClientMessageImpl.acknowledge(ClientMessageImpl.java:96)
              at org.apache.activemq.artemis.core.client.impl.ClientMessageImpl.acknowledge(ClientMessageImpl.java:38)
              at org.apache.activemq.artemis.jms.client.ActiveMQMessageConsumer.getMessage(ActiveMQMessageConsumer.java:241)
              at org.apache.activemq.artemis.jms.client.ActiveMQMessageConsumer.receive(ActiveMQMessageConsumer.java:132)
              at org.jboss.qa.hornetq.apps.clients.Receiver11.receiveMessage(Receiver11.java:153)
              at org.jboss.qa.hornetq.apps.clients.ReceiverTransAck.run(ReceiverTransAck.java:83)
      Caused by: ActiveMQIllegalStateException[errorType=ILLEGAL_STATE message=AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,683 queue = jms.queue.testQueue0]
              ... 12 more
      

      This is because live server has no knowledge of messages being send to client by backup server before failback. This behaviour is still ok as client logic will simply retry receive of message.
      Exception above causes that JMS transaction in which client consumed messages is marked as "rollback only" and when client calls session.commit() then internal Artemis client logic is actually doing rollback. This however results in another javax.jms.IllegalStateException instead of javax.jms.TransactionRollbackException and leaves session stopped:

      09:46:16,139 Thread-1956 ERROR [org.jboss.qa.hornetq.apps.clients.Receiver11:64]  Receiver - COMMIT FAILED - JMSException thrown during commit: AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,659 queue = jms.
      queue.testQueue0. Receiver for node: 127.0.0.1. Received message - count: 23805, retrying receive
      javax.jms.IllegalStateException: AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,659 queue = jms.queue.testQueue0
              at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:404)
              at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:315)
              at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sendACK(ActiveMQSessionContext.java:467)
              at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.acknowledge(ClientSessionImpl.java:746)
              at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.doAck(ClientConsumerImpl.java:1105)
              at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.flushAcks(ClientConsumerImpl.java:789)
              at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.flushAcks(ClientSessionImpl.java:1646)
              at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.rollback(ClientSessionImpl.java:564)
              at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.rollback(ClientSessionImpl.java:538)
              at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.rollbackOnFailover(ClientSessionImpl.java:472)
              at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.commit(ClientSessionImpl.java:493)
              at org.apache.activemq.artemis.jms.client.ActiveMQSession.commit(ActiveMQSession.java:218)
              at org.jboss.qa.hornetq.apps.clients.Receiver11.commitSession(Receiver11.java:42)
              at org.jboss.qa.hornetq.apps.clients.ReceiverTransAck.run(ReceiverTransAck.java:103)
      Caused by: ActiveMQIllegalStateException[errorType=ILLEGAL_STATE message=AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,659 queue = jms.queue.testQueue0]
              ... 14 more
      

      The problem is that during rollback, client stops session, sends ack for last received message, sends rollback packet and then starts session again and if exception is thrown for sending ack then it never starts session again - ClientSessionImpl.rollback:

      public void rollback(final boolean isLastMessageAsDelivered, final boolean waitConsumers) {
      ...
          boolean wasStarted = started;
      
            if (wasStarted) {
               stop();
            }
      
            // We need to make sure we don't get any inflight messages
            for (ClientConsumerInternal consumer : cloneConsumers()) {
               consumer.clear(waitConsumers);
            }
      
            // Acks must be flushed here *after connection is stopped and all onmessages finished executing
            flushAcks();
      
            sessionContext.simpleRollback(isLastMessageAsDelivered);
      
            if (wasStarted) {
               start();
            }
      ...
      }
      

      This is not a regression. This code is pretty old, the same is in EAP 6/HornetQ and it seems that work on network layer (in Netty processor) to allow more parallel execution increased chances to hit it.

            ehugonne1@redhat.com Emmanuel Hugonnet
            mnovak1@redhat.com Miroslav Novak
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: