Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-3275

Regression: Backup doesn't activate after shared store is reconnected

    XMLWordPrintable

Details

    • -
    • Hide
      Previously, if you had a live-backup broker pair configured for high availability using shared store, activation of the backup broker upon shutdown of the live broker could fail. Specifically, this situation occurred if the shared store had previously been disconnected and reconnected, before shutdown of the live broker. This issue is now resolved.
      Show
      Previously, if you had a live-backup broker pair configured for high availability using shared store, activation of the backup broker upon shutdown of the live broker could fail. Specifically, this situation occurred if the shared store had previously been disconnected and reconnected, before shutdown of the live broker. This issue is now resolved.
    • Documented as Resolved Issue
    • Verified in a release
    • Hide

      To reproduce this issue, I used a dual-homed broker host with one interface sharing a network (192.168.100.0) with the master broker and used for cluster communications and a second interface (10.0.0.0) used for communication with the NFS server.

      Mount options for the share are as below:

      10.0.0.10:/var/nfs on /opt/nfs type nfs4 (rw,sync,lookupcache=none,actimeo=0,noac,soft,addr=10.0.0.10,clientaddr=10.0.0.11
      

      To trigger the issue, I started both master and slave brokers and waited for the master to go live and the slave to announce as backup. After brokers were up, I triggered a 1 minute (61 seconds) interruption in the network interface between the slave broker and the nfs server:

      #!/bin/bash
      
              sleep 2
              echo "Interrupting network"
              ip link set eth1 down
              sleep 61
              ip link set eth1 up
      
      

      After the script completes, stop the master broker.

      The slave logs the connection failure with the master and tries to start, with the resultant stack trace from the description.

      I could not reproduce the issue on the non-LTS 7.5.0 release.

      Show
      To reproduce this issue, I used a dual-homed broker host with one interface sharing a network (192.168.100.0) with the master broker and used for cluster communications and a second interface (10.0.0.0) used for communication with the NFS server. Mount options for the share are as below: 10.0.0.10:/ var /nfs on /opt/nfs type nfs4 (rw,sync,lookupcache=none,actimeo=0,noac,soft,addr=10.0.0.10,clientaddr=10.0.0.11 To trigger the issue, I started both master and slave brokers and waited for the master to go live and the slave to announce as backup. After brokers were up, I triggered a 1 minute (61 seconds) interruption in the network interface between the slave broker and the nfs server: #!/bin/bash sleep 2 echo "Interrupting network" ip link set eth1 down sleep 61 ip link set eth1 up After the script completes, stop the master broker. The slave logs the connection failure with the master and tries to start, with the resultant stack trace from the description. I could not reproduce the issue on the non-LTS 7.5.0 release.

    Description

      In a shared-store configuration, if the slave broker loses communication with the NFS service and the connection is restored, a subsequent failover to the slave results in a failed start of the broker with:

      2020-02-19 15:28:32,076 ERROR [org.apache.activemq.artemis.core.server] AMQ224000: Failure in initialisation: java.io.IOException: Input/output error
      	at sun.nio.ch.FileDispatcherImpl.pread0(Native Method) [rt.jar:1.8.0_232]
      	at sun.nio.ch.FileDispatcherImpl.pread(FileDispatcherImpl.java:52) [rt.jar:1.8.0_232]
      	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:220) [rt.jar:1.8.0_232]
      	at sun.nio.ch.IOUtil.read(IOUtil.java:192) [rt.jar:1.8.0_232]
      	at sun.nio.ch.FileChannelImpl.readInternal(FileChannelImpl.java:735) [rt.jar:1.8.0_232]
      	at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:721) [rt.jar:1.8.0_232]
      	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.getState(FileLockNodeManager.java:256) [artemis-server-2.9.0.redhat-00009.jar:2.9.0.redhat-00009]
      	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:135) [artemis-server-2.9.0.redhat-00009.jar:2.9.0.redhat-00009]
      	at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77) [artemis-server-2.9.0.redhat-00009.jar:2.9.0.redhat-00009]
      	at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3738) [artemis-server-2.9.0.redhat-00009.jar:2.9.0.redhat-00009]
      

      The journal store is visible on the slave host and restarting the slave broker results in a normal startup in live mode.

      Attachments

        Issue Links

          Activity

            People

              dbruscin Domenico Francesco Bruscino
              rhn-support-dhawkins Duane Hawkins
              Tiago Bueno Tiago Bueno
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: