Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-126

Failover with fabric discovery takes more than 1 minute to reconnect after master amq container was killed when replicated levelDB used

    XMLWordPrintable

Details

    • Bug
    • Resolution: Obsolete
    • Major
    • None
    • None
    • None
    • None
    • Hide

      To reproduce:
      0)Modified karaf script and added there

      -Dzookeeper.timeout=3000
      

      1)Start fuse
      2)run command:

      fabric:create --wait-for-provisioning

      3)Unpack test attached reconnect-test.zip and run with mvn clean instal

      Show
      To reproduce: 0)Modified karaf script and added there -Dzookeeper.timeout=3000 1)Start fuse 2)run command: fabric:create --wait-for-provisioning 3)Unpack test attached reconnect-test.zip and run with mvn clean instal

    Description

      The time for reconnection of the client is more than 1 minute:

          [m!!!Stopping container0
          !!!!Killing container broker-nossl-c0
          [0;32m16-12-2014 12:35:41,671 | INFO | [SSHClient] | executeCommand - Command: shell:exec kill -9 6528
          [mDelay = 27ms
          !!!Received 1 messages, number = 0
          [0;32m16-12-2014 12:35:41,755 | INFO | [SSHClient] | executeCommand - Response:
          [m[0;32m16-12-2014 12:35:41,755 | INFO | [Container] | killContainerAtLocalHost - Result of killing master process :
          [m[0;32m16-12-2014 12:35:41,758 | INFO | [SSHClient] | executeCommand - Command: shell:exec kill -9 6528
          [m[0;33m16-12-2014 12:35:41,773 | WARN | [FailoverTransport] | handleTransportFailure - Transport (tcp://dhcp-10-40-3-26.brq.redhat.com/10.40.3.26:54789@41760) failed, attempting to automatically reconnect
          java.io.EOFException
                  at java.io.DataInputStream.readInt(DataInputStream.java:392)
                  at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:258)
                  at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:221)
                  at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:213)
                  at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:196)
                  at java.lang.Thread.run(Thread.java:744)
          [m[0;33m16-12-2014 12:35:41,777 | WARN | [FailoverTransport] | handleTransportFailure - Transport (tcp://dhcp-10-40-3-26.brq.redhat.com/10.40.3.26:54789@41762) failed, attempting to automatically reconnect
          java.io.EOFException
                  at java.io.DataInputStream.readInt(DataInputStream.java:392)
                  at org.apache.activemq.openwire.OpenWireFormat.unmarshal(OpenWireFormat.java:258)
                  at org.apache.activemq.transport.tcp.TcpTransport.readCommand(TcpTransport.java:221)
                  at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:213)
                  at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:196)
                  at java.lang.Thread.run(Thread.java:744)
          [m[0;32m16-12-2014 12:35:41,847 | INFO | [SSHClient] | executeCommand - Response:
          [m[0;32m16-12-2014 12:35:41,848 | INFO | [Container] | killContainerAtLocalHost - Result of killing master process :
          [m!!!!Waiting for empty process broker-nossl-c0
           
          ...logs.... Containers restarted.....
           
          [m[0;33m16-12-2014 12:35:57,141 | WARN | [FailoverTransport] | doReconnect - Failed to connect to [tcp://dhcp-10-40-3-26.brq.redhat.com:54789] after: 10 attempt(s) continuing to retry.
          [m[0;33m16-12-2014 12:35:57,142 | WARN | [FailoverTransport] | doReconnect - Failed to connect to [tcp://dhcp-10-40-3-26.brq.redhat.com:54789] after: 10 attempt(s) continuing to retry.
           
       ...logs.... Containers restarted.....
           
          [m[0;32m16-12-2014 12:39:33,918 | INFO | [FailoverTransport] | doReconnect - Successfully reconnected to tcp://dhcp-10-40-3-26.brq.redhat.com:58553
          [m[0;32m16-12-2014 12:39:33,920 | INFO | [FailoverTransport] | doReconnect - Successfully reconnected to tcp://dhcp-10-40-3-26.brq.redhat.com:58553
      

      Set for fuse and for the client zookeeper.timeout=3000.
      In fabric there is created 3 containers with replicated levelDB:

      mq-create --no-ssl --parent-profile=mq-replicated --group a broker-nossl
      container-create-child --profile mq-broker-a.broker-nossl root broker-nossl-c0
      container-create-child --profile mq-broker-a.broker-nossl root broker-nossl-c1
      container-create-child --profile mq-broker-a.broker-nossl root broker-nossl-c2
      

      There are 3 threads in the test - one is restarting amq container 1 by one, another is sending persistent messages, the third consumes those messages.
      Both thread use url for connection:

      discovery:(fabric://a?useExponentialBackoff=false&trace=true&nested.maxReconnectAttempts=5&nested.initialReconnectDelay=1000)
      

      Sometimes after the test execution there are messages in DLQ and in the queue left. In the queue it's impossible to browse them, while in the DLQ it was possible.
      For example I got 1 message in the ActiveMQ.DLQ and 3 messages in the test queue during the last test run.
      1 message was not received.

      The message from ActiveMQ.DLQ was as follows:

      ActiveMQTextMessage {commandId = 628, responseRequired = true, messageId = ID:dhcp-4-224.brq.redhat.com-55932-1418827020048-1:1:1:1:624, originalDestination = null, originalTransactionId = null, producerId = ID:dhcp-4-224.brq.redhat.com-55932-1418827020048-1:1:1:1, destination = queue://continuousCommonSessionRestartWithBrokersKillTest, transactionId = null, expiration = 0, timestamp = 1418827248606, arrival = 0, brokerInTime = 1418827248606, brokerOutTime = 1418903486296, correlationId = null, replyTo = null, persistent = true, type = null, priority = 4, groupID = null, groupSequence = 0, targetConsumerId = null, compressed = false, userID = null, content = org.apache.activemq.util.ByteSequence@11f55379, marshalledProperties = org.apache.activemq.util.ByteSequence@13b3625, dataStructure = null, redeliveryCounter = 0, size = 0, properties = {timeProperty=1418827248606}, readOnlyProperties = true, readOnlyBody = true, droppable = false, jmsXGroupFirstForConsumer = false, text = 623_Some text}
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              emedvede Elena Medvedeva (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: