Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-2702

Broker unresponsive when many consumers have delayed and negative acknowledgement on the same address

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • AMQ 7.4.1.GA
    • AMQ 7.2.2.GA
    • broker-core
    • None
    • +
    • Verified in a release
    • Hide

      1. Set up a single master-slave replicated pair using AMQ 7.2.2
      2. Unpack and built the attached coreconsume3.zip. This is a stand-alone Java application that just consumers from a broker on thirty separate threads, and rolls back the JMS Session for every message consumed.
      3. Edit the hostname, credentials, and queue name if necessary, and run the client
      4. Place a message onto the queue __test_destination on the master using any client. You should see the message redelivered several times, and then the broker log will show the message being sent to the DLQ
      5. Place 1000 messages onto the queue, each with a header JMSXGroupID=01 (any group ID will be fine, in fact – there just has to be one)
      6. Watch the client while all the messages are processed, and sent to the DLQ
      7. When the client is idle, repeat the test.

      In general, I have to repeat two or three times to elicit the failure. When the master fails, it throws out a thread dump and an error message "Timed out waiting for lock on consumer". The slave may promote itself to master but, in my tests, this doesn't always happen. If it does, the master remains capable of accepting client connections, although it doesn't seem to do any work on them.

      Show
      1. Set up a single master-slave replicated pair using AMQ 7.2.2 2. Unpack and built the attached coreconsume3.zip. This is a stand-alone Java application that just consumers from a broker on thirty separate threads, and rolls back the JMS Session for every message consumed. 3. Edit the hostname, credentials, and queue name if necessary, and run the client 4. Place a message onto the queue __test_destination on the master using any client. You should see the message redelivered several times, and then the broker log will show the message being sent to the DLQ 5. Place 1000 messages onto the queue, each with a header JMSXGroupID=01 (any group ID will be fine, in fact – there just has to be one) 6. Watch the client while all the messages are processed, and sent to the DLQ 7. When the client is idle, repeat the test. In general, I have to repeat two or three times to elicit the failure. When the master fails, it throws out a thread dump and an error message "Timed out waiting for lock on consumer". The slave may promote itself to master but, in my tests, this doesn't always happen. If it does, the master remains capable of accepting client connections, although it doesn't seem to do any work on them.

    Description

      A JMS client of AMQ 7 uses the JMS-core client runtime with default connection settings (including pre-fetch). The client creates 30 consumers on the same anycast address. In a failure scenario involving some external system, the client either negatively acknowledges some messages, or delays in acknowledgement, perhaps by minutes.

      When this situation arises, the entire broker becomes unresponsive. In a master-slave configuration, it can no longer maintain the master role.

      While it is an external system which is the ultimate cause of the problem, we do not expect the broker to become completely unresponsive.

      Attachments

        1. artemis_reproduced_twice.log
          930 kB
        2. coreconsume.zip
          5 kB
        3. coreconsume2.zip
          5 kB
        4. coreconsume3.zip
          5 kB
        5. oops.txt
          76 kB
        6. thread.dump.single
          213 kB

        Activity

          People

            fnigro Francesco Nigro
            rhn-support-kboone Kevin Boone
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: