Affects Version/s: 6.1.1
Fix Version/s: JBoss A-MQ 6.3
Standalone A-MQ ActiveMQ 5.9.0.redhat-611443 on two nodes of cluster (master/standby).
We were able to reproduce the problem using a small pool size, we could look into the problem more deeply.
There were two types of blocked threads:
1) borrow connection
2) acquire lock
Looking at the code we were able to identify the #2 lock, which is makes the transactions wait:
The lock has 3 states:
1) open - read lock can be acquired freely
- outstanding read locks needs to finish first
- write lock pending (blocking until read is done)
- no new read lock handed out (new requests blocking)
3) closed - until write lock(s) finish new readers blocks
The cleanupExclusiveLock protects the internal structure of the database from changing in an unexpected way. It is used by the DB cleanup (doDeleteOldMessages) in WRITE mode. All other transactions require READ mode.
The actual deadlock is caused by this lock, which blocks transactions to finish and others are blocked by the lack of free connection.
For the exact explanation please check the overview picture attached.
"Lock" represents the rw lock
"Conn" is the connection pool.
"Tx1-3" represent the transactions.
There are two types of transactions:
tx1 - "not transactional" - they first acquire the read lock and then getConnection.
(they need to come first, in "open mode")
tx3 - "transactional" - they try to acquire a the read lock whilst owning a connection!
(they must come in "half-closed" mode.)
In case of sending/receiving in transaction (TX3), the connection is acquired on BEGIN, so there is a connection already in the TransactionContext.
(eg. "ActiveMQ Transport: tcp:///xx.xx.xx.xx:41767@14016") in jstack.xxxxx.20160531214821.AMQ.log
In case of sending/receiving using local transaction (TX1), the connection is first acquired inside the lock.
(eg. "ActiveMQ Transport: tcp:///xx.xx.xx.xx:27378@14016")
The problem occurs, if TX1 type operations cannot acquire a connection, thus blocking.
Tx2 (cleanup) is blocked, due to Tx1.
TX3 keeps getting connections first and wait for the clean-up to finish whilst owning the connection.
If tx3 quick enough it will make the cycle complete, then this situation will exhaust the database pool, so the server will loose the master lock.
Regarding the fix, we don't think that getting the connection before the clean-up would help, because in our cases the process was not able to enter into the write locked area.
You will need to move all getConnection() calls out of the read-locks or only execute the statement within the read-lock.