-
Bug
-
Resolution: Won't Do
-
Critical
-
None
-
None
-
None
-
-
Needs More Info
Hi Clebert.
When I was testing Hornetq 2.2.1.QA-10280 for GFS2 fix I noticed unstable behaviour on failback with colocated configuration when I used many producers pushing messages into one queue.
Failback was successful in 6 from 7 cases, one time server didn't start after kill -9 and was blocked in 'Waiting to obtain live lock' state. Clients were already notified to failback to the starting server and later they were throwing 'Timed out waiting for response when sending packet 71' exception.
I tested it in messaging lab - messaging-18 for serverA, messaging-19 for serverB and messaging-20 for producers, HornetQ was configured to use 100 MB for queue with BLOCK policy, to hit the issue I used 55 clients which were continuously producing messages into one queue.
There is no 100% reproducer, it happened in 1 from 7 runs. Please take a look at it and try to identify if customers can hit such issue or not. Later I'll provide detailed steps of my test.
------------------------------------------------------------------------------------------
LOGS - SERVER A - start after kill -9
------------------------------------------------------------------------------------------
04:19:34,729 INFO [LegacyStateTransferIntegrator] Using version 4096
04:19:34,743 INFO [RPCManagerImpl] Cache local address is 10.16.100.40:55200
04:19:34,744 INFO [RPCManagerImpl] state was retrieved successfully (in 154 milliseconds)
04:19:34,758 INFO [ComponentRegistry] JBoss Cache version: JBossCache 'Malagueta' 3.2.5.GA
04:19:34,758 INFO [DefaultPartition] Fetching serviceState (will wait for 30000 milliseconds):
04:19:34,817 INFO [DefaultPartition] serviceState was retrieved successfully (in 59 milliseconds)
04:19:34,908 INFO [HANamingService] Started HAJNDI bootstrap; jnpPort=1100, backlog=50, bindAddress=messaging-18/10.16.100.40
04:19:34,911 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /10.16.100.40:1102, group=230.0.0.4, HA-JNDI address=10.16.100.40:1100
04:19:35,745 INFO [UnifiedInvokerHA] Service name is jboss:service=invoker,type=unifiedha
04:19:36,191 INFO [HornetQServerImpl] live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=true,journalDirectory=/mnt/jbm/common/tmp/nodeA/hornetq/journal,bindingsDirectory=/mnt/jbm/common/tmp/nodeA/hornetq/bindings,largeMessagesDirectory=/mnt/jbm/common/tmp/nodeA/hornetq/largemessages,pagingDirectory=/mnt/jbm/common/tmp/nodeA/hornetq/paging)
04:19:36,192 INFO [HornetQServerImpl] Waiting to obtain live lock
04:19:36,242 INFO [JournalStorageManager] Using AIO Journal
04:19:36,271 WARNING [HornetQServerImpl] Security risk! It has been detected that the cluster admin user and password have not been changed from the installation default. Please see the HornetQ user guide, cluster chapter, for instructions on how to do this.
04:19:36,368 INFO [ClusterManagerImpl] announcing backup
04:19:37,258 INFO [ClusterManagerImpl] backup announced
04:19:38,578 INFO [AIOFileLockNodeManager] Waiting to obtain live lock
And SERVER A is stuck here
------------------------------------------------------------------------------------------
LOGS - SERVER B - when server A was killed and is staring again
------------------------------------------------------------------------------------------
04:19:04,890 INFO [DefaultPartition] Suspected member: 10.16.100.40:55200
04:19:05,039 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: -1) : [10.16.100.42:1099]
04:19:05,039 INFO [RPCManagerImpl] Received new cluster view: [10.16.100.42:55200|2] [10.16.100.42:55200]
04:19:05,053 INFO [DefaultPartition] I am (10.16.100.42:1099) received membershipChanged event:
04:19:05,053 INFO [DefaultPartition] Dead members: 1 ([10.16.100.40:1099])
04:19:05,053 INFO [DefaultPartition] New Members : 0 ([])
04:19:05,053 INFO [DefaultPartition] All Members : 1 ([10.16.100.42:1099])
04:19:08,599 INFO [NettyAcceptor] Started Netty Acceptor version 3.2.3.Final-r${buildNumber} messaging-19:5446 for CORE protocol
04:19:08,917 WARN [ClientSessionFactoryImpl] Failed to connect to server.
04:19:09,132 INFO [HornetQServerImpl] Backup Server is now live
04:19:09,148 INFO [BridgeImpl] Connecting bridge sf.failover-cluster.29248a3a-4577-11e0-a6f4-001b21862475 to its destination [2928f70b-4577-11e0-a6f4-001b21862475]
04:19:09,212 WARN [ChannelImpl] Can't find packet to clear: last received command id 1 first stored command id 1
04:19:09,220 WARN [ChannelImpl] Can't find packet to clear: last received command id 2 first stored command id 2
04:19:09,222 WARN [ChannelImpl] Can't find packet to clear: last received command id 3 first stored command id 3
04:19:09,232 WARN [ChannelImpl] Can't find packet to clear: last received command id 4 first stored command id 4
04:19:09,253 INFO [BridgeImpl] Bridge sf.failover-cluster.29248a3a-4577-11e0-a6f4-001b21862475 is connected [2928f70b-4577-11e0-a6f4-001b21862475-> sf.failover-cluster.29248a3a-4577-11e0-a6f4-001b21862475]
04:19:09,394 WARN [PostOfficeImpl] Duplicate message detected - message will not be routed. Message information:
count=465
_HQ_DUPL_ID=[B@bec220a
_HQ_ROUTE_TO=[B@3baec4a3
04:19:10,120 WARN [DiscoveryGroupImpl] There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=2928f70b-4577-11e0-a6f4-001b21862475
04:19:34,470 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 3, delta: 1) : [10.16.100.42:1099, 10.16.100.40:1099]
04:19:34,477 INFO [DefaultPartition] I am (10.16.100.42:1099) received membershipChanged event:
04:19:34,485 INFO [DefaultPartition] Dead members: 0 ([])
04:19:34,485 INFO [DefaultPartition] New Members : 1 ([10.16.100.40:1099])
04:19:34,485 INFO [DefaultPartition] All Members : 2 ([10.16.100.42:1099, 10.16.100.40:1099])
04:19:34,646 INFO [RPCManagerImpl] Received new cluster view: [10.16.100.42:55200|3] [10.16.100.42:55200, 10.16.100.40:55200]
04:19:39,219 INFO [HornetQServerImpl] live server wants to restart, restarting server in backup
04:19:39,279 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,279 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,281 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,288 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,288 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,292 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,292 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,305 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,305 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,305 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,305 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
04:19:39,332 INFO [BridgeImpl] stopped bridge sf.failover-cluster.29248a3a-4577-11e0-a6f4-001b21862475
04:19:39,683 WARN [RemotingConnectionImpl] Connection failure has been detected: The connection was disconnected because of server shutdown [code=4]
04:19:39,683 WARN [RemotingConnectionImpl] Connection failure has been detected: The connection was disconnected because of server shutdown [code=4]
04:20:21,468 WARN [RemotingConnectionImpl] Connection failure has been detected: Did not receive data from /10.16.100.40:59133. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. You also might have configured connection-ttl and client-failure-check-period incorrectly. Please check user manual for more information. The connection will now be closed. [code=3]
04:20:21,468 WARN [ServerSessionImpl] Client connection failed, clearing up resources for session 2dea9dd8-4577-11e0-a6f4-001b21862475
04:20:21,469 WARN [ServerSessionImpl] Cleared up resources for session 2dea9dd8-4577-11e0-a6f4-001b21862475