Uploaded image for project: 'JBoss Enterprise Application Platform 4 and 5'
  1. JBoss Enterprise Application Platform 4 and 5
  2. JBPAPP-6014

Problem with HornetQ failback under heavy load

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Critical
    • None
    • None
    • Documentation, HornetQ
    • None
    • Hide
      CAUSE

      CONSEQUENCE

      FIX

      RESULT
      Show
      CAUSE CONSEQUENCE FIX RESULT
    • Needs More Info

    Description

      Hi Clebert.

      When I was testing Hornetq 2.2.1.QA-10280 for GFS2 fix I noticed unstable behaviour on failback with colocated configuration when I used many producers pushing messages into one queue.

      Failback was successful in 6 from 7 cases, one time server didn't start after kill -9 and was blocked in 'Waiting to obtain live lock' state. Clients were already notified to failback to the starting server and later they were throwing 'Timed out waiting for response when sending packet 71' exception.

      I tested it in messaging lab - messaging-18 for serverA, messaging-19 for serverB and messaging-20 for producers, HornetQ was configured to use 100 MB for queue with BLOCK policy, to hit the issue I used 55 clients which were continuously producing messages into one queue.

      There is no 100% reproducer, it happened in 1 from 7 runs. Please take a look at it and try to identify if customers can hit such issue or not. Later I'll provide detailed steps of my test.

      ------------------------------------------------------------------------------------------
      LOGS - SERVER A - start after kill -9
      ------------------------------------------------------------------------------------------
      04:19:34,729 INFO [LegacyStateTransferIntegrator] Using version 4096
      04:19:34,743 INFO [RPCManagerImpl] Cache local address is 10.16.100.40:55200
      04:19:34,744 INFO [RPCManagerImpl] state was retrieved successfully (in 154 milliseconds)
      04:19:34,758 INFO [ComponentRegistry] JBoss Cache version: JBossCache 'Malagueta' 3.2.5.GA
      04:19:34,758 INFO [DefaultPartition] Fetching serviceState (will wait for 30000 milliseconds):
      04:19:34,817 INFO [DefaultPartition] serviceState was retrieved successfully (in 59 milliseconds)
      04:19:34,908 INFO [HANamingService] Started HAJNDI bootstrap; jnpPort=1100, backlog=50, bindAddress=messaging-18/10.16.100.40
      04:19:34,911 INFO [DetachedHANamingService$AutomaticDiscovery] Listening on /10.16.100.40:1102, group=230.0.0.4, HA-JNDI address=10.16.100.40:1100
      04:19:35,745 INFO [UnifiedInvokerHA] Service name is jboss:service=invoker,type=unifiedha
      04:19:36,191 INFO [HornetQServerImpl] live server is starting with configuration HornetQ Configuration (clustered=true,backup=false,sharedStore=true,journalDirectory=/mnt/jbm/common/tmp/nodeA/hornetq/journal,bindingsDirectory=/mnt/jbm/common/tmp/nodeA/hornetq/bindings,largeMessagesDirectory=/mnt/jbm/common/tmp/nodeA/hornetq/largemessages,pagingDirectory=/mnt/jbm/common/tmp/nodeA/hornetq/paging)
      04:19:36,192 INFO [HornetQServerImpl] Waiting to obtain live lock
      04:19:36,242 INFO [JournalStorageManager] Using AIO Journal
      04:19:36,271 WARNING [HornetQServerImpl] Security risk! It has been detected that the cluster admin user and password have not been changed from the installation default. Please see the HornetQ user guide, cluster chapter, for instructions on how to do this.
      04:19:36,368 INFO [ClusterManagerImpl] announcing backup
      04:19:37,258 INFO [ClusterManagerImpl] backup announced
      04:19:38,578 INFO [AIOFileLockNodeManager] Waiting to obtain live lock

      And SERVER A is stuck here

      ------------------------------------------------------------------------------------------
      LOGS - SERVER B - when server A was killed and is staring again
      ------------------------------------------------------------------------------------------
      04:19:04,890 INFO [DefaultPartition] Suspected member: 10.16.100.40:55200
      04:19:05,039 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 2, delta: -1) : [10.16.100.42:1099]
      04:19:05,039 INFO [RPCManagerImpl] Received new cluster view: [10.16.100.42:55200|2] [10.16.100.42:55200]
      04:19:05,053 INFO [DefaultPartition] I am (10.16.100.42:1099) received membershipChanged event:
      04:19:05,053 INFO [DefaultPartition] Dead members: 1 ([10.16.100.40:1099])
      04:19:05,053 INFO [DefaultPartition] New Members : 0 ([])
      04:19:05,053 INFO [DefaultPartition] All Members : 1 ([10.16.100.42:1099])
      04:19:08,599 INFO [NettyAcceptor] Started Netty Acceptor version 3.2.3.Final-r${buildNumber} messaging-19:5446 for CORE protocol
      04:19:08,917 WARN [ClientSessionFactoryImpl] Failed to connect to server.
      04:19:09,132 INFO [HornetQServerImpl] Backup Server is now live
      04:19:09,148 INFO [BridgeImpl] Connecting bridge sf.failover-cluster.29248a3a-4577-11e0-a6f4-001b21862475 to its destination [2928f70b-4577-11e0-a6f4-001b21862475]
      04:19:09,212 WARN [ChannelImpl] Can't find packet to clear: last received command id 1 first stored command id 1
      04:19:09,220 WARN [ChannelImpl] Can't find packet to clear: last received command id 2 first stored command id 2
      04:19:09,222 WARN [ChannelImpl] Can't find packet to clear: last received command id 3 first stored command id 3
      04:19:09,232 WARN [ChannelImpl] Can't find packet to clear: last received command id 4 first stored command id 4
      04:19:09,253 INFO [BridgeImpl] Bridge sf.failover-cluster.29248a3a-4577-11e0-a6f4-001b21862475 is connected [2928f70b-4577-11e0-a6f4-001b21862475-> sf.failover-cluster.29248a3a-4577-11e0-a6f4-001b21862475]
      04:19:09,394 WARN [PostOfficeImpl] Duplicate message detected - message will not be routed. Message information:
      count=465
      _HQ_DUPL_ID=[B@bec220a
      _HQ_ROUTE_TO=[B@3baec4a3

      04:19:10,120 WARN [DiscoveryGroupImpl] There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=2928f70b-4577-11e0-a6f4-001b21862475
      04:19:34,470 INFO [DefaultPartition] New cluster view for partition DefaultPartition (id: 3, delta: 1) : [10.16.100.42:1099, 10.16.100.40:1099]
      04:19:34,477 INFO [DefaultPartition] I am (10.16.100.42:1099) received membershipChanged event:
      04:19:34,485 INFO [DefaultPartition] Dead members: 0 ([])
      04:19:34,485 INFO [DefaultPartition] New Members : 1 ([10.16.100.40:1099])
      04:19:34,485 INFO [DefaultPartition] All Members : 2 ([10.16.100.42:1099, 10.16.100.40:1099])
      04:19:34,646 INFO [RPCManagerImpl] Received new cluster view: [10.16.100.42:55200|3] [10.16.100.42:55200, 10.16.100.40:55200]
      04:19:39,219 INFO [HornetQServerImpl] live server wants to restart, restarting server in backup
      04:19:39,279 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,279 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,280 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,281 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,287 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,288 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,288 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,292 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,292 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,305 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,305 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,305 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,305 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,306 WARN [BridgeImpl] Unable to send message, will try again once bridge reconnects
      04:19:39,332 INFO [BridgeImpl] stopped bridge sf.failover-cluster.29248a3a-4577-11e0-a6f4-001b21862475
      04:19:39,683 WARN [RemotingConnectionImpl] Connection failure has been detected: The connection was disconnected because of server shutdown [code=4]
      04:19:39,683 WARN [RemotingConnectionImpl] Connection failure has been detected: The connection was disconnected because of server shutdown [code=4]
      04:20:21,468 WARN [RemotingConnectionImpl] Connection failure has been detected: Did not receive data from /10.16.100.40:59133. It is likely the client has exited or crashed without closing its connection, or the network between the server and client has failed. You also might have configured connection-ttl and client-failure-check-period incorrectly. Please check user manual for more information. The connection will now be closed. [code=3]
      04:20:21,468 WARN [ServerSessionImpl] Client connection failed, clearing up resources for session 2dea9dd8-4577-11e0-a6f4-001b21862475
      04:20:21,469 WARN [ServerSessionImpl] Cleared up resources for session 2dea9dd8-4577-11e0-a6f4-001b21862475

      Attachments

        Activity

          People

            mnovak1@redhat.com Miroslav Novak
            rsvoboda@redhat.com Rostislav Svoboda
            Russell Dickenson Russell Dickenson (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: