Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2137

JGroups: one slow/stuck node slows/freezes entire cluster

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Do
    • Major
    • None
    • 3.6.4
    • None
    • Hide

      To simulate this, we ran one node (Node B) in debug mode, and stopped the execution at the following line:

      TUNNEL.java, Class – StubReceiver, 'run' method. Stop inside the while loop to simulate a slow/stuck reader.

      This leads to the situation mentioned in the 'Description' section.

      Show
      To simulate this, we ran one node (Node B) in debug mode, and stopped the execution at the following line: TUNNEL.java, Class – StubReceiver, 'run' method. Stop inside the while loop to simulate a slow/stuck reader. This leads to the situation mentioned in the 'Description' section.

    Description

      We have a multi node cluster with one node (say Node A) running the gossip router. We use TUNNEL mode, i.e., other nodes in cluster can talk to each other only via Node A. If one of the nodes in the cluster (say Node B) is slow in reading or gets stuck while reading from the channel, it affects the entire cluster. Inter node gossip also gets stuck and the nodes fall out of cluster.

      Thread dump on Node A indicate that 'ConnectionHandler' for node B stuck (at SocketOutputStream.socketWrite) while it is holding on to a lock, thus blocking ConnectionHandlers for all other nodes.

      --snip (from thread dump on Node A) –
      "gossip-handlers-129" #1088 daemon prio=5 os_prio=0 tid=0x00007f65d20ce800 nid=0x2353 runnable [0x00007f6557efd000]
      java.lang.Thread.State: RUNNABLE
      at java.net.SocketOutputStream.socketWrite0(Native Method)
      at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109)
      at java.net.SocketOutputStream.write(SocketOutputStream.java:153)
      at sun.security.ssl.OutputRecord.writeBuffer(OutputRecord.java:431)
      at sun.security.ssl.OutputRecord.write(OutputRecord.java:417)
      at sun.security.ssl.SSLSocketImpl.writeRecordInternal(SSLSocketImpl.java:857)
      at sun.security.ssl.SSLSocketImpl.writeRecord(SSLSocketImpl.java:828)
      at sun.security.ssl.AppOutputStream.write(AppOutputStream.java:123)

      • locked <0x00000005f2445028> (a sun.security.ssl.AppOutputStream)
        at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140)
      • locked <0x00000005f248a210> (a java.io.BufferedOutputStream)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at org.jgroups.stack.GossipRouter.sendToMember(GossipRouter.java:607)
      • locked <0x00000005f248a1f0> (a java.io.DataOutputStream)
        at org.jgroups.stack.GossipRouter.sendToAllMembersInGroup(GossipRouter.java:590)
      • locked <0x00000005d4aa1458> (a java.util.concurrent.ConcurrentHashMap)
        at org.jgroups.stack.GossipRouter.route(GossipRouter.java:487)
        at org.jgroups.stack.GossipRouter.access$800(GossipRouter.java:63)
        at org.jgroups.stack.GossipRouter$ConnectionHandler.readLoop(GossipRouter.java:753)
        at org.jgroups.stack.GossipRouter$ConnectionHandler.run(GossipRouter.java:706)
        at java.lang.Thread.run(Thread.java:745)
        -snip end-

      Other gossip-handler threads (meant for other nodes in the cluster) on Node A wait for acquiring lock on the ConnectionHandler map at following place: GossipRouter.java, method: sendToAllMembersInGroup

      -snip-
      "gossip-handlers-128"
      #1078 daemon prio=5 os_prio=0 tid=0x00007f65d20ce000 nid=0x2343 waiting
      for monitor entry [0x00007f654c258000]
      java.lang.Thread.State: BLOCKED (on object monitor)
      at org.jgroups.stack.GossipRouter.sendToAllMembersInGroup(GossipRouter.java:583)

      • waiting to lock <0x00000005d4aa1458> (a java.util.concurrent.ConcurrentHashMap)
        at org.jgroups.stack.GossipRouter.route(GossipRouter.java:487)
        at org.jgroups.stack.GossipRouter.access$800(GossipRouter.java:63)
        at org.jgroups.stack.GossipRouter$ConnectionHandler.readLoop(GossipRouter.java:753)
        at org.jgroups.stack.GossipRouter$ConnectionHandler.run(GossipRouter.java:706)
        at java.lang.Thread.run(Thread.java:745)

      "gossip-handlers-127"
      #1073 daemon prio=5 os_prio=0 tid=0x00007f65d01a6800 nid=0x233c waiting
      for monitor entry [0x00007f6697afb000]
      java.lang.Thread.State: BLOCKED (on object monitor)
      at org.jgroups.stack.GossipRouter.sendToAllMembersInGroup(GossipRouter.java:583)

      • waiting to lock <0x00000005d4aa1458> (a java.util.concurrent.ConcurrentHashMap)
        at org.jgroups.stack.GossipRouter.route(GossipRouter.java:487)
        at org.jgroups.stack.GossipRouter.access$800(GossipRouter.java:63)
        at org.jgroups.stack.GossipRouter$ConnectionHandler.readLoop(GossipRouter.java:753)
        at org.jgroups.stack.GossipRouter$ConnectionHandler.run(GossipRouter.java:706)
        at java.lang.Thread.run(Thread.java:745)
        -snip end-

      If node B were to go down, JGroups quickly takes it out of the cluster and
      there is no problem. But if it stays in the cluster and is slow/stuck, is
      there a way to avoid rest of the cluster getting affected? We'd
      appreciate any help/pointers. Thanks.

      Attachments

        Activity

          People

            rhn-engineering-bban Bela Ban
            bharad4 Bharad S (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: