Uploaded image for project: 'Red Hat Data Grid'
  1. Red Hat Data Grid
  2. JDG-2922

TCP: connection close can block when send() block on full TCP send-window

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • 7.3.2.CR1
    • JDG 7.2.3 GA
    • JGroups
    • None
    • DataGrid Sprint #29

    Description

      When a peer is non-responsive (without closing its socket), a TcpConnection.send() can block on a write (state is RUNNABLE!).

      The problem is that the TcpConnection cannout be closed either, as TcpConnection.close() tries to acquire the same lock already held by TcpConnection.send().

      See the stack trace below for a sample scenario.

      The use case is this one:

      • Say we have nodes A (coord), B and C
      • There's heavy (clustering) traffic to all 3 nodes, from the 2 clients
      • B is isolated by executing 'ifdown bond0'
      • At this point, the messages going to B will back up at (say) A because A doesn't get any TCP acks from B
      • At some point, depending on the traffic and the size of the sent messages, A will acquire a lock on the send connection to B, to write data, but the write will block as the TCP send-window to B is full (note that the sender thread will still be in state RUNNABLE!)
      • After 40s, A suspects B and emits a new view {A,C}
      • This causes A's connection to B to be closed and subsequently removed. However, this won't happen, as the connection close will need to acquire the connection lock, which is held by the TCP write
      "main" #1 prio=5 os_prio=31 tid=0x00007fbbd3802000 nid=0x2303 runnable [0x0000700009793000]
         java.lang.Thread.State: RUNNABLE
      	at java.net.SocketOutputStream.socketWrite0(Native Method)
      	at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
      	at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
      	at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
      	- locked <0x000000079e790a50> (a java.io.BufferedOutputStream)
      	at java.io.DataOutputStream.write(DataOutputStream.java:107)
      	- locked <0x000000079e790838> (a java.io.DataOutputStream)
      	at org.jgroups.blocks.cs.TcpConnection.doSend(TcpConnection.java:161)
      	at org.jgroups.blocks.cs.TcpConnection.send(TcpConnection.java:131)
      	at org.jgroups.blocks.cs.TcpClient.send(TcpClient.java:103)
      	at org.jgroups.tests.bla6.main(bla6.java:35)
      "Thread-2" #15 prio=5 os_prio=31 tid=0x00007fbbd2150800 nid=0x6503 waiting on condition [0x000070000bcf6000]
         java.lang.Thread.State: WAITING (parking)
      	at sun.misc.Unsafe.park(Native Method)
      	- parking to wait for  <0x000000079e7871a8> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
      	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
      	at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
      	at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
      	at org.jgroups.blocks.cs.TcpConnection.close(TcpConnection.java:358)
      	at org.jgroups.util.Util.close(Util.java:422)
      	at org.jgroups.blocks.cs.TcpClient.stop(TcpClient.java:85)
      	at org.jgroups.blocks.cs.BaseServer.close(BaseServer.java:147)
      	at org.jgroups.util.Util.close(Util.java:422)
      	at org.jgroups.tests.bla6.lambda$main$0(bla6.java:27)
      	at org.jgroups.tests.bla6$$Lambda$1/1384010761.run(Unknown Source)
      	at java.lang.Thread.run(Thread.java:748)
      

      Attachments

        Issue Links

          Activity

            People

              rhn-support-wfink Wolf Fink
              rhn-support-wfink Wolf Fink
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: