Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1427

Race conditions during TCP start up cause broken connections and cluster-wide slow down

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 3.1
    • 3.0.5
    • None
    • Hide

      I have a patch you can review if you are interested, but the fix is quite trivial.

      Show
      I have a patch you can review if you are interested, but the fix is quite trivial.

    Description

      I actually found this on 2.4.1, but I can reproduce it on the Git master.

      When starting our cluster of 18 nodes using TCP, perhaps one start in 20 results in a node having very slow cluster response. Some trace logging eventually revealed that on random occasions one of the TCP connections to a sister node would start and then immediate terminate. From that point on, every cluster broadcast would be subject to the response timeout (for us it was at 60 seconds).

      The problem is in org.jgroups.blocks.BasicConnectionTable. There is a number of non-threadsafe field reference updates during startup, but in particular in BasicConnectionTable.Sender.start() the 'senderThread' field is updated in one thread, and immediately referenced in the daughter thread's invocation of BasicConnectionTable.Sender.run().

      If the timing works out wrong, 'senderThread' is set in the L2 cache of the PingSender thread, but not yet updated in the L2 cache used by ConnectionTable.Connection.Sender, and hence the run() method exits immediately. Here is the smoking gun from a trace log:

      22:18:22,985 INFO  [org.jgroups.blocks.ConnectionTable] {main} server socket created on 127.0.0.1:7800
      22:18:22,997 DEBUG [org.jgroups.blocks.MessageDispatcher$ProtocolAdapter] {main} setting local_addr (null) to 127.0.0.1:7800
      22:18:23,013 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Receiver started
      22:18:23,013 INFO  [org.jgroups.blocks.ConnectionTable] {PingSender} created socket to 127.0.0.1:7801
      22:18:23,015 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Sender thread started
      22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Receiver started
      22:18:23,018 INFO  [org.jgroups.blocks.ConnectionTable] {PingSender} created socket to 127.0.0.1:7802
      22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable] {ConnectionTable.Connection.Sender [127.0.0.1:44710 - 127.0.0.1:7802]} ConnectionTable.Connection.Sender thread terminated
      22:18:23,018 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Sender thread started
      22:18:23,019 DEBUG [org.jgroups.blocks.ConnectionTable] {PingSender} ConnectionTable.Connection.Receiver started
      2
      

      You can see that at 22:18:23,018 the PingSender thread created a Sender, and in the same millisecond the ConnectionTable.Connection.Sender thread killed it off.

      The fix is easy, just convert 'senderThread' to an AtomicReference. This is very old code, I am not sure why we were the first ones to report this? Perhaps it is because our application is heavily multithreaded and we run on servers with 16 physical CPU cores. There is a very high chance that PingSender and ConnectionTable.Connection.Sender were not bound to the same core, and hence their L2 caches had significant time intervals where they were out of sync. There were a number of other sloppy references and updates to variables that were used for thread control in BasicConnectionTable, I think this code could benefit from a review for that kind of thing.

      Cheers!

      Jay Guidos

      Attachments

        Activity

          People

            rhn-engineering-bban Bela Ban
            jay.guidos-bidstrading.com Jay Guidos (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: