Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1171

Address cache in TP protocol never removes inactive members, which causes enourmous delays sending multicast messages using TCP

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 2.10
    • 2.8, 2.9
    • None

    Description

      org.jgroups.blocks.LazyRemovalCache used in org.jgroups.protocols.TP removes marked cache items only when it's size exceeds max_elements size, which is set to 20 in TP.

      I'm using jgroups (tried 2.8 and 2.9) with jboss-cache 3.2.1, using TCP protocol. I've tried to investigate why when any node leaves the cluster, replication time increases by a second (around 50ms initially).

      Here's what I found:

      What a node leaves the cluster and view changes:
      1. TP calls logical_addr_cache.retainAll(members);
      2. LazyRemovalCache.retainAll updates the map, setting removable flag to true on those members that are not in the view.
      3. LazyRemovalCache.checkMaxSizeExceeded NEVER removes them from the cache because it's size is always less than max_elements, which is 20.

      1. BasicTCP.sendMulticast calls TP.sendToAllPhysicalAddresses
      2. TP.sendToAllPhysicalAddresses iterates through all values in logical_addr_cache calling sendUnicast for each
      3. logical_addr_cache contains all the nodes including those killed, and tries to connect to each if them, which causes enormous delays

      This is causing replication time to increase for connection timeout for every node removed from cluster

      Attachments

        Issue Links

          Activity

            People

              rhn-engineering-bban Bela Ban
              feutche Fedor Cherepanov (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: