Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1529

RELAY2: Intra-site view not being accepted upon inter-site installation failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.3
    • None
    • None

      When a node becomes coordinator, it sends the VIEW_CHANGE event up the stack. This should result in Receiver.viewAccepted(...) method call. However, when RELAY2 is in stack and the coordinator cannot be reached, it blocks the thread (sending discovery pings) and, therefore, the viewAccepted event is postponed.
      In my opinion the inter-site stack should be created and handled in different thread.

      Context:
      In my case, the coordinator for both local cluster and the global (inter-site) cluster was killed. The FD_SOCK on inter-site stack somehow failed to notice that the coordinator has crashed (more investigation required) and the nodes in global cluster still reported the crashed node as the global coordinator.
      Therefore, the new coordinator of local cluster failed to join the global cluster (obviously got no response from the dead global coordinator).
      The restarted node joined the local cluster and then tried to join the local Infinispan cache with a new local view ID. However, the coordinator failed to notice (in Infinispan viewAccepted handler which was not called) that he had already installed a new JGroups view and it did not respond to the cache join request because it was waiting until it got the new JGroups view (again, which was installed in JGroups but the viewAccepted did not notified Infinispan about that).

            rhn-engineering-bban Bela Ban
            rvansa1@redhat.com Radim Vansa (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: