Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-9544

Server cache manager stop/restart breaks cache topology management

    XMLWordPrintable

Details

    • Bug
    • Resolution: Obsolete
    • Major
    • None
    • 9.4.0.CR3
    • Server
    • None

    Description

      Before FORK was introduced, ClsuterTopologyManagerImpl and LocalTopologyManagerImpl assumed that the coordinator would always reply to other members' requests. After the introduction of FORK we added some hacks to work around the fact that the coordinator may not yet have a ForkChannel with our ID running *yet*, but we still expect the FORK setup to be symmetric after a reasonable amount of time.

      Stopping a FORK and starting it back without restarting the underlying channel also doesn't work, because a FORK start/stop does not trigger a new view. When a node sends a request to the coordinator and receives back a CacheNotFoundResponse, it assumes that it will also receive a new view, but if the CacheNotFoundResponse was a consequence of stopping a single DefaultCacheManager/ForkChannel, that view will never arrive.

      We don't restart individual cache managers in our tests, but the spark connector test suite does it, and it sometimes fails because of it:

      2018-09-26 21:18:03,035 INFO  [org.infinispan.CLUSTER] (MSC service thread 1-4) ISPN000094: Received new cluster view for channel cluster: [server2|6] (3) [server2, server0, server1]
      2018-09-26 21:18:05,778 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5) server1 sending request 37 to server2: CacheTopologyControlCommand{cache=org.infinispan.spark.suites.DistributedSuite, type=POLICY_GET_STATUS, sender=server1, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=6}
      2018-09-26 21:18:05,795 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (jgroups-4,server1) server1 received response for request 37 from server2: CacheNotFoundResponse
      2018-09-26 21:18:05,798 TRACE [org.infinispan.topology.LocalTopologyManagerImpl] (MSC service thread 1-5) Coordinator left the cluster while querying rebalancing status, retrying
      2018-09-26 21:18:05,823 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5) server1 sending request 41 to server2: CacheTopologyControlCommand{cache=org.infinispan.spark.suites.DistributedSuite, type=POLICY_GET_STATUS, sender=server1, joinInfo=null, topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, phase=null, actualMembers=null, throwable=null, viewId=6}
      2018-09-26 21:18:05,841 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (jgroups-19,server1) server1 received response for request 41 from server2: CacheNotFoundResponse
      2018-09-26 21:18:05,846 TRACE [org.infinispan.topology.LocalTopologyManagerImpl] (MSC service thread 1-5) Coordinator left the cluster while querying rebalancing status, retrying
      2018-09-26 21:18:05,871 TRACE [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-5) Waiting for transaction data for view 7, current view is 6
      2018-09-26 21:19:05,779 ERROR [org.jboss.msc.service.fail] (MSC service thread 1-5) MSC000001: Failed to start service jboss.datagrid-infinispan.clustered."org.infinispan.spark.suites.DistributedSuite": org.jboss.msc.service.StartException in service jboss.datagrid-infinispan.clustered."org.infinispan.spark.suites.DistributedSuite": Failed to start service
      	at org.jboss.msc.service.ServiceControllerImpl$StartTask.execute(ServiceControllerImpl.java:1728)
      	at org.jboss.msc.service.ServiceControllerImpl$ControllerTask.run(ServiceControllerImpl.java:1556)
      	at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
      	at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1985)
      	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1487)
      	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1364)
      	at java.lang.Thread.run(Thread.java:748)
      Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000451: Timed out waiting for view 7, current view is 6
      	at org.infinispan.topology.LocalTopologyManagerImpl.waitForView(LocalTopologyManagerImpl.java:558)
      	at org.infinispan.topology.LocalTopologyManagerImpl.executeOnCoordinatorRetry(LocalTopologyManagerImpl.java:598)
      	at org.infinispan.topology.LocalTopologyManagerImpl.isCacheRebalancingEnabled(LocalTopologyManagerImpl.java:580)
      	at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:233)
      	at org.infinispan.cache.impl.CacheImpl.start(CacheImpl.java:1056)
      	at org.infinispan.cache.impl.AbstractDelegatingCache.start(AbstractDelegatingCache.java:451)
      	at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:653)
      	at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:598)
      	at org.infinispan.manager.DefaultCacheManager.internalGetCache(DefaultCacheManager.java:481)
      	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:465)
      	at org.infinispan.manager.impl.AbstractDelegatingEmbeddedCacheManager.getCache(AbstractDelegatingEmbeddedCacheManager.java:157)
      	at org.infinispan.server.infinispan.SecurityActions.lambda$startCache$4(SecurityActions.java:122)
      	at org.infinispan.security.Security.doPrivileged(Security.java:44)
      	at org.infinispan.server.infinispan.SecurityActions.doPrivileged(SecurityActions.java:69)
      	at org.infinispan.server.infinispan.SecurityActions.startCache(SecurityActions.java:126)
      	at org.jboss.as.clustering.infinispan.subsystem.CacheService.start(CacheService.java:87)
      	at org.jboss.msc.service.ServiceControllerImpl$StartTask.startService(ServiceControllerImpl.java:1736)
      	at org.jboss.msc.service.ServiceControllerImpl$StartTask.execute(ServiceControllerImpl.java:1698)
      	... 6 more
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            dberinde@redhat.com Dan Berindei (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: