Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-4996

Problem with capacityFactor=0 and restart of all nodes with capacityFactor > 0

    Details

    • Type: Bug
    • Status: New (View Workflow)
    • Priority: Blocker
    • Resolution: Unresolved
    • Affects Version/s: 7.0.2.Final
    • Fix Version/s: None
    • Component/s: Core
    • Labels:
      None
    • Steps to Reproduce:
      Hide

      1) configure a DIST_SYNCH cache
      2) start one node with capacityFactor = 1
      3) start one node with capacityFactor = 0 which reads from a cache every 1 second
      4) start one node with capacityFactor = 0 which write to the cache every 1 second
      5) stop the first node
      6) start the first node
      7) org.infinispan.util.concurrent.TimeoutException: Timed out waiting for topology 1 on "writer" node

      Show
      1) configure a DIST_SYNCH cache 2) start one node with capacityFactor = 1 3) start one node with capacityFactor = 0 which reads from a cache every 1 second 4) start one node with capacityFactor = 0 which write to the cache every 1 second 5) stop the first node 6) start the first node 7) org.infinispan.util.concurrent.TimeoutException: Timed out waiting for topology 1 on "writer" node

      Description

      I have a only one DIST_SYNC cache, most of the JVM in the cluster are configured with capacityFactor = 0 (like the distibutedlocalstorage=false property of Coherence) and some node are configured with capacityFactor>0 (for instance 1000). We are talking about 100 nodes with capacityFactor=0 and 4 nodes of the other kind, al the cluster is indide one single "site/rack". Partition Handling is off, numOwners is 1.

      When all the nodes with capacityFactor > 0 are down the cluster comes to a degraded state
      the ploblem is that even if nodes with capacityFactor>0 are up again the cluster does not recover, a full restart is needed

      If I enable partition-handling AvailablyExceptions start to be throw and I think is the expected behaviour (see the "Infinispan User Guide").

      I think this is the problem and it is a bug:

      14/11/17 09:27:25 WARN topology.CacheTopologyControlCommand: ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=shared, type=JOIN, sender=testserver1@xxxxxxx-22311, site-id=xxx, rack-id=xxx, machine-id=24 bytes, joinInfo=CacheJoinInfo

      {consistentHashFactory=org.infinispan.distribution.ch.impl.TopologyAwareConsistentHashFactory@78b791ef, hashFunction=MurmurHash3, numSegments=60, numOwners=1, timeout=120000, totalOrder=false, distributed=true}

      , topologyId=0, rebalanceId=0, currentCH=null, pendingCH=null, availabilityMode=null, throwable=null, viewId=3}

      java.lang.IllegalArgumentException: A cache topology's pending consistent hash must contain all the current consistent hash's members

      at org.infinispan.topology.CacheTopology.<init>(CacheTopology.java:48)

      at org.infinispan.topology.CacheTopology.<init>(CacheTopology.java:43)

      at org.infinispan.topology.ClusterCacheStatus.startQueuedRebalance(ClusterCacheStatus.java:631)

      at org.infinispan.topology.ClusterCacheStatus.queueRebalance(ClusterCacheStatus.java:85)

      at org.infinispan.partionhandling.impl.PreferAvailabilityStrategy.onJoin(PreferAvailabilityStrategy.java:22)

      at org.infinispan.topology.ClusterCacheStatus.doJoin(ClusterCacheStatus.java:540)

      at org.infinispan.topology.ClusterTopologyManagerImpl.handleJoin(ClusterTopologyManagerImpl.java:123)

      at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:158)

      at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:140)

      at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:278)

      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

      at java.lang.Thread.run(Thread.java:745)

      After that error every "put" results in:

      14/11/17 09:27:27 ERROR interceptors.InvocationContextInterceptor: ISPN000136: Execution error

      org.infinispan.util.concurrent.TimeoutException: Timed out waiting for topology 1

      at org.infinispan.statetransfer.StateTransferLockImpl.waitForTransactionData(StateTransferLockImpl.java:93)

      at org.infinispan.interceptors.base.BaseStateTransferInterceptor.waitForTransactionData(BaseStateTransferInterceptor.java:96)

      at org.infinispan.statetransfer.StateTransferInterceptor.handleNonTxWriteCommand(StateTransferInterceptor.java:188)

      at org.infinispan.statetransfer.StateTransferInterceptor.visitPutKeyValueCommand(StateTransferInterceptor.java:95)

      at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)

      at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98)

      at org.infinispan.interceptors.CacheMgmtInterceptor.updateStoreStatistics(CacheMgmtInterceptor.java:148)

      at org.infinispan.interceptors.CacheMgmtInterceptor.visitPutKeyValueCommand(CacheMgmtInterceptor.java:134)

      at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)

      at org.infinispan.interceptors.base.CommandInterceptor.invokeNextInterceptor(CommandInterceptor.java:98)

      at org.infinispan.interceptors.InvocationContextInterceptor.handleAll(InvocationContextInterceptor.java:102)

      at org.infinispan.interceptors.InvocationContextInterceptor.handleDefault(InvocationContextInterceptor.java:71)

      at org.infinispan.commands.AbstractVisitor.visitPutKeyValueCommand(AbstractVisitor.java:35)

      at org.infinispan.commands.write.PutKeyValueCommand.acceptVisitor(PutKeyValueCommand.java:71)

      at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:333)

      at org.infinispan.cache.impl.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1576)

      at org.infinispan.cache.impl.CacheImpl.putInternal(CacheImpl.java:1054)

      at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:1046)

      at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:1646)

      at org.infinispan.cache.impl.CacheImpl.put(CacheImpl.java:245)

      This is the actual configuration:

      GlobalConfiguration globalConfig = new GlobalConfigurationBuilder()

      .globalJmxStatistics()

      .allowDuplicateDomains(true)

      .cacheManagerName(instanceName)

      .transport()

      .defaultTransport()

      .clusterName(clustername)

      .addProperty("configurationFile", configurationFile) (udp for my cluster, approx 100 machines)

      .machineId(instanceName)

      .siteId("site1")

      .rackId("rack1")

      .nodeName(serviceName + "@" + instanceName)

      .remoteCommandThreadPool().threadPoolFactory(CachedThreadPoolExecutorFactory.create())

      .build();

      Configuration wildcard = new ConfigurationBuilder()

      .locking().lockAcquisitionTimeout(lockAcquisitionTimeout)

      .concurrencyLevel(10000).isolationLevel(IsolationLevel.READ_COMMITTED).useLockStriping(true)

      .clustering()

      .cacheMode(CacheMode.DIST_SYNC)

      .l1().lifespan(l1ttl)

      .hash().numOwners(numOwners).capacityFactor(capacityFactor)

      .partitionHandling().enabled(false)

      .stateTransfer().awaitInitialTransfer(false).timeout(initialTransferTimeout).fetchInMemoryState(false)

      .storeAsBinary().enabled(true).storeKeysAsBinary(false).storeValuesAsBinary(true)

      .jmxStatistics().enable()

      .unsafe().unreliableReturnValues(true)

      .build();

      One workaround is to set capacityFactor = 1 instead of 0, but I do not want "simple-nodes" (with less RAM) to becaome key-owners

      For me this is a showstopper problem

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                dan.berindei Dan Berindei
                Reporter:
                enrico.olivelli Enrico Olivelli
              • Votes:
                2 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: