Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-4872

During state transfer, a pessimistic transaction can fail due to a ClusteredGetCommand forcing a lock command

    Details

    • Type: Bug
    • Status: Resolved (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: 5.2.7.Final, 6.0.2.Final, 7.0.0.CR2
    • Fix Version/s: 7.0.0.Final
    • Component/s: Core
    • Labels:
      None

      Description

      This is a strange one:

      A pessimistic transaction started on the primary owner during state transfer can fail because a backup owner issues a ClusteredGetCommand to the primary while processing the write command. This can happen if the backup needs to perform a remote get (due to a DELTA_WRITE, conditional, or reliable return values). In this case, it's a DELTA_WRITE.

      In this chain of events, state transfer is ongoing, and the union CH is (east-dht5, west-dht5, east-dht6) In this case, the pessimistic transaction originated on east-dht5 must be invoked on west-dht5 and east-dht6:

      2014-10-21 02:15:01,309 TRACE [org.infinispan.transaction.TransactionTable] (OOB-1713,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Created a new local transaction: LocalXaTransaction{xid=null} LocalTransaction{remoteLockedNodes=null, isMarkedForRollback=false, lockedKeys=null, backupKeyLocks=null, isFromStateTransfer=false} globalTx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local, topologyId=33, age(ms)=0
      2014-10-21 02:15:01,309 TRACE [org.infinispan.util.concurrent.locks.containers.OwnableReentrantPerEntryLockContainer] (OOB-1713,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Creating and acquiring new lock instance for key EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831]
      2014-10-21 02:15:01,345 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-1713,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Replication task sending PrepareCommand {modifications=[PutKeyValueCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831], value=QamResource [qamId=431831, currentBandwidth=3750000, maxBandwidth=38810000, maxPrograms=255, currentPrograms=1], flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE], putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1, successful=true, ignorePreviousValue=false}], onePhaseCommit=true, gtx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local, cacheName='qamAllocation', topologyId=33} to addresses [240-west-dht5.comcast.net-35898(CH2-Chicago-IL), 240-east-dht6.comcast.net-47049(CMC-Denver-CO)] with response mode GET_ALL
      

      Let's examine east-dht6. It gets the write request. It needs to perform a remote get to pull the full key context, since it receives only the delta.

      2014-10-21 02:15:01,362 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-2149,session-resource-cluster,240-east-dht6.comcast.net-47049(CMC-Denver-CO)) Attempting to execute command: PrepareCommand {modifications=[PutKeyValueCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831], value=QamResourceDelta [qamId=431831, basePort=2049, currentBandwidth=3750000, maxBandwidth=38810000, maxPrograms=255, currentPrograms=1, baseProgramNumber=1, portStep=1, program=Program [portNumber=2053, programNumber=5, status=UNUSED, sessionToken=, lastUpdatedTime=1413872101310, bandwidth=0, inputId=-1]], flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE], putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1, successful=true, ignorePreviousValue=false}], onePhaseCommit=true, gtx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local, cacheName='qamAllocation', topologyId=33} [sender=240-east-dht5.comcast.net-26106(CMC-Denver-CO)]
      ...
      2014-10-21 02:15:01,362 TRACE [org.infinispan.transaction.TransactionTable] (OOB-2149,session-resource-cluster,240-east-dht6.comcast.net-47049(CMC-Denver-CO)) Created and registered remote transaction RemoteTransaction{modifications=...
      globalTx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:remote, topologyId=33, age(ms)=0
      2014-10-21 02:15:01,362 TRACE [org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher] (OOB-2149,session-resource-cluster,240-east-dht6.comcast.net-47049(CMC-Denver-CO)) Replication task sending ClusteredGetCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831], flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE]} to addresses [240-east-dht5.comcast.net-26106(CMC-Denver-CO), 240-west-dht5.comcast.net-35898(CH2-Chicago-IL)] with response mode GET_FIRST
      

      east-dht5 receives the ClusteredGetCommand request. Because TxDistributionInterceptor::line 325 is true, the acquireRemoteLock flag is set to true.

              boolean acquireRemoteLock = false;
               if (ctx.isInTxScope()) {
                  TxInvocationContext txContext = (TxInvocationContext) ctx;
                  acquireRemoteLock = isWrite && isPessimisticCache && !txContext.getAffectedKeys().contains(key);
               }
      

      Thus, when east-dht5 runs the ClusteredGetCommand. it will create and invoke a LockControlCommand. Because the LCC is created inline, it will not have its origin set. But the LCC will create a RemoteTxInvocationContext. When the LCC runs through the interceptor chain, TxInterceptor::invokeNextInterceptorAndVerifyTransaction will check the originator (because of the remote context). When it doesn't find it, it will force a rollback.

      2014-10-21 02:15:01,347 TRACE [org.infinispan.remoting.InboundInvocationHandlerImpl] (OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Calling perform() on ClusteredGetCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831], flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE]}
      2014-10-21 02:15:01,347 TRACE [org.infinispan.transaction.TransactionTable] (OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Created and registered remote transaction RemoteTransaction{modifications=[], lookedUpEntries={}, lockedKeys=null, backupKeyLocks=null, missingLookedUpEntries=false, isMarkedForRollback=false} globalTx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local, topologyId=33, age(ms)=0
      2014-10-21 02:15:01,347 TRACE [org.infinispan.statetransfer.StateTransferInterceptor] (OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) handleTopologyAffectedCommand for command LockControlCommand{cache=qamAllocation, keys=[EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831]], flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE], unlock=false}
      2014-10-21 02:15:01,347 TRACE [org.infinispan.interceptors.locking.PessimisticLockingInterceptor] (OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Locking key EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831], no need to check for pending locks.
      2014-10-21 02:15:01,347 TRACE [org.infinispan.interceptors.TxInterceptor] (OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) invokeNextInterceptorAndVerifyTransaction :: originatorMissing=true, alreadyCompleted=false
      2014-10-21 02:15:01,347 TRACE [org.infinispan.interceptors.TxInterceptor] (OOB-1721,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) Rolling back remote transaction GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local because either already completed(false) or originator(null) no longer in the cluster(true).
       
      // ROLLBACK HAPPENS HERE
      

      When the transaction attempts to commit, it fails due to the spurious rollback. For instance:

      2014-10-21 02:15:01,384 WARN  [org.infinispan.remoting.rpc.RpcManagerImpl] (OOB-1713,session-resource-cluster,240-east-dht5.comcast.net-26106(CMC-Denver-CO)) ISPN000071: Caught exception when handling command PrepareCommand {modifications=[PutKeyValueCommand{key=EdgeResourceCacheKey[edgeDeviceId=4211,resourceId=431831], value=QamResource [qamId=431831, currentBandwidth=3750000, maxBandwidth=38810000, maxPrograms=255, currentPrograms=1], flags=[SKIP_REMOTE_LOOKUP, DELTA_WRITE], putIfAbsent=false, lifespanMillis=-1, maxIdleTimeMillis=-1, successful=true, ignorePreviousValue=false}], onePhaseCommit=true, gtx=GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local, cacheName='qamAllocation', topologyId=33}
      org.infinispan.remoting.RemoteException: ISPN000217: Received exception from 240-west-dht5.comcast.net-35898(CH2-Chicago-IL), see cause for remote stack trace
      ...
      	at org.infinispan.commands.tx.PrepareCommand.acceptVisitor(PrepareCommand.java:124)
      	at org.infinispan.interceptors.InterceptorChain.invoke(InterceptorChain.java:343)
      	at org.infinispan.transaction.TransactionCoordinator.commit(TransactionCoordinator.java:175)
      	at org.infinispan.transaction.xa.TransactionXaAdapter.commit(TransactionXaAdapter.java:122)
      ... 
      com.arjuna.ats.internal.jta.resources.arjunacore.XAResourceRecord.topLevelOnePhaseCommit(XAResourceRecord.java:636)
      	at com.arjuna.ats.arjuna.coordinator.BasicAction.onePhaseCommit(BasicAction.java:2619)
      	at com.arjuna.ats.arjuna.coordinator.BasicAction.End(BasicAction.java:1779)
      	at com.arjuna.ats.arjuna.coordinator.TwoPhaseCoordinator.end(TwoPhaseCoordinator.java:88)
      	at com.arjuna.ats.arjuna.AtomicAction.commit(AtomicAction.java:177)
       
      Caused by: org.infinispan.transaction.xa.InvalidTransactionException: This remote transaction GlobalTransaction:<240-east-dht5.comcast.net-26106(CMC-Denver-CO)>:1575124:local is already rolled back
      	at org.infinispan.transaction.RemoteTransaction.checkIfRolledBack(RemoteTransaction.java:137)
      	at org.infinispan.transaction.RemoteTransaction.putLookedUpEntry(RemoteTransaction.java:73)
      	at org.infinispan.context.impl.RemoteTxInvocationContext.putLookedUpEntry(RemoteTxInvocationContext.java:99)
      	at org.infinispan.container.EntryFactoryImpl.wrapInternalCacheEntryForPut(EntryFactoryImpl.java:242)
      	at org.infinispan.container.EntryFactoryImpl.wrapEntryForPut(EntryFactoryImpl.java:166)
      ...
      

      So it looks like there's a couple ways to handle this. One would be to only acquire a remote lock in TxDistributionInterceptor if the current node was the originator. Another – possibly less intrusive – would be to set the origin of the invoked LCC command to that of the ClusteredGetCommand. The former should be preferable, but this code tends to be a bit labyrinthine with the various pessimistic use cases out there.

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                dan.berindei Dan Berindei
                Reporter:
                an1310 Erik Salter
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: