Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-10093

PersistenceManagerImpl stop deadlock with topology update

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

      DistSyncStoreNotSharedTest.clearContent hanged in CI recently:

      "testng-DistSyncStoreNotSharedTest" #16 prio=5 os_prio=0 cpu=11511.26ms elapsed=435.14s tid=0x00007fdb710b6000 nid=0x3222 waiting on condition  [0x00007fdb352d3000]
         java.lang.Thread.State: WAITING (parking)
      	at jdk.internal.misc.Unsafe.park(java.base@11/Native Method)
      	- parking to wait for  <0x00000000c8a22450> (a java.util.concurrent.Semaphore$NonfairSync)
      	at java.util.concurrent.locks.LockSupport.park(java.base@11/LockSupport.java:194)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11/AbstractQueuedSynchronizer.java:885)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11/AbstractQueuedSynchronizer.java:1009)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11/AbstractQueuedSynchronizer.java:1324)
      	at java.util.concurrent.Semaphore.acquireUninterruptibly(java.base@11/Semaphore.java:504)
      	at org.infinispan.persistence.manager.PersistenceManagerImpl.stop(PersistenceManagerImpl.java:222)
      	at jdk.internal.reflect.GeneratedMethodAccessor72.invoke(Unknown Source)
      	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@11/DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(java.base@11/Method.java:566)
      	at org.infinispan.commons.util.SecurityActions.lambda$invokeAccessibly$0(SecurityActions.java:79)
      	at org.infinispan.commons.util.SecurityActions$$Lambda$237/0x0000000100661c40.run(Unknown Source)
      	at org.infinispan.commons.util.SecurityActions.doPrivileged(SecurityActions.java:71)
      	at org.infinispan.commons.util.SecurityActions.invokeAccessibly(SecurityActions.java:76)
      	at org.infinispan.commons.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:181)
      	at org.infinispan.factories.impl.BasicComponentRegistryImpl.performStop(BasicComponentRegistryImpl.java:601)
      	at org.infinispan.factories.impl.BasicComponentRegistryImpl.stopWrapper(BasicComponentRegistryImpl.java:590)
      	at org.infinispan.factories.impl.BasicComponentRegistryImpl.stop(BasicComponentRegistryImpl.java:461)
      	at org.infinispan.factories.AbstractComponentRegistry.internalStop(AbstractComponentRegistry.java:431)
      	at org.infinispan.factories.AbstractComponentRegistry.stop(AbstractComponentRegistry.java:366)
      	at org.infinispan.cache.impl.CacheImpl.performImmediateShutdown(CacheImpl.java:1160)
      	at org.infinispan.cache.impl.CacheImpl.stop(CacheImpl.java:1125)
      	at org.infinispan.cache.impl.AbstractDelegatingCache.stop(AbstractDelegatingCache.java:521)
      	at org.infinispan.manager.DefaultCacheManager.terminate(DefaultCacheManager.java:747)
      	at org.infinispan.manager.DefaultCacheManager.stopCaches(DefaultCacheManager.java:799)
      	at org.infinispan.manager.DefaultCacheManager.stop(DefaultCacheManager.java:775)
      	at org.infinispan.test.TestingUtil.killCacheManagers(TestingUtil.java:846)
      	at org.infinispan.test.MultipleCacheManagersTest.clearContent(MultipleCacheManagersTest.java:158)
      
      "persistence-thread-DistSyncStoreNotSharedTest-NodeB-p16432-t1" #53654 daemon prio=5 os_prio=0 cpu=1.26ms elapsed=301.93s tid=0x00007fdb3c3d8000 nid=0x8ef waiting on condition  [0x00007fdb00055000]
         java.lang.Thread.State: WAITING (parking)
      	at jdk.internal.misc.Unsafe.park(java.base@11/Native Method)
      	- parking to wait for  <0x00000000c8b1fb88> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
      	at java.util.concurrent.locks.LockSupport.park(java.base@11/LockSupport.java:194)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11/AbstractQueuedSynchronizer.java:885)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(java.base@11/AbstractQueuedSynchronizer.java:1009)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(java.base@11/AbstractQueuedSynchronizer.java:1324)
      	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11/ReentrantReadWriteLock.java:738)
      	at org.infinispan.persistence.manager.PersistenceManagerImpl.pollStoreAvailability(PersistenceManagerImpl.java:196)
      	at org.infinispan.persistence.manager.PersistenceManagerImpl$$Lambda$492/0x00000001007fb440.run(Unknown Source)
      	at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11/Executors.java:515)
      	at java.util.concurrent.FutureTask.runAndReset(java.base@11/FutureTask.java:305)
      	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11/ScheduledThreadPoolExecutor.java:305)
      
      "transport-thread-DistSyncStoreNotSharedTest-NodeB-p16424-t5" #53646 daemon prio=5 os_prio=0 cpu=3.15ms elapsed=301.94s tid=0x00007fdb2007a000 nid=0x8e8 waiting on condition  [0x00007fdb0b406000]
         java.lang.Thread.State: WAITING (parking)
      	at jdk.internal.misc.Unsafe.park(java.base@11/Native Method)
      	- parking to wait for  <0x00000000c8d2abb0> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
      	at java.util.concurrent.locks.LockSupport.park(java.base@11/LockSupport.java:194)
      	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(java.base@11/AbstractQueuedSynchronizer.java:2081)
      	at io.reactivex.internal.operators.flowable.BlockingFlowableIterable$BlockingFlowableIterator.hasNext(BlockingFlowableIterable.java:94)
      	at io.reactivex.Flowable.blockingForEach(Flowable.java:5682)
      	at org.infinispan.statetransfer.StateConsumerImpl.removeStaleData(StateConsumerImpl.java:1011)
      	at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:453)
      	at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:202)
      	at org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:58)
      	at org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:114)
      	at org.infinispan.topology.LocalTopologyManagerImpl.resetLocalTopologyBeforeRebalance(LocalTopologyManagerImpl.java:437)
      	at org.infinispan.topology.LocalTopologyManagerImpl.doHandleRebalance(LocalTopologyManagerImpl.java:519)
      	- locked <0x00000000c8b30b30> (a org.infinispan.topology.LocalCacheStatus)
      	at org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleRebalance$3(LocalTopologyManagerImpl.java:484)
      	at org.infinispan.topology.LocalTopologyManagerImpl$$Lambda$574/0x000000010089a040.run(Unknown Source)
      	at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)

      Full thread dump

      Somehow the producer thread for the transport-thread iteration is blocked, but without waiting for the persistence mutex. Maybe it's waiting for a topology? Not sure if it's relevant, but the last test to run was testClearWithFlag, so the data container was empty and the store had 5 entries.

            [ISPN-10093] PersistenceManagerImpl stop deadlock with topology update

            Will Burns added a comment -

            This is fixed in ISPN-13560

            Will Burns added a comment - This is fixed in ISPN-13560

            Will Burns added a comment - - edited

            Oh and also the semaphore acquire only only acquires permits when it can get the entire batch at once, the key is just that it is holding the writeLock which in turns blocks the storeAvailability.

            This brings up the point that pollStoreAvailability should probably use a tryLock and not a lock call as it shouldn't matter if it can't get it.

            Will Burns added a comment - - edited Oh and also the semaphore acquire only only acquires permits when it can get the entire batch at once, the key is just that it is holding the writeLock which in turns blocks the storeAvailability. This brings up the point that pollStoreAvailability should probably use a tryLock and not a lock call as it shouldn't matter if it can't get it.

            Will Burns added a comment -

            Sounds like you got it

            Just to clarify

            I assume needs another thread to acquire a storesMutex read lock to perform the iteration:

            Should never be true and if so would be a bug for sure.

            Will Burns added a comment - Sounds like you got it Just to clarify I assume needs another thread to acquire a storesMutex read lock to perform the iteration: Should never be true and if so would be a bug for sure.

            I think I got it now, I finally got over storesMutex and started focusing on publisherSemaphore, and it's not related to ISPN-10206 after all.

            At this point, StateConsumerImpl.removeStaleData() has a publisherSemaphore permit and I assume needs another thread to acquire a storesMutex read lock to perform the iteration:

            at io.reactivex.Flowable.blockingForEach(Flowable.java:5682)
            at org.infinispan.statetransfer.StateConsumerImpl.removeStaleData(StateConsumerImpl.java:1011)
            

            But at the same time, PersistenceManagerImpl.stop() has the storesMutex write lock and wants to acquire all the publisherSemaphore permits:

            at java.util.concurrent.Semaphore.acquireUninterruptibly(java.base@11/Semaphore.java:504)
            at org.infinispan.persistence.manager.PersistenceManagerImpl.stop(PersistenceManagerImpl.java:222)
            

            Looks like a deadlock, the only niggle remaining is that the producer thread trying to acquire a storesMutex read lock is not in the thread dump.

            Actually I think I have an explanation for that, too: the single persistence thread is blocked trying to acquire a storesMutex read lock for an otherwise unrelated store availability check:

            	at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11/ReentrantReadWriteLock.java:738)
            	at org.infinispan.persistence.manager.PersistenceManagerImpl.pollStoreAvailability(PersistenceManagerImpl.java:196)
            

            Why a single persistence thread? Because the Jenkins agent machines only have 1 CPU:

                  DEFAULT_THREAD_COUNT.put(PERSISTENCE_EXECUTOR, ProcessorInfo.availableProcessors());
            

            Dan Berindei (Inactive) added a comment - I think I got it now, I finally got over storesMutex and started focusing on publisherSemaphore, and it's not related to ISPN-10206 after all. At this point, StateConsumerImpl.removeStaleData() has a publisherSemaphore permit and I assume needs another thread to acquire a storesMutex read lock to perform the iteration: at io.reactivex.Flowable.blockingForEach(Flowable.java:5682) at org.infinispan.statetransfer.StateConsumerImpl.removeStaleData(StateConsumerImpl.java:1011) But at the same time, PersistenceManagerImpl.stop() has the storesMutex write lock and wants to acquire all the publisherSemaphore permits: at java.util.concurrent.Semaphore.acquireUninterruptibly(java.base@11/Semaphore.java:504) at org.infinispan.persistence.manager.PersistenceManagerImpl.stop(PersistenceManagerImpl.java:222) Looks like a deadlock, the only niggle remaining is that the producer thread trying to acquire a storesMutex read lock is not in the thread dump. Actually I think I have an explanation for that, too: the single persistence thread is blocked trying to acquire a storesMutex read lock for an otherwise unrelated store availability check: at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(java.base@11/ReentrantReadWriteLock.java:738) at org.infinispan.persistence.manager.PersistenceManagerImpl.pollStoreAvailability(PersistenceManagerImpl.java:196) Why a single persistence thread? Because the Jenkins agent machines only have 1 CPU: DEFAULT_THREAD_COUNT.put(PERSISTENCE_EXECUTOR, ProcessorInfo.availableProcessors());

            Will Burns added a comment -

            The UnsupportedOperationException was fixed in ISPN-10187. Sorry thought it was linked before. That would be unrelated though as it never starts a Publisher to do expiration.

            The issue can happen due in any test hitting the same PersistenceManager. This manifests itself the exact same way I would expect in ISPN-10206. If this occurs again I would just reopen this issue.

            Will Burns added a comment - The UnsupportedOperationException was fixed in ISPN-10187 . Sorry thought it was linked before. That would be unrelated though as it never starts a Publisher to do expiration. The issue can happen due in any test hitting the same PersistenceManager. This manifests itself the exact same way I would expect in ISPN-10206 . If this occurs again I would just reopen this issue.

            wburns@redhat.com, in ISPN-10206 you say "this only occurs when a Stream early terminating operation is used with a cache store and terminates early without iterating over all contents in the store", but I can't see DistSyncStoreNotSharedTest doing anything like that. I just re-checked the build log and clearContent() hanged after DistSyncStoreNotSharedTest.testClearWithFlag[DIST_SYNC, tx=false, shared=false, preload=false, segmented=true], which only iterates over the data container, not the cache or the store.

            Also, what about the UnsupportedOperationException?

            Dan Berindei (Inactive) added a comment - wburns@redhat.com , in ISPN-10206 you say "this only occurs when a Stream early terminating operation is used with a cache store and terminates early without iterating over all contents in the store", but I can't see DistSyncStoreNotSharedTest doing anything like that. I just re-checked the build log and clearContent() hanged after DistSyncStoreNotSharedTest.testClearWithFlag[DIST_SYNC, tx=false, shared=false, preload=false, segmented=true] , which only iterates over the data container, not the cache or the store. Also, what about the UnsupportedOperationException ?

            Will Burns added a comment -

            I am pretty sure this is the same issue as https://issues.jboss.org/browse/ISPN-10206

            Will Burns added a comment - I am pretty sure this is the same issue as https://issues.jboss.org/browse/ISPN-10206

            Will Burns added a comment - - edited

            Have to admit the stack thread here doesn't make much sense. The producer should be the same as the subscriber thread. The DummyInMemoryStore shouldn't be subscribing on a different thread, which means the subscriber should be publishing the entries on the invoking thread, so the fact that it is blocked is kinda weird.

            Will Burns added a comment - - edited Have to admit the stack thread here doesn't make much sense. The producer should be the same as the subscriber thread. The DummyInMemoryStore shouldn't be subscribing on a different thread, which means the subscriber should be publishing the entries on the invoking thread, so the fact that it is blocked is kinda weird.

            Found an exception in the log, maybe this is why the producer isn't visible in the thread dump?

            12:42:45,237 WARN  (expiration-thread-DistSyncStoreNotSharedTest-NodeA-p16415-t1:[]) [PersistenceManagerImpl] ISPN000026: Caught exception purging data container!
            java.lang.UnsupportedOperationException: null
            	at org.infinispan.persistence.spi.AdvancedCacheExpirationWriter.purge(AdvancedCacheExpirationWriter.java:59) ~[classes/:?]
            	at org.infinispan.persistence.support.ComposedSegmentedLoadWriteStore.purge(ComposedSegmentedLoadWriteStore.java:169) ~[classes/:?]
            	at org.infinispan.persistence.manager.PersistenceManagerImpl.lambda$purgeExpired$6(PersistenceManagerImpl.java:468) ~[classes/:?]
            	at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
            	at org.infinispan.persistence.manager.PersistenceManagerImpl.purgeExpired(PersistenceManagerImpl.java:471) [classes/:?]
            	at org.infinispan.expiration.impl.ClusterExpirationManager.processExpiration(ClusterExpirationManager.java:120) [classes/:?]
            	at org.infinispan.expiration.impl.ExpirationManagerImpl$ScheduledTask.run(ExpirationManagerImpl.java:245) [classes/:?]
            	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
            	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
            	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?]
            	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
            

            Dan Berindei (Inactive) added a comment - Found an exception in the log, maybe this is why the producer isn't visible in the thread dump? 12:42:45,237 WARN (expiration-thread-DistSyncStoreNotSharedTest-NodeA-p16415-t1:[]) [PersistenceManagerImpl] ISPN000026: Caught exception purging data container! java.lang.UnsupportedOperationException: null at org.infinispan.persistence.spi.AdvancedCacheExpirationWriter.purge(AdvancedCacheExpirationWriter.java:59) ~[classes/:?] at org.infinispan.persistence.support.ComposedSegmentedLoadWriteStore.purge(ComposedSegmentedLoadWriteStore.java:169) ~[classes/:?] at org.infinispan.persistence.manager.PersistenceManagerImpl.lambda$purgeExpired$6(PersistenceManagerImpl.java:468) ~[classes/:?] at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?] at org.infinispan.persistence.manager.PersistenceManagerImpl.purgeExpired(PersistenceManagerImpl.java:471) [classes/:?] at org.infinispan.expiration.impl.ClusterExpirationManager.processExpiration(ClusterExpirationManager.java:120) [classes/:?] at org.infinispan.expiration.impl.ExpirationManagerImpl$ScheduledTask.run(ExpirationManagerImpl.java:245) [classes/:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305) [?:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]

              wburns@redhat.com Will Burns
              dberinde@redhat.com Dan Berindei (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: