Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-9988

ScatteredStateConsumerImpl can leak the exclusive topology lock

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 10.1.0.Final
    • 9.4.7.Final, 10.0.0.Final
    • Core
    • None

      When an exception happens in ScatteredStateConsumerImpl.beforeTopologyInstalled, the exclusive topology lock is not released in StateConsumerImpl.onTopologyUpdate:

      15:21:54,783 ERROR (transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t5:[Topology-scattered]) [LocalTopologyManagerImpl] ISPN000230: Failed to start rebalance for cache scattered
      java.lang.IllegalArgumentException: The task is already cancelled.
      	at org.infinispan.statetransfer.InboundTransferTask.cancelSegments(InboundTransferTask.java:172) ~[classes/:?]
      	at org.infinispan.statetransfer.StateConsumerImpl.cancelTransfers(StateConsumerImpl.java:959) ~[classes/:?]
      	at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.beforeTopologyInstalled(ScatteredStateConsumerImpl.java:115) ~[classes/:?]
      	at org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:292) ~[classes/:?]
      	at org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102) ~[classes/:?]
      	at org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200) ~[classes/:?]
      

      Because the exclusive topology lock is not released, threads that try to apply a new topology update block forever. This causes random failures with the ISPN-9863 thread leak checker:

      15:26:25,922 WARN  (testng-RehashClusterPublisherManagerTest:[]) [ThreadLeakChecker] Possible leaked thread:
      "transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3" daemon prio=5 tid=0x236fd nid=NA waiting
         java.lang.Thread.State: WAITING
      	java.base@11/jdk.internal.misc.Unsafe.park(Native Method)
      	java.base@11/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
      	java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
      	java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
      	java.base@11/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
      	java.base@11/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
      	app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:42)
      	app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:291)
      	app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:102)
      	app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:200)
      	app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:57)
      	app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:113)
      	app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:353)
      	app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:275)
      15:26:25,923 ERROR (testng-RehashClusterPublisherManagerTest:[]) [TestSuiteProgress] Test configuration failed: org.infinispan.reactive.publisher.impl.RehashClusterPublisherManagerTest.testClassFinished
      java.lang.AssertionError: Leaked threads: 
        {transport-thread-FunctionalScatteredInMemoryTest-NodeA-p43135-t3: possible sources [org.infinispan.functional.FunctionalScatteredInMemoryTest[bias=ON_WRITE], org.infinispan.statetransfer.ClusterTopologyManagerTest[SCATTERED_SYNC, tx=false], org.infinispan.functional.FunctionalCachestoreTest[passivation=true], org.infinispan.functional.distribution.rehash.FunctionalNonTxBackupOwnerBecomingPrimaryOwnerTest, org.infinispan.functional.distribution.rehash.FunctionalNonTxJoinerBecomingBackupOwnerTest, org.infinispan.api.mvcc.PutForExternalReadTest[REPL_SYNC, tx=false], org.infinispan.functional.distribution.rehash.FunctionalTxTest, org.infinispan.functional.FunctionalEncodingTypeTest[tx=true]]}
      	at org.infinispan.commons.test.ThreadLeakChecker.performCheck(ThreadLeakChecker.java:148) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
      	at org.infinispan.commons.test.ThreadLeakChecker.testFinished(ThreadLeakChecker.java:109) ~[infinispan-commons-test-10.0.0-SNAPSHOT.jar:10.0.0-SNAPSHOT]
      	at org.infinispan.test.fwk.TestResourceTracker.testFinished(TestResourceTracker.java:112) ~[test-classes/:?]
      	at org.infinispan.test.AbstractInfinispanTest.testClassFinished(AbstractInfinispanTest.java:142) ~[test-classes/:?]
      

      The fix should address both the exclusive topology lock itself, by releasing it in a finally block, and the IllegalArgumentException, either by ignoring already cancelled transfers or by only cancelling transfers while holding transferMapsLock.

            [ISPN-9988] ScatteredStateConsumerImpl can leak the exclusive topology lock

            StateConsumerImpl can now leak the exclusive topology lock even in a dist/repl cache, because we have some new code executed while holding the lock:

                     CompletionStages.join(persistenceManager.addSegments(newWriteSegments));
            

            Sometimes it causes tests to hang during cluster shutdown, sometimes it shows only as a thread leak:

                  org.infinispan.commons.test.ThreadLeakChecker$LeakException: Leaked thread: transport-thread-ClusteredListenerJoinsTest-NodeA-p42036-t1 << testng-ClusteredListenerJoinsTest << org.infinispan.notifications.cachelistener.cluster.ClusteredListenerJoinsTest
            	at java.base@11.0.3/jdk.internal.misc.Unsafe.park(Native Method)
            	at java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
            	at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
            	at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917)
            	at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240)
            	at java.base@11.0.3/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959)
            	at app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:45)
            	at app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:307)
            	at app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:106)
            	at app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:210)
            	at app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:66)
            	at app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:122)
            	at app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:365)
            	at app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:286)
            	at app//org.infinispan.topology.LocalTopologyManagerImpl$$Lambda$1134/0x0000000100685840.run(Unknown Source)
            	at app//org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175)
            	at app//org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37)
            	at app//org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227)
            	at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
            	at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
            	at java.base@11.0.3/java.lang.Thread.run(Thread.java:834)
            

            I believe both the scattered cache-specific cancelling of inbound transfers and the adding of segments can be done *before* acquiring the exclusive topology lock, minimizing the lock duration so that we don't need a CompletionStage-based API for it.

            Dan Berindei (Inactive) added a comment - StateConsumerImpl can now leak the exclusive topology lock even in a dist/repl cache, because we have some new code executed while holding the lock: CompletionStages.join(persistenceManager.addSegments(newWriteSegments)); Sometimes it causes tests to hang during cluster shutdown, sometimes it shows only as a thread leak: org.infinispan.commons.test.ThreadLeakChecker$LeakException: Leaked thread: transport-thread-ClusteredListenerJoinsTest-NodeA-p42036-t1 << testng-ClusteredListenerJoinsTest << org.infinispan.notifications.cachelistener.cluster.ClusteredListenerJoinsTest at java.base@11.0.3/jdk.internal.misc.Unsafe.park(Native Method) at java.base@11.0.3/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885) at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:917) at java.base@11.0.3/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1240) at java.base@11.0.3/java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(ReentrantReadWriteLock.java:959) at app//org.infinispan.statetransfer.StateTransferLockImpl.acquireExclusiveTopologyLock(StateTransferLockImpl.java:45) at app//org.infinispan.statetransfer.StateConsumerImpl.onTopologyUpdate(StateConsumerImpl.java:307) at app//org.infinispan.scattered.impl.ScatteredStateConsumerImpl.onTopologyUpdate(ScatteredStateConsumerImpl.java:106) at app//org.infinispan.statetransfer.StateTransferManagerImpl.doTopologyUpdate(StateTransferManagerImpl.java:210) at app//org.infinispan.statetransfer.StateTransferManagerImpl.access$000(StateTransferManagerImpl.java:66) at app//org.infinispan.statetransfer.StateTransferManagerImpl$1.updateConsistentHash(StateTransferManagerImpl.java:122) at app//org.infinispan.topology.LocalTopologyManagerImpl.doHandleTopologyUpdate(LocalTopologyManagerImpl.java:365) at app//org.infinispan.topology.LocalTopologyManagerImpl.lambda$handleTopologyUpdate$1(LocalTopologyManagerImpl.java:286) at app//org.infinispan.topology.LocalTopologyManagerImpl$$Lambda$1134/0x0000000100685840.run(Unknown Source) at app//org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:175) at app//org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:37) at app//org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:227) at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base@11.0.3/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base@11.0.3/java.lang.Thread.run(Thread.java:834) I believe both the scattered cache-specific cancelling of inbound transfers and the adding of segments can be done * before * acquiring the exclusive topology lock, minimizing the lock duration so that we don't need a CompletionStage -based API for it.

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: