Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Done
Priority: Minor
Fix Version/s: 9.4.1.Final
Affects Version/s: 6.0.0.Final
Component/s: State Transfer
Labels:
None

Git Pull Request:
https://github.com/infinispan/infinispan/pull/6350

Description

When the coordinator shuts down, it tries to shut down each of its caches first. This triggers a rebalance for the rest of the members, but the rebalance usually finishes only after the coordinator's channel also shuts down.

The nodes who finish their state transfer will then send a REBALANCE_CONFIRM command to the new coordinator, but the new coordinator doesn't know about that rebalance (it will start the rebalance process from scratch). This results in exceptions like this in the new coordinator's log:

12:36:04,977 WARN  [org.infinispan.topology.CacheTopologyControlCommand] (remote-thread-2,ISPN-Node-1) ISPN000071: Caught exception when handling command CacheTopologyControlCommand{cache=MyCoolCache, type=REBALANCE_CONFIRM, sender=ISPN-Node-3-54019, joinInfo=null, topologyId=8, currentCH=null, pendingCH=null, throwable=null, viewId=4}: org.infinispan.commons.CacheException: Received invalid rebalance confirmation from ISPN-Node-3-54019 for cache MyCoolCache, we don't have a rebalance in progress
	at org.infinispan.topology.ClusterTopologyManagerImpl.handleRebalanceCompleted(ClusterTopologyManagerImpl.java:190) [infinispan-core-6.0.0.Final.jar:6.0.0.Final]
	at org.infinispan.topology.CacheTopologyControlCommand.doPerform(CacheTopologyControlCommand.java:147) [infinispan-core-6.0.0.Final.jar:6.0.0.Final]
	at org.infinispan.topology.CacheTopologyControlCommand.perform(CacheTopologyControlCommand.java:124) [infinispan-core-6.0.0.Final.jar:6.0.0.Final]
	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher$4.run(CommandAwareRpcDispatcher.java:270) [infinispan-core-6.0.0.Final.jar:6.0.0.Final]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_45]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_45]
	at java.lang.Thread.run(Thread.java:744) [rt.jar:1.7.0_45]

A simple way to avoid these warnings would be to keep track of the coordinator that initiated a particular rebalance on each node, and only send the confirmation message to that coordinator. The same warnings seem to appear on the old coordinator, when it receives a confirmation after its ClusterTopologyManager started shutting down, so we may need another check there.

A more ambitious approach would be to keep the old rebalance when the new coordinator takes over, and have another round in the cluster state recovery asking if any members have already sent REBALANCE_CONFIRMATION commands (after the new coordinator is ready to process those commands). This should eliminate the duplicate state transfer that happens now.

Attachments

Activity

People

Assignee:: Dan Berindei (Inactive)

Reporter:: Dan Berindei (Inactive)

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 2013/12/06 3:33 AM

Updated:: 2020/02/07 6:06 AM

Resolved:: 2018/10/24 3:17 PM