Uploaded image for project: 'WildFly'
  1. WildFly
  2. WFLY-6213

Failed to recover cluster state after the current node became the coordinator

    Details

      Description

      Seen in failover tests - HA Singleton deployment scenarios - jvmkill failover type, random election policy

      Something weird is happening when a new election should take place after any of the nodes is killed (it doesn't have to be cluster coordinator nor the singleton provider):

      Timeline:

      • perf18 was killed around 03:40:07
      • perf19 was elected, JBEAP-2254 occured, perf19 left the cluster
      • right after that perf20 was elected, JBEAP-2254 occured, perf20 left the cluster
      • right after that perrf21 was elected and logged these errors (perf21 was the only node in the cluster that time):
      [JBossINF] [0m[0m03:40:07,816 INFO  [org.wildfly.clustering.server] (notification-thread--p2-t1) WFLYCLSV0001: This node will now operate as the singleton provider of the jboss.deployment.unit."clusterbench-ee7-singleton-jbossall.ear".FIRST_MODULE_USE service
      [JBossINF] [0m[31m03:40:07,821 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p15-t5) ISPN000196: Failed to recover cluster state after the current node became the coordinator: org.infinispan.commons.CacheException: Unsuccessful response received from node perf20: CacheNotFoundResponse
      [JBossINF] 	at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:480)
      [JBossINF] 	at org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:348)
      [JBossINF] 	at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:286)
      [JBossINF] 	at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:617)
      [JBossINF] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      [JBossINF] 	at java.lang.Thread.run(Thread.java:745)
      ...
      [JBossINF] [0m[0m03:40:07,905 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (Incoming-8,ee,perf21) ISPN000094: Received new cluster view for channel server: [perf21|6] (1) [perf21]
      [JBossINF] [0m[31m03:40:07,904 ERROR [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p14-t15) ISPN000196: Failed to recover cluster state after the current node became the coordinator: java.util.concurrent.ExecutionException: org.infinispan.remoting.transport.jgroups.SuspectException: Cache not running on node perf20
      [JBossINF] 	at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
      [JBossINF] 	at org.infinispan.topology.ClusterTopologyManagerImpl.executeOnClusterSync(ClusterTopologyManagerImpl.java:471)
      [JBossINF] 	at org.infinispan.topology.ClusterTopologyManagerImpl.recoverClusterStatus(ClusterTopologyManagerImpl.java:348)
      [JBossINF] 	at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:286)
      [JBossINF] 	at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener$1.run(ClusterTopologyManagerImpl.java:617)
      [JBossINF] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      [JBossINF] 	at java.lang.Thread.run(Thread.java:745)
      [JBossINF] Caused by: org.infinispan.remoting.transport.jgroups.SuspectException: Cache not running on node perf20
      [JBossINF] 	at org.infinispan.remoting.transport.AbstractTransport.checkResponse(AbstractTransport.java:46)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:763)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$174(JGroupsTransport.java:599)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
      [JBossINF] 	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.SingleResponseFuture.futureDone(SingleResponseFuture.java:30)
      [JBossINF] 	at org.jgroups.blocks.Request.checkCompletion(Request.java:169)
      [JBossINF] 	at org.jgroups.blocks.UnicastRequest.viewChange(UnicastRequest.java:164)
      [JBossINF] 	at org.jgroups.blocks.RequestCorrelator.receiveView(RequestCorrelator.java:331)
      [JBossINF] 	at org.jgroups.blocks.RequestCorrelator.receive(RequestCorrelator.java:242)
      [JBossINF] 	at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:684)
      [JBossINF] 	at org.jgroups.JChannel.up(JChannel.java:738)
      [JBossINF] 	at org.jgroups.fork.ForkProtocolStack.up(ForkProtocolStack.java:123)
      [JBossINF] 	at org.jgroups.stack.Protocol.up(Protocol.java:374)
      [JBossINF] 	at org.jgroups.protocols.FORK.up(FORK.java:118)
      [JBossINF] 	at org.jgroups.protocols.FRAG2.up(FRAG2.java:165)
      [JBossINF] 	at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
      [JBossINF] 	at org.jgroups.protocols.FlowControl.up(FlowControl.java:394)
      [JBossINF] 	at org.jgroups.protocols.pbcast.GMS.installView(GMS.java:735)
      [JBossINF] 	at org.jgroups.protocols.pbcast.CoordGmsImpl.handleViewChange(CoordGmsImpl.java:244)
      [JBossINF] 	at org.jgroups.protocols.pbcast.GMS.up(GMS.java:925)
      [JBossINF] 	at org.jgroups.stack.Protocol.up(Protocol.java:412)
      [JBossINF] 	at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:294)
      [JBossINF] 	at org.jgroups.protocols.UNICAST3.up(UNICAST3.java:474)
      [JBossINF] 	at org.jgroups.protocols.pbcast.NAKACK2.deliverBatch(NAKACK2.java:982)
      [JBossINF] 	at org.jgroups.protocols.pbcast.NAKACK2.removeAndPassUp(NAKACK2.java:912)
      [JBossINF] 	at org.jgroups.protocols.pbcast.NAKACK2.handleMessage(NAKACK2.java:846)
      [JBossINF] 	at org.jgroups.protocols.pbcast.NAKACK2.up(NAKACK2.java:618)
      [JBossINF] 	at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:155)
      [JBossINF] 	at org.jgroups.protocols.FD_ALL.up(FD_ALL.java:200)
      [JBossINF] 	at org.jgroups.protocols.FD_SOCK.up(FD_SOCK.java:310)
      [JBossINF] 	at org.jgroups.protocols.MERGE3.up(MERGE3.java:285)
      [JBossINF] 	at org.jgroups.protocols.Discovery.up(Discovery.java:295)
      [JBossINF] 	at org.jgroups.protocols.TP.passMessageUp(TP.java:1577)
      [JBossINF] 	at org.jgroups.protocols.TP$3.run(TP.java:1511)
      [JBossINF] 	... 3 more
      

      Link:
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/eap-7x-failover-singleton-deployment-jvmkill-random-election-policy/9/console-perf21/

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  pferraro Paul Ferraro
                  Reporter:
                  mvinkler Michal Vinkler
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  2 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: