Uploaded image for project: 'WildFly'
  1. WildFly
  2. WFLY-9890

"ISPN000476: Timed out waiting for responses for request X from node Y" immediately after node Y rejoins the cluster (failover)

    XMLWordPrintable

Details

    Description

      Seen in SF failover tests - scenarios:
      failover-http-granular-shutdown-repl-sync (replication granularity is set to ATTRIBUTE level)
      failover-http-session-jvmkill-repl-sync (replication granularity is set to SESSION level)

      Description: In the failover test, each server is shut down and after some time it rejoins the cluster. After joining the cluster again, the load is distributed to this node. It seems that first request for some session handled by that node produces ERROR in the log of some other server until the time another cluster topology change occurs.

      Stacktrace of the error:

      [JBossINF] [0m[31m10:18:00,256 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p12-t1) ISPN000136: Error executing command PrepareCommand, writing keys [SessionCreationMetaDataKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9), SessionAccessMetaDataKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9), SessionAttributeKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9[1])]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 30250 from perf18
      [JBossINF] 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:163)
      [JBossINF] 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
      [JBossINF] 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
      [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
      [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      [JBossINF] 	at java.lang.Thread.run(Thread.java:748)
      

      Client gets the stale data:

      2018/02/21 10:18:00:272 EST [WARN ][Runner - 16] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  <org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 60, received 59, Runner: 16>
              org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 60, received 59, Runner: 16
      	at org.jboss.smartfrog.loaddriver.http.AbstractSerialNumberValidatorFactoryImpl$SerialNumberValidator.processRequest(AbstractSerialNumberValidatorFactoryImpl.java:133)
      	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
      	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:103)
      	at java.lang.Thread.run(Thread.java:748)
      
      

      Example - timeline:

      4 servers are running, 2000 clients, no issue in the beggining
      2018/02/21 10:16:43:234 EST [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Node 0 (perf18) is down. 
      - still no issues
      2018/02/21 10:17:43:234 EST [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Bringing back node 0 (perf18)
      2018/02/21 10:17:55:611 EST [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Node 0 (perf18) is up.
      2018/02/21 10:18:00:272 EST [WARN ][Runner - 16] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  <org.jboss.smartfrog.loaddriver.RequestProcessingException: Stale session data received. Expected 60, received 59, Runner: 16>
      

      perf20 started logging the errors meanwhile and stopped only after perf19 was shut down:

      [JBossINF] [0m[0m10:17:50,227 INFO  [org.infinispan.CLUSTER] (thread-19,ejb,perf20) ISPN000094: Received new cluster view for channel ejb: [perf19|5] (4) [perf19, perf20, perf21, perf18]
      [JBossINF] [0m[0m10:17:50,228 INFO  [org.infinispan.CLUSTER] (thread-19,ejb,perf20) ISPN100000: Node perf18 joined the cluster
      [JBossINF] [0m[31m10:18:00,256 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (timeout-thread--p12-t1) ISPN000136: Error executing command PrepareCommand, writing keys [SessionCreationMetaDataKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9), SessionAccessMetaDataKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9), SessionAttributeKey(f25_wwXXpvtNC9QTHZUtSwhmEUmg-wI77RL_17b9[1])]: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 30250 from perf18
      [JBossINF] 	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:163)
      [JBossINF] 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:86)
      [JBossINF] 	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:21)
      [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
      [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      [JBossINF] 	at java.lang.Thread.run(Thread.java:748)
      [JBossINF] 
      
      

      There were no issues in this scenario in EAP 7.1.0.GA.

      Links:
      Clients: http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/perflab_eap-7x-failover-http-granular-shutdown-repl-sync/25/console-perf17/
      perf18: http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/perflab_eap-7x-failover-http-granular-shutdown-repl-sync/25/console-perf18/
      perf20: http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/perflab_eap-7x-failover-http-granular-shutdown-repl-sync/25/console-perf20/

      Another test:
      perf19: http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/perflab_eap-7x-failover-http-session-jvmkill-repl-sync/27/console-perf19/

      Attachments

        Issue Links

          Activity

            People

              pferraro@redhat.com Paul Ferraro
              mvinkler1@redhat.com Michal Vinkler
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: