Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-1704

IllegalStateException in surviving nodes during node crash in cluster

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

      This bug appeared in EDG build 96: http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-build-edg-from-source/96/artifact/edg-srcbuild.zip
      that contains Infinispan 5.1.0.CR3

      Test scenario:

      1. start 4 nodes (distributed cache)
      2. wait 2 min
      3. kill node2
      4. wait 2 min
      5. start node2
      6. wait 2 min and end the test

      server side logs:
      http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard-perflab/8/artifact/report/serverlogs.zip
      client side logs:
      http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard-perflab/8/console-perf05/consoleText
      http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard-perflab/8/console-perf07/consoleText

      after crashing of the node2, there were no other succesfull requests, most of the requests ended with this error:

      ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (HotRodServerWorker-1-43) ISPN000136: Execution error: java.lang.IllegalStateException: Trying to release state transfer shared lock without acquiring it first
      

      before showing the error on the client side, the requests had been blocked around 1,5min

            [ISPN-1704] IllegalStateException in surviving nodes during node crash in cluster

            No longer relevant since NBST landed.

            Dan Berindei (Inactive) added a comment - No longer relevant since NBST landed.

            I just created ISPN-1799.

            Dan Berindei (Inactive) added a comment - I just created ISPN-1799 .

            Manik Surtani (Inactive) added a comment - - edited

            @Dan have you got a new JIRA for this, to revisit in 5.2? It would be good to link to it here.

            Manik Surtani (Inactive) added a comment - - edited @Dan have you got a new JIRA for this, to revisit in 5.2? It would be good to link to it here.

            The method StateTransferLockImpl.waitForStateTransferToEnd() didn't have any way of signalling
            that it failed to re-acquire the state transfer lock.

            I've added a new exception, StateTransferLockReacquisitionException, but we'll have to revisit this for 5.2.

            Dan Berindei (Inactive) added a comment - The method StateTransferLockImpl.waitForStateTransferToEnd() didn't have any way of signalling that it failed to re-acquire the state transfer lock. I've added a new exception, StateTransferLockReacquisitionException, but we'll have to revisit this for 5.2.

            I've ran 5 times with 5.1.0.CR4 and the same settings that managed to reproduce it for 5.1.0.CR3 but no results.

            Michal Linhard (Inactive) added a comment - I've ran 5 times with 5.1.0.CR4 and the same settings that managed to reproduce it for 5.1.0.CR3 but no results.

            I managed to reproduce it once again with pure infinispan 5.1.0.CR3 (four hotrod servers tests)
            http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard-perflab/32/artifact/report/serverlogs.zip

            Michal Linhard (Inactive) added a comment - I managed to reproduce it once again with pure infinispan 5.1.0.CR3 (four hotrod servers tests) http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard-perflab/32/artifact/report/serverlogs.zip

            Now the challenge is to repeat it with TRACE log.

            Michal Linhard (Inactive) added a comment - Now the challenge is to repeat it with TRACE log.

            it appeared in the log again (for 5.1.0.CR4) though in very different situation:
            see node04.log in http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard-perflab/25/artifact/report/serverlogs.zip

            Previously it happened for StateTransferLockInterceptor.visitPutKeyValueCommand and happened many times starting shortly after node crash.
            Now it's in StateTransferLockInterceptor.visitPrepareCommand and happened only 2 times at the end of the test.

            Michal Linhard (Inactive) added a comment - it appeared in the log again (for 5.1.0.CR4) though in very different situation: see node04.log in http://hudson.qa.jboss.com/hudson/view/EDG6/view/EDG-QE/job/edg-60-experiments-mlinhard-perflab/25/artifact/report/serverlogs.zip Previously it happened for StateTransferLockInterceptor.visitPutKeyValueCommand and happened many times starting shortly after node crash. Now it's in StateTransferLockInterceptor.visitPrepareCommand and happened only 2 times at the end of the test.

            Thanks for keeping an eye on this one, Dan.

            Manik Surtani (Inactive) added a comment - Thanks for keeping an eye on this one, Dan.

            hmm, one more idea, I'll try setting rehashWait to 5sec and see if that increases the chance of the exception... because it might be connected with StateTransferInProgressException.

            Michal Linhard (Inactive) added a comment - hmm, one more idea, I'll try setting rehashWait to 5sec and see if that increases the chance of the exception... because it might be connected with StateTransferInProgressException.

              dberinde@redhat.com Dan Berindei (Inactive)
              mlinhard Michal Linhard (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: