Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-6388

Spark integration - TimeoutException: Replication timeout on application execution

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • 8.2.0.Final
    • Analytics
    • None

      The issue occurs sporadically while application is executing (e.g. WordCount example). To some degree it seems to be affected by number of partitions used (i.e. higher the count, the less likely the issue occurs).

      Using 8 node cluster (1 worker/1 ISPN server per physical node), connector v. 0.2.

      Attached sample driver, server, application logs.

        1. app_0.txt
          1.34 MB
        2. driver.txt
          896 kB
        3. server.txt
          175 kB

            [ISPN-6388] Spark integration - TimeoutException: Replication timeout on application execution

            Closing for now, will reopen if the issue occurs again.

            Matej Čimbora (Inactive) added a comment - Closing for now, will reopen if the issue occurs again.

            Indeed, running with ISPN 8.2.1.Final & Spark connector 0.3 shows no problems.

            Matej Čimbora (Inactive) added a comment - Indeed, running with ISPN 8.2.1.Final & Spark connector 0.3 shows no problems.

            Here's my theory of what happened in your test.

            There were failures during the iteration: either a server was down or for some reason it stopped responding, maybe due to GC (it does not matter the reason).
            When such failures occur, there is a retry with the segments that were not done, and since from the logs you were using the Hot Rod client version 8.1.0.Final, it was being affected by https://issues.jboss.org/browse/ISPN-6234, where after a failover it would retry with the wrong segments. Since the segments were wrong, the iteration would not be confined to the local server where it contacted, causing remote RPC to obtain the segments, ultimately provoking a cascade effect resulting on timeouts.

            I believe the timeouts should not occur anymore (I was not able to reproduce), could you maybe test again with Infinispan 8.2.1.Final (both client and server) and the Spark connector 0.3?

            Gustavo Fernandes (Inactive) added a comment - Here's my theory of what happened in your test. There were failures during the iteration: either a server was down or for some reason it stopped responding, maybe due to GC (it does not matter the reason). When such failures occur, there is a retry with the segments that were not done, and since from the logs you were using the Hot Rod client version 8.1.0.Final, it was being affected by https://issues.jboss.org/browse/ISPN-6234 , where after a failover it would retry with the wrong segments. Since the segments were wrong, the iteration would not be confined to the local server where it contacted, causing remote RPC to obtain the segments, ultimately provoking a cascade effect resulting on timeouts. I believe the timeouts should not occur anymore (I was not able to reproduce), could you maybe test again with Infinispan 8.2.1.Final (both client and server) and the Spark connector 0.3?

            I looked into the issue some time to ago, however couldn't finish it due to context switch. DistributedCacheStream.rehashAwareIteration shows multiple stayLocal=false evaluations.

            Matej Čimbora (Inactive) added a comment - I looked into the issue some time to ago, however couldn't finish it due to context switch. DistributedCacheStream.rehashAwareIteration shows multiple stayLocal=false evaluations.

              gfernand@redhat.com Gustavo Fernandes (Inactive)
              mcimbora_jira Matej Čimbora (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: