[ISPN-6388] Spark integration - TimeoutException: Replication timeout on application execution - Red Hat Issue Tracker

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 8.2.0.Final
Component/s: Analytics
Labels:
None

The issue occurs sporadically while application is executing (e.g. WordCount example). To some degree it seems to be affected by number of partitions used (i.e. higher the count, the less likely the issue occurs).

Using 8 node cluster (1 worker/1 ISPN server per physical node), connector v. 0.2.

Attached sample driver, server, application logs.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

app_0.txt
1.34 MB
2016/03/16 10:08 AM
driver.txt
896 kB
2016/03/16 10:08 AM
server.txt
175 kB
2016/03/16 10:08 AM

is caused by

ISPN-6234 Remote Iterator resumes from wrong segments after failover

Closed

is related to

JDG-77 Spark integration - TimeoutException: Replication timeout on application execution

Closed

Matej Čimbora (Inactive) added a comment - 2016/04/26 7:46 AM

Closing for now, will reopen if the issue occurs again.

Matej Čimbora (Inactive) added a comment - 2016/04/26 7:46 AM Closing for now, will reopen if the issue occurs again.

Matej Čimbora (Inactive) added a comment - 2016/04/26 7:44 AM

Indeed, running with ISPN 8.2.1.Final & Spark connector 0.3 shows no problems.

Matej Čimbora (Inactive) added a comment - 2016/04/26 7:44 AM Indeed, running with ISPN 8.2.1.Final & Spark connector 0.3 shows no problems.

Gustavo Fernandes (Inactive) added a comment - 2016/04/22 9:24 AM

Here's my theory of what happened in your test.

There were failures during the iteration: either a server was down or for some reason it stopped responding, maybe due to GC (it does not matter the reason).
When such failures occur, there is a retry with the segments that were not done, and since from the logs you were using the Hot Rod client version 8.1.0.Final, it was being affected by https://issues.jboss.org/browse/ISPN-6234, where after a failover it would retry with the wrong segments. Since the segments were wrong, the iteration would not be confined to the local server where it contacted, causing remote RPC to obtain the segments, ultimately provoking a cascade effect resulting on timeouts.

I believe the timeouts should not occur anymore (I was not able to reproduce), could you maybe test again with Infinispan 8.2.1.Final (both client and server) and the Spark connector 0.3?

Gustavo Fernandes (Inactive) added a comment - 2016/04/22 9:24 AM Here's my theory of what happened in your test. There were failures during the iteration: either a server was down or for some reason it stopped responding, maybe due to GC (it does not matter the reason). When such failures occur, there is a retry with the segments that were not done, and since from the logs you were using the Hot Rod client version 8.1.0.Final, it was being affected by https://issues.jboss.org/browse/ISPN-6234 , where after a failover it would retry with the wrong segments. Since the segments were wrong, the iteration would not be confined to the local server where it contacted, causing remote RPC to obtain the segments, ultimately provoking a cascade effect resulting on timeouts. I believe the timeouts should not occur anymore (I was not able to reproduce), could you maybe test again with Infinispan 8.2.1.Final (both client and server) and the Spark connector 0.3?

Matej Čimbora (Inactive) added a comment - 2016/03/16 10:55 AM

I looked into the issue some time to ago, however couldn't finish it due to context switch. DistributedCacheStream.rehashAwareIteration shows multiple stayLocal=false evaluations.

Matej Čimbora (Inactive) added a comment - 2016/03/16 10:55 AM I looked into the issue some time to ago, however couldn't finish it due to context switch. DistributedCacheStream.rehashAwareIteration shows multiple stayLocal=false evaluations.

Assignee:: Gustavo Fernandes (Inactive)

Reporter:: Matej Čimbora (Inactive)

Archiver:: Amol Dongare

Created:: 2016/03/16 10:36 AM

Updated:: 2020/09/30 2:52 PM

Resolved:: 2016/04/26 7:46 AM

Archived:: 2024/11/28 6:21 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

[ISPN-6388] Spark integration - TimeoutException: Replication timeout on application execution

Collapse comment: Matej Čimbora (Inactive) added a comment - 2016/04/26 7:46 AM

Expand comment: Matej Čimbora (Inactive) added a comment - 2016/04/26 7:46 AM

Collapse comment: Matej Čimbora (Inactive) added a comment - 2016/04/26 7:44 AM

Expand comment: Matej Čimbora (Inactive) added a comment - 2016/04/26 7:44 AM

Collapse comment: Gustavo Fernandes (Inactive) added a comment - 2016/04/22 9:24 AM

Expand comment: Gustavo Fernandes (Inactive) added a comment - 2016/04/22 9:24 AM

Collapse comment: Matej Čimbora (Inactive) added a comment - 2016/03/16 10:55 AM

Expand comment: Matej Čimbora (Inactive) added a comment - 2016/03/16 10:55 AM

People

Dates

PagerDuty