[ISPN-3918] Inconsistent view of the cache with putIfAbsent in a non-tx cache during state transfer

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

Type: Bug
Resolution: Obsolete
Priority: Major
Fix Version/s: None
Affects Version/s: 6.0.0.Final
Component/s: Core, State Transfer
Labels:
- consistency

In a non-tx cache, sometimes it's possible for a get(k) to return null even though a previous putIfAbsent(k, v) returned a non-null value and the only concurrent operations on the cache are concurrent putIfAbsent calls.

Say [B, A, C] are the owners of k (C just joined)
1. A starts a putIfAbsent(k, v1) command, sends it to B
2. B forwards the command to A and C
3. C writes k=v1
4. C becomes the primary owner of k (owners are now [C, A])
5. A/B see the new topology before committing and throw an outdatedTopologyException
6. A retries the command, sends it to C
7. C forwards the command to A, which writes k=v1
8. C doesn't have to update the entry, returns null

If, between steps 3 and 7, another thread on A starts a putIfAbsent(k, v2) command, the command will fail and return v1 (because the primary owner already has a value). However, a subsequent get(k) command will return null, because A is an owner and doesn't have the value.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

NonTxPutIfAbsentDuringLeaveStressTest.testNodeLeavingDuringPutIfAbsent_8.log.gz
187 kB
2017/08/09 7:30 AM
NonTxPutIfAbsentDuringRebalanceStressTest.testPutIfAbsentDuringJoin_1.log.gz
261 kB
2017/08/09 11:24 AM
ntpiadjst.log.gz
301 kB
2014/01/22 6:45 AM

is related to

ISPN-7638 Observing non-final values on backup owner

Closed

ISPN-7590 Invocation idempotence

Closed

relates to

ISPN-4286 Two concurrent putIfAbsent operations can both return null during rebalance

Closed

ISPN-6451 NonTxPutIfAbsentDuringLeaveStressTest.testNodeLeavingDuringPutIfAbsent still fails randomly

Closed

ISPN-2956 putIfAbsent on Hot Rod Java client doesn't reliably fulfil contract

Closed

Pedro Ruivo added a comment - 2017/08/25 9:15 AM

if the primary replies to the originator as a regular message, it will be ordered with the backup command and it solve the problem when the topology is stable.
if a topology changes while the put if absent, we would need to versioning to keep track with command-invocation-id generate which version (and return value) to decide if the command should be handled or not.

Pedro Ruivo added a comment - 2017/08/25 9:15 AM if the primary replies to the originator as a regular message, it will be ordered with the backup command and it solve the problem when the topology is stable. if a topology changes while the put if absent, we would need to versioning to keep track with command-invocation-id generate which version (and return value) to decide if the command should be handled or not.

Dan Berindei (Inactive) added a comment - 2017/08/10 4:21 AM

Indeed, fixed.

Dan Berindei (Inactive) added a comment - 2017/08/10 4:21 AM Indeed, fixed.

Radim Vansa (Inactive) added a comment - 2017/08/10 4:17 AM

Dan at point 29, the DC on B should contain v3, shouldn't it? Not that it would change much...

Radim Vansa (Inactive) added a comment - 2017/08/10 4:17 AM Dan at point 29, the DC on B should contain v3, shouldn't it? Not that it would change much...

Dan Berindei (Inactive) added a comment - 2017/08/10 4:06 AM - edited

I found another failure mode while trying to work around this issue in NonTxPutIfAbsentDuringLeaveStressTest (I was also trying to merge it with NonTxPutIfAbsentDuringJoinStressTest into a NonTxPutIfAbsentDuringRebalanceStressTest, so the stack traces in the attached log NonTxPutIfAbsentDuringRebalanceStressTest.testPutIfAbsentDuringJoin_1.log.gz don't match master).

Say owners(k) = AB in topology t, and owners(k) = BA in topology t+1. In the following scenario, putIfAbsent fails in thread C-app1 and C-app2 with different values:

C-app1: start putIfAbsent(k, v1)
C-app1: send putIfAbsent(k, v1) to A (primary)
A-remote1: receive putIfAbsent(k, v1)
A-remote1: send backup request for putIfAbsent(k, v1) to B
A-remote1: write k=v1
C-remote1: receive null for putIfAbsent(k, v1)
C-app2: start putIfAbsent(k, v2)
C-app2: send putIfAbsent(k, v2) to A (primary)
A-remote2: receive putIfAbsent(k, v1)
A-remote2: read v1 from data container, fail the command
C-remote2: receive v1 for putIfAbsent(k, v2)
C-app2: return v1 (put was unsuccessful)
B-remote1: install topology t+1, B is now primary owner
B-remote1: receive backup request for putIfAbsent(k, v1)
B-remote1: check topology, send OutdatedTopologyException ack to C
C-remote3: install topology t+1
C-app3: start putIfAbsent(k, v3)
C-app3: send putIfAbsent(k, v3) to B (primary)
B-remote3: receive putIfAbsent(k, v3)
B-remote3: send backup request for putIfAbsent(k, 3) to A
B-remote3: write k=v3
C-remote3: receive null for putIfAbsent(k, v3)
A-remote3: receive backup request for putIfAbsent(k, v3)
A-remote3: write k=v3
A-remote3: send backup ack for putIfAbsent(k, v3) to C
C-remote3: receive backup ack for putIfAbsent(k, v3)
C-app3: return null (put was successful)
B-remote1: retry, B is now primary owner
B-remote1: read v3 from data container, fail the command
C-remote1: receive v3 for putIfAbsent(k, v1)
C-app1: return v3 (put was unsuccessful)

I think both this scenario and the previous one are worse than the initial report of seeing null, because we're not respecting the READ_COMMITTED isolation level (a transaction is seeing a value that was never committed).

Dan Berindei (Inactive) added a comment - 2017/08/10 4:06 AM - edited I found another failure mode while trying to work around this issue in NonTxPutIfAbsentDuringLeaveStressTest (I was also trying to merge it with NonTxPutIfAbsentDuringJoinStressTest into a NonTxPutIfAbsentDuringRebalanceStressTest , so the stack traces in the attached log NonTxPutIfAbsentDuringRebalanceStressTest.testPutIfAbsentDuringJoin_1.log.gz don't match master). Say owners(k) = AB in topology t , and owners(k) = BA in topology t+1 . In the following scenario, putIfAbsent fails in thread C-app1 and C-app2 with different values: C-app1 : start putIfAbsent(k, v1) C-app1 : send putIfAbsent(k, v1) to A (primary) A-remote1 : receive putIfAbsent(k, v1) A-remote1 : send backup request for putIfAbsent(k, v1) to B A-remote1 : write k=v1 C-remote1 : receive null for putIfAbsent(k, v1) C-app2 : start putIfAbsent(k, v2) C-app2 : send putIfAbsent(k, v2) to A (primary) A-remote2 : receive putIfAbsent(k, v1) A-remote2 : read v1 from data container, fail the command C-remote2 : receive v1 for putIfAbsent(k, v2) C-app2 : return v1 (put was unsuccessful) B-remote1 : install topology t+1, B is now primary owner B-remote1 : receive backup request for putIfAbsent(k, v1) B-remote1 : check topology, send OutdatedTopologyException ack to C C-remote3 : install topology t+1 C-app3 : start putIfAbsent(k, v3) C-app3 : send putIfAbsent(k, v3) to B (primary) B-remote3 : receive putIfAbsent(k, v3) B-remote3 : send backup request for putIfAbsent(k, 3) to A B-remote3 : write k=v3 C-remote3 : receive null for putIfAbsent(k, v3) A-remote3 : receive backup request for putIfAbsent(k, v3) A-remote3 : write k=v3 A-remote3 : send backup ack for putIfAbsent(k, v3) to C C-remote3 : receive backup ack for putIfAbsent(k, v3) C-app3 : return null (put was successful) B-remote1 : retry, B is now primary owner B-remote1 : read v3 from data container, fail the command C-remote1 : receive v3 for putIfAbsent(k, v1) C-app1 : return v3 (put was unsuccessful) I think both this scenario and the previous one are worse than the initial report of seeing null , because we're not respecting the READ_COMMITTED isolation level (a transaction is seeing a value that was never committed).

Dan Berindei (Inactive) added a comment - 2017/08/09 7:23 AM

The situation is even worse when the topology changes: a get after a failed putIfAbsent can return not only null, but also a completely different value.

Say owners(k) = AB, and there is a topology change but the owners of k stay the same. In the following scenario, thread B-app3 first sees putIfAbsent(k, v3) = v2, and then get(k) = v1 (B-appX means application thread X on B, and A-remoteX means remote thread X on A):

B-app1: start putIfAbsent(k, v1)
B-app1: send putIfAbsent(k, v1) to A (primary)
A-remote1: receive putIfAbsent(k, v1)
A-remote1: send backup request for putIfAbsent(k, v1) to B
B-remote1: receive backup request for putIfAbsent(k, v1)
B-remote1: write k=v1
B-remote1: send backup ack for putIfAbsent(k, v1)
A-remote1: check topology, send OutdatedTopologyException back to B
A-app2: start putIfAbsent(k, v2)
A-app2: send backup request for putIfAbsent(k, v2) to B
A-app2: write k=v2
B-app3: start putIfAbsent(k, v3)
B-app3: send putIfAbsent(k, v3) to A (primary)
A-remote3: receive putIfAbsent(k, v3)
A-remote3: read v2 from data container, fail the command
B-remote3: receive v2 for putIfAbsent(k, v3)
B-app3: return v2 for putIfAbsent(k, v3) (put was unsuccessful)
B-app3: start get(k)
B-app3: read v1 from local container
B-app3: return v1 for get(k)
B-remote2: receive putIfAbsent(k, v2) backup command
B-remote2: write k=v2
B-remote2: send backup ack for putIfAbsent(k, v2)
A-remote2: receive backup ack for putIfAbsent(k, v2)
A-app2: return null for putIfAbsent(k, v2) (put was successful)
B-remote1: receive OutdatedTopologyException for putIfAbsent(k, v1)
B-remote1: retry, send putIfAbsent(k, v1) to A (primary)
A-remote1: receive putIfAbsent(k, v1)
A-remote1: read v2 from data container, fail the command
B-remote1: receive v2 for putIfAbsent(k, v1)
B-app1: return v2 for putIfAbsent(k, v1) (put was unsuccessful)

Dan Berindei (Inactive) added a comment - 2017/08/09 7:23 AM The situation is even worse when the topology changes: a get after a failed putIfAbsent can return not only null , but also a completely different value. Say owners(k) = AB , and there is a topology change but the owners of k stay the same. In the following scenario, thread B-app3 first sees putIfAbsent(k, v3) = v2 , and then get(k) = v1 ( B-appX means application thread X on B, and A-remoteX means remote thread X on A): B-app1 : start putIfAbsent(k, v1) B-app1 : send putIfAbsent(k, v1) to A (primary) A-remote1 : receive putIfAbsent(k, v1) A-remote1 : send backup request for putIfAbsent(k, v1) to B B-remote1 : receive backup request for putIfAbsent(k, v1) B-remote1 : write k=v1 B-remote1 : send backup ack for putIfAbsent(k, v1) A-remote1 : check topology, send OutdatedTopologyException back to B A-app2 : start putIfAbsent(k, v2) A-app2 : send backup request for putIfAbsent(k, v2) to B A-app2 : write k=v2 B-app3 : start putIfAbsent(k, v3) B-app3 : send putIfAbsent(k, v3) to A (primary) A-remote3 : receive putIfAbsent(k, v3) A-remote3 : read v2 from data container, fail the command B-remote3 : receive v2 for putIfAbsent(k, v3) B-app3 : return v2 for putIfAbsent(k, v3) (put was unsuccessful) B-app3 : start get(k) B-app3 : read v1 from local container B-app3 : return v1 for get(k) B-remote2 : receive putIfAbsent(k, v2) backup command B-remote2 : write k=v2 B-remote2 : send backup ack for putIfAbsent(k, v2) A-remote2 : receive backup ack for putIfAbsent(k, v2) A-app2 : return null for putIfAbsent(k, v2) (put was successful) B-remote1 : receive OutdatedTopologyException for putIfAbsent(k, v1) B-remote1 : retry, send putIfAbsent(k, v1) to A (primary) A-remote1 : receive putIfAbsent(k, v1) A-remote1 : read v2 from data container, fail the command B-remote1 : receive v2 for putIfAbsent(k, v1) B-app1 : return v2 for putIfAbsent(k, v1) (put was unsuccessful)

Radim Vansa (Inactive) added a comment - 2017/03/20 7:43 AM

This situation can happen with triangle algorithm even without any topology change; as primary does not hold the lock during replication, second putIfAbsent may fail before the first putIfAbsent is executed on backup.

Radim Vansa (Inactive) added a comment - 2017/03/20 7:43 AM This situation can happen with triangle algorithm even without any topology change; as primary does not hold the lock during replication, second putIfAbsent may fail before the first putIfAbsent is executed on backup.

Dan Berindei (Inactive) added a comment - 2015/06/17 6:35 AM

FORCE_WRITE_LOCK doesn't work in non-tx caches.

Dan Berindei (Inactive) added a comment - 2015/06/17 6:35 AM FORCE_WRITE_LOCK doesn't work in non-tx caches.

Radim Vansa (Inactive) added a comment - 2014/11/13 5:13 AM

Possible workaround is to use FORCE_WRITE_LOCK flag for the get() operation.

Radim Vansa (Inactive) added a comment - 2014/11/13 5:13 AM Possible workaround is to use FORCE_WRITE_LOCK flag for the get() operation.

Dan Berindei (Inactive) added a comment - 2014/01/22 6:45 AM

This is causing a random failure in NonTxPutIfAbsentDuringJoinStressTest.

Dan Berindei (Inactive) added a comment - 2014/01/22 6:45 AM This is causing a random failure in NonTxPutIfAbsentDuringJoinStressTest.

Assignee:: Unassigned

Reporter:: Dan Berindei (Inactive)

Archiver:: Amol Dongare

Created:: 2014/01/22 6:44 AM

Updated:: 2023/05/25 1:38 PM

Resolved:: 2023/05/25 1:38 PM

Archived:: 2024/11/28 6:21 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Pedro Ruivo added a comment - 2017/08/25 9:15 AM

Expand comment: Pedro Ruivo added a comment - 2017/08/25 9:15 AM

Collapse comment: Dan Berindei (Inactive) added a comment - 2017/08/10 4:21 AM

Expand comment: Dan Berindei (Inactive) added a comment - 2017/08/10 4:21 AM

Collapse comment: Radim Vansa (Inactive) added a comment - 2017/08/10 4:17 AM

Expand comment: Radim Vansa (Inactive) added a comment - 2017/08/10 4:17 AM

Collapse comment: Dan Berindei (Inactive) added a comment - 2017/08/10 4:06 AM, Edited by Dan Berindei - 2017/08/10 4:20 AM

Expand comment: Dan Berindei (Inactive) added a comment - 2017/08/10 4:06 AM, Edited by Dan Berindei - 2017/08/10 4:20 AM

Collapse comment: Dan Berindei (Inactive) added a comment - 2017/08/09 7:23 AM

Expand comment: Dan Berindei (Inactive) added a comment - 2017/08/09 7:23 AM

Collapse comment: Radim Vansa (Inactive) added a comment - 2017/03/20 7:43 AM

Expand comment: Radim Vansa (Inactive) added a comment - 2017/03/20 7:43 AM

Collapse comment: Dan Berindei (Inactive) added a comment - 2015/06/17 6:35 AM

Expand comment: Dan Berindei (Inactive) added a comment - 2015/06/17 6:35 AM

Collapse comment: Radim Vansa (Inactive) added a comment - 2014/11/13 5:13 AM

Expand comment: Radim Vansa (Inactive) added a comment - 2014/11/13 5:13 AM

Collapse comment: Dan Berindei (Inactive) added a comment - 2014/01/22 6:45 AM

Expand comment: Dan Berindei (Inactive) added a comment - 2014/01/22 6:45 AM

People

Dates

PagerDuty