Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 9.3.0.Final
Affects Version/s: 9.2.3.Final, 9.3.0.Beta1
Component/s: Core
Labels:
None

Sprint:
Sprint 9.3.0.CR1, Sprint 9.3.0.Final
Git Pull Request:
https://github.com/infinispan/infinispan/pull/6055

Description

When a node leaves during rebalance, we remove the leaver from the current CH and from the pending CH with ConsistentHashFactory.updateMembers(). However, since the 4-phase rebalance changes joiners are also removed from the pending CH.

In the following log numOwners=2 and pending CH had one segment (239) owned by a joiner (D) and a leaver (B). The updated pending CH doesn't have either B or D, this means both owners are lost and random owners (C and A) are elected. A sees that it was allocated new segments from updateMembers and logs that the segment has been lost:

08:53:01,528 INFO  (remote-thread-test-NodeA-p2-t6:[cluster-listener]) [CLUSTER] [Context=cluster-listener] ISPN100002: Starting rebalance with members [test-NodeA-15001, test-NodeB-55628, test-NodeC-62395, test-NodeD-29215], phase READ_OLD_WRITE_ALL, topology id 10
08:53:01,554 TRACE (transport-thread-test-NodeD-p28-t2:[Topology-cluster-listener]) [CacheTopology] Current consistent hash's routing table: test-NodeA-15001 primary: {... 237-238 244-245 248-250 252 254}, backup: {... 235-236 243 246-247 251 253}
  test-NodeB-55628 primary: {... 231 234 240 242}, backup: {... 232-233 239 241 255}
  test-NodeC-62395 primary: {... 232-233 235-236 239 241 243 246-247 251 253 255}, backup: {... 231 234 237-238 240 242 244-245 248-250 252 254}
08:53:01,554 TRACE (transport-thread-test-NodeD-p28-t2:[Topology-cluster-listener]) [CacheTopology] Pending consistent hash's routing table: test-NodeA-15001 primary: {... 237-238 245 248 252}, backup: {... 235-236 244 246-247 251}
  test-NodeB-55628 primary: {... 231 240 242}, backup: {... 230 239 241}
  test-NodeC-62395 primary: {... 232-233 235-236 241 243 246-247 251 253 255}, backup: {... 231 234 240 242 245 249-250 252 254}
  test-NodeD-29215 primary: {... 234 239 244 249-250 254}, backup: {... 232-233 237-238 243 248 253 255}

08:53:01,606 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener]) [ClusterCacheStatus] Removed node test-NodeB-55628 from cache cluster-listener: members = [test-NodeA-15001, test-NodeC-62395, test-NodeD-29215], joiners = [test-NodeD-29215]
08:53:01,611 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener]) [CacheTopology] Current consistent hash's routing table: test-NodeA-15001 primary: {... 237-238 244-245 248-250 252 254}, backup: {... 235-236 243 246-247 251 253}
  test-NodeC-62395 primary: {... 230-236 239-243 246-247 251 253 255}, backup: {... 237-238 244-245 248-250 252 254}
08:53:01,611 TRACE (remote-thread-test-NodeA-p2-t5:[cluster-listener]) [CacheTopology] Pending consistent hash's routing table: test-NodeA-15001 primary: {... 237-238 244-245 248 252}, backup: {... 235-236 239 246-247 251}
  test-NodeC-62395 primary: {... 227-236 239-243 246-247 249-251 253-255}, backup: {... 226 245 252}

08:53:01,613 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener]) [StateTransferManagerImpl] Installing new cache topology CacheTopology{id=11, phase=READ_OLD_WRITE_ALL, rebalanceId=4, currentCH=DefaultConsistentHash{ns=256, owners = (2)[test-NodeA-15001: 134+45, test-NodeC-62395: 122+50]}, pendingCH=DefaultConsistentHash{ns=256, owners = (2)[test-NodeA-15001: 129+45, test-NodeC-62395: 127+28]}, unionCH=DefaultConsistentHash{ns=256, owners = (2)[test-NodeA-15001: 134+62, test-NodeC-62395: 122+68]}, actualMembers=[test-NodeA-15001, test-NodeC-62395, test-NodeD-29215], persistentUUIDs=[0506cc27-9762-4703-ad56-6a3bf7953529, c365b93f-e46c-4f11-ab46-6cafa2b2d92b, d3f21b0d-07f2-4089-b160-f754e719de83]} on cache cluster-listener
08:53:01,662 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener]) [StateConsumerImpl] On cache cluster-listener we have: new segments: {1-18 21 30-53 56-63 67 69-73 75-78 81-85 88-101 107-115 117-118 125-134 136-139 142-156 158-165 168-170 172-174 177-179 181-188 193-211 214-226 228-229 235-239 243-254}; old segments: {1-15 30-53 56-63 69-73 75-77 81-85 88-101 107-115 117-118 125-131 136-139 142-145 150-156 158-165 168-170 172-174 177-179 183-188 193-211 214-218 220-226 228-229 235-238 243-254}
08:53:01,663 TRACE (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener]) [StateConsumerImpl] On cache cluster-listener we have: added segments: {16-18 21 67 78 132-134 146-149 181-182 219 239}; removed segments: {}
08:53:01,663 DEBUG (transport-thread-test-NodeA-p4-t1:[Topology-cluster-listener]) [StateConsumerImpl] Not requesting segments {16-18 21 67 78 132-134 146-149 181-182 219 239} because the last owner left the cluster

There isn't any visible inconsistency: A only owns segment 239 for writing, and the coordinator immediately starts a new rebalance, ignoring the pending CH that it sent out earlier. However, the new rebalance causes its own problems: ~~ISPN-8240~~.

Attachments

Issue Links

is related to

ISPN-8240 Coordinator sends REBALANCE_START command when there is already a rebalance in progress

Closed

Activity

People

Assignee:: Dan Berindei (Inactive)

Reporter:: Dan Berindei (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 2018/05/21 6:34 AM

Updated:: 2020/02/07 5:55 AM

Resolved:: 2018/06/12 4:37 AM