[ISPN-3140] JMX operation to suppress state transfer

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

Type: Enhancement
Resolution: Done
Priority: Major
Fix Version/s: 5.2.7.Final, 5.3.0.Final
Affects Version/s: 5.2.6.Final
Component/s: Core, State Transfer
Labels:
None

Estimated Difficulty:
Low
Git Pull Request:
https://github.com/infinispan/infinispan/pull/1887, https://github.com/infinispan/infinispan/pull/1900
Bugzilla References:
https://bugzilla.redhat.com/show_bug.cgi?id=974402

This feature request is to expose a JMX operation on each node, to suppress state transfer for a period of time. This flag would be false by default.

The use case of this flag would be to ease bringing down (and up) a cluster for maintenance work. A typical workflow would be:

1) Shut down application requests to the data grid
2) Suppress state transfer on all nodes via JMX
3) Bring down all nodes
4) Perform maintenance work
5) Bring up nodes, one at a time. As each node comes up, disable state transfer for the node via JMX.
6) Once all nodes are up, enable state transfer for each node again via JMX
7) Allow application requests to reach the grid again.

The purpose of this is to allow smooth and fast shutdown and startup, remove the risk of OOM errors (when bringing a grid down).

This is a small but useful subset of full manual state transfer as defined in ~~ISPN-1394~~.

is cloned by

ISPN-3209 Server operation to suppress state transfer

Closed

is related to

ISPN-1394 Investigate possibility of doing manual rehashing

Closed

ISPN-3233 Expose the JMX rebalancing toggle on the server RHQ plugin

Closed

relates to

ISPN-1239 Graceful shutdown should be supported

Closed

RH Bugzilla Integration added a comment - 2013/10/09 4:29 AM

Anna Manukyan <amanukya@redhat.com> changed the Status of bug 974402 from ON_QA to VERIFIED

RH Bugzilla Integration added a comment - 2013/10/09 4:29 AM Anna Manukyan <amanukya@redhat.com> changed the Status of bug 974402 from ON_QA to VERIFIED

RH Bugzilla Integration added a comment - 2013/10/09 4:29 AM

Anna Manukyan <amanukya@redhat.com> made a comment on bug 974402

Verified! Thanks a lot.

RH Bugzilla Integration added a comment - 2013/10/09 4:29 AM Anna Manukyan <amanukya@redhat.com> made a comment on bug 974402 Verified! Thanks a lot.

RH Bugzilla Integration added a comment - 2013/10/05 8:00 AM

Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from MODIFIED to ON_QA

RH Bugzilla Integration added a comment - 2013/10/05 8:00 AM Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from MODIFIED to ON_QA

RH Bugzilla Integration added a comment - 2013/10/05 4:21 AM

Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from ASSIGNED to MODIFIED

RH Bugzilla Integration added a comment - 2013/10/05 4:21 AM Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from ASSIGNED to MODIFIED

RH Bugzilla Integration added a comment - 2013/09/27 7:15 AM

Anna Manukyan <amanukya@redhat.com> changed the Status of bug 974402 from ON_QA to ASSIGNED

RH Bugzilla Integration added a comment - 2013/09/27 7:15 AM Anna Manukyan <amanukya@redhat.com> changed the Status of bug 974402 from ON_QA to ASSIGNED

RH Bugzilla Integration added a comment - 2013/09/27 7:15 AM

Anna Manukyan <amanukya@redhat.com> made a comment on bug 974402

Tested for ER1 and still the issue described above appears.

RH Bugzilla Integration added a comment - 2013/09/27 7:15 AM Anna Manukyan <amanukya@redhat.com> made a comment on bug 974402 Tested for ER1 and still the issue described above appears.

RH Bugzilla Integration added a comment - 2013/08/28 7:45 AM

Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from MODIFIED to ON_QA

RH Bugzilla Integration added a comment - 2013/08/28 7:45 AM Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from MODIFIED to ON_QA

RH Bugzilla Integration added a comment - 2013/08/26 3:28 AM

Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from ASSIGNED to MODIFIED

RH Bugzilla Integration added a comment - 2013/08/26 3:28 AM Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from ASSIGNED to MODIFIED

RH Bugzilla Integration added a comment - 2013/08/14 9:02 AM

Anna Manukyan <amanukya@redhat.com> changed the Status of bug 974402 from ON_QA to ASSIGNED

RH Bugzilla Integration added a comment - 2013/08/14 9:02 AM Anna Manukyan <amanukya@redhat.com> changed the Status of bug 974402 from ON_QA to ASSIGNED

RH Bugzilla Integration added a comment - 2013/08/02 6:03 PM

Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from MODIFIED to ON_QA

RH Bugzilla Integration added a comment - 2013/08/02 6:03 PM Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from MODIFIED to ON_QA

RH Bugzilla Integration added a comment - 2013/08/02 4:37 PM

Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from ASSIGNED to MODIFIED

RH Bugzilla Integration added a comment - 2013/08/02 4:37 PM Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from ASSIGNED to MODIFIED

RH Bugzilla Integration added a comment - 2013/07/29 2:23 AM

Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 974402

The HotRod topology cache MUST not be configured by hand, but only by using the <topology-state-transfer> configuration element. ~~ISPN-3373~~ adds support for this.

RH Bugzilla Integration added a comment - 2013/07/29 2:23 AM Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 974402 The HotRod topology cache MUST not be configured by hand, but only by using the <topology-state-transfer> configuration element. ISPN-3373 adds support for this.

RH Bugzilla Integration added a comment - 2013/07/29 2:21 AM

Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 974402

Anna, disabling

RH Bugzilla Integration added a comment - 2013/07/29 2:21 AM Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 974402 Anna, disabling

RH Bugzilla Integration added a comment - 2013/07/25 9:06 AM

Anna Manukyan <amanukya@redhat.com> changed the Status of bug 974402 from ON_QA to ASSIGNED

RH Bugzilla Integration added a comment - 2013/07/25 9:06 AM Anna Manukyan <amanukya@redhat.com> changed the Status of bug 974402 from ON_QA to ASSIGNED

RH Bugzilla Integration added a comment - 2013/07/24 3:33 AM

Michal Linhard <mlinhard@redhat.com> made a comment on bug 974402

anna can you please verify this ? (you did this for the patch)

RH Bugzilla Integration added a comment - 2013/07/24 3:33 AM Michal Linhard <mlinhard@redhat.com> made a comment on bug 974402 anna can you please verify this ? (you did this for the patch)

RH Bugzilla Integration added a comment - 2013/07/24 3:19 AM

Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from MODIFIED to ON_QA

RH Bugzilla Integration added a comment - 2013/07/24 3:19 AM Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from MODIFIED to ON_QA

RH Bugzilla Integration added a comment - 2013/06/26 5:38 AM

Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from NEW to MODIFIED

RH Bugzilla Integration added a comment - 2013/06/26 5:38 AM Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 974402 from NEW to MODIFIED

RH Bugzilla Integration added a comment - 2013/06/26 5:38 AM

Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 974402

Resolved upstream

RH Bugzilla Integration added a comment - 2013/06/26 5:38 AM Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 974402 Resolved upstream

Dan Berindei (Inactive) added a comment - 2013/06/13 10:59 AM

We should broadcast (to all the members) the suspend request so that when the coordinator dies, the new coordinator would pick it up and it wouldn't start the rebalance.

Dan Berindei (Inactive) added a comment - 2013/06/13 10:59 AM We should broadcast (to all the members) the suspend request so that when the coordinator dies, the new coordinator would pick it up and it wouldn't start the rebalance.

Adrian Nistor (Inactive) added a comment - 2013/06/07 12:17 PM

Integrated in master. Thanks!

Adrian Nistor (Inactive) added a comment - 2013/06/07 12:17 PM Integrated in master. Thanks!

Tristan Tarrant added a comment - 2013/06/07 11:19 AM

This needs to have a "Server" counterpart, so that it can be exposed via the Server RHQ plugin (which doesn't use JMX, but DMR)

Tristan Tarrant added a comment - 2013/06/07 11:19 AM This needs to have a "Server" counterpart, so that it can be exposed via the Server RHQ plugin (which doesn't use JMX, but DMR)

Adrian Nistor (Inactive) added a comment - 2013/06/07 9:57 AM

As pointed out by Dennis Reed (http://markmail.org/message/al7elzaqme5jri22) it makes sense to forward the jmx operation from any member to the coordinator to make it easier for the user.

Adrian Nistor (Inactive) added a comment - 2013/06/07 9:57 AM As pointed out by Dennis Reed ( http://markmail.org/message/al7elzaqme5jri22 ) it makes sense to forward the jmx operation from any member to the coordinator to make it easier for the user.

Dan Berindei (Inactive) added a comment - 2013/06/07 9:12 AM - edited

Pasted from Adrian's message (http://markmail.org/message/ns7aojy7v7su2t7p):

1. Add a JMX writable attribute (or operation?) to ClusterTopologyManager (name it suppressRehashing?) that is false by default but should also be configurable via API or xml. While this attribute is true the ClusterTopologyManager queues all join/leave/exclude(see below) requests and does not execute them on the spot as it would normally happen. [...] When it is set back to false all queued operations (except the ones that cancel eachother out) are executed. The setter should be synchronous so when setting is back to false it does not return until the queue is empty and all rehashing was processed.

2. We add a JMX operation excludeNodes(list of addresses) to ClusterTopologyManager. [...] This operation removes the node from the topology (almost as if it left) and forces a rebalance. The node is still present in the current CH but not in the pending CH. It's basically disowned by all its data which is now being transferred to other (not excluded) nodes. At the end of the rebalance the node is removed from topology for good and can be shut down without loosing data. Note that if suppressRehashing==true operation excludeNodes(..) just queues them for later removal. We can batch multiple such exclusions and then re-activate the rehashing.

The parts that need to be implemented are written in italic above. Everything else is already there.

excludeNodes is a way of achieving a soft shutdown and should be used only if we care about preserving data int the extreme case where the nodes are the last/single owners. We can just kill the node directly if we do not care about its data.

suppressRehashing is a way of achieving some kind of batching of topology changes. This should speed up state transfer a lot because it avoids a lot of pointless reshuffling of data segments when we have many successive joiners/leavers.

So what happens if the current coordinator dies for whatever reason? The new one will take control and will not have knowledge of the existing rehash queue or the previous status of suppressRehashing attribute so it will just get the current cache membership status from all members of current view and proceed with the rehashing as usual. If the user does not want this he can set a default value of true for suppressRehashing. The admin has to interact now via JMX with the new coordinator. But that's not as bad as the alternative where all the nodes are involved in this jmx scheme I think having only the coordinator involved in this is a plus.

We're actually going to implement only point 1 now, and point 2 will be a separate issue (or perhaps as a part of ~~ISPN-1394~~).

Dan Berindei (Inactive) added a comment - 2013/06/07 9:12 AM - edited Pasted from Adrian's message ( http://markmail.org/message/ns7aojy7v7su2t7p ): 1. Add a JMX writable attribute (or operation?) to ClusterTopologyManager (name it suppressRehashing?) that is false by default but should also be configurable via API or xml. While this attribute is true the ClusterTopologyManager queues all join/leave/exclude(see below) requests and does not execute them on the spot as it would normally happen. [...] When it is set back to false all queued operations (except the ones that cancel eachother out) are executed. The setter should be synchronous so when setting is back to false it does not return until the queue is empty and all rehashing was processed. 2. We add a JMX operation excludeNodes(list of addresses) to ClusterTopologyManager. [...] This operation removes the node from the topology (almost as if it left) and forces a rebalance. The node is still present in the current CH but not in the pending CH. It's basically disowned by all its data which is now being transferred to other (not excluded) nodes. At the end of the rebalance the node is removed from topology for good and can be shut down without loosing data. Note that if suppressRehashing==true operation excludeNodes(..) just queues them for later removal. We can batch multiple such exclusions and then re-activate the rehashing. The parts that need to be implemented are written in italic above. Everything else is already there. excludeNodes is a way of achieving a soft shutdown and should be used only if we care about preserving data int the extreme case where the nodes are the last/single owners. We can just kill the node directly if we do not care about its data. suppressRehashing is a way of achieving some kind of batching of topology changes. This should speed up state transfer a lot because it avoids a lot of pointless reshuffling of data segments when we have many successive joiners/leavers. So what happens if the current coordinator dies for whatever reason? The new one will take control and will not have knowledge of the existing rehash queue or the previous status of suppressRehashing attribute so it will just get the current cache membership status from all members of current view and proceed with the rehashing as usual. If the user does not want this he can set a default value of true for suppressRehashing. The admin has to interact now via JMX with the new coordinator. But that's not as bad as the alternative where all the nodes are involved in this jmx scheme I think having only the coordinator involved in this is a plus. We're actually going to implement only point 1 now, and point 2 will be a separate issue (or perhaps as a part of ISPN-1394 ).

Mircea Markus (Inactive) added a comment - 2013/05/28 6:11 AM

where would the data in this cluster be persisted during the shutdown? simpler with a shared cache store, each cache persisting locally would complicate things a bit.

Mircea Markus (Inactive) added a comment - 2013/05/28 6:11 AM where would the data in this cluster be persisted during the shutdown? simpler with a shared cache store, each cache persisting locally would complicate things a bit.

Assignee:: Dan Berindei (Inactive)

Reporter:: Manik Surtani (Inactive)

Archiver:: Amol Dongare

Created:: 2013/05/28 6:07 AM

Updated:: 2024/01/16 3:27 PM

Resolved:: 2013/06/14 7:00 AM

Archived:: 2024/11/28 6:21 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: RH Bugzilla Integration added a comment - 2013/10/09 4:29 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/10/09 4:29 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/10/09 4:29 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/10/09 4:29 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/10/05 8:00 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/10/05 8:00 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/10/05 4:21 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/10/05 4:21 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/09/27 7:15 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/09/27 7:15 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/09/27 7:15 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/09/27 7:15 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/08/28 7:45 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/08/28 7:45 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/08/26 3:28 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/08/26 3:28 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/08/14 9:02 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/08/14 9:02 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/08/02 6:03 PM

Expand comment: RH Bugzilla Integration added a comment - 2013/08/02 6:03 PM

Collapse comment: RH Bugzilla Integration added a comment - 2013/08/02 4:37 PM

Expand comment: RH Bugzilla Integration added a comment - 2013/08/02 4:37 PM

Collapse comment: RH Bugzilla Integration added a comment - 2013/07/29 2:23 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/07/29 2:23 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/07/29 2:21 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/07/29 2:21 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/07/25 9:06 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/07/25 9:06 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/07/24 3:33 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/07/24 3:33 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/07/24 3:19 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/07/24 3:19 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/06/26 5:38 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/06/26 5:38 AM

Collapse comment: RH Bugzilla Integration added a comment - 2013/06/26 5:38 AM

Expand comment: RH Bugzilla Integration added a comment - 2013/06/26 5:38 AM

Collapse comment: Dan Berindei (Inactive) added a comment - 2013/06/13 10:59 AM

Expand comment: Dan Berindei (Inactive) added a comment - 2013/06/13 10:59 AM

Collapse comment: Adrian Nistor (Inactive) added a comment - 2013/06/07 12:17 PM

Expand comment: Adrian Nistor (Inactive) added a comment - 2013/06/07 12:17 PM

Collapse comment: Tristan Tarrant added a comment - 2013/06/07 11:19 AM

Expand comment: Tristan Tarrant added a comment - 2013/06/07 11:19 AM

Collapse comment: Adrian Nistor (Inactive) added a comment - 2013/06/07 9:57 AM

Expand comment: Adrian Nistor (Inactive) added a comment - 2013/06/07 9:57 AM

Collapse comment: Dan Berindei (Inactive) added a comment - 2013/06/07 9:12 AM, Edited by Adrian Nistor - 2013/06/07 10:51 AM

Expand comment: Dan Berindei (Inactive) added a comment - 2013/06/07 9:12 AM, Edited by Adrian Nistor - 2013/06/07 10:51 AM

Collapse comment: Mircea Markus (Inactive) added a comment - 2013/05/28 6:11 AM

Expand comment: Mircea Markus (Inactive) added a comment - 2013/05/28 6:11 AM

People

Dates

PagerDuty