Loading...

Details

Type: Feature Request
Resolution: Won't Do
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- infra

Bugzilla References:
https://bugzilla.redhat.com/show_bug.cgi?id=1232869

Description

Repair is process in Cassandra in which data is made consistent across replicas. There are two kinds - read repair and anti-entropy repair. The former happens automatically in the background on queries. The latter is done via JMX. Although nodes can remain operational while anti-entropy repair run, it is very resource intensive and can take a long time to complete. It can easily be on the order of hours or even days. The Cassandra docs recommend running regularly, scheduled anti-entropy repair within gc_grace_seconds, which is the time to wait before Cassandra garbage collects tombstones (i.e., deletion markers). The reason for running it within gc_grace_seconds is to ensure deletes get propagated and to prevent them from being undone (i.e., prevent deleted data from being resurrected/undeleted). gc_grace_seconds is configured per keyspace and defaults to 10 days. In RHQ we set gc_grace_seconds to 8 days, and ran anti-entropy repair weekly in a Quartz job named StorageClusterReadRepairJob.

As long as replicas are up, data will be consistent between them. Note however that the consistency levels used influences when data becomes consistent between replicas. If we have cluster where the nodes never go down, then there is no need to run anti-entropy repair with respect to data consistency. Of course nodes do go down. Cassandra has another mechanism called hinted handoff that comes into play. When the target replica is down, the coordinator node (the one receiving the request), stores a hint of the mutation that is intended for that replica. When the replica comes back up, it will receive the hints, making it consistent with other replicas.

There is a maximum amount of time a node can be down and other nodes will store hints. The is defined by the max_hint_window_in_ms property in cassandra.yaml, and it defaults to 3 hours. If a node is down longer than that, then other nodes assume the down node is dead unless and until it comes back up. So if we do not run scheduled repair and if a node is down for more than max_hint_window_in_ms, then need to run a full repair on the node when it comes back up to account for any dropped hints.

With metric data we are dealing with append-only data, where each column is only ever written once and never updated. In the event some metric data was deleted on one replica, and still alive on the other, we know that it has the TTL set and will expire; therefore, we do not need to worry about deletes being undone. I should note that there is a PR for HWKMETRICS-193 that stops using TTL and does deletes manually. For metric data we still should not have to worry about deletes undone since we never do in-place updates. Append-only data is not always the case for other tables in metrics as well as for other tables in other parts of Hawkular. In these situations resurrected data should be a concern. The take away here is that the frequency at which anti-entropy repair runs can vary from table to table.

There are basically three ways to schedule anti-entropy repair. The most coarse-grained way is by specifying the keyspace. Cassandra will run repair over each table. Cassandra schedules repair jobs or sessions by token range for a table. Let's say for example that we have token ranges 1-10, 10-20, ..., 80-90, and 90-100. When repairs runs against a table, Cassandra schedules sessions by token ranges. The next way to schedule repair is by specifying the tables to run against. The most fine-grained way is by specifying the token range in addition to the table.

There are a couple big challenges with repair with respect to management that have been major pain points for RHQ/JON users. First, Cassandra neither reports progress nor provides a good notification mechanism to know repair has finished. When repair can runs for hours or days, this can very problematic because users have a really difficult time determining if repair has finished. Because repair can and does fail and because it can take a long time to complete, being able to track the progress is vital. You can register a JMXNotififcationListener to receive updates about finished token ranges. This does provide a lot of what we need in terms of being able to determine whether or not there are failures. There is nothing though in terms providing approximate estimates on how long repair might take. Some sort of reasonably intelligent estimates could be very helpful for users.

The second challenge is that Cassandra does not maintain any persistent state about a scheduled anti-entropy job. If repair fails for a subset of token ranges for a particular table, it would be really nice to just tell Cassandra to rerun repair for only previously tried and failed token ranges. This capability would be very useful since repair can be resource intensive and can take a long time to finish. We would need to build this sort of state management into Hawkular since Cassandra does not provide it. Based on experience in RHQ/JON, I definitely think this is something we need to do.

Provide support for running repair on Cassandra cluster

Details

Description

Attachments

Activity

People

Dates