Uploaded image for project: 'Hawkular Metrics'
  1. Hawkular Metrics
  2. HWKMETRICS-494

Data compression job fails easily and is not fault tolerant

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Critical
    • Resolution: Done
    • Affects Version/s: 0.20.0
    • Fix Version/s: 0.21.0
    • Component/s: Core, Scheduler
    • Labels:
      None

      Description

      The data compression job substantially increases the load we place on Cassandra. Every two hours runs and does the follow for every metric:

      • Fetch past two hours of raw data
      • Compress the raw data
      • Write the compress data to the data_compressed table
      • Delete the raw data

      On my dev machine I have had a server running for several hours and using a single Cassandra node configured with a 1 GB heap. I was pumping data with the gatling load test in the metrics repo. Here is the command line I used for gatling:

      mvn clean gatling:execute -Dloops=500 -Dinterval=10 -Dmetrics=1000
      

      This results in pushing 1,000 data points every 10 seconds. Nothing too crazy. I started seeing errors by the 3rd run of the compression job. The job triggers lots of GC activity in Cassandra which almost certainly contributes to a lot of the timeouts I was seeing. I also noticed lots of read timeouts.

      I am attaching my server.log and cassadra.log files for review. Things start to fall apart in the logs a few seconds after 15:00.

      We may need to do some GC tuning to Cassandra due to the new read work load. We might also need to explore some throttling. The job only runs every couple hours. We should see about trying to spread the work out over time window.

      The compression job has to be more fault tolerant. Right now if there is an error during the job, like a request timeout, the job will effectively be aborted. Let's say for example that when the job runs at 15:00 it fails part of the way through. The job will be scheduled to run again at 17:00. We need to make sure that the data from the 13:00 to 15:00 block still gets compressed and also that the raw data gets deleted.

      I think it is particularly important to make sure that the raw data gets deleted since we are no longer using TTL.

      Lastly, I think do we should consider keeping track of the work done. Suppose the job fails on the 999th (out of 1000) metric. I think we should try to avoid redoing all of that again if possible. Sure the job can run quickly, but the additional load leads to a lot of instability, at least with a single node setup.

        Gliffy Diagrams

          Attachments

          1. cassandra.log
            13.70 MB
          2. server.log
            1.40 MB

            Activity

              People

              • Assignee:
                john.sanda John Sanda
                Reporter:
                john.sanda John Sanda
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: