Uploaded image for project: 'Hawkular Metrics'
  1. Hawkular Metrics
  2. HWKMETRICS-774

Add logging to SimpleTagQueryParser for tags queries

    XMLWordPrintable

    Details

    • Type: Task
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: 0.31.0, 0.30.4
    • Component/s: Core
    • Labels:
      None

      Description

      We have had problems with the POST /metrics/metrics/stats/query endpoint for as long as it has been used in OpenShift. Yesterday I was investigating an OCP 3.6 cluster, and I saw this in the logs:

      WARN  [org.hawkular.metrics.api.jaxrs.log.time.RequestTimeLogger] (RxComputationScheduler-10) Request POST /hawkular/metrics/m/stats/query took: 186166 ms, exceeds 10000 ms threshold, tenant-id: myproject
      

      For reference, here are a couple examples of the tags parameter in these requests:

      descriptor_name:network/tx_rate|network/rx_rate,type:pod,pod_id:ae01bf2f-3439-11e8-bdf7-54e1ad486be8|ae01ee22-3439-11e8-bdf7-54e1ad486be8|adfc47e9-3439-11e8-bdf7-54e1ad486be8|adfe6edc-3439-11e8-bdf7-54e1ad486be8|adfe4cf4-3439-11e8-bdf7-54e1ad486be8
      

      and

      descriptor_name:memory/usage|cpu/usage_rate,type:pod_container,pod_id:ae01bf2f-3439-11e8-bdf7-54e1ad486be8|ae01ee22-3439-11e8-bdf7-54e1ad486be8|adfc47e9-3439-11e8-bdf7-54e1ad486be8|adfe6edc-3439-11e8-bdf7-54e1ad486be8|adfe4cf4-3439-11e8-bdf7-54e1ad486be8,container_name:nodeapp
      

      The cluster has only roughly 1k pods, and this particular project only had 17 pods. I dug a bit deeper and found that the tags queries executed returned result set with more than 300k rows. With default page size of 1000 for the Cassandra driver, we are looking at 30+ round trips to and from Cassandra for each tag query.

      We virtually have no visibility into what kind of data in terms of result set size with which we are dealing outside of manual inspection like I did.

      I want to log a DEBUG message for each tag query that includes the tag's key and value(s) and the result set size. We can establish some threshold based on the configured page size for the Cassandra driver to instead log a similar message at INFO or at WARN. For example, if the page size is 1000 and if the threshold is 10, then a result set with 300,000 rows should trigger the INFO or WARN message.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

                People

                • Assignee:
                  Unassigned
                  Reporter:
                  john.sanda John Sanda
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: