This problem was first brought to our attention by Frank Eigler while using the data source plugin for Grafana with an OpenShift 3.3 cluster. When trying to load metric names using this tags query - tags=container_name:* - the operation fails. In some instances I was getting an HTTP 504 gateway timeout which means Grafana was not hearing back from the Hawkular Metrics server soon enough. Other times we are seeing a 500 response with a timeout reported by the Cassandra driver.
The method getting invoked is MetricHandler.findMetrics. It first queries the metrics_tags_idx table to look up metrics that have the container_name tag. We then perform a separate query against metrics_idx for each metric name we find in the tags index. I am not sure which queries are causing problems.
There might be some optimizations we can make to reduce the number of queries against the metrics_idx table. If the metrics_tags_idx query returns a bunch of metric names that belong to the same partition in metrics_idx, then we can fetch those in a single query.
It may turn out that there is not new problem here. I got the logs from the OpenShift cluster, and I see that it is running into
HWKMETRICS-590. That tombstone issue can cause a lot of performance problems. This is an OpenShift hosted cluster which means it is likely also running into HWKMETRICS-606. These two bugs will cause all sorts of problems. I wanted to create this ticket to track the issue; however, there might not be a new bug to be fixed. We need to test with a cluster that has the fixes for 590 and 606. I am working on that.