The purpose of this ticket is to address BZ 1569096, at least partially. Requests to the POST /hawkular/metrics/m/stats/query endpoint frequently result in HTTP timeouts which means graphs in the OpenShift web console fail to render. Part of the problem is that we have unbounded growth within partitions in the metrics_tags_idx table.
The tag queries for this endpoint look like (whitespace added for readability):
As noted in BZ 1569096 the type tag query in one small OpenShift cluster returned over 1 million rows. With a default page size of 1,000, that will require more than 1,000 round trips to Cassandra to complete just that one query. And to make things worse, the entire result set is fully realized in memory, which can lead to GC pressure.
There are a fixed number of metrics per pod which means that we can pre-calculate the result set size for the pod_id query. Unless we were dealing with an extremely large number of pod ids (like 10s of thousands), the result set for the pod_id query will be pretty small. By executing this query first, we can much more quickly narrow the potential set of metrics to be included in the stats/query endpoint response.
Reordering the queries does not fully address the problems in BZ 1569096, but it should make a substantial improvement.