Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.13.z
Component/s: Monitoring
Labels:

Severity:
Moderate
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:

Description of problem:

Customer-installed Prometheus Operator-based monitoring stacks collide with and attempt to control resources of Managed Openshift Prometheus Operator monitoring stacks (cluster-monitoring-operator, User Workload Monitoring, RHOBS).

In almost all the cases this has been shown to be customer Prometheus operators installed unrestricted to a set of namespaces.  These then identify the `prometheus` custom resources from the managed operators, and attempt to control those resources.

This results in the `prometheus-k8s-1` and `alertmanager-main-1` pods cycling through status pending, init, running, and then deleting as the operators fight with one another.

In practical terms there are stretches where the cluster is essentially unmonitored, with SRE teams unable to receive alerts from the cluster, or the cluster appearing to have disappeared due to triggered alerts from DeadMansSnitch.  With no insight into what is happening on-cluster, SRE and the customer risk service degredation or outages for these customers.

Version-Release number of selected component (if applicable):

4.x

How reproducible:

100%

Steps to Reproduce:

1. Install unrestricted Prometheus operator stack (Example: Prometheus operator bundled with Cisco Service Mesh management - https://www.cisco.com/c/en/us/products/collateral/cloud-systems-management/intersight/nb-06-service-mesh-mgr-aag-cte-en.html

Actual results:

Cluster monitoring operator, UWM, RHOBS operators are degraded or down, in addition to the customer-installed monitoring operator.  Prometheus and Alertmanager replica sets are degraded as pods are created and killed. Alerts from the cluster are sometimes suppressed, and DeadMansSnitch check-in failures are frequent.

Expected results:

Managed monitoring operator stacks are unimpacted by most, if not all, customer-installed Prometheus operators.

Additional info:

I understand it is impossible to restrict the actions of customers who have been granted cluster-admin, and in an ideal world the policy and responsibility matrix would be enough to inhibit these issues. However, in the interests of customer experience, it would be beneficial to prevent these operator collisions with some mix of hardening and obfuscating managed monitoring solutions so out-of-the-box Prometheus operator stacks do not interfere, even if they're unconfigured or inexpertly configured. Perhaps the names of the CRs could be extended/renamed to something like "RH-managed-prometheus" or have a randomized string, and RBAC rules hardened to prevent other operators from being able to read/write managed resources. If the lowest hanging fruit could be addressed here, it would probably solve 85% of the issues we see in this area.

relates to

RFE-4733 Allow installation of Prometheus operators without touching built-in CRDs

Under Review

Assignee:: Daniel Mellado Area

Reporter:: Chris Collins

QA Contact:: Junqi Zhao

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/05/22 9:13 PM

Updated:: 2024/05/08 9:46 PM

Resolved:: 2024/05/08 9:46 PM

Details

Description

Attachments

Issue Links

Activity

People

Dates