Uploaded image for project: 'Red Hat Fuse'
  1. Red Hat Fuse
  2. ENTESB-14347

[Observability] Provide SOPs for SLOs

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Won't Do
    • Icon: Blocker Blocker
    • 2021-M2
    • None
    • Camel-K

      What

      Create standard operating procedures (SOPs) for addressing breaches of SLOs

      Why

      So SRE can investigate the cause of an Alert without needing extensive Service specific knowledge, and ultimately get the Service back into a good state before the SLO is breached

      How

      A SOP, in the context of RHMI Monitoring & Alerting, is a document that has a clear set of steps to troubleshoot why an Alert might be firing, and how to fix the problem. SOPs should assume the reader has a high level of OpenShift & Kubernetes knowledge, but doesn’t have much, if any, service specific knowledge. Any service specific terms or concepts relevant to the Alert should be clearly defined and explained how they are relevant to the firing Alert. The SOP should specify how to verify the issue is fixed after taking remedial action.
      An example SOP can be seen in the Appendix.

      Futher Information:

            astefanu@redhat.com Antonin Stefanutti
            dffrench@redhat.com David Ffrench
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: