Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-9856

New feature: spoke time drift protection in Multicluster Observability

XMLWordPrintable

    • False
    • None
    • False
    • No

      Create an informative issue (See each section, incomplete templates/issues won't be triaged)

      Using the current documentation as a model, please complete the issue template. 

      Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.

      Prerequisite: Start with what we have

      Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:

       - Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes

       - Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs 

      Describe the changes in the doc and link to your dev story

      Provide info for the following steps:

      1. - [x] Mandatory Add the required version to the Fix version/s field.

      2. - [x] Mandatory Choose the type of documentation change.

            - [x] New topic in an existing section or new section

      In the release notes for ACM Observability add a section mentioning that we are adding some level of protection against potential issues caused by a spoke with a clock that drifted into the future: https://github.com/stolostron/rhacm-docs/blob/2.10_stage/release_notes/whats_new.adoc

            - [ ] Update to an existing topic

      3. - [ ] Mandatory for GA content:
                  
             - [ ] Add steps and/or other important conceptual information here:

                  
             - [ ] Add Required access level for the user to complete the task here:
             

             - [ ] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?)
           
           
             - [x] Add link to dev story here: https://issues.redhat.com/browse/ACM-5860

      4. - [x] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation:

      Thanos Receive is the component used by ACM Observability to ingest metrics timeseries coming from the spoke clusters. Due to its nature, the timestamp of these timeseries is critical for the TSDB and it will, by default and design of its format, refuse data that is considered to be in the past.       

      In case a spoke connected to a hub experience a time drift and push the latest timestamp of the TSDB to a time in the future, every other spoke with a correct clock will have their data refused at ingest time for being in the "past". In this situation the metrics of all spokes except 1 are lost.

      The situation will only normalize when the spoke syncs its clock and time passes until that moment in the future to which the TSDB was advanced.

      So, to help preventing these situation, we are configuring the Thanos Receive components to refuse any data more than 5 minutes in the future. We believe this value to be enough, as we often see time drifts way bigger than 5 minutes. With this protection in place, only 1 spoke's metrics get lost.

            jberger@redhat.com Jacob Berger
            rh-ee-doolivei Douglas Camata
            Xiang Yin Xiang Yin
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: