Loading...

XML

Word

Printable

Type: Task
Resolution: Won't Do
Priority: Undefined
Fix Version/s: ACM 2.10.0
Affects Version/s: ACM 2.9.0
Component/s: Documentation, Observability
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Regression:
No
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Create an informative issue (See each section, incomplete templates/issues won't be triaged)

Using the current documentation as a model, please complete the issue template.

Note: Doc team updates the current version and the two previous versions (n-2). For earlier versions, we will address only high-priority, customer-reported issues for releases in support.

Prerequisite: Start with what we have

Always look at the current documentation to describe the change that is needed. Use the source or portal link for Step 4:

- Use the Customer Portal: https://access.redhat.com/documentation/en-us/red_hat_advanced_cluster_management_for_kubernetes

- Use the GitHub link to find the staged docs in the repository: https://github.com/stolostron/rhacm-docs

Describe the changes in the doc and link to your dev story

Provide info for the following steps:

1. - [x] Mandatory Add the required version to the Fix version/s field.

2. - [x] Mandatory Choose the type of documentation change.

- [x] New topic in an existing section or new section

In the release notes for ACM Observability add a section mentioning that we are adding some level of protection against potential issues caused by a spoke with a clock that drifted into the future: https://github.com/stolostron/rhacm-docs/blob/2.10_stage/release_notes/whats_new.adoc

- [ ] Update to an existing topic

3. - [ ] Mandatory for GA content:

- [ ] Add steps and/or other important conceptual information here:

- [ ] Add Required access level for the user to complete the task here:

- [ ] Add verification at the end of the task, how does the user verify success (a command to run or a result to see?)

- [x] Add link to dev story here: https://issues.redhat.com/browse/ACM-5860

4. - [x] Mandatory for bugs: What is the diff? Clearly define what the problem is, what the change is, and link to the current documentation:

Thanos Receive is the component used by ACM Observability to ingest metrics timeseries coming from the spoke clusters. Due to its nature, the timestamp of these timeseries is critical for the TSDB and it will, by default and design of its format, refuse data that is considered to be in the past.

In case a spoke connected to a hub experience a time drift and push the latest timestamp of the TSDB to a time in the future, every other spoke with a correct clock will have their data refused at ingest time for being in the "past". In this situation the metrics of all spokes except 1 are lost.

The situation will only normalize when the spoke syncs its clock and time passes until that moment in the future to which the TSDB was advanced.

So, to help preventing these situation, we are configuring the Thanos Receive components to refuse any data more than 5 minutes in the future. We believe this value to be enough, as we often see time drifts way bigger than 5 minutes. With this protection in place, only 1 spoke's metrics get lost.

Assignee:: Jacob Berger

Reporter:: Douglas Camata

QA Contact:: Xiang Yin

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2024/02/08 5:50 PM

Updated:: 2024/02/26 1:22 PM

Resolved:: 2024/02/26 1:22 PM

Details

Description

Create an informative issue (See each section, incomplete templates/issues won't be triaged)

Prerequisite: Start with what we have

Describe the changes in the doc and link to your dev story

Attachments

Activity

People

Dates