Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Etcd
Labels:
- bug
- bytes
- etcd
- grpc
- kube-apiserver
- monitoring
- stability

Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Review Complete:

Description of problem:

In gRPC, we have MaxSendMsgSize is set to 2147483647 bytes. If specific resources in OpenShift Container Platform 4 are exceeding the 2147483647 bytes (for example if all secrets in etcd contain more than 2147483647 bytes), OpenShift Container Platform 4 will start misbehaving as different controllers won't be able to list those resources and thus remain stack in error loop.

I0821 23:03:24.662977      17 trace.go:205] Trace[419549052]: "List(recursive=true) etcd3" key:/secrets,resourceVersion:,resourceVersionMatch:, limit:10000,continue: (21-Aug-2023 23:03:23.767) (total time: 895 ms): Trace[419549052]: [895.681318ms] [895.681318ms] END
W0821 23:03:24.662996      17 reflector.go:324] storage/cacher.go:/secrets: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs.  2147483647)
E0821 23:03:24.663007      17 cacher.go:425] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647); reinitializing...

{"level":"warn","ts":"2023-08-17T23:03:24.662Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00357ae00/10.1.1.3:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647)"}

This effect in turn will cause pods to remain stuck in terminating state, fail to create new pods and also trigger problems with kubelets and many other resources. Given the various failures associated with the issue, the OpenShift Container Platform 4 - API will also eventually start overloading, ending up with requests in Priority and Fairness Queue and overall OpenShift Container Platform 4 - Cluster to stop working.

The problem was also documented in https://access.redhat.com/solutions/7040736 and reducing the size of the resources below 2147483647 bytes solved the problem.

So in order to prevent that from happening, either the limitation of about 2147483647 bytes for gRPC requests needs to be changed or alternative alerting needs to be put in place to help customers understand that they eventually reach a critical size for a given resource and therefore should start reducing bytes for that given resource type.

Version-Release number of selected component (if applicable):

All OpenShift Container Platform 4 version

How reproducible:

Always

Steps to Reproduce:

1. Install OpenShift Container Platform 4 - Cluster
2. Create many large secret to reach an overall size of all secrets beyond 2147483647 bytes
3. Try to start/stop pods and see whether it works or not

Actual results:

I0821 23:03:24.662977      17 trace.go:205] Trace[419549052]: "List(recursive=true) etcd3" key:/secrets,resourceVersion:,resourceVersionMatch:, limit:10000,continue: (21-Aug-2023 23:03:23.767) (total time: 895 ms): Trace[419549052]: [895.681318ms] [895.681318ms] END
W0821 23:03:24.662996      17 reflector.go:324] storage/cacher.go:/secrets: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs.  2147483647)
E0821 23:03:24.663007      17 cacher.go:425] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647); reinitializing...

The above being reported in kube-apiserver and therefore many activity such as stopping and creating pods won't work any more

Expected results:

Either have alerting in place to help customers understand when they are about to reach that critical size for a given resource type or change the gRPC limitation of 2147483647 bytes to something bigger that is less likely to be reached.

Additional info:

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2023-11-09-11-56-32-441.png
131 kB
2023/11/09 10:56 AM

links to

OpenShift Container Platform 4 is failing to terminate or create pods and many Cluster Operator are failing to work as expected

Assignee:: Thomas Jungblut

Reporter:: Simon Reber

QA Contact:: ge liu

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/10/25 5:20 AM

Updated:: 2024/04/22 6:17 AM

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Hide