Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-22368

Missing alerting when specific resources are reaching potential gRPC limitation of about 2147483647 bytes

XMLWordPrintable

    • Important
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      In gRPC, we have MaxSendMsgSize is set to 2147483647 bytes. If specific resources in OpenShift Container Platform 4 are exceeding the 2147483647 bytes (for example if all secrets in etcd contain more than 2147483647 bytes), OpenShift Container Platform 4 will start misbehaving as different controllers won't be able to list those resources and thus remain stack in error loop.
      
      I0821 23:03:24.662977      17 trace.go:205] Trace[419549052]: "List(recursive=true) etcd3" key:/secrets,resourceVersion:,resourceVersionMatch:, limit:10000,continue: (21-Aug-2023 23:03:23.767) (total time: 895 ms): Trace[419549052]: [895.681318ms] [895.681318ms] END
      W0821 23:03:24.662996      17 reflector.go:324] storage/cacher.go:/secrets: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs.  2147483647)
      E0821 23:03:24.663007      17 cacher.go:425] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647); reinitializing...
      
      {"level":"warn","ts":"2023-08-17T23:03:24.662Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00357ae00/10.1.1.3:2379","attempt":0,"error":"rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647)"}
      
      This effect in turn will cause pods to remain stuck in terminating state, fail to create new pods and also trigger problems with kubelets and many other resources. Given the various failures associated with the issue, the OpenShift Container Platform 4 - API will also eventually start overloading, ending up with requests in Priority and Fairness Queue and overall OpenShift Container Platform 4 - Cluster to stop working.
      
      The problem was also documented in https://access.redhat.com/solutions/7040736 and reducing the size of the resources below 2147483647 bytes solved the problem.
      
      So in order to prevent that from happening, either the limitation of about 2147483647 bytes for gRPC requests needs to be changed or alternative alerting needs to be put in place to help customers understand that they eventually reach a critical size for a given resource and therefore should start reducing bytes for that given resource type.
      
      

      Version-Release number of selected component (if applicable):

      All OpenShift Container Platform 4 version
      

      How reproducible:

      Always
      

      Steps to Reproduce:

      1. Install OpenShift Container Platform 4 - Cluster
      2. Create many large secret to reach an overall size of all secrets beyond 2147483647 bytes
      3. Try to start/stop pods and see whether it works or not
      

      Actual results:

      I0821 23:03:24.662977      17 trace.go:205] Trace[419549052]: "List(recursive=true) etcd3" key:/secrets,resourceVersion:,resourceVersionMatch:, limit:10000,continue: (21-Aug-2023 23:03:23.767) (total time: 895 ms): Trace[419549052]: [895.681318ms] [895.681318ms] END
      W0821 23:03:24.662996      17 reflector.go:324] storage/cacher.go:/secrets: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs.  2147483647)
      E0821 23:03:24.663007      17 cacher.go:425] cacher (*core.Secret): unexpected ListAndWatch error: failed to list *core.Secret: rpc error: code = ResourceExhausted desc = grpc: trying to send message larger than max (2169698338 vs. 2147483647); reinitializing...
      
      The above being reported in kube-apiserver and therefore many activity such as stopping and creating pods won't work any more
      

      Expected results:

      Either have alerting in place to help customers understand when they are about to reach that critical size for a given resource type or change the gRPC limitation of 2147483647 bytes to something bigger that is less likely to be reached.
      

      Additional info:

      
      

            tjungblu@redhat.com Thomas Jungblut
            rhn-support-sreber Simon Reber
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated: