Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: Logging 5.3.14
Affects Version/s: Logging 5.3.12
Component/s: Log Collection
Labels:

Blocked:
False
Blocked Reason:
None
Ready:
False
Docs QE Status:
NEW
QE Status:
VERIFIED
Release Note Text:

Hide
Before this update, the log file size map generated by the `log-file-metrics-exporter` component did not remove entries for deleted files, resulting in an increase file size, and process memory. With this update, the log file size map does not contain entries for deleted files.

Show
Before this update, the log file size map generated by the `log-file-metrics-exporter` component did not remove entries for deleted files, resulting in an increase file size, and process memory. With this update, the log file size map does not contain entries for deleted files.
Intelligence Requested:
Market:

Sprint:
Log Collection - Sprint 227, Log Collection - Sprint 228
Severity:
Important

SFDC Cases Links:
SFDC Cases Counter:

Description of problem:

The collector pod contains 2 containers:

Starts the fluentd process
Starts the `/usr/local/bin/log-file-metric-exporter` process

If we review both containers, only the first has requests and limits for cpu and memory and they can be managed from the ClusterLogging Operator:

    - name: COLLECTOR_CONF_HASH
      value: fb4ebfa073fd0ea24153c48f22abdaa9
    image: registry.redhat.io/openshift-logging/fluentd-rhel8@sha256:1140e317d111e13c4900c1b6d128c5fdef05b9f319b0bd693665d67f3139d03a
    imagePullPolicy: IfNotPresent
    name: collector
    ports:
    - containerPort: 24231
      name: metrics
      protocol: TCP
    resources:   <---------------- this is limited as set in the clusterLogging instance
      limits:
        memory: 2Gi
      requests:
        cpu: 100m
        memory: 1Gi

but the second process `/usr/local/bin/log-file-metric-exporter` has not limits/requests set by default and is even, not able to set them from the ClusterLogging Operator

  - command:
    - /usr/local/bin/log-file-metric-exporter    <----------- the same process seen in the output from node consuming 8GB of RAM
    - '  -verbosity=2'
    - ' -dir=/var/log/containers'
    - ' -http=:2112'
    - ' -keyFile=/etc/fluent/metrics/tls.key'
    - ' -crtFile=/etc/fluent/metrics/tls.crt'
    image: registry.redhat.io/openshift-logging/log-file-metric-exporter-rhel8@sha256:2f43018b00df04dcdb0eebb7ae90e91dd60970494d13fd0851d91b996c8b0daf
    imagePullPolicy: IfNotPresent
    name: logfilesmetricexporter
    ports:
    - containerPort: 2112
      name: logfile-metrics
      protocol: TCP
    resources: {}     <----------------- not limit and not option in the clusterLogging CR instance of doing it

Then, for any unknown reason, the `/usr/local/bin/log-file-metric-exporter` process was starting to increase the usage of memory leading to consuming 8G, the moment in the master OCP node was starting to have big issues leading performance in the cluster since the etcd was starting to answer when reaching this node with high times.

The memory usage by the process was detected in a sosreport, and it was:

 Top MEM-using processes: 
USER PID %CPU %MEM VSZ-MiB RSS-MiB TTY STAT START TIME COMMAND 
root 7047 8.1 41.3 9616 8295 ? - Mar10 28603:18 /usr/local/bin/log-file-metric-exporter -verbosity=2 -dir=/var/log/containers 
root 1531508 20.9 10.3 2859 2080 ? - Nov08 299:43 kube-apiserver --openshift-config=/etc/kubernetes/static-pod-resources/configmaps/config/config.yaml
root 3882 9.1 5.3 10252 1064 ? - Mar10 32027:28 etcd --logger=zap --log-level=info

Version-Release number of selected component (if applicable):

cluster-logging.5.3.2-20

But the same is happening in the latest version

How reproducible:

Not able to reproduce, but it's easy to review that the container for the process `/usr/local/bin/log-file-metric-exporter` has no limits and is not able to set them.

Actual results:

The container `/usr/local/bin/log-file-metric-exporter` in the collector pods has no limits leading for an unknown reason to consume 8GB of RAM impacting the node ( a master ) and all the clusters.

Expected results:

The container `/usr/local/bin/log-file-metric-exporter` has limits/requests set by default not being able to consume without limits and perhaps, having the option to set them from the clusterLogging Operator.

Then, if something is leading the process to start to consume memory or CPU, the limits stop it.