Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.12
Component/s: Networking / ovn-kubernetes
Labels:
None

Severity:
Important
Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:
When there are hundreds of pods in namespace which is matched egressIP , after reboot the egress node for times, it may having stale snat rules left.

Version-Release number of selected component (if applicable):
% oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.33 True False 10h Cluster version is 4.12.33

How reproducible:
Randomly

Steps to Reproduce:

1. % oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.12.33 True False 3h47m Cluster version is 4.12.33

2. % oc get nodes
NAME STATUS ROLES AGE VERSION
huirwang-41233-vc5cq-compute-0 Ready worker 4h2m v1.25.12+26bab08
huirwang-41233-vc5cq-compute-1 Ready worker 4h2m v1.25.12+26bab08
huirwang-41233-vc5cq-compute-2 Ready worker 4h2m v1.25.12+26bab08
huirwang-41233-vc5cq-compute-3 Ready worker 4h2m v1.25.12+26bab08
huirwang-41233-vc5cq-compute-4 Ready worker 3h55m v1.25.12+26bab08
huirwang-41233-vc5cq-control-plane-0 Ready control-plane,master 4h11m v1.25.12+26bab08
huirwang-41233-vc5cq-control-plane-1 Ready control-plane,master 4h12m v1.25.12+26bab08
huirwang-41233-vc5cq-control-plane-2 Ready control-plane,master 4h12m v1.25.12+26bab08

3. Label 4 nodes as egress nodes huirwang-41233-vc5cq-compute-0 huirwang-41233-vc5cq-compute-1 huirwang-41233-vc5cq-compute-2 huirwang-41233-vc5cq-compute-3

4. Created an egressIP object
% oc get egressip -o yaml
apiVersion: v1
items:

apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
creationTimestamp: "2024-05-07T06:25:29Z"
generation: 10
name: egressip-2
resourceVersion: "153956"
uid: f14969c0-1ee5-4bd4-8515-63f34370b9ab
spec:
egressIPs:
192.168.221.100
namespaceSelector:
matchLabels:
name: qe
status:
items:
egressIP: 192.168.221.100
node: huirwang-41233-vc5cq-compute-0
kind: List
metadata:
resourceVersion: ""

4. Create a namespace test and replicas pods in this namespace, scale up pods to 350
% oc get pods -n test | grep Running| wc -l
350

5. Label namespace test name=qe

6. check snat in ovn db
sh-4.4# ovn-nbctl --format=csv find nat external_ids:name=egressip-2 | grep k8s-huirwang-41233-vc5cq-compute-0 | wc -l
350

7. Reboot egress node and check snat again after after egressIP failover to another egress node
8.Repeat step 7 for some times

when the issue happened,egress node is huirwang-41233-vc5cq-compute-0
% oc get egressip -o yaml
apiVersion: v1
items:

apiVersion: k8s.ovn.org/v1
kind: EgressIP
metadata:
creationTimestamp: "2024-05-07T06:25:29Z"
generation: 10
name: egressip-2
resourceVersion: "153956"
uid: f14969c0-1ee5-4bd4-8515-63f34370b9ab
spec:
egressIPs:
192.168.221.100
namespaceSelector:
matchLabels:
name: qe
status:
items:
egressIP: 192.168.221.100
node: huirwang-41233-vc5cq-compute-0
kind: List
metadata:
resourceVersion: ""
There is one stale snat rule left on for node huirwang-41233-vc5cq-compute-2.
sh-4.4# ovn-nbctl --format=csv find nat external_ids:name=egressip-2 | grep k8s-huirwang-41233-vc5cq-compute-2
a6136d02-f4b3-4806-b5e4-cb846a6c53ef,[],[], {name=egressip-2}
,"""192.168.221.100""",[],"""""",[],"""10.128.2.75""",k8s-huirwang-41233-vc5cq-compute-2,"
{stateless=""false""}
",snat
sh-4.4# ovn-nbctl --format=csv find nat external_ids:name=egressip-2 | grep k8s-huirwang-41233-vc5cq-compute-2 | wc -l
1
sh-4.4# ovn-nbctl --format=csv find nat external_ids:name=egressip-2 | grep k8s-huirwang-41233-vc5cq-compute-0 | wc -l
350

Actual results:
having stale snat rules left

Expected results:
No stale snat rules

Additional info:

scaled pods to 750, after reboot egress node for 4~5 times, it has 149 stale snat rules left.

Affected Platforms:

Is it an

internal CI failure
customer issue / SD
internal RedHat testing failure

If it is an internal RedHat testing failure:

Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

If it is a CI failure:

Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
If it's a connectivity issue,
What is the srcNode, srcIP and srcNamespace and srcPodName?
What is the dstNode, dstIP and dstNamespace and dstPodName?
What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

If it is a customer / SD issue:

Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
Don’t presume that Engineering has access to Salesforce.
Please provide must-gather and sos-report with an exact link to the comment in the support case with the attachment. The format should be: https://access.redhat.com/support/cases/#/case/<case number>/discussion?attachmentId=<attachment id>
Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
When showing the results from commands, include the entire command in the output.
Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
- If the issue is in a customer namespace then provide a namespace inspect.
- If it is a connectivity issue:
  - What is the srcNode, srcNamespace, srcPodName and srcPodIP?
  - What is the dstNode, dstNamespace, dstPodName and dstPodIP?
  - What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
  - Please provide the UTC timestamp networking outage window from must-gather
  - Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
- If it is not a connectivity issue:
  - Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.

For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, labels with “sbr-untriaged”
Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”

Details

Description

Attachments

Activity

People

Dates