Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: CNV v4.15.3
Affects Version/s: None
Component/s: CNV Virtualization
Labels:
- Scale
- cnv-4?
- cnvbugsm
- devel_ack+
- needinfo?
- pm_ack+
- qa_ack?

Story Points:
8
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
BZ Status:
ASSIGNED
BZ URL:
https://bugzilla.redhat.com/show_bug.cgi?id=2218435
Bugzilla Bug:
RHBZ: 2218435
Component Fix Version(s):
virt-launcher-rhel9-container-v4.15.2-2
Regression:
No
[QE] How to address?:
---
[QE] Why QE missed?:
---

Sprint:
CNV Virtualization Sprint 239, CNV Virtualization Sprint 240, CNV Virtualization Sprint 241, CNV Virtualization Sprint 243, CNV Virtualization Sprint 244, CNV Virtualization Sprint 245, CNV Virtualization Sprint 246, CNV Virtualization Sprint 247, CNV Virtualization Sprint 248, CNV Virtualization Sprint 252
Severity:
High

SFDC Cases Links:
SFDC Cases Counter:

Description

I'm running a scale regression setup on :
=========================================
OpenShift 4.13.2
OpenShift Virtualization 4.13.1
OpenShift Container Storage - 4.12.4-rhodf
this is a large-scale setup with 132 nodes running 6000 RHEL VMs on an external RHCS.

while I was testing idle VMs migration in bulks - meaning I schedule 100 VMs migrations, wait for completion, and then schedule another 100, I noticed that
the migration completion rate was slowly degrading with every bulk, starting at 20 seconds per VM and reaching up to 1570 seconds per VM in the last bulks,
in order to debug this issue I schedule 800 VMs migration so it will be easier to notice the root cause.
ideally, the expected result is that we will queue all those migration jobs and then execute them at a rate of parallelMigrationsPerCluster,
however, what actually happened is that all those queues got stuck in the virt-controller migration queue.
they remained there indefinitely while consuming MEM & CPU, even after the vmim's already failed, the queue remained unphased, in fact, the only thing that caused a few of those queues to be eliminated is when nonvoluntary_ctxt_switches were triggered I eventually killed active virt-controller after 4.5 hours - see attached image virt-controller-queue.

the way I found to avoid triggering this issue is by making sure through automation that the number of scheduled migrating VMs queue will always be <= parallelMigrationsPerCluster
by doing that I was able to complete 1200 VMs migration in just above 12 minutes.

it's important to note that this issue is exclusive to the migration flow, for example when I mass-scheduled 6000 VMs for starting I didn't experience any issues.

note that I was using the following debug, but the rate at which those logs were generating and getting overwritten made them useless at this scale.
============================================================================================================================================

Spec:
logVerbosityConfig:
kubevirt:
virtController: 9
virtHandler: 9
virtLauncher: 9
virtAPI: 9
============================================================================================================================================

steps to reproduce:
this issue is 100% reproducible
1. create a cluster with 800 VMs
2. initiate a large number of migrations (as easy as running a bunch of "virtctl migrate")