Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-17024

[2066222] Large scale |VMs Migration is failing due to different HT configurations on nodes

    XMLWordPrintable

Details

    • CNV Virtualization Sprint 220, CNV Virtualization Sprint 221, CNV Virtualization Sprint 222, CNV Virtualization Sprint 231, CNV Virtualization Sprint 232
    • High

    Description

      Some background:
      -------------------------
      I'm running a scale OpenShift setup with 100 OpenShift nodes as a preparation for an environment that was requested by a customer, with 47 RHCS 5.0 hosts as an external storage cluster.

      this setup is currently running 3000 VMs:
      1500 RHEL 8.5 persistent storage VMs
      500 Windows10 persistent storage VMs.
      1000 Fedora Ephemeral storage VMs.

      The workers are divided to 3 zones:
      worker000 - worker031. = Zone0
      worker032 - worker062. = Zone1
      worker033 - worker096. = Zone2

      I start the migration by applying an empty machineconfig to the zone
      which then causes the nodes to start draining.

      this is my migration config:


      liveMigrationConfig:
      completionTimeoutPerGiB: 800
      parallelMigrationsPerCluster: 11
      parallelOutboundMigrationsPerNode: 22
      progressTimeout: 150
      workloads: {}


      another thing worth mentioning is that I'm running a custom kubletconfig that is required due to the additional 21,400 pods on the cluster:


      spec:
      kubeletConfig:
      kubeAPIBurst: 200
      kubeAPIQPS: 100
      maxPods: 500
      machineConfigPoolSelector:
      matchLabels:
      custom-kubelet: enabled


      Issue number 1:
      the first problem I encountered was that right after starting the migration,
      it got stuck for a few hours and nothing happened.
      I also tried to manually run virtctl migrate to a few of the vms there were scheduled on cordoned nodes, and the cli was failing due to timeouts.
      I resolved that by patching the API with additional pods, this issue is already discussed at - https://github.com/kubevirt/kubevirt/issues/7101


      apiVersion: hco.kubevirt.io/v1beta1
      kind: HyperConverged
      metadata:
      annotations:
      deployOVS: "false"
      kubevirt.kubevirt.io/jsonpatch: '[{"op": "add", "path": "/spec/customizeComponents/patches",
      "value": [{"resourceType": "Deployment", "resourceName": "virt-api", "type":
      "json", "patch": "[

      {\"op\": \"replace\", \"path\": \"/spec/replicas\", \"value\": 5}

      ]"}]}]'


      Issue number 2:
      once the migration started to run I hoped that was it however a few VMs are currently failing to migrate due to various reasons, those are the VMs:

      rhel82-vm0074 3d23h Migrating True
      rhel82-vm0188 3d22h Migrating True
      rhel82-vm0253 3d21h Migrating True
      rhel82-vm0443 3d19h Migrating True
      rhel82-vm0451 3d19h Migrating True
      rhel82-vm0611 3d18h Migrating True
      rhel82-vm0784 3d17h Migrating True
      rhel82-vm1184 3d14h Migrating True
      rhel82-vm1428 3d12h Migrating True

      here are a few examples:

      VM rhel82-vm0451 - running on worker031 failing due to Assertion on kvm_buf_set_msrs
      ---------------------------------------
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Normal SuccessfulCreate 148m disruptionbudget-controller Created Migration kubevirt-evacuation-mvr7r
      Normal PreparingTarget 143m virt-handler Migration Target is listening at 10.131.44.5, on ports: 36763,37373
      Normal PreparingTarget 143m (x24 over 11h) virt-handler VirtualMachineInstance Migration Target Prepared.
      Warning Migrated 143m virt-handler VirtualMachineInstance migration uid b9fd0b54-26a5-4063-bd46-7b1e5dbeddd5 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
      Normal SuccessfulCreate 89m disruptionbudget-controller Created Migration kubevirt-evacuation-nvjbk
      Normal PreparingTarget 85m virt-handler Migration Target is listening at 10.130.2.5, on ports: 37595,32775
      Normal PreparingTarget 85m (x12 over 10h) virt-handler VirtualMachineInstance Migration Target Prepared.
      Warning Migrated 80m virt-handler VirtualMachineInstance migration uid 7c9f1f23-43e3-4af7-b423-6b0088cf563f failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
      Normal SuccessfulCreate 35m disruptionbudget-controller Created Migration kubevirt-evacuation-7fstf
      Normal PreparingTarget 33m virt-handler Migration Target is listening at 10.128.44.6, on ports: 38759,34661
      Warning Migrated 27m virt-handler VirtualMachineInstance migration uid 5b8e1b88-1b03-4668-9e1a-755989f7c868 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T08:48:41.882103Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
      qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
      Normal SuccessfulCreate 27m disruptionbudget-controller Created Migration kubevirt-evacuation-ts6nn
      Normal PreparingTarget 23m virt-handler Migration Target is listening at 10.128.44.6, on ports: 41015,35881
      Warning Migrated 17m virt-handler VirtualMachineInstance migration uid 74324d10-c606-41b2-8c8c-baeb94ccaa04 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
      Normal SuccessfulCreate 14m disruptionbudget-controller Created Migration kubevirt-evacuation-6sqfb
      Normal SuccessfulUpdate 13m (x32 over 11h) virtualmachine-controller Expanded PodDisruptionBudget kubevirt-disruption-budget-cnqj5
      Normal PreparingTarget 8m53s (x2 over 8m53s) virt-handler Migration Target is listening at 10.128.44.6, on ports: 46429,34129
      Normal PreparingTarget 8m52s (x13 over 33m) virt-handler VirtualMachineInstance Migration Target Prepared.
      Normal Migrating 8m52s (x116 over 11h) virt-handler VirtualMachineInstance is migrating.
      Normal SuccessfulUpdate 7m47s (x32 over 11h) disruptionbudget-controller shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-cnqj5)
      Warning SyncFailed 3m40s (x32 over 10h) virt-handler server error. command Migrate failed: "migration job already executed"
      Warning Migrated 3m40s virt-handler VirtualMachineInstance migration uid ed604e03-8658-4739-9875-95b88f2e0dd0 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')

      VM rhel82-vm0660 - running on worker031 failing due to what seems to be a race condition
      ---------------------------------------
      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Normal SuccessfulCreate 151m disruptionbudget-controller Created Migration kubevirt-evacuation-6zlsd
      Normal PreparingTarget 145m virt-handler Migration Target is listening at 10.131.0.7, on ports: 45093,37935
      Normal PreparingTarget 145m (x12 over 7h47m) virt-handler VirtualMachineInstance Migration Target Prepared.
      Warning Migrated 140m virt-handler VirtualMachineInstance migration uid 09832eb9-bed5-403c-9020-3e2f586a41e7 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
      Normal SuccessfulCreate 92m disruptionbudget-controller Created Migration kubevirt-evacuation-n8267
      Normal SuccessfulUpdate 91m (x11 over 11h) virtualmachine-controller Expanded PodDisruptionBudget kubevirt-disruption-budget-twdcv
      Normal PreparingTarget 88m virt-handler Migration Target is listening at 10.128.4.6, on ports: 39099,40877
      Normal Migrating 88m (x35 over 11h) virt-handler VirtualMachineInstance is migrating.
      Normal PreparingTarget 88m (x8 over 8h) virt-handler VirtualMachineInstance Migration Target Prepared.
      Normal SuccessfulUpdate 87m (x11 over 11h) disruptionbudget-controller shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-twdcv)
      Warning SyncFailed 83m (x11 over 11h) virt-handler server error. command Migrate failed: "migration job already executed"
      Warning Migrated 83m virt-handler VirtualMachineInstance migration uid fb9e3a06-21ab-4e54-8e6e-861f44bbee36 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
      ---------------------------------------

      VM rhel82-vm0836 - running on worker024 failing again to what seems to be a race condition, and failed API calls.
      ---------------------------------------

      Type Reason Age From Message
      ---- ------ ---- ---- -------
      Normal SuccessfulCreate 168m disruptionbudget-controller Created Migration kubevirt-evacuation-ddrdp
      Normal PreparingTarget 165m virt-handler Migration Target is listening at 10.131.0.7, on ports: 46719,42575
      Normal PreparingTarget 165m (x12 over 7h43m) virt-handler VirtualMachineInstance Migration Target Prepared.
      Warning Migrated 159m virt-handler VirtualMachineInstance migration uid 61b4e7c8-1348-4bac-9768-6a6be9c129e0 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
      Normal SuccessfulCreate 140m disruptionbudget-controller Created Migration kubevirt-evacuation-pnbpm
      Normal PreparingTarget 135m (x9 over 3h34m) virt-handler VirtualMachineInstance Migration Target Prepared.
      Normal PreparingTarget 135m (x2 over 135m) virt-handler Migration Target is listening at 10.130.30.5, on ports: 35207,37223
      Warning Migrated 129m virt-handler VirtualMachineInstance migration uid acca7d9e-723e-4f3c-adad-97d432db3a1b failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T07:18:04.563275Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
      qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
      Normal SuccessfulCreate 124m disruptionbudget-controller Created Migration kubevirt-evacuation-frtqg
      Normal PreparingTarget 120m virt-handler Migration Target is listening at 10.131.44.5, on ports: 45403,37683
      Normal PreparingTarget 120m (x8 over 8h) virt-handler VirtualMachineInstance Migration Target Prepared.
      Warning Migrated 120m virt-handler VirtualMachineInstance migration uid 0f6f18b5-3fde-4beb-8a9c-c112b7f8da02 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
      Normal SuccessfulCreate 107m disruptionbudget-controller Created Migration kubevirt-evacuation-wsjrc
      Normal PreparingTarget 102m (x17 over 11h) virt-handler VirtualMachineInstance Migration Target Prepared.
      Normal PreparingTarget 102m virt-handler Migration Target is listening at 10.130.2.5, on ports: 35511,35451
      Warning Migrated 95m virt-handler VirtualMachineInstance migration uid ab614074-848a-4ee4-8c37-07d4fcfbd872 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=1, Domain=10, Message='internal error: qemu unexpectedly closed the monitor: 2022-03-21T07:51:10.157656Z qemu-kvm: error: failed to set MSR 0x38f to 0x7000000ff
      qemu-kvm: ../target/i386/kvm.c:2701: kvm_buf_set_msrs: Assertion `ret == cpu->kvm_msr_buf->nmsrs' failed.')
      Normal SuccessfulCreate 11m disruptionbudget-controller Created Migration kubevirt-evacuation-zv85q
      Normal SuccessfulUpdate 10m (x38 over 16h) virtualmachine-controller Expanded PodDisruptionBudget kubevirt-disruption-budget-ln2kq
      Normal PreparingTarget 6m39s virt-handler Migration Target is listening at 10.128.44.6, on ports: 42829,45929
      Normal Migrating 6m38s (x126 over 16h) virt-handler VirtualMachineInstance is migrating.
      Normal PreparingTarget 6m38s (x4 over 6m39s) virt-handler VirtualMachineInstance Migration Target Prepared.
      Normal SuccessfulUpdate 5m14s (x38 over 16h) disruptionbudget-controller shrank PodDisruptionBudget%!(EXTRA string=kubevirt-disruption-budget-ln2kq)
      Warning SyncFailed 95s (x38 over 16h) virt-handler server error. command Migrate failed: "migration job already executed"
      Warning Migrated 95s virt-handler VirtualMachineInstance migration uid 1d146437-b031-47f7-accb-9ca42e960025 failed. reason:Live migration failed error encountered during MigrateToURI3 libvirt api call: virError(Code=9, Domain=20, Message='operation failed: domain is not running')
      ---------------------------------------

      Versions of all relevant components:
      CNV 4.9.2
      RHCS 5.0
      OCP 4.9.15

      CNV must-gather:
      -----------------
      http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/must-gather-failed-migration.tar.gz

      Attachments

        Issue Links

          Activity

            People

              bmordeha@redhat.com Barak Mordehai
              bbenshab Boaz Ben Shabat
              Kedar Bidarkar Kedar Bidarkar
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: