Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-23056

[2149960] CNV 4.12.0-745 (dev nightly) & 4.11.1| miniscale testing - VMs are crashing during migration / migration is failing.

XMLWordPrintable

    • CNV Virtualization Sprint 228, CNV Virtualization Sprint 229, CNV Virtualization Sprint 230, CNV Virtualization Sprint 238
    • High

      Some background:
      -------------------------
      I'm running a mini-scale OpenShift setup with 30 OpenShift nodes, the CNV build is the latest nightly 4.12.0-745, I'm currently running 1500 fedora VMs, and I have been doing some migration testing, unfortunately after triggering 1000 migration 161 VMs are stuck at an error state e.g:

      virt-launcher-fedora-vm1421-7qjfj 0/2 Error 0 98m
      virt-launcher-fedora-vm1422-plnpg 0/2 Error 0 64m
      virt-launcher-fedora-vm1423-vw5nv 0/2 Error 0 81m
      virt-launcher-fedora-vm1424-85bh7 0/2 Error 0 15m
      virt-launcher-fedora-vm1425-vvflp 0/2 Error 0 48m
      virt-launcher-fedora-vm1427-xmzls 0/2 Error 0 31m
      virt-launcher-fedora-vm1428-8c7cf 0/2 Error 0 15m
      virt-launcher-fedora-vm1430-wm56l 0/2 Error 0 15m

      logs show:
      -----------
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "Collected all requested hook sidecar sockets",
      "pos": "manager.go:86",
      "timestamp": "2022-12-01T12:41:01.973479Z"
      }
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "Sorted all collected sidecar sockets per hook point based on their priority and name: map[]",
      "pos": "manager.go:89",
      "timestamp": "2022-12-01T12:41:01.973531Z"
      }
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "Connecting to libvirt daemon: qemu+unix:///session?socket=/var/run/libvirt/libvirt-sock",
      "pos": "libvirt.go:497",
      "timestamp": "2022-12-01T12:41:01.974885Z"
      }
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "Connecting to libvirt daemon failed: virError(Code=38, Domain=7, Message='Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory')",
      "pos": "libvirt.go:505",
      "timestamp": "2022-12-01T12:41:01.975225Z"
      }
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "libvirt version: 8.0.0, package: 5.module+el8.6.0+14495+7194fa43 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-03-16-19:03:54, )",
      "subcomponent": "libvirt",
      "thread": "46",
      "timestamp": "2022-12-01T12:41:02.286000Z"
      }
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "hostname: virt-launcher-fedora-vm1430-wm56l",
      "subcomponent": "libvirt",
      "thread": "46",
      "timestamp": "2022-12-01T12:41:02.286000Z"
      }
      {
      "component": "virt-launcher",
      "level": "error",
      "msg": "internal error: Unable to get session bus connection: Cannot autolaunch D-Bus without X11 $DISPLAY",
      "pos": "virGDBusGetSessionBus:128",
      "subcomponent": "libvirt",
      "thread": "46",
      "timestamp": "2022-12-01T12:41:02.286000Z"
      }
      {
      "component": "virt-launcher",
      "level": "error",
      "msg": "internal error: Unable to get system bus connection: Could not connect: No such file or directory",
      "pos": "virGDBusGetSystemBus:101",
      "subcomponent": "libvirt",
      "thread": "46",
      "timestamp": "2022-12-01T12:41:02.286000Z"
      }
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "Connected to libvirt daemon",
      "pos": "libvirt.go:513",
      "timestamp": "2022-12-01T12:41:02.476806Z"
      }
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "Registered libvirt event notify callback",
      "pos": "client.go:510",
      "timestamp": "2022-12-01T12:41:02.479265Z"
      }
      {
      "component": "virt-launcher",
      "level": "info",
      "msg": "Marked as ready",
      "pos": "virt-launcher.go:74",
      "timestamp": "2022-12-01T12:41:02.479412Z"
      }
      parse error: Invalid numeric literal at line 12, column 6

      this is my migration config:
      ----------------------------
      liveMigrationConfig:
      completionTimeoutPerGiB: 800
      parallelMigrationsPerCluster: 20
      parallelOutboundMigrationsPerNode: 4
      progressTimeout: 150
      workloads: {}
      --------------------

      Versions of all relevant components:
      CNV 4.12.0-745
      OCP 4.11.4

      CNV must-gather:
      -----------------
      http://perf148h.perf.lab.eng.bos.redhat.com/share/BZ_logs/pods_crah_after_migration.tar.gz

            jelejosne Jed Lejosne
            bbenshab Boaz Ben Shabat
            Kedar Bidarkar Kedar Bidarkar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: