Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-2224

Lockstep hibernation with MachineConfigPool upgrades

XMLWordPrintable

    • Lockstep hibernation with MachineConfigPool upgrades
    • False
    • None
    • False
    • Not Selected
    • In Progress
    • OCPSTRAT-543 - Shutdown/Resume of managed OSD/ROSA clusters
    • OCPSTRAT-543Shutdown/Resume of managed OSD/ROSA clusters

      One of the criterion for hibernation in OSD/ROSA is

      Cluster shutdown must be blocked if the MachineConfigPools are in updating state.

      There is an inherent timing problem with simply effecting

      if isUpgrading(MCO) {
        return errors,New("Can't hibernate during MCO upgrade")
      }
      hibernate()
      

      as an upgrade could kick off between when we check and when we initiate the hibernation. Thus it would have to look more like:

      freezeMCOUpgrades()
      if isUpgrading(MCO) {
        unfreezeMCOUpgrades()
        return errors.New("Can't hibernate during MCO upgrade")
      }
      hibernate()
      
      // ...and then in the resume flow
      resume()
      unfreezeMCOUpgrades()
      

      This assumes a freezeMCOUpgrades() is possible. I'm told it is – but if you freeze in the middle of an upgrade, MCO will finish whatever machine it's on and leave the rest. So some additional coordination will be necessary to figure out how to freeze either before or after that whole process.

      I'm also told that in 4.13+, cert rotation is now done independently of the upgrade procedure. Assuming cert rotation is the motivation behind the original restriction ("no hibernation during MCO upgrades") this may make this issue moot for 4.13+... but add an extra criterion to the logic for <4.13.

            leah_leshchinsky Leah Leshchinsky
            efried.openshift Eric Fried
            Ju Lim Ju Lim
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: