Openstack: Migrating from Intel to AMD hypervisors

Due to a management decision to go for more energy efficient AMD based hypervisors in the future we needed to migrate an OpenStack cluster from Intel Xeon Gold 6148 to AMD EPYC-SP3-7702 based systems. In order to speed up the migration for the customer and also have the least customer impact, we decided to change the CPU feature set and do a live migration.
In the end we successfully migrated 288 VMs with minimal downtime and no data loss using live-migration.

Scenario

In August of 2019 the new generation of AMD Epyc processors was released (codename Rome). The specifications of this processor looked intriguing to us at x-ion as they offered more cores at better TDP than our previous Intel based systems. This potentially allowed us to offer the same compute power at a lower price, making it also attractive for our customers. At the same time lower energy consumption is also a step towards lower CO² footprint in our data centers. This caused us at x-ion to invest into some test systems and create an action plan for a migration of production systems.

While live-migration of our staging workloads is not really relevant, as a technical mind one will always think of doing a task with zero impact to the customer, when ideas don’t pop up, least impact will be the next option. So for our production clusters we were testing live-migration as a way to reduce or even nullify impact for our customers during a transition from Intel to AMD.

Action: Live-migration of VMs from Intel to AMD hypervisors and vice versa.

Error

In our test environment we attempted to do a live-migration of VM’s from Intel to AMD Hypervisors which led to an (expected) error message due to incompatibility of the CPU features.

Error : Unacceptable CPU info: CPU doesn’t have compatibility.

As we’d like to ease the transition for our customers, we needed to come up with a solution for this, in case the customer did not want to recreate all VMs.

Acceptable Solution

We tested a feasible solution in our staging environment and came up with a change of the cpu_model to kvm64 and a plan of action for migration between our Intel and AMD hypervisors.

In [libvirt] section of /etc/nova/nova.conf adding cpu_mode=custom and cpu_model=kvm64 and restarting nova-compute service in Intel hypervisors, helped libvirt  to understand the cpu model, which made live-migration work. As live-migration is possible from lower models of cpu to higher but not the other way, nothing needs to be changed in AMD Hypervisors.

## Changes in nova.conf to reduce the used CPU flags used from a VM to kvm64

[...]
[libvirt]
+ cpu_mode=custom
+ cpu_model=kvm64
[...]

Restarting the nova-compute service will not have any impact on running VM’s, they continue to keep the connections alive, certain operations like reboot or delete of VM will wait until the service is available again.

As VM’s need to stop and start to apply the changes, it is ideal to stop and start a VM, migrate it and stop and start it again to get the new information from AMD nodes. As kvm64 is a very old cpu model with much lesser cpu_flags its not preferable to have it in any running production environment but one can use this as a one time solution to migrate VMs from Intel to AMD ( and vise versa ). Carefully one can expose few cpu flags to VM’s if necessary

Production

One of our customers needed more resources in one of his private OpenStack based clouds. Together with our customer we decided that it would be in our shared best interest to utilize the new AMD platform. The high-level idea was to completely replace the Intel hypervisors in one private cloud with AMD based hypervisors and integrate the Intel hypervisors in another private cloud from the same customer. This way we would have two homogeneous environments. One with Intel hypervisors and one with AMD based ones.

However the customer’s private cloud is used for his production and an effort to reduce impact was mandatory. At the same time it was also important to finish the migration quickly, as the customer needed the additional resources. 

In order to help our customer have the resources as soon as possible, we formulated an action plan to migrate his staging VMs during the day and announce a maintenance window at night to allow for short downtimes of production VMs during restart.

Migrating the staging VMs first gave us experience, safety and confidence for the production VM migration.

The complete process for the migration looked as follows:

  1. Disable the source hypervisor in the scheduler
  2. Change the nova.conf on the source hypervisor to use limited CPU flags
  3. Restart nova-compute on the source hypervisor
  4. Stop the VMs which need to be migrated
  5. Start the VMs which need to be migrated
  6. Change the nova.conf on the source hypervisor back to use all CPU flags
  7. Restart nova-compute on the source hypervisor
  8. Start a live-migration to the destination hypervisor
    1. wait until this is finished
    2. repeat for next VM
  9. Stop the migrated VMs 
  10. Start the migrated VMs again

We migrated 228 staging VMs during the day in around 6 hours and 60 production VMs in 90 minutes during the maintenance window. Reduced performance during the period with the kvm64 feature set was not observed. However our customer does not rely on special feature sets nor does he really have compute heavy workloads. So your own experience may vary.

Conclusion

Migration of one VM at a time is the key in order to be in control of the process and be able to react in case something bad happens. Migrating multiple VMs at the same time can be a bad idea. Experience from the community shows that using reduced CPU features – as in the kvm64 CPU model – in some environments may have a severe impact and some have even reported a loss of all data. But in most cases this works and it worked flawlessly in our environment too. 

Be mindful of using this solution and check it in all possible ways before applying to your OpenStack clusters.

Live-Migration of VMs from Intel to AMD hypervisors works. Being aware of your own environment and understanding of impact may help in making the right decision. 

  • Produkte & Services
  • Neuigkeiten von x-ion