This four-part series focuses on improving Hyper-V cluster performance. Part one covered how firmware, drivers,...
patches and updates affect virtual host cluster stability. Part two offers personal workarounds to two Hyper-V problems that have helped the overall stability of my virtual environment. Here, in part three, I present more personal fixes to address Hyper-V cluster performance issues.
Hyper-V cluster performance issue No. 3: Volume GUID changes
Because of the natural growth of workloads, sometimes it's necessary to modify the logical unit number (LUN) size where the VM resides. After extending a LUN in a Hyper-V cluster, however, the volume GUID can change. This causes Quick Migration problems and will display an "unsupported cluster configuration" in System Center Virtual Machine Manager (SCVMM).
This problem occurs because the LUN has changed its volume GUID, but the Hyper-V setting has the old volume GUID.
In most cases, the VM runs fine on its cluster node. When attempting to move the VM to another node, however, the LUN will fail to mount. Eventually, it will return to its original node before attempting to move.
Once a VM enters this state, there is a creative workaround that involves shutting down the VM and using the cluster.exe command to re-register the VM's configuration. I have used this method with some success. Generally, though, I shut down the VM in Failover Cluster Manager, delete it in Hyper-V Manager and re-provision the VM (pointing to the new volume GUID and attaching the existing Virtual Hard Disks). My method requires reconfiguring the VM's network settings, but it gets the VM running.
To prevent this from repeatedly happening, install KB970529 on every Hyper-V cluster node. This addresses the volume GUID changes, so you won't have to use workarounds to correct the problem. Unfortunately, it will not fix VMs already affected.
(Note: I use Hyper-V Manager for VM deletion, instead of SCVMM, because it does not delete the VM files.)
Hyper-V cluster performance issue No. 4: IT administrative errors
Some Hyper-V cluster performance problems are not the vendor's fault or the result of unexpected failures. At times, IT administrative errors happen, and you need to take the blame.
In Hyper-V R1, there are complex requirements for Quick Migration, such as having a LUN for each VM. In one cluster, I have more than 100 VMs, meaning there are more than 100 LUNs of varying sizes. On top of that, each LUN is presented to six nodes, so the VM LUNs can mount on any node.
A problem occurs, however, if a LUN isn't presented to every node. One time, I had a handful of VMs that would not move to a particular host. The host was new, so I thought there was a firmware or driver issue. A VM would go into a save state and un-mount the disk. Then, when the cluster tried to move the LUN to another node, it would fail and bounce to another cluster node.
After the firmware and drivers checked out, I investigated the configuration of servers. Ultimately, I had forgotten to present the older, existing VM LUNs to the new host. Because there wasn't a Fibre Channel path to the VM LUNs, the new node could not mount the LUN.
Luckily, this issue has been resolved with Hyper-V R2's Cluster Shared Volumes (CSV) or through the use of a third party product like Melio FS, because these solutions do not rely on the one-LUN-per-VM architecture. Until there is a product that can catch everything that slips my mind, careful assessment and re-certification of virtual cluster environments after changes is necessary to prevent IT administrative errors.
Ultimately, for all the stability and redundancy that a Hyper-V cluster can add to a virtual environment, it does create a significant level of complexity as well. In my opinion, the trade-off is definitely worth it. But there are bound to be implementation shortcomings because of bugs or IT administrative errors. Knowing how to quickly stabilize your environment is a skill that needs to be developed.
In part four, I will focus on some strange virtual network issues and explain when it's necessary to take drastic action to recover from virtual network problems. Until then, send me any feedback or issues you have seen.
About the expert
Rob McShinsky is a senior systems engineer at Dartmouth Hitchcock Medical Center in Lebanon, N.H., and has more than 12 years of experience in the industry -- including a focus on server virtualization since 2004. He has been closely involved with Microsoft as an early adopter of Hyper-V and System Center Virtual Machine Manager 2008, as well as a customer reference. In addition, he blogs at VirtuallyAware.com, writing tips and documenting experiences with various virtualization products.