This four-part series focuses on problems with Microsoft Hyper-V virtual machine (VM) clusters. The following Hyper-V virtual machine problems and fixes include tips from Microsoft and hardware vendors, as well as personal workarounds that have helped the overall stability of my virtual environment.
Many of these pointers are not exclusive to Hyper-V problems, and they may also apply to VMware and Citrix XenServer. Part one covers hardware, drivers, patches and configurations that may cause virtual environment instability.
All these virtual machine problems have plagued me at one time or another and reduced the reliability of my Hyper-V clustered environment. My goal is to expose these problems so that you may address them before they become an issue.
Upgrading firmware is crucial for an environment's stability. In a clustered arrangement, this involves more than a BIOS update. This setup is more complicated than a standalone environment because you need to consider the entire data path. One firmware update can affect the BIOS, host bus adapter (HBA), Fibre switches and storage area network storage controller.
After I moved most of my Hyper-V hosts to blade servers, there were numerous factors that could affect the stability of my virtualization environment. This arrangement requires more component firmware updates for the blade chassis than a rackmount setup. Because of this, I can rarely update a component's firmware without considering its interaction with other, older firmware in the environment.
Previously, I noticed our HP Virtual Connect network devices automatically resetting. Other blade servers in the enclosure, however, did not experience this problem. But issues arose for the Hyper-V cluster when the main network and cluster heartbeat became disconnected for 30 to 45 seconds. Behaving as though there was a failure, the other cluster nodes would move VMs to the remaining hosts.
Reviewing the HP documentation and bulletins revealed a fault in the virtual connect network devices' firmware. Before it could be updated, however, the BIOS, HBA, HP Onboard Administrator and Virtual Connect fiber switches fireware needed to be upgraded. Among the six enclosures, this took more than three weeks to coordinate. When completed, though, the system stability was regained.
New drivers are released all the time for existing hardware. While I don't upgrade drivers just because they are new, some circumstances require an update. Often when firmware is updated, various drivers require updates to correspond with the new firmware revision.
Similar to firmware upgrades, driver updates affect numerous interactions on clustered hosts. Remember: Driver consistency across hosts is imperative in a clustered arrangement.
Take, for example, Fibre Channel HBA or iSCSI drivers. Most likely, each connects to the multipath I/O (MPIO) framework. When using EMC PowerPath or HP MPIO framework, matching the correct driver across cluster nodes to the MPIO level is important.
In some cases, mixing and matching drivers with MPIO levels can cause the clustered resources failover feature to malfunction. This problem is not limited to HBA drivers, as other cluster problems may occur when the network or power management drivers are inconsistent across cluster nodes.
I have experienced these problems when adding new cluster nodes. At the time, the latest MPIO, HBA and network drivers were installed. The mismatch between older and newer nodes resulted in more instability and unpredictability within my clustered virtual environment.
What is my recommendation? Stick with the same driver level for every clustered host that is also compatible with your current firmware. Sometimes, the most recent firmware upgrade is always the best. I tend to stick with stable configurations. That said, if there is a reason to install new drivers, try to get the new revision out to every host as soon as possible.
Server virtualization is still maturing. Despite vendors' push to bring these offering to virtualized environments, these new features and capabilities have shortcomings that create problems. Patches are released frequently to fix issues, but they can be hard to find at times.
In my Hyper-V clustered environment, there have been only a few instances when I've had lengthy support calls to fix a problem. In most cases, I've found a patch before a problem arose, or an issue was solved after a short call with Microsoft support.
Below are three sites I use to find new patches.
These sites are useful, but the Microsoft support blogs are usually the most helpful. Next time you are on the phone with Microsoft support, ask whether there is a blog about your concern. Some of the best insights into recent patches or enhancements come from people on the front lines. Here are a few of my favorite support blogs:
Whether it's a clustered or standalone environment, it's critical to keep up with host or VM patches from your virtualization vendor. This technology evolves rapidly, and losing a host because of a product bug can be devastating. If you want to add a complex cluster arrangement to your virtual environment, you periodically need to play detective to discover new patches.
Automatic Server Recovery (ASR) reboots
ASR is a server reset mechanism that aids in restarting a server "gracefully" when an installed agent senses a problem with the system (i.e., a thermal event or an OS lockup). If you don't use Hyper-V, most vendors have a similar feature.
My exposure to ASR comes from HP hardware, and numerous false positives have resulted in my host clusters hard-powering down (Here are other examples with the same problems on HP hardware:1, 2). For this reason, I disable ASR. The technology's reliability has been suspect, and I've lost confidence in the feature because it automatically brings down servers without consideration for the VMs running on the host.
In my experience, the HP ProLiant BL460c virtual hosts have been solid. A memory chip may occasionally go down, and drives may fail intermittently; otherwise, its performance has been good. The accompanying HP software, however, is a different story. I recommend disabling the ASR BIOS setting and agents that trigger the reboots to improve virtual host cluster reliability.
Ultimately, matching the firmware and drivers, updating patches and disabling ASR reboots will provide a more stable foundation for your virtual clustered hosts. In the remaining three parts of this series, I will address other Hyper-V cluster problems. While some of these issues are product deficiencies, others are administrative errors and oversights. In any case, I will provide a few tips to avoid these problems and VM downtime.
Until then, send along any experience or issues you might have seen with your clustered virtual hosts.
About the expert Rob McShinsky is a senior systems engineer at Dartmouth Hitchcock Medical Center in Lebanon, N.H., and has more than 12 years of experience in the industry -- including a focus on server virtualization since 2004. He has been closely involved with Microsoft as an early adopter of Hyper-V and System Center Virtual Machine Manager 2008, as well as a customer reference. In addition, he blogs at VirtuallyAware.com, writing tips and documenting experiences with various virtualization products.