High-availability software and other products may look the same on vendor data sheets, but under the hood they can vary significantly.
It's important to explore and test these features carefully because they can have an impact on recovery performance, scalability, failure detection and management of your high-availability architecture.
Evaluating high-availability software
When testing hypervisor vendor high-availability software and products or validating internal configurations, you should look at the following:
- automated virtual machine (VM) recovery following a physical server failure;
- automated VM recovery following a VM guest OS failure;
- criteria that determine VM placement;
- VM balancing and redistribution following a server failure; and
- VM deployment and management TCO associated with the cluster's underlying storage architecture.
Naturally, any high-availability software should restart VMs on surviving cluster nodes following an outage. Basic failover is often tested by unplugging the power cord from a physical cluster node. You should determine whether a less catastrophic hardware issue -- such as a network failure -- has triggered failover in your high-availability architecture, or whether it's being caused by a storage port that reduces service levels below acceptable standards.
In addition, you need to test the cluster's "split brain" avoidance features. Do so by unplugging a cluster heartbeat network connection from one of the physical cluster nodes and ensuring that the cluster heartbeat traffic continues over an alternate path -- such as over the network or Fibre Channel -- without disruption.
Finally, you should force a potential split brain by removing all network heartbeat ports on one physical node -- assuming redundancy isn't provided through the Fibre Channel storage area network. Removing all heartbeat paths will isolate the node and leave it unable to communicate with the remaining cluster nodes. If the cluster responds by trying to mount and start VMs on multiple nodes -- the original node and one of the surviving nodes -- the cluster does not handle split brain very well. That means VM data could be corrupt as a result.
Comparing hypervisors for a high-availability architecture
Many hypervisors include capabilities to detect major guest OS failures, such as Windows stop errors or Linux kernel panic inside a VM. They will automatically restart a VM following a failure.
To force a stop error on Windows, you can use a feature to generate a memory dump file by using the keyboard. Another feature for Linux is Set Up Linux Kernel Crash Dump on SUSE Linux Enterprise Server.
Criteria that determine VM host placement may include the following:
- available CPU, memory, network I/O and storage I/O;
- service-level requirement awareness; and
- security- and compliance-related placement restrictions.
For service-level requirement awareness, ask yourself whether the high-availability software bases VM physical host selection on your VMs' required service levels. You should also see how VMs are placed following a failure. Some VM cluster products place VMs by node order instead of fanning VMs out across all surviving physical nodes.
If cluster node 1 fails, for example, all VMs running on node 1 would try and start on node 2. VMs that couldn't start on node 2 would then try and start on node 3. The process would continue until all VMs have started.
A hypervisor's management software should offer some sort of intelligent placement by spreading VM restarts across all remaining nodes in the cluster. Of course, you need at least a three-node cluster to test failover to validate whether a particular product spreads VM restart jobs across multiple surviving nodes.
Shared storage's effect on a high-availability architecture
Cluster products rely on shared storage, so it's important to evaluate the impact of shared storage management on a high-availability infrastructure.
For example, clustered hypervisor products that require a logical unit number (LUN) per VM may require server administrators to contact a storage administrator each time they plan to deploy a new VM. They may have to do so to have a new LUN provisioned and mapped to the appropriate physical hosts in a cluster.
Vendors such as Citrix Systems have started to offer products that integrate LUN provisioning with hypervisor management, but not all storage arrays are supported. Although a small point, it's possible that the LUN-per-VM architecture can add to the cost of managing the system.
Shared cluster file systems such as VMware's Virtual Machine File System (VMFS) allow VMs to share large volumes of 2 TB and above. As a result, they allow virtualization administrators to deploy new VMs without having to provision new LUNs.
The level of integration between VMFS and storage arrays vary widely, so evaluate the level of support before buying an array. Otherwise, you may discover that many advanced array features are unusable. Of course, many hypervisors support Network File System, another option that involves relatively simple and familiar management.
When evaluating storage for virtual environments, there are also many factors to consider. Storage is critical to a high-availability architecture, so take your time and choose wisely. The storage model you choose should be easily and efficiently managed. It should also be well integrated with your organization's virtualization and storage infrastructure.
High availability is essential for production virtualization environments. Having too many eggs in one basket -- or too many VMs on one physical host -- means that a single server failure can bring multiple resources momentarily offline. So thoroughly evaluate hypervisor cluster features. Otherwise you may experience buyer's remorse after high-availability software doesn't behave as expected following a failure.
About the Author
ChrisWolf, an analyst in the Data Center Strategies service at Midvale, Utah-based Burton Group, has more than 15 years of experience in the IT trenches and nine years of experience with enterprise virtualization technologies. Wolf provides enterprise clients with practical research and advice about server virtualization, data center consolidation, business continuity and data protection. He authored Virtualization: From the Desktop to the Enterprise, the first book published on the topic and has published dozens of articles on advanced virtualization topics, high availability and business continuity.