Which high-availability strategy is best for your VMs?

Did you know that a software bug could crash a VM and even cripple a VM platform? Learn HA strategies that keep workloads running when trouble threatens a mission-critical VM.

Virtual machines (VMs) bring flexibility and improved management capabilities to an enterprise, but VMs are not...

immune from problems. An unexpected software bug can crash a VM, or the underlying physical server might fail and cripple all VMs on that platform. Having a high-availability (HA) strategy in place can keep workloads running and available to users when trouble threatens a mission-critical VM.

There are three principal HA strategies available for VMs -- host clusters, guest failover clusters and network load balancing (NLB) clusters. Each method provides a valid approach for VM fault tolerance. Which method to use, however, isn't always apparent. This tip describes each approach and helps you carefully weigh each to decide which is best for your organization.

Introduction to Virtualization e-book
This article is excerpted from Chapter 4 of the Introduction to Virtualization e-book, which covers the basics of server virtualization technology. Learn about server consolidation, disaster recovery, high availability and more.

Table 1 shows some guidelines for choosing and implementing an HA solution for production VMs. Although the table outlines some basic concerns, you should, at the very least, create host failover clusters. Each host runs several production VMs. If that host fails and no HA solution exists, then each virtual machine that resides on the host will fail. It's different when you run single workloads in individual physical machines. In that case, there's no reason why you can't run a host-level cluster while simultaneously running a guest-level HA solution such as failover clustering or NLB.

Table 1. Choosing a VM high-availability method.

VM characteristics Host server clustering Guest failover clustering Guest NLB clustering

Operating system edition Web standard enterprise data center Enterprise data center Web standard enterprise data center
Number of guest nodes Single nodes only Generally two, but up to 16 Up to 32
Required resources in the virtual machine At least one virtual network adapter iSCSI disk connectors and a minimum of three virtual network adapters: cluster public, cluster private and iSCSI Minimum of two virtual network adapters: cluster public and cluster private
Potential server role Any server role Stateful application servers, file and print servers, storage components for collaboration servers, network infrastructure servers such as dynamic host configuration protocols Stateless application servers, dedicated Web servers, front-end collaboration servers, front -end terminal servers
Internal VM application Any application SQL or database servers, Exchange servers, message queuing servers, file servers, print servers Web farms, Exchange client access servers, Internet security and acceleration server (ISA), VPN servers; streaming media servers; unified communications servers

Use these in concert with your organization's existing service-level requirements to determine which level of HA you need to configure for each VM. You also must consider the support policy of the application you intend to run in the VM.

Single-site and multi-site host clusters
Single-site and multi-site clusters are available for host servers. Single-site clusters are based on shared storage in various forms. VMware, for example, uses two key technologies for host clustering: HA and the virtual machine file system (VMFS).

VMFS is a sharable file system that lets multiple host servers connect to the same storage container. VMFS usually requires some form of SAN, network attached storage (NAS) or iSCSI storage targets. VMware can also perform this via the Network File System (NFS), which enables small organizations to access HA configurations for host servers. VMware's HA component then manages potential host server failures. VMware host clusters can include up to 32 nodes.

Citrix XenServer can also rely on shared storage -- usually in the form of NFS, NAS, SAN or even iSCSI targets to provide HA for host servers. In a Citrix host server environment, you create highly available configurations by configuring host server resource pools. While other hypervisors rely on management databases to control multi-host configurations, each Citrix XenServer host stores its own copy of the resource pool configuration data. This removes a potential single point of failure from resource pool configurations. Citrix resource pools can also include up to 32 host nodes.

Microsoft Hyper-V relies on Windows Server 2008 Failover Clustering to create host clusters. Single-site Hyper-V host clusters require shared storage in the form of SANs or iSCSI targets. No other storage format is supported. Hyper-V single-site clusters can include up to 16 host nodes.

Hyper-V can also support multi-site clusters, which span more than one site to support disasters that might affect an entire site. Because of this, the Hyper-V multi-site cluster does not require shared storage and can rely on the much faster direct-attached storage (DAS) to operate. However, to provide VM high availability, those DAS repositories must be synchronized at all times with a third-party replication tool.

No matter which hypervisor you use, it's best to create host clusters when possible to provide two different levels of service continuity:

    • Host clusters support continuous VM operation. If a host fails or indicates that it is failing, all VMs running on that host will be transferred automatically to another node on the cluster.

  • Host clusters support VM operation during maintenance. If you need to work on one cluster node to install software updates, for example, you can move VMs off of the node during operation. Move them back to the node once the operation is complete. Repeat this process if other cluster nodes also require maintenance.

In either case, moving VMs will still interrupt service to some degree. When the cluster detects that a node is failing, the cluster service causes VMs to fail over to another node. In this case, it will use a migration process to move the VM from one node to another. Depending on which hypervisor you use, this may cause a service interruption. VMware, Citrix and Microsoft Hyper-V can perform live migrations -- moves that occur while the VM is running.

When a node completely fails, the cluster service moves the VM by restarting it on another node. In this case, VM downtime increases because all of the virtual machines on that node are turned off. When you need to perform maintenance on a node, use the migration process to move a VM from one host node to another.

Remember that you must have spare resources on each host server in a cluster to support moving VMs from one host to another. Ideally, each host server will possess enough spare resources to support the failure of at least one other node in the cluster.

Guest failover clustering
You can make any VM highly available by adding it as an application within a host cluster. But a VM is not like a traditional application. Even though the VM will always run, or runs as much as possible, when operating on a host cluster, this model won't apply to every workload in your production network. Host server clustering does not affect applications contained within the VM. Those apps are unaware of the host's HA feature -- unlike applications that are installed directly into a cluster through guest failover clustering, for example. Host server clustering, however, does ensure that the VM will run if a host fails. This HA model works for most applications, despite the fact that they aren't aware of it when transfers occur from one node to another.

Some state-sensitive applications, such as Microsoft Exchange, do not behave properly under this model and may lose data when a transfer occurs. Transactional applications, especially those that support very high-speed transactions, do not work well with this model because the applications are designed to behave in a particular way when failover occurs. The applications cannot behave as planned when a VM has been failed over.

Because of this, you should consider building highly available VMs -- creating clusters within the VM layer -- to produce application-aware clusters. These clusters ensure continuous availability and stability of the applications you move into the virtual layer of a resource pool. Failover clusters only work for stateful workloads – those that record data from user sessions.

To ensure the HA of stateful applications within virtual workloads, most organizations opt to run single-site clusters. Single-site clusters are often easiest to create in the virtual infrastructure and don't require a replication engine, which often must be procured from third-party sources.

When you create single-site guest clusters, consider these key points:

Use anti-affinity rules. If you create a two-node virtual machine cluster to run on top of a host cluster, you must make sure that the two nodes of the VM cluster aren't located on the same node as the host cluster. If they both reside on the same node as the host cluster and that node fails, then your entire VM cluster will fail. This will nullify any benefit of having created it. To control VM locations on host nodes, use anti-affinity rules or place each node of the VM cluster on different host clusters. In Windows, anti-affinity rules are set using the Cluster.exe command. Other hypervisors use different methods to set these values.

Rely on virtual LANs (VLANs). Rely on a hypervisor's guest VLAN capabilities to segregate intra-cluster traffic required for the guest cluster from other network traffic. Each virtual network adapter in a VM can use a different VLAN setting.

Rely on iSCSI storage. To target shared storage for guest clusters, use iSCSI storage. It lets you create shared storage infrastructures that rely on network interfaces to access the storage. VMs can easily consume iSCSI shared storage since they only need network adapters to access it.

Through these three approaches, you can configure single-site guest failover clusters and enable them to run in the virtual layer of the resource pool. Since you need a network for private cluster traffic, you also need a network for iSCSI storage. A network for public end-user traffic inside each VM in the cluster is also required, and you must configure several virtual network adapters in the VM and on host servers.

When a failure occurs on a host server in which one node of the guest cluster is running, the second node will discover the guest VM failure. It then will automatically transfer the application within the VM to the other node in the guest cluster. End users won't experience downtime during the transfer.

When an application running in a failover cluster is moved from one node to another, there may be a delay in response time for end users. However, this delay usually lasts only a few microseconds -- depending on the application -- and often goes unnoticed.

Guest NLB clusters
NLB is an HA solution, but it is different from failover clustering. In a failover cluster, only one node in the cluster runs a given service. When that node fails, the service is passed on to another node, which then becomes the owner of the service. Because of the structure of the failover cluster model, only one node can access a given storage volume at a time. Therefore, the clustered application can only run on a single node at one time.

In NLB or server load balancing clusters, each member of the cluster offers the same service. Users are directed to a single IP address when connecting to a particular service. The NLB service then redirects users to the first available node in the cluster. Because each member in the cluster can provide the same services, they are usually in read-only mode and considered stateless.

NLB clusters are fully supported in VMs because the hypervisor network layer provides a full set of networking services, one of which is NLB redirection. This means that you can create a multi-node cluster -- up to 32 NLB nodes -- to provide HA for the stateless services available in production VMs. However, each computer participating in an NLB cluster should include at least two network adapters -- one for management traffic and another for public traffic. This can be done in VMs by adding another virtual network adapter.

Danielle Ruest and Nelson Ruest are IT experts focused on continuous service availability and infrastructure optimization. They are authors of multiple books, including Virtualization: A Beginner's Guide and Windows Server 2008, The Complete Reference for McGraw-Hill Osborne. Contact them at [email protected]

Dig Deeper on Preventing virtual machine sprawl