Virtualization has vastly improved the utilization of server hardware and brought greater flexibility to workload provisioning and migration, but virtualization has also introduced potential vulnerabilities. Today's server may host 10, 15 or even more virtual machines, so any fault or disruption to the server hardware can affect many workloads rather than just one. This has raised the importance of resilience and reliability for server hardware as well as hypervisors and VMs. Let's examine some of the most common issues around VM high availability.
VM clustering options
The idea behind clustering is to duplicate workloads on two or more servers and keep each duplicate in perfect synchronization. If the primary cluster server (node) fails, the afflicted server can be isolated from the cluster and a secondary instance can take over without disruption. When the original server is restored, it can safely rejoin the cluster and resynchronize with the other instances. This is an example of cluster failover where only one instance of the workload handles all of the traffic and other instances are kept on standby.
The alternative approach to clustering is load sharing, where all the servers within the cluster are active, and each duplicate workload shares part of the traffic load all the time. If a node fails, it is also isolated, and remaining instances rebalance traffic in order to handle the load. Both of these traditional clustering concepts translate directly to virtual environments, so existing clustering software like Windows Server Failover Clustering will generally support the addition of a hypervisor and VMs. Still, it's always important to verify support for virtualization and upgrade to virtualization-aware clustering software if necessary.
There are also alternatives to traditional server clustering software. For example, tools like everRun MX from Stratus Technologies or Double-Take Availability 7.0 from Vision Solutions. These tools work independently to provide duplication and synchronization capabilities for selected VMs under a major hypervisor like Microsoft Hyper-V or VMware vSphere.
There are two important issues when considering resilient VMs. First, use clustering judiciously. Remember that any kind of resilient VM architecture represents an added expense that is really only justified for top-tier mission-critical workloads. Other less-essential workloads can typically be protected with non-cluster technologies like snapshots.
Second, understand that resilience is primarily intended to guard against hardware failures, so it is crucial to prevent multiple instances of a VM from residing on the same physical server. As a consequence, resilient workloads may cause problems with automatic migration or workload balancing tools. It may be necessary to set migration rules within the resilience software that expressly dictates where duplicated workloads may (or may not) be moved.
Using snapshots for VM resilience
Clustering and snapshots are two different concepts that are sometimes confused by the simple proliferation of tools and technologies currently available to protect workloads. Resilience techniques (such as clustering and VM duplication) are designed to maintain workload availability without disruption. If one instance of the VM fails, a duplicate instance in lockstep on another server takes over immediately. Clustering behavior is seamless; users never even know that a cluster node has failed. Applications that cannot tolerate any loss of availability (like sales or other transactional software) are best served with clustering.
By comparison, snapshots capture and save workload states at certain points in time. This means there is a definite recovery point objective (RPO). It also takes time to restore the VM from a snapshot, so there is a recovery time objective (RTO). Even though snapshots can be taken frequently and restored quickly -- yielding short RPOs and RTOs -- there may still be a notable lapse in the workload's availability while a snapshot is recovered to an available server.
For non-critical virtual workloads that can tolerate short periods of downtime, snapshots are usually an adequate solution for data protection. Snapshots are typically centralized in the local storage area network (SAN). Although snapshots can remain on the local SAN, they are often duplicated to an off-site SAN if an additional layer of data protection is needed (e.g., protecting against a total loss of the entire local facility due to some natural disaster, act of war, etc.).
Ultimately, the decision between clustering and snapshots must depend on the individual workload and its importance to the business.
Hardware choices to improve VM high availability
One important strategy to improve VM high availability is to focus on the resilience of the underlying hardware. If the server doesn't fail, its workloads shouldn't fail. Modern servers now sport a wealth of resilience features that organizations can consider for their next hardware refresh.
For example, processors like Intel's latest Xeon chips now include internal data bus checking to enhance the integrity of data within the processor itself. Some single-bit errors within the processor can now be detected and corrected. New processor machine check features can report and record soft errors for later evaluation by management software tools, so administrators can make informed decisions about pre-emptive system repairs and replacements before hard faults actually occur.
At the broader system level, look for redundant server elements built from quality components. For example, redundant power supplies have long been standard equipment on enterprise-class servers. Processors with 10, 12 and more cores now provide ample headroom for workload growth and auxiliary computing resources (if one core fails, the workload can be restarted on another available core within the same server).
Memory features like hot spares allow the contents of a failed memory module to be rebuilt on a spare memory module. By comparison, memory mirroring duplicates memory contents within complementary memory modules -- if one module fails, the mirror takes over without disruption (much like RAID 1 for memory).
Other server components should include redundant elements. For example, a network interface card might include three or four individual Ethernet ports. This not only provides additional bandwidth to support multiple bandwidth-intensive workloads, it also allows some traffic to fail over to other ports if the need arises.
Workload availability is an important attribute of enterprise-class workloads, but organizations should apply the technologies that are most appropriate for the particular workload rather than spending capital for high-end protection on less important workloads, or mediocre protection that exposes critical workloads to disruption. Regardless of what VM high availability strategies you adopt, remember to test the technologies to verify that the implementations actually work as expected. Don't wait for an actual disaster to find unplanned or unacceptable flaws in system availability.