Before you build a multi-site cluster for Hyper-V failover purposes, check out the first two articles of this series on the hardware and storage requirements for highly available Hyper-V environments. After connecting servers to the proper networking and shared storage equipment, building a multi-site cluster is a relatively simple process.
Hyper-V high availability
In this three-part series on Hyper-V high availability, I'll explain how to successfully deploy a highly available environment. Its architecture will support your needs for automated virtual machine failover, the successful storage and processing of VMs, and even expand to become a multi-site, disaster-resistant, fully-automated infrastructure -- all for not a lot of money.
First, use the Validate a Configuration wizard in Failover Cluster Manager to run a series of tests. If everything passes, you can create a Hyper-V failover cluster.
Stretching a cluster across multiple sites isn't an exceedingly complex process. Today's Windows Failover Cluster service in Windows Server 2008 R2 already includes most of the necessary components for creating a multi-site cluster, also known as a "stretch cluster" or "GeoCluster." The only missing component is a mechanism to replicate shared storage between sites.
Single-site clusters limit Hyper-V failover
To truly understand the utility of a multi-site cluster, consider the types of failures against which a single-site cluster protects. In this arrangement, the Hyper-V hosts connect to a piece of shared storage. The "shared" part of this storage is important, because any connections between servers and storage are limited to their maximum cable lengths. Both Fibre Channel and iSCSI storage have a maximum effective distance that limits how far you can physically spread your servers.
While this architecture is excellent for protecting virtual machines (VMs) against the loss of a single host, it does little when an entire site goes down. An outage can occur during a catastrophic event -- such as a natural disaster, for example. Or more commonly, it can happen because of a site-wide problem, such as a network or power outage.
During a site-wide failure, a single-site cluster cannot protect against the loss of VM functionality, because the hosted VM and its processing, storage and networking reside at the same location. Therefore, all these components will experience a failure during a site-wide problem.
Creating a multi-site cluster for Hyper-V failover
A multi-site cluster protects cluster functionality by extending it to one or more additional physical locations. Using Windows Failover Clustering, the shared storage contents are copied to a secondary site.
Data replication for cluster storage is typically accomplished through synchronous or asynchronous replication. With synchronous replication, each piece of data that is replicated between the two interconnected storage area networks (SANs) must be confirmed at the secondary site before the next piece of data can be processed. This acknowledgement ensures that the data is transferred between storage devices, thus guaranteeing that the two SANs are always synchronized.
Synchronous replication is excellent for data preservation, but it comes at the cost of performance. Because each piece of transferred data must be acknowledged before the next data is processed, the transfer speed can quickly bottleneck overall performance.
Asynchronous replication circumvents this performance problem by allowing the sending SAN to queue up data that requires replication. Data is then sent in a batch at configurable intervals, with the entire batch acknowledged at once. So asynchronous replication does not pose the performance bottlenecks involved with synchronous replication, but when a site failure occurs, you risk losing some data.
Solutions for asynchronous replication are implemented as features within your SAN storage or as software add-ons to VMs or a Hyper-V host. Each approach comes with benefits and drawbacks. Ultimately, you must weigh the possibility of data loss against reduced system performance to decide which option is best.
Architecting and implementing replicated storage between your two sites is arguably the most difficult part of creating a multi-site cluster. Once the storage is correctly configured, you'll find the remaining tasks are trivial in comparison, including provisioning additional Hyper-V hosts at the secondary site, adding them into the existing cluster, and configuring failover and other cluster settings to ensure that VMs migrate only during a full-site failure.
Two other considerations that require special attention with multi-site clusters are the reconfiguration of the cluster quorum and the reconvergence of name resolution at the secondary site.
Quorums and Hyper-V failover
A cluster by nature is always prepared for failure. At its core, a Hyper-V failover cluster always watches for components to go down and, when a failure occurs, knows which action to take.
One way the cluster facilitates this task is through a quorum. In essence, a quorum is a collection of cluster elements that determine whether there are enough resources available for the cluster to function.
Quorums use a "voting system" to decide whether a cluster should remain online. There are several ways to configure the voting process:
- counting the number of votes cast by individual hosts;
- counting votes from hosts plus shared storage; or
- counting votes from hosts plus a file share witness in a third and separate site.
When creating a multi-site cluster, carefully consider your quorum options. I detail best practices for configuring quorums in multi-site clusters in Chapter Four of my free e-book The Shortcut Guide to Architecting iSCSI Storage for Microsoft Hyper-V.
Domain name server resolution after Hyper-V failover
The final consideration for a multi-site cluster is the need for name resolution after a VM fails over from one site to another. Today's Windows Failover Cluster service has the ability to span subnets (and IP address ranges). This process simplifies a cluster installation, because the network subnets no longer have to span between sites. But when failed-over VMs relocate to a new IP address range, the move complicates name resolution.
In short, when your VMs fail over from a primary site to a secondary site, they fail over to the secondary site's IP address scheme. As a result, the IP address configuration for these VMs must be reconfigured at the time of failover. Also, clients must flush their local domain name server (DNS) cache to receive the server's new address information.
Setting up virtual servers to use the Dynamic Host Configuration Protocol for address configuration simplifies their configuration update. For clients, this problem can be resolved by a reboot, clearing their cache with the ipconfig or flushdns commands, or by minimizing the time-to-live setting for server DNS entries.
While multi-site Hyper-V failover clusters have special requirements for storage and data replication, the extension of your Windows Failover Cluster should not be difficult to set up. With the right technology and good planning, you can extend your Hyper-V high availability to protect against a full-site disaster.
Additional resources on Hyper-V high availability
- The following resources will provide more information about Hyper-V high availability:
- Fixing virtual machine cluster problems in Hyper-V
- Hyper-V clustering and VM configuration problems
- Cluster performance problems in Hyper-V and how to fix them
- Killing Hyper-V high-availability cluster services and network issues
Greg Shields is an independent author, instructor, Microsoft MVP and IT consultant based in Denver. He is a co-founder of Concentrated Technology LLC and has nearly 15 years of experience in IT architecture and enterprise administration. Shields specializes in Microsoft administration, systems management and monitoring, and virtualization. He is the author of several books, including Windows Server 2008: What's New/What's Changed, available from Sapien Press.
Dig Deeper on Virtualized clusters and high-performance computing