Bandwidth needs may put a stretched cluster out of reach

A stretched cluster can provide true failover and VM mobility, but its networking and storage requirements make it an expensive approach.

When it comes to server virtualization, failover clustering has long been used as a mechanism for preventing a physical server failure from resulting in an outage. But what happens if an entire data center fails? Both Microsoft and VMware offer the ability to failover VMs to a secondary data center by using a stretched cluster. However, there are a number of factors that you must consider before you can build a stretched cluster.

WAN bandwidth requirements

Regardless of whether you are using Hyper-V or VMware, there are network bandwidth requirements for stretched clusters. After all, your remote data center is essentially acting as an extension of a local cluster, so you must have reliable and reasonably efficient communications between the two sites.

VMware recommends that your data centers be within about 100 kilometers of one another due to the latency requirements. The WAN link must have a round-trip latency of no more than 5 milliseconds (10 msec or less for vMotion if the organization is using vSphere Enterprise Plus licenses, but 1 ms or less for cross-site fault tolerance). Additionally, VMware requires redundant network links operating at 622 Mbps or greater.

Hyper-V makes use of Windows Server Failover Clustering. Some of the Microsoft documentation is contradictory when it comes to best practices for Windows Failover Clusters, so I recommend taking a cue from Exchange Server. Exchange Server Database Availability Groups (which also make use of failover clustering) require a minimum round-trip latency of 500 msec. However, Microsoft cautions that "round trip latency requirements may not be the most stringent network bandwidth and latency requirement for a multi-data center configuration. You must evaluate the total network load, which includes client access, Active Directory, transport, continuous replication and other application traffic, to determine the necessary network requirements for your environment."

This same general advice also applies to Hyper-V. Latency is an important consideration, but the most important consideration is to make sure that there is sufficient bandwidth to allow your VMs to perform efficiently and to allow them to live-migrate from one data center to another as needed.

Storage needs

Within a single data center, VMware and Microsoft have long recommended using a storage architecture in which the VMs reside on storage that is accessible from every node in the cluster. However, this shared storage model is not appropriate for long-distance clusters.

The reason for this is simple. Imagine for a moment that an administrator built a Hyper-V cluster within a local data center and that the cluster made use of a Cluster Shared Volume (CSV), which is Microsoft's approach to Hyper-V shared storage. Now imagine that the cluster was eventually extended with nodes in a secondary data center. With that said, consider what would happen if the primary data center were destroyed. The CSV would also be destroyed because it resided in the primary data center, and the VMs would not be able to fail over to the secondary site.

There are a number of different ways to get around this problem. For example, some organizations take advantage of the fact that Microsoft removed the requirement for clustered VMs to share a CSV in in Windows Server 2012 Hyper-V. Other organizations take advantage of storage replication as an alternative.

Even so, storage and bandwidth are not the only obstacles. Before the release of Windows Server 2012 R2, node placement was also a major consideration. Clusters had to be designed so that the majority of the cluster nodes would remain online in the event of a data center failure (or a WAN link failure). Typically, this meant placing an equal number of cluster nodes in each data center, then placing a witness server at a third location.

Windows Server 2012 R2 made it a lot easier to build a stretched cluster by making the quorum voting mechanism much more dynamic. Even so, some organizations have chosen to simply use two clusters instead of dealing with the complexities of stretching a single cluster. In doing so, VMs can be replicated to standby hosts in a secondary data center. Of course, the disadvantages to this approach include paying for hardware that isn't being actively used for running VMs and the lack of instant failover. Failover can still be performed, but it must be done manually.

This approach isn't completely unique to Hyper-V. VMware also has customers that have decided to use a secondary site as a rapid recovery mechanism rather than stretching a cluster to the secondary site. Doing so involves replicating data to the recovery site and using vCenter Site Recovery Manager.

IP address assignments

Regardless of the technique used to move a VM to a remote data center, the VM's IP address must be taken into account. Without a mechanism for network virtualization or for IP address reassignment, the VM could become inaccessible at its new location.

Before a multisite failover cluster is deployed, there are a number of requirements that must be considered. Failover clusters can provide VM mobility, but these capabilities come at a cost and may not be an option unless sufficient bandwidth is available. For organizations with modest budgets or inadequate bandwidth, it might be better to treat the secondary data center as a recovery site rather than attempting to build a stretched cluster.

Dig Deeper on Disaster recovery, failover and high availability for virtual servers