Virtualization offers many advantages, but the server consolidation also presents new challenges. Failover capabilities within clusters help recover VMs when a host fails, but there are many aspects to consider when designing these failover clusters to ensure you protect each VM.
One common principle of these failover clusters, and one of the basic advantages of virtualization, is improving resource utilization by ensuring hosts in the cluster run VMs at all times. Under this strategy, each host sets aside a percentage of memory and computing resources so that it can accept more VMs if another host in the cluster fails. During normal operations, this strategy maximizes resource use by distributing space capacity and avoiding the need for idle servers. It also minimizes the number of VMs affected by a host failure, since each host runs fewer VMs.
When a host in this cluster fails, software distributes VMs from that host among the remaining hosts with space capacity. Of course, there must be sufficient resources on the surviving hosts to power on every VM, particularly the largest VMs, which also tend to be the most important.
Sizing space capacity
The percentage of space capacity set aside on each server is directly related to the number of hosts in a cluster. Generally, a small cluster will set aside one host's worth of resources. Larger clusters may set aside two or three. For example, a four-node cluster might set aside 25% of each host's resources to be available to restart VMs after a fault. A 12-node cluster that needs to survive two hosts failing will set aside 17%.
However, large VMs pose a problem for this strategy. Since this model distributes spare capacity across several hosts, a host may not have enough space capacity to start a large VM. In the 12-node example above, a host may not be able to restart a VM that consumed more than 17% of its original host's resources.
Ten years ago, when virtualization was first growing in popularity, the average VM required only a couple gigabytes of RAM and a single CPU, while the average physical server had four cores and tens of gigabytes of RAM. The amount of resources required to power on each VM was small. It was easy to find spare capacity on remaining hosts to restart these VMs when a host failed. Clusters were often small and lightly loaded, so there was plenty of spare capacity. Many organizations still have many small VMs acting as authentication or file and print servers, but increasingly they also rely on large VMs.
Large VMs can cause big problems
Now that virtualization is well established and widely deployed we see larger clusters. Clusters are much more likely to be heavily loaded now that we trust the hypervisor to manage resources. Many companies are also now virtualizing large and important workloads, such as databases, ERP servers and Exchange servers. These big workloads require large VMs that need a lot of CPU and RAM.
If you have a fully loaded cluster of six physical hosts each with 128 GB of RAM, they will have one sixth of their RAM available as spare capacity for failover (about 21 GB on each host). Clearly a single VM that requires 32 GB of RAM will not be able to restart in the cluster. One way to protect the 32 GB VM in this example would be to increase the RAM in each host to 192 GB.
Another way to avoid the issue is to use a standby host instead of a distributed capacity model. A standby host that is empty during normal operation can receive every VM when another host fails. This goes against the principle of maximizing resource utilization since the standby host is idle and unused most of the time. However, it does ensure that you can restart even the largest VM running in the cluster.
A simple principle to sizing failover clusters is to consider your largest VMs. For example, few large VMs are best suited to a small cluster of large physical hosts -- just be sure that your largest VM does not exceed the space capacity on any host. Small VMs work well on any type of cluster. If you only have small VMs, you can maximize resources by using larger clusters and leaving less spare capacity on each host. In general, design your cluster to support the largest VM and the smaller VMs will likely fall into place.