How does incorporating virtualization into a failover strategy minimize downtime in the case of hardware failures?
I hear this question quite often. There are different ways to approach incorporating virtualization into a failover strategy. The most widely used method is VMware High Availability (HA).
Let's look at an example of how VMware HA failover works. The VMware HA agent (which is installed on each host) sends heartbeat signals to all hosts in the cluster. If one of your hosts encounters a hardware failure and crashes, the agents on the remaining hosts would sense the loss of heartbeat signals and automatically restart workloads that were running on the failed host on the surviving hosts. Now, you are able to determine the root cause of the host failure, resolve it and then return the host back into service while your workloads remain available.
In order to use VMware HA, you must have at least two vSphere hosts and they must be part of a cluster. To create an HA cluster, first create a data center within your vCenter server, then create a cluster within the data center and finally add your hosts to the cluster.
When creating the cluster, select the option to enable HA and adjust the settings appropriately for your environment. Be aware that you also need to set the Admission Control Policy, which imposes failover constraints based on available resources. If you do not have adequate hardware resources available on the surviving host(s), your failover strategy will not work. For example, if you only have two vSphere hosts in your cluster, you would set the number of host failures the cluster can tolerate to one. This setting will also ensure each host has enough spare capacity to absorb workloads from failed hosts. The setting may prevent you from starting new workloads if the host would not have enough remaining capacity to absorb workloads in the event of another host failure. You may be tempted to disable this constraint, but it is required in order to provide a functional failover strategy.
When creating your cluster, be sure to enable VMware Distributed Resource Scheduler (DRS). VMware's DRS is a feature that uses vMotion to distribute workloads among hosts within the cluster automatically or manually. Even though HA will bring your workloads back online, after you revive the failed host your workloads will remain running on the surviving hosts. This would create an imbalance in your cluster, and you need to move workloads back to the recovered server to rebalance your cluster.
This was first published in January 2014