Far too many people have implemented virtualization without a real understanding of the implications. Even worse, administrators who downplayed high availability during their server cluster implementation now find that HA can be a source of problems as much as it is a solution to problems.
In fact, high availability protects against a limited set of problems. It's simply a service that -- no matter which hypervisor you use -- reboots VMs after a host failure. That's it. Continuous availability is the ideal objective, but VMs will still experience some downtime.
High availability is commonly associated with live migration, XenMotion, or vMotion, but in fact, it has little to do with those "never-go-down technologies." I see a number of problems that occur in a server cluster the first time a host fails, as a result of this confusion.
High-availability technologies are getting smarter, but watch out for the following problems that could kill your server cluster.
How DNS affects high availability
With VMware HA implementation, domain name system(DNS) resolution can be a serious gotcha. To allow server cluster nodes to intercommunicate, VMware places a heavy responsibility on DNS resolution. Normally, that wouldn't be a problem. But many of today's IT professionals have gotten used to the notion that DNS is a service they don't need to manage much.
Part of the reason for this hands-off approach is Windows' dynamic DNS functionality. Many admins simply haven't had to deal with DNS as much as in the old days, because dynamic DNS now performs most of the work automatically. But VMware servers don't use dynamic DNS.
If you use VMware HA in your server cluster, make sure that your management network IP addresses and associated host names are all entered into DNS. You'll need to do this manually and maintain it over time as you make changes and additions to your virtual environment. VMware has gotten better at presenting notifications when DNS isn't properly configured, but it's easy to miss those notifications until it's too late.
DNS resolution in a multi-site server cluster
This problem with DNS also affects multi-site Hyper-V clusters. Hyper-V's Windows Failover Clustering service now spans across subnets. In some respects, this architecture is great because you no longer need to use complex network magic to span a subnet across geographic differences. But on the other hand, VMs that fail over to a secondary site often have to deal with new subnets.
That isn't a big problem on the server side, but it can create problems on the client side. Clients are configured with a time-to-live value that determines how long they'll cache DNS records. After a failover, those records become stale. In a physical disaster, that usually isn't much of a problem, because you're probably dealing with more important issues, such as, "The data center is now a crater!" But in virtual infrastructures, the problem arises when VMs accidentally get migrated to an alternate site.
This high-availability problem isn't specific to Hyper-V clusters. Any server cluster that enables the failover of VMs to different subnets will experience a similar problem.
The importance of failover order
The DNS problem highlights the fact that failover order is an important facet of server cluster management. Some server clusters organize failover order better than others. VMware HA, for example, lets the server cluster handle failover order by itself. In others, such as Hyper-V, admins can manually determine where VMs should go after a failure.
What you don't want is VMs moving to server cluster nodes that are inappropriate, such as those on the other end of a multi-site cluster or those that are overloaded. Pay careful attention to your failover order and make sure that you always balance your cluster load appropriately.
What to do when host isolation occurs
Host isolation occurs when a server cluster host remains online, but it can no longer communicate with the other nodes for some reason. The problem with host isolation is that the isolated host still runs its VMs. In a VMware HA isolation incident, those VMs are usually running on different vSwitches that may not be affected by the isolation. The cluster may want to fail over these VMs to take them out of isolation, but if an isolated host locks the VMs' disk files, it can't.
There are several ways to fix this problem. Obviously, getting an isolated host back online is one of the best solutions. But if that can't be done, you may need to power down VMs so that the surviving cluster nodes can fail over the VMs. Pay attention to your high-availability solution's isolation response settings and decide which of its settings make sense for your particular needs. Most features allow you to choose Leave VMs On or Power Off when host isolation occurs.
High availability is a useful component of a virtual infrastructure, but don't bypass its important settings to manage the more-exciting load-balancing capabilities in the server cluster. If you do, you might be in for trouble down the road.
About the expert
Greg Shields is an independent author, instructor, Microsoft MVP and IT consultant based in Denver. He is a co-founder of Concentrated Technology LLC and has nearly 15 years of experience in IT architecture and enterprise administration. Shields specializes in Microsoft administration, systems management and monitoring, and virtualization. He is the author of several books, including Windows Server 2008: What's New/What's Changed, available from Sapien Press.
This was first published in November 2010