VMware's Distributed Resource Scheduler and High Availability are two cornerstones of the software company's virtualization management family. DRS uses vMotion to balance servers across multiple hosts for both performance and reliability, and HA restarts VMs in the event of a host failure. Both features have existed for several versions of ESX/ESXi and are robust products that administrators have come to rely on. However, that does not mean everyone takes full advantage of these products. The rules for each product overlap quite a bit, so each is not exclusive of the other; they need to work in unison to truly give you the environment you want. Lets take a look at a few things you might be able to do to help improve performance and uptime.
While DRS exists to ensure proper load balancing to maintain performance, it does not take the number of VMs per host into account. This means you could have three ESXi hosts online, two with no VMs, while the remaining host runs 30. If a performance issue or resource contention does not exist on the host with 30 VMs, DRS will not migrate any VMs. Ideally, a more distributed balance of VMs across all three hosts would help reduce potential problems if a single host failed. DRS will not make that adjustment based on VM count. However, with carefully placed rules, you can give DRS a helping hand.
Let's start with two key categories that exist when setting up HA and DRS rules: VMs you want to separate and VMs you want to keep together.
Separating VMs seems pretty easy and comes with clear benefits. If you have two Active Directory controllers or print servers, you may want to keep them on separate hosts in the event of a host failure. As your environment grows, the simple rule of keeping things apart gets a bit more complex. When you specify to keep particular VMs apart, you're doing it to protect against failure. However if you separate the VMs and both hosts are placed in the same rack of equipment, then how much protection have you gained? If the host has an issue you might be OK, but if there is a rack power or networking issue then you still might be in trouble.
That is where the ability to set preferred (rather than required) hardware for VMs is important. This means they can reside on specified hardware in conjunction with rules to separate VMs, gaining the ability to avoid issues in the event of rack outages. This has even greater benefits when separating VMs with preferred hardware in blade server environments. Having multiple critical VMs on different blades, in the same blade enclosure, only provides protection for a blade failure. A chassis or rack failure can be devastating to a virtualized environment because of the higher density of VMs per rack with blades. While we often design for consolidation and performance, it is important to remember to design with failure in mind. With blade systems this may include limiting the number of blade enclosures in one rack to better support desired HA and DRS rules.
There are also reasons we may want to keep certain VMs together on the same host. This is ideal for VMs that will need to communicate between each other frequently. A front-end web server with a back- end database server that communicates heavily is an ideal candidate to remain on the same host. If they reside on the same vLAN, this will allow network traffic to stay internal to the host, which will prevent excessive physical network traffic. Of course this does present a problem if you're looking at possible HA rules; but does it really matter if they both go down? If the servers are dependent on each other, then having one up and running without the other has no value. However, one issue with both VMs being on the same server is if they do not start in the correct order. This can be addressed by putting the VMs in a vApp, allowing you to control the startup order and add a delay as needed.
While a lot of focus is given to priority startups of VMs, having VMs that do not automaticly restart is a valid and worthwhile option. Test and development servers could be left off the initial startup list, allowing for more production VMs to be started. HA also gives you to option of allowing "VMs to violate resource constraints." What this means is if the cluster does not have enough resources to guarantee the resources that the VM needs, it will not be started. If this is overridden then the VM will be started, but it may suffer performance issues due to the lack of resources. Not restarting test and development VMs will help with additional resources, and overriding the resource constraint setting will have them running. They may be running slower, but in some cases that might be better than not running at all.
Safeguard rules with documentation
Within the HA and DRS configuration screens, you do not have a simple way to export the rules or configuration. This means a critical piece of your infrastructure goes without any documentation. In all likelihood the thought that you would ever lose these settings is very remote; in fact aside from deleting the rules the only way is to completely disable HA and DRS. No one would do that, right? Unfortunatly, it happens all too often. On two different occasions I have seen VMware Certified Professionals disable DRS or HA on a cluster and instead of unclicking host monitoring and moving DRS to manual mode, they unclick the HA and DRS check boxes at the cluster setting, which uninstalls both features from the cluster. Sure there is a warning box, but who has time to read those? While doing this won't cause any reboots for guests or hosts, it does remove every HA and DRS rule you had, including resource pools. Unlike some applications, there is no undo button and your rules are gone forever -- unless you've taken the time to write them down. Creating a spreadsheet with the rules for VM separation, HA startup priority and resource pool settings can be a lifesaver in this unlikely event.
Got HA or DRS? Then you'll need ESXi storage too
Are you in the know on the latest VMware HA best practices?