Large VMware hosts can seem attractive, but the resultant VM density can reduce your ability to quickly recover...
It's tempting to examine VMware hosts from a hardware perspective and focus on going as large as possible. In a shared environment, more resources will always seem better than the risk of not having enough.
The memory capacities in hardware platforms have grown, and the sheer number of cores that can fit in a physical CPU is getting hard to count. The ability to scale up hosts often means VMs can reach staggering sizes, but that doesn't happen often.
Application development has shifted to a more scaled-out approach, so the massive, single-server model isn't as popular as it once was. This leaves IT staff with high-capacity ESXi hosts that can lead to increased VM density.
Control VM density
Increased VM density isn't necessarily a bad thing. It can help reduce OS licensing costs, network connections and the size of a data center's footprint.
The key to controlling VM density is to evaluate its functionality in the event of a failure. VMware offers high availability (HA), predictive HA and fault tolerance to help recover from system failures. These technologies help prevent workloads from going offline and reduce the need to quickly restart them after a host failure. These features also work with small and large hosts, but their effectiveness can vary depending on the size of the VMware hosts and VM density.
Host size is often based on CPU or memory load. CPU is usually the top consideration because memory is easier to adjust than CPU cores. Depending on your math, you might want to allocate two to four vCPUs per core.
On the low end -- with an older CPU -- this could be a quad-core CPU hyper-threaded for eight usable cores. If you allocate three VM vCPUs per core, that will result in roughly 24 vCPUs per physical CPU socket. If you factor in a dual CPU socket server, which would give you 48 vCPUs for your VMs, that would average 24 VMs per host, with each VM getting 2 vCPUs.
VM configuration will vary depending on need, but this example is broadly applicable. Hyper-threading will increase the density of the physical CPUs from 8 to 48 cores.
With the same math -- 3 vCPUs per CPU core on a dual CPU socket server -- you go from 16 cores in your host to 96. This means that the supported number of vCPUs goes from 48 to 288. That increases your possible VM density from 24 VMs to 144 VMs per host.
This might sound great because of the hardware footprint reduction and possible OS savings, but consider what will happen during a failure. Losing 24 VMs is bad, but losing 144 at once? Fault tolerance can only protect four VMs per host, so that might not be helpful enough. Any increase in density requires examination with regard to potential failures.
Evaluate HA restart speed
Having more hosts reduces density, which can make technologies like fault tolerance more attractive because they can better handle fewer VMs per host.
For example, supporting a quantity of 400 VMs with a ratio of 25 VMs to 1 host requires a minimum of roughly 17 hosts to handle the loss of one. Restarting 25 VMs across 16 hosts means you need no more than 2 VMs per host for a quick restart time.
If you do the same math -- 4 hosts and 400 VMs -- you end up with an average of 100 VMs per host. Losing one means the remaining VMware hosts have to restart 33 more VMs. Adding 2 VMs to 16 hosts won't make a noticeable difference, but adding 33 VMs to 3 hosts in restart mode is a substantial jump that can significantly affect your service-level agreements.
The downside to scaling out is an increased footprint in your data center, more network connections and, of course, more VMware licenses.
It's technically possible to find out the best density balance by examining costs, but it's difficult to attach a fixed cost to an extensive outage. This is due to the nature of virtualization and the way workload movement obfuscates your ability to predict exactly what will happen. You can plan for disaster and schedule recovery testing, but disaster preparation starts as soon as you make the commitment to expand virtual infrastructure. This requires finding a balance between massive VMware hosts and scaling out for potential failures.