BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
It's important to design a resilient infrastructure that ensures your workloads can continue to run when a server or component fails, but many organizations aren't aware of the level of resiliency they need and what it will cost.
When we design an infrastructure, we often talk in terms of how many failures the system can sustain before critical operations are disrupted. We can use the N+1 formula to express an infrastructure that includes one more component (+1) than is needed for the expected workload, giving us the ability to withstand a single failure. However, more organizations are now considering the more complex N+X+Y formula.
Why would we want to consider N+X+Y resiliency? In a virtualized environment, we have a feature called high availability that will automatically restart failed virtual machines on another server. On the surface, this is a great concept, but as we look deeper, we begin to see an issue: As we drive higher levels of VM density, we begin to have less capacity.
Let's consider an example in which we have two hosts with 512 GB of RAM each and 12 cores per host.
At any time, we cannot exceed 50% memory or CPU usage, because if either one of the hosts fail, we need to have the ability to take the load of the failed host on the one remaining host. So, in effect, we are losing 512 GB of memory across two servers.
However, most clusters include more than two hosts to reduce this waste, so let's expand this example to 10 hosts. In that case, we would have 5 TB of RAM and 120 cores in the cluster. As a rule, we do not want to exceed 90% utilization of the memory (nothing bad will happen, but you would have no room to maneuver VMs in and out of systems).
So, we can only run each host up to 460.8 GB of RAM per host, or 4608 GB total -- not taking a host failure into account. If one host were to fail, we will now go over the 90% barrier and some guests will not restart. So, now we need to account for 512 GB of memory for one failed host. We can now only use 4096 GB of memory, or 80% or the resource capacity loss of two servers.
Another factor we need to consider is very large VMs. For example, if we have a guest that uses 72 GB or 128 GB of RAM, and it needs to be restarted on a different host, there may not be enough resources available on any single host to restart that machine.
At this juncture, we can drop back our density to a number that allows a host failure and offers the ability to restart very large VMs. Let's say we have thee very large VMs (one requiring 72 GB and two needing 128 GB) totaling 328 GB. We need to have 128 GB free on at least two hosts, but we cannot designate which hosts, just in case those servers happen to be the ones that fail or are not available at the time of a failure. So, every host needs to be capable of running these VMs, which means we'll need:
- 5120 GB total available memory
- 4608 GB to keep below 90% utilization
- 4096 GB for a single host failure
- 3072 GB for the ability to restart one 128 GB VM
If we size our cluster to be able to restart just one large guest, we've lost 40% of our capacity. If we are trying to push maximum density, this is an unacceptable solution.
Now if we are to implement an N+1+1 or an N+2+1 scenario, we could take every host up to 4608 GB (90%) and still sustain one or two host failures and also do maintenance without impacting the other guest workloads or our high-availability resiliency. Another way to look at it is to say that by adding a dedicated standby server, we can recover 40% of our resources.
N+1+1 vs. N+2+1
Statistics and years of experience drive us to predictions on which components will fail and when. Needless to say, every component will fail at some point. So, the real value of X in the N+X+1 equation comes down to how many failures we want to prepare for and how we'll deal with them. Remember that multiple failures may not mean the same component failing on two machines -- it simply means two machines have failed for some reason -- one may be a CPU problem and the other a bad software patch. Regardless of the cause, we still have two servers down. So X is the number of failures that a system can have before we start seeing degradation in performance and resources.
The cost of redundancy
If we consider a simple N+1 scenario, it will cost us about 40% of resources, which on 10 hosts is 2 TB of RAM. If we go with an N+1+1 we will lose only the 10% needed for moving systems in and out of the host and two to three hosts (1-1.5 TB) that are in standby and maintenance mode. As our clusters get larger and clients require more resiliencies and more very large VMs, N+1+1 is a less expensive option when compared to losing 30% to 50% of our capacity to accommodate a single failure.
Deciding between N+1+1 and N+2+1 is a matter of determining the capital available for additional servers. The question is, "Will we have multiple failures at the same time? Assuming an average enterprise-grade server costs about $80,000, it's not hard to imagine how cost can be a big factor in deciding which approach to choose. Going with an N+2+1 solution can cost twice as much as an N+1+1 solution, but is still less costly than going with a N+1 high-availability solution.
After careful examination of the facts, N+X+1 is the best approach. It makes the best use of our capacity and resources. The question of whether to go with one or two standby servers is more a matter of whether your business is willing to tolerate the performance hit that will be taken from not having adequate capacity in the event of a failure versus the cost of the standby servers. Remember, that anything that has a moving part or is made by a human will fail at some point; it is just a matter of when. How we deal with that failure and how many of those failures happen at the same time is a matter of conjecture.
Avoiding data center downtime
Assessing cloud risks: Data center resiliency