Nonstop computing is IT nirvana, especially if it's economically achievable. If we didn't have to worry about stoppages...
in app processing or communications, coupled with perpetual storage that never loses a bit, we would have an ideal computer system. This is the dream that drives the desire for five 9s availability -- a system that only needs 5.39 minutes of total annual downtime.
The traditional approach to high availability (HA) has involved specialized computers with expensive hardware features and significant duplication. Moreover, a good setup needs a distant spare setup to handle any natural disasters in real time.
Sensibly, all but the largest or most specialized IT shops avoid the cost and associated admin nightmares. It's not that they wouldn't like HA; it's that they have limited budgets and must be realistic. Meanwhile, they know that many IT workloads can tolerate some downtime.
Virtualization transformed HA
Enter virtualization. The virtual server approach is a crucial change in the HA paradigm because it encourages many more parallel-active instances of the same app code. Now, if one instance dies, there are still, say, nine -- or 99 -- copy instances still running.
Looked at simplistically, this isn't a five 9s operation, since a small percentage of the total service is lost and operations no longer run at 100%. In practice, adding one more instance to the pool solves that issue, typically for far less than the cost of special hardware.
Is this just playing a numbers game? If not and the objective of five 9s availability is to deliver 100 instances running, then that objective is met by the extra instance in the overprovisioned model -- or more if there's a need for additional robustness.
Look at it another way: You can use orchestration to create a new replacement instance and quickly return to 100%. If this only takes microseconds, the system can then deliver five 9s total availability as long as the failure count isn't high.
Virtualized clusters offer better upgrade management
Of course, the devil is in the details. Slow detection and the time required to create and book virtual instances bogs down the typical, automated detection recovery cycle. The overprovisioned model might be the answer and remains cheaper than special hardware.
Even with overprovisioning, clusterwide failures and managing software upgrades within the maximum five-and-a-half minutes annual downtime that is required for five 9s pose issues. Cloud service providers get spectacular headlines for zone-level failures, typically when updating network router software. These can take hours to resolve and blow five 9s claims out of the water.
Software upgrades often involve synchronizing all the changes within instances, so each instance has to go offline to upgrade safely. Fortunately, virtual clusters usually have plenty of unused instances available. Rather than take down and change the instance, you can create another one with the upgraded code, then switch over when ready and kill the old instances off.
There are some challenges to this due to pending transactions, so this isn't completely app-transparent. If a set of microservices replaces an app, there are likely revision-level dependencies that require enormous care to manage.
Virtualization has potential for a refocused HA service
Getting to HA nirvana isn't only an IT measurement. What does a mobile user see? As far as a paying customer is concerned, a mobile typing environment that restarts every tenth transaction isn't even close to five 9s.
Recovering transactions-in-progress is a difficult challenge. It involves journaling, restart points and is use case-dependent. The location of the failure also affects detection and response mechanisms.
Virtualization holds a great deal of promise for HA service, but the focus of HA has migrated from expensive platforms to user applications. This is easier for new apps, but even the current trend to break down legacy apps into microservices fits this paradigm shift. As they are created, microservices can have HA features built in.
Storage advancement protects data loss
We've come a long way from RAID arrays with only drive-level redundancy. Newer storage uses replication or erasure code and protects against appliance failure rather than drive failure. Most data is now proof against two appliances failing out of three for replication or as many as 10 drives or two appliances with erasure coding. With data downtimes in the 32 seconds/six 9s range, this means that it's unlikely for data to ever be lost.
Here, the devil is again in the details. The switch from remote copies in another zone can be slow, especially if a lot of tenants are trying to do it at the same time. This is a function of your disaster recovery procedures, which are within IT's control.
Virtualization makes five 9s economically feasible
Overall, virtualization opens up the option to achieve five 9s availability cheaply. The burden lies with your apps rather than your infrastructure. Going part way down the journey to five 9s is a worthwhile exercise for all virtualized systems admins, because your customers will see a better and more stable IT environment.
One final consideration: Rigorously test this type of HA environment so that you'll learn what still needs fixing. Netflix, for example, has a chaos generator for just that purpose, and this is the right approach, rather than highly structured testing. With proper testing procedures in place, virtualization can offer practical HA.
Create availability groups to pre-empt outages
Achieve HA through redundant cloud storage
Increase application availability for enterprise workloads