Intentionally breaking your infrastructure may seem like a crazy idea, but some prominent and forward-looking companies do it on a regular basis. I always enjoy talking with enterprise IT pros about how Netflix builds its applications under the assumption they will fail, and how the company tests resilience by purposely sabotaging random parts of its infrastructure. Talking about a disposable VM and suggesting that services should be crashed at random is a great way to give everyone from the architects to the operations team nightmares. Why does Netflix break things, and is this something that enterprise IT shops can learn from?
Embrace the disposable VM
Companies like Netflix Inc. have decided it's easier and more cost effective to assume a virtual machine will fail and instead design applications that can cope with failure. The motivation is simple: Netflix wants to buy compute resource as cheaply as possible. This means some parts of its infrastructure run on cloud resources that aren't guaranteed -- hence the lower cost. Some VMs the company uses could be torn down at any time if resources are needed by Amazon to service a customer willing to pay more. When these VMs are torn down, Netflix doesn't want its customers to have a poor experience, so the infrastructure must be able to handle the loss of these VMs without a loss of service.
If you want to get good at something you have to practice, so Netflix practices tearing down VMs with an application called "Chaos Monkey" designed to randomly disable production workloads. Part of the objective is to instill the idea of the disposable VM into the minds of developers so they get good at building applications that can cope with failure. A failure should be detected and automatically recovered without intervention from the operations team and without customers noticing.
Instead of traditional hardware-based high availability, Netflix leans on application-level high availability -- where the application runs on several different VMs and is able to withstand the loss of a VM without failing. Most enterprise IT applications, on the other hand, are designed to run on perfect or near perfect infrastructure. These enterprises leverage virtualization to help them develop a more reliable infrastructure; if a physical server fails, then software will restart the VMs it was running on other servers in the cluster. This means that recovery from hardware failure is fast, often just a few minutes. But, this highly available, redundant physical infrastructure is expensive.
Rethink in-house service levels
There may be developers or business units in your company that could accept low-cost VMs that could be down between 1 p.m. and 3 p.m. on the same day every month for planned maintenance. If you can design an infrastructure where a virtual machine can be down for a few hours, then there may be no need for that VM to connect to an expensive storage area network.
A clearly communicated service level is important here, along with the management authority to ignore requests to defer maintenance. IT needs to define a lower service level to go with the lower cost and to have the authority to stick to it. This is essentially an internal cloud with public cloud service levels.
Once developers become accustomed to using this type of VM, they may start developing for real cloud applications -- with application-level availability -- that are suited to this type of platform. There should be no need to host an application that is used in-house on a public cloud platform. An internal cloud run on the same service levels should deliver the same cost savings. Of course there are workloads that are not appropriate for a disposable VM, and these could remain on the expensive and resilient enterprise hardware.
There are a lot of benefits to having a service offering that matches the availability and cost of a public cloud. Enterprise IT spends a lot of money avoiding failure when there is a growing movement in application development to embrace the disposable VM and keep services operational even when components fail. Adding a failure-prone level of service may achieve public cloud savings without the public cloud headaches.