To achieve high levels of operational efficiency and maximize profit, cloud providers often have to approach their data center designs differently from traditional enterprise IT. Some of these approaches may not work for your business, but others can help you think outside the box to find new savings.
Some cloud providers go to great lengths to maximize efficiency. One item I didn't expect was a cloud provider asking a server vendor to put all of the connectors, status lights and swappable parts on one side of their new servers. Having calculated that it was an average of 200 steps from the front of their rack rows to the back, the cloud provider wanted to minimize the time a maintenance engineer would spend walking back and forth.
This example highlights the operational efficiency cloud providers are looking for -- a level of efficiency that could be useful for architects and designers in enterprise IT. An old adage says that enterprise IT spends 70% of its time and budget just keeping things running. The only way to focus more time and money on moving forward is by improving the way we approach the day-to-day tasks that make up that 70%. By now, most large IT teams are on board with automation. Repetitive tasks should be scripted and more complex activities should be orchestrated for consistent outcomes with minimum human effort. If you are not using scripting and automation, then start learning now and get started.
Delay and schedule maintenance
Another aspect where cloud IT operations are different from enterprise IT is in replacing or, rather, not replacing failed components. This idea is usually used with containerized data centers. These containers are packed with low-cost servers and scale-out storage, connected to the network, and then left alone. The containerized data center is used as a pool of resources that runs a pool of workloads. Individual components -- fans, disks and CPUs -- are not replaced as they fail. Failed servers are simply shut down remotely, and workloads are automatically started on other servers. Failed disks or storage systems have their load redistributed among the remaining nodes. When the capacity of the container drops below a set threshold, all the workloads are migrated to other containers and the original is sent to the recyclers. Like the servers with components on one side, the aim is to minimize human effort over the whole life of the system.
Most enterprise IT organizations aren't ready for containerized data centers. On the other hand, there is a big move in enterprise IT toward using converged infrastructure as a platform for virtualization, which is a smaller-scale application of a similar idea. The infrastructure is built into a pool of resources, and it runs a pool of workloads. The idea of not replacing failed components is probably a bit revolutionary. But it wouldn't be too hard to build the infrastructure with some spare capacity to step in for the failed parts. Hot spare disks and servers are a common practice, and by having a few more, you can remove the need to rapidly replace failed units. Having a hardware replacement window once a month or once a quarter, where failed hardware is swapped all at once, can improve efficiency by minimizing maintenance time and reducing the urgency of getting replacement parts, saving expedited shipping charges.
I think that the biggest impediment to this approach in enterprise IT is the project-based nature of most shops. Specific projects have finite budgets, and IT needs to bring a product to production in a short time. Decisions that provide better lifecycle cost and operational efficiency will only be accepted if they don't cost more or delay the project.
Reducing the cost of "keeping the lights on" in enterprise IT may require some rethinking of how we design our data centers and handle failures. If we are ever going to spend more of the IT budget on improvement, then we'd better focus on improving the way we run things.