freshidea - Fotolia
In recent weeks we've heard of several data center outages affecting some high-profile U.S. companies, including The Wall Street Journal, New York Stock Exchange and United Airlines in one week. While it would be impossible to prevent every outage, these highly-publicized problems can cost a lot of money and significantly affect how customers perceive a company. So, this month we're asking our Advisory Board what can and should companies be doing to maintain high levels of uptime? What are some common mistakes that can lead to downtime? And is the average data center as secure and resilient as customers hope and expect -- or are occasional outages just a fact of life?
Brian Kirsch, Milwaukee Area Technical College
The balance between availability and everything else is one of the cornerstones of IT. We all want our systems to always be there when we need them – so do our managers and executives. The problem comes in when you have to balance availability against what it takes to gain that availability. The concerns are not simply just cost, it's the complexity and testing that are the keys to making it all work together. The idea that a single hardware or software product will provide the ability to become more available just does not exist today. While the backup and disaster recovery products we use today have become more wide-ranging and effective, the applications have become even more complex. This constant race between the applications and availability of them results in large scale outages when the disaster recovery products we have fails to keep up with our application needs and designs.
However, the hardware and software is only a small part of the outage puzzle. Many outages occur due to system failure and change. We design to prevent failures; we secure systems to prevent unauthorized changes. All of this effort on the front end, however, cannot prevent every outage. We are overdue for a new way in looking at disaster recovery and outages. Let's design our systems with the idea that we will have an outage instead of simply trying to prevent them. Embracing failure gives us true application resiliency because the failure protection is no longer only surface thick. We can then test and show that we have the ability to handle failure.
Nowhere is this more apparent than with Netflix and its Chaos Monkey engineering group. Netflix faced a massive reboot of the Amazon EC2 cloud while having to keep its services online. For many companies, the reboot of the EC2 cloud presented them with something they thought they would never see and few had plans to prevent an outage. On the other hand, Netflix and its uniquely-named Chaos Monkey engineering group, have a plan. Chaos Monkey's role in Netflix is to repeatedly and regularly exercise failure. By continually testing and correcting the issues before they can create widespread outages, Netflix has created a service designed with failure in mind to ensure availability.
Dave Sobel, LogicNow
It's almost shameful for companies like the NYSE, Wall Street Journal and United Airlines to have downtime of any kind. Outages can be expensive -- and with computing resources relatively inexpensive -- advanced planning can ensure downtime is at a minimum. Businesses with a critical need can now easily build backup systems that live in the cloud, and that are only used in an emergency situation. Windows Azure, for instance, only charges for active computing loads, meaning that an entire backup network can be in cold standby awaiting a problem. Hot standbys can be set to minimal usage levels as well, ensuring a failover is ready to go. Monitoring and management software, which should always be used, continues to get more advanced, with predictive analytics possible to anticipate downtime.
However, communication is the most important part of mitigating the effects of an outage. As a passenger who was actually stuck due to the United Airlines outage, the most frustrating element of the issue was the lack of information. Companies have to acknowledge the issue proactively, then under promise and over deliver. Silence in social media and lack of information to employees is among the worst customer service offenses possible.
Jim O'Reilly, Volanto
How can fail-safe systems fail? That may seem like an oxymoron, but United Airlines, the New York Stock Exchange (NYSE) and others all recently went out to lunch. What is wrong with their IT infrastructure?
Evolving complexity is certainly part of the problem. Often, these are old systems that have been patched and extended many times. This leads to vulnerabilities in hardware and software. The United Airlines problem was blamed on a router that failed, but that begs the question of how a highly redundant system had a single point of failure.
Problems in communications aren't a United Airlines specialty, of course. The giant of the cloud, Amazon Web Services (AWS), lost a couple of zones a while back when the software on a router was incorrectly updated. Such failures reek of poor operational procedures, a lack of checks and balances in configuring the setup and complacency or, worse, laziness, in figuring out if reliability is compromised.
The mess at the NYSE, like the AWS zone failure, arose from a bad software update -- in this case to the "matching engines" that connect buy and sell orders.
Despite the blaming of hardware or software, the real culprit in all of these problems was human error. Failures are to be expected in highly-evolved systems, where the administrators have to change gears to handle different platforms and application approaches. Poor network topology, untested updating, misapplied updates are all avoidable errors. The question is how to avoid them without generating a monster bill.
Automating the operations is the answer for update problems. Anyone who uses Windows is familiar with its upgrade approach. Sometimes it's automatic and done in the background, while sometimes there are some questions to answer, but the bulk of the effort, and any reconfiguration of the new code, is handled by the software.
Linux, on the other hand -- in the best traditions of the command- line interface -- often requires administrators typing at phenomenal rates to execute the updates. Scripting is considered state of the art. Though, of course, the scripts always need tweaks to work correctly.
Systems with high levels of manual interaction are inherently failure-prone and the AWS and NYSE events are typical results. United Airlines had a different problem. Clearly a single point of failure existed. Preventing this isn't rocket science. A manual review of the routing structure should have identified that one router could cripple the system. To be fair, manual checks aren't easy when the topology of the application suite and the underlying platforms are always in flux.
This is where some software would be valuable in detecting problem points in the system. Companies like Continuity Software address configuration issues. Big data analytics approaches may increase the sophistication of the approach.
Even so, poor application design, especially with less resilient legacy systems, is a problem that will continue to plague us. The answer to this class of issue is the sandbox and rigorous testing.
Can failure-free data centers exist? The answer is that we are far from achieving the ideal, but we can do much better.
Virtual server security and uptime concerns
The causes and costs of data center downtime
Rethinking the N+1 approach to availability
Prevent outages with battery monitoring service