When I built weather satellites for the U.S. government, we used a framework called failure mode and effects analysis (FMEA). It analyzes a system's potential failure modes and calculates the anticipated consequences.
But how does this relate to virtualization and storage area network (SAN) infrastructures? Well, today's virtual environments place an exceptionally heavy responsibility on their centralized storage infrastructures. Live Migration and vMotion both require centralized SANs for virtual machines (VMs) to fail over and load-balance.
SANs: The linchpin of virtual infrastructure
Because of this requirement almost every virtual environment has to implement a SAN infrastructure, but it also increases the adverse effects and costs associated with a SAN failure.
To illustrate my point, draw out the interdependencies between each component in your virtual infrastructure. For each component, draw a line to another component on which it relies. Continue this exercise until you map out the entire dependency tree. (The end result is similar to an FMEA scenario.)
That's a terrible situation, and it's not easy to restart completely from scratch. In the event that these resources go offline, the interconnections between your servers and applications will likely require a specific startup procedure that is time-consuming.
Storage vendors recognize this fact. Last year, Hitachi announced 100% storage uptime with its Hitachi High Availability Manager. DataCore's storage virtualization software now advertises 100% uptime at one of its hosting partners. High-end solutions from EMC, Hewlett-Packard and Dell offer zero-downtime options or the assurance of zero downtime during certain SAN operations. Even software-based SAN vendor StarWind Software will create zero downtime with storage replication through an active/active, two-node storage cluster.
But you can achieve 100% storage availability through a combination of technologies and techniques. You need multiple levels of redundancy for SAN power, disk drives, storage connections, storage processors and even fully redundant storage nodes (e.g., HP's modular storage solutions). Adding storage replication to secondary, on-site and off-site SANs will further protect your data.
In the end, how much storage downtime can your virtual environment handle? The answer is not much, if any. Design multiple levels of redundancy, if you can afford it. Also, before a SAN infrastructure purchase, ask you vendor where the infrastructure's weak points are. A year down the road, you don't want a complete SAN failure to take down your entire computing infrastructure.
This was first published in July 2010