Get started Bring yourself up to speed with our introductory content.

Fail at scale: Build failure into virtual infrastructure design

It's better to be over-prepared than surprised. Design virtual infrastructure around the possibility that it might fail and have a plan in mind for when it does.

Virtual infrastructure design requires IT administrators to predict the possibility of failure and integrate that...

potential into the planning process.

Virtualization infrastructure design already demands planning around outages. This means ensuring separated hosts, the correct placement of distributed resource scheduler (DRS) and high availability (HA) rules, and establishing redundant connections to networking and storage fabrics.

Converged infrastructure platforms are among the biggest changes in the modern data center's hardware, offering both advantages and new requirements to virtual infrastructure design. These platforms are a bridge between the density of blade servers and single-host platforms. They combine networking, compute and storage in one frame, which creates a source of vulnerability, as well as convenience.

As with blade servers, it's necessary to prepare for the loss of an entire platform -- not just a blade. This is where it gets more complex. In a converged infrastructure, storage is part of that platform, so the effect of a chassis failure can exceed the compute consequences. Traditional recovery technology works best when storage is outside the main virtualization environment, but with converged platforms, it becomes part of the enclosed infrastructure and presents new failure possibilities.

Admins like to debate whether a node in the platform will fail or if it will be the entire frame. Salespeople will always claim the backplanes are passive and can't fail.

In reality, passive blade frames have caused issues that required a complete shutdown of the frame and all of the blades to get the system back online. If these were virtualization hosts, the impact would be staggering.

Though this is rare, the claim that any technology is impervious to failure doesn't hold up in enterprise environments. Virtual infrastructure design, especially when using converged and hyper-converged infrastructures, requires the understanding that anything can fail.

New challenges for virtual infrastructure design

Converged and hyper-converged infrastructure have a distinct advantage over traditional blades because they offer reduced density. You can take advantage of this to balance hardware density choices with workload placements.

Converged products typically come in one, two or four compute node densities per frame. Power and cooling are important factors because these frames aren't light when it comes to data center power rack requirements.

It's also necessary to evaluate whether there are enough nodes to spread out the virtual workloads in a converged infrastructure frame, but you should also know the number of frames. Though it might be possible to fit everything on several four-node converged platforms, application recovery needs might demand a two-node density per frame for a wider recovery footprint. All of this depends on storage requirements, as well, so serious examination should focus on balancing storage I/O and capacity with the number of nodes and frames for the applications.

A lot of the distributed design services, such as converged and hyper-converged infrastructures, as well as storage technologies, such as VMware vSAN, require a different approach to application placement and virtual infrastructure design. This is complex, but modern application design makes it a little easier than it sounds.

The distributed approach to application design means there's less contention with the monolithic application stack. While the integration of storage removes many of the complexities with connection and fiber and offers lower cost options, it does increase reliance on the proper setup of failure rules.

The distributed approach to application design means there's less contention with the monolithic application stack.

DRS and HA recovery rules require even more focus after moving to a converged platform. They can't solve all the problems, however, because storage isn't centralized. Admins could move storage to the background, but that isn't efficient, so it comes down to application interaction and placement.

Admins must understand not only how the application works and how the pieces work with each other, but also what the infrastructure needs are. This includes aspects from delivery to Active Directory components for authorization.

A distributed infrastructure can have a positive effect on cost and complexity, but it comes with some additional challenges. No one wants to talk about the failure of passive parts or what happens in a large-scale failure. Don't avoid the topic because it will supposedly never happen. In IT, the impossible will eventually happen, and while the availability of a perfect resolution is unlikely, a solid virtual infrastructure design that incorporates the possibility of failure can at least point you in the right direction during the recovery process.

This was last published in July 2018

Dig Deeper on Application virtualization

Join the conversation

1 comment

Send me notifications when other members comment.

Please create a username to comment.

How do you think about failure during the IT planning process?