While storage and networking are critical to the data center, how we view, deploy and manage them as resources...
could not be more different. Looking at storage and network resources in a virtual data center, we see two different ends of the spectrum when it comes to how we view and how we abuse the overcommitment of these resources.
It's common practice for administrators to overcommit storage resources through thin provisioning – a technology that can stretch resources and reduce waste, but can also get an administrator into trouble. On the other hand, the overcommitment of network resources remains relatively low, with most companies preferring to invest in upgrades.
Networking is critical to the virtual environment; it's how we get our data in and out. Oddly enough, while we hear about network outages and other issues we often don't hear about slow network performance due to overcommitment. Several years ago, that was often the case with limited bandwidth and excessive network collisions, but today's modern office network is 1 Gb Ethernet and the data center is typically 10 GbE to our virtualization hosts. This doesn't mean network congestion can't occur, but the bottlenecks typically occur at another part of the data center stack.
Most network troubles stem from configuration changes and errors rather than bandwidth issues. However, this can be a challenge when you are using storage over an Ethernet network. Storage is very capable of saturating a network connection. A single drive is usually not responsible for saturation, unless it's an SSD, but it is possible with a network attached storage device which has several spindles that can deliver a lot of data.
With networking we can take advantage of virtual Local Area Networks (vLANs) and quality of service (QOS) to prioritize traffic and ensure we have proper throughput to our critical end points. Most of the time this works very well since much of the traffic on our networks is not causing excessive congestion. While storage can greatly affect a network, storage networks are often physically separated from data networks. Traditional 1 GbE networks are simply too slow for storage in virtualization hosts, even traditional SATA 3 local speeds have several times the bandwidth and none of the Ethernet collision concerns. This often forces organizations to separate or update the storage network to 10 GbE speeds. Because of this, overcommitment of the network remains low. There will always be exceptions, and this is where vLANs, QOS and network traffic shaping can be used. Organizations should always monitor bandwidth, however it is often not the limiting factor or the source of the problem.
Storage is often one of the most overcommitted resources in your virtualized environment and not because it's the easiest to do. Storage resources are the most abused and, unfortunately, one of the most expensive pieces of your infrastructure. While no one specifically tries to waste resources it's often the consequence of failing to plan for the future. Too often we hear "we'll grow into it" when requesting VMs. While consumer storage is relatively cheap, enterprise-level data center storage is an expensive resource that can cost a lot of money, making it hard to justify the "just in case" attitude.
The starting point for dealing with storage shortages is capacity management. Often, what is requested and what is needed are two very different numbers. This is where thin-provisioning comes into play. You can promise the requester 100 GB, but with thin provisioning, they may only be using 20 GB. That is a huge saving in capacity and, best of all, the requester is happy thinking they got what they asked for. Sure it's a small lie and what is the chance they will ever need 100 GB anyway? As it turns out, the case for thin provisioning is relatively safe up to a certain level of overcommitment. For most cases, that level is around 30% overcommitment. Any higher and you start to get a little nervous that the requester will call your bluff.
Thin-provisioning -- whether it is on the storage frame or through the hypervisor -- is often the first choice when overcommitting storage resources. It's fairly easy to do, and in many cases both the VM and the end user are unaware. However, this willingness to agree to excessive user requests has led to a problem of its own. Many administrators don't bother to challenge the requester, and without challenge, the requester will continue to ask for more and more. When the administrator falls into this pattern, they move away from what was a safe overcommitment level to a situation where they are overcommitting 50% to 60% of their capacity. This level of overcommitment carries a lot more risk should one, or a handful of VMs, actually reach their provisioned capacity.
Storage overcommitment works best as a capacity adjustment and not as a capacity policy. The requester must be held accountable. Targeting a 30% overcommitment level is ideal when VMs are only overcommitted by 5% to 10% each. Keeping the level in check spreads out the risk as several VMs would have to use everything they have allocated to create a problem. However, if the VMs are each overcommitted by 40% to 50% one or two VMs could potentially eat up a lot of capacity. While your requesters may not always be happy with having a more accurate VM, you can remind them that adding storage capacity is usually a nondisruptive process.
This article is the third in a three-part series about overcommit technologies and tips for managing resources in a virtualized data center. Read part one of this series to learn how to track and regulate CPU overcommit. Read part two of this series for strategies to balance memory overcommit and mitigate risk.
The other concern to remember with overcommitment of storage is the effect it has on I/O performance. While capacity is often the first concern, you can easily overload a single LUN serving several VMs with high I/O demands. Each hypervisor has built-in tools to address this problem, and there are many third-party tools that can help monitor storage queue issues. If properly monitored, you can normally identify and prepare for storage I/O demands.
Overcommitment is not something to be avoided -- it's an important part of a virtualized modern data center. We need to embrace it, be aware of its possible impact and manage it properly. Modern overcommitment technologies can be so easy to apply that we can overlook the root issue of right-sizing VMs in the first place.
Instead, overcommitment should be used in concert with correct VM sizing practices as a way to make our infrastructure more efficient. If we get complacent in how we provision our VMs, we will open ourselves up to failures from which we cannot recover or promises we cannot keep.
Explaining VM storage optimization strategies
Demystifying thin provisioning in VMware environments
Thin provisioning at the hypervisor vs. the storage array