Every system has its limits. Some element of a system inevitably becomes the persistent bottleneck, the limiting...
factor preventing an organization from making full use of its IT capabilities. For decades, the storage subsystem has been the bottleneck in the overwhelming majority of virtualization deployments.
Virtualization's key resources are CPU, RAM, disk and networking. With the exception of some fringe workloads, CPUs stopped being the bottleneck for anything about a decade ago. RAM is a function of economics. Usually the server supports putting more RAM in the box than people actually do because the sweet spot RAM module sizes are half the size of the maximum theoretical RAM module sizes; you'll never get the disk subsystem fast enough to actually want the maximum amount of RAM.
Depending on how your disk is attached to your virtualization cluster, networking can be part of the storage bottleneck problem. Or it can be as idle as the CPUs, big networking pipes waiting for something to do.
More often than not, however, it's the storage subsystem holding up everything. Solving storage I/O bottlenecks is a complicated process that depends on factors ranging from physical infrastructure choices to configuration changes. Let's look at how to identify the issues in the storage subsystem and how they can be overcome.
Before you can set about solving storage I/O problems in your storage subsystem, you first need to find them. Tweaking your VMs so they can use more storage performance, for example, won't help if your problem is that you simply can't deliver adequate performance to the virtual hosts. Similarly, there's no point in chasing through an infrastructure if a simple tweak to a VM config file can solve the problem.
Benchmarking is a great way to find out where the problem lies. Benchmarks can be run with the same configuration on physical hosts as well as VMs, and on VMs that run on different virtual hosts. This ability to generate identical workloads with identical I/O profiles allows us to locate the problem.
To use benchmarks to solve your storage I/O difficulties, you need to establish baseline measurements of how storage performs under normal conditions. Since determining what's normal is a fairly subjective process, you'll want to keep in mind a few principles. First, do you have storage troubles in one place but not another? If so, you'll want to take baseline readings where there are problems and where there aren't.
Some examples of this might be that you have two storage systems; one is slow, the other is not. Alternately, you may have one storage system, and VMs are slow when on one host but not slow when on another. Run benchmarks throughout the day when the other VMs are doing various tasks to get a good idea of how things currently work.
Always record your benchmark runs. What configuration did you use? How many IOPS did you get? What was the throughput? What were the peak and average latencies?
Vary your measuring. A four-thread 50% read/50% write full random benchmark is a great middle-of-the-road benchmark for testing generic performance. But the exact mix of read and write will vary dramatically from workload to workload and organization to organization. If your storage product has any means of reporting to you what your read/write ratio is, it might be a good idea to run benchmarks with that ratio as well.
Try some benchmarks at 100% read and some at 100% write. If your storage problems are rooted in one side of the storage balance failing to deliver the required performance, then benchmarks tilted to extremes are going to make that apparent. It may be necessary to make multiple virtual disks available to the benchmarking VM. You may also have to implement multiple benchmarking VMs.
Once you've done your basic benchmarking, it's time to perform background work tests. Use Iometer, an essential I/O measurement tool, to determine your peak IOPS under various conditions -- 100% read, 100% write and 50% of each. Load the storage system to 25%, 33%, 50% and 75% of IOPS capacity. Now run various common administrative tasks and time them.
The first task is to do a full VM backup using backup software. Then attempt to snapshot and clone operations. Try to create a VM from a template. Next, attempt to run multiple workload-specific benchmark tests such as Exchange Server Jetstress and other similar testing programs at the same time. This simulates multiple simultaneous workloads operating while the system is being stressed by Iometer.
Analyzing benchmarking results
Making sense of all the numbers obtained is a challenge, but it doesn't need to be overwhelming. Sometimes the numbers tell the tale quite plainly. Do you have reasonable read IOPS but abysmal write IOPS? This could be because write caching is not turned on at some layer of your storage -- and you should probably turn that on, assuming you have the battery backup technology for that to make sense.
You might try adding a flash tier, if your storage product allows it. Alternately, if you are using some form of networked storage, you could try implementing server-side caching; this uses flash on the virtual hosts to absorb writes during peak times, and drains them to the storage array during lulls in activity.
Are you seeing abysmal numbers on both read and write sides of the equation? This could be that the storage product is simply not up to the job, or because there is a bottleneck between the virtual hosts and the storage.
If your storage is network-based, and poor storage performance coincides with the networking being saturated, then there's a good chance the problem is the network card. If you're sure the array should provide more oomph than you're getting, but you aren't seeing saturation of the network card, the problem might be either with an intermediate link between the virtual host and the storage or in the network card driver.
Do not underestimate the importance of latency in your analysis. It is entirely possible to see fantastic IOPS and throughput, but still see terrible application performance, because the applications in question are strongly sensitive to latency. For many applications, latency matters more than any other dimension of storage performance.
Don't think of benchmark analysis as gospel that contains great truths. It won't reveal all ills. It's more akin to testing each link in a chain meticulously and in sequence. Benchmark analysis often leads to running new tests with different parameters. No guide can tell you exactly what to do, nor can it provide analysis for every eventuality. Study the available data and the design of your virtualization infrastructure. When you see something that doesn't make sense, then that's the place to start refining the tests.
When it comes to identifying and eliminating bottlenecks in your storage subsystem, benchmarking and analyzing results are just the first step. Once you have sufficient information to address the problem head-on, you'll need to create a plan to solve any and all configuration and/or extreme storage issues.
Identifying and troubleshooting data storage problems
Techniques for resolving network bottlenecks
How to create more useful server benchmarks