Developing a strategy to identify and fix VM performance issues

Many factors can hamper VM performance. A careful troubleshooting process can help IT administrators identify and correct VM performance issues.

VM performance issues are always an important consideration. Resource bottlenecks can occur unexpectedly. Physical resource demands sometimes vary in a virtualized environment, and the environment's setup and even the application's suitability for virtualization can all conspire to affect its performance under real load conditions. This article will provide a series of guidelines to help IT administrators track down and resolve possible causes of poor virtual machine behavior.

Start with a sanity check

The challenge with virtual machine (VM) performance monitoring and evaluation is that the difference between "good" and "bad" VM performance is often subjective. In the absence of detailed benchmark data, a technician might perceive VM performance issues and wind up investing countless hours chasing a problem that doesn't actually exist. Before you make the decision to tweak performance, understand what the performance should be or have a baseline for comparison.

This means benchmarking. For example, a physical workload typically performs just slightly better than a virtualized workload because the hypervisor layer adds a small amount of computing overhead. Running a benchmark on the physical workload before a P2V migration is a perfect way to draw a baseline. If the workload clearly demonstrates performance issues when a new post-migration benchmark is performed, then it is easier to narrow the potential range of problems.

Similarly, the initial benchmark drawn on a VM may reflect acceptable performance. If the VM experiences performance problems later on, a later benchmark on the same VM might show a clear degradation in a particular subsystem or resource that can help IT personnel quantify the problem and identify a plan of corrective action without time-consuming and disruptive troubleshooting.

Get the platform up-to-date

Hardware is driven by software, and you'd be shocked at how much better a hardware system can work when all of the software elements work properly together. All too often, an operating system (OS) patch or update is applied to the server or VM, or a hypervisor patch or update is applied to the server, or sometimes even an application update may introduce changes. Any of these might impact the workload's ultimate performance.

And it's not just new software updates -- problems can also occur when certain updates or patches are missed. For example, a seemingly routine hypervisor update may require that the host OS be patched to a certain level. If the OS patch is overlooked, the hypervisor update may actually cause stability, performance or other problems that might otherwise trigger hardware troubleshooting.

This is the principal reason why software updates are first tested in a lab setting before being rolled out to a production environment. Benchmark the system and workloads before making any changes, and benchmark the system again after making any changes, so that the results can be compared directly. If there is a performance problem, you'll know that it's related to the software update that was just applied, and it should be a simple matter for well-prepared administrators to roll back the update and resolve the problem until the issue can be studied.

And never allow antivirus software to scan VM files -- scanning is guaranteed to create VM performance issues. Instead, antivirus software should be installed to each VM to scan only that VM.

Check resource allocation

Too many organizations create new VMs by allocating arbitrary amounts of computing resources that provide too much or too little memory, processor cycles, I/O and so on. There is no harm in over-allocating resources, though the excess resources are wasted and this limits the consolidation potential for the server. Under-allocating VM resources is a more serious factor that will absolutely create VM performance issues. For example, if the VM does not have enough physical memory space, it will rely on much slower disk swap files to make up any shortfall.

Organizations typically start by allocating computing resources based on the recommended system requirements for each application -- plus an additional 10% to 20% to accommodate variations in resource demands over time.

Resources can usually be verified by benchmarking the VM and studying the amount of free resources or the percentage of utilization for each resource. Benchmark results that report little (or no) resource availability may be a good indication of a resource shortage. For example, if the CPU utilization for a VM routinely tops out into the 90% to 100% range, the VM may need additional processor cycles. It's a similar story for memory.

When a potential resource shortage is identified, administrators can allocate additional resources to the VM and re-benchmark the VM to quantify any changes in performance.

Remember that the resource requirements for many VMs can change over time. For example, an email server may perform adequately today, but more users and increased email traffic volumes into the future can eventually result in performance problems. Every VM should be benchmarked again and assessed periodically as part of a capacity-planning exercise. This can help administrators adjust computing resources preemptively to prevent performance problems from affecting the user experience.

Check the processor setup

Review the processor setup in the server BIOS to ensure that any relevant virtualization options are enabled and power management features are disabled, if necessary.

For example, although processor virtualization capabilities are almost certainly enabled by default, advanced virtualization features like I/O virtualization (e.g., Intel-VTd) may not be fully enabled -- or may be enabled improperly. Verify the server's virtualization capabilities and enable only the virtualization features that correspond to the server's design.

Also be wary of processor power-management capabilities. For example, processors should generally not be allowed to drop a virtualized server into "standby" or "hibernation" mode. Although this should ideally work in concept, the integration between workloads, hypervisors, OSes and server hardware is not always precise enough to handle shifting power modes (especially recovery from those modes) properly. This can cause performance problems for workloads after the server returns from power-saving modes, too.

Reboot the server and check the power conservation modes. Then disable any power conservation modes that might disrupt VM operation, document your changes and restart the system. Compare performance results over time to determine if performance has improved.

Common storage issues

The server's local disk storage can be another source of performance problems because workloads often must wait for storage reads/writes. Delays in the disk subsystem, therefore, can cause delays in VM responsiveness. (Network storage issues are a separate matter and can appear as network issues.) Local storage problems are almost always attributed to inefficiencies in the way that limited disk units are employed on the server.

Snapshots are one important example. Anytime that a snapshot is created or read, disk I/O increases because multiple files must be opened and processed. It is often better to store snapshots on a separate disk or virtual disk; another independent spindle (disk) available in the server can alleviate I/O stress that can slow storage performance.

More on VM performance issues

Boosting VM performance with disk partitions

What thin provisioning means for virtual machine performance

Tutorial: Securing and monitoring virtual machines

Swap files are another source of problems, much for the same reason as snapshots. Swap files use disk space as supplemental memory and cause additional (sometimes significant) disk I/O, which can slow workload performance. One option is to eliminate the use of swap files by allocating more physical memory to the VM. If swap files are unavoidable, try moving the swap files to a disk that is separate from the one housing the OS.

Consider hardware problems in the disk or controller. For example, bad sectors and excess file fragmentation can cause extra disk work, which slows apparent storage performance. Defragment the disk and check for bad sectors. The server may include specific diagnostics for this purpose. In addition, it may be necessary to upgrade the local storage subsystem to use smaller high-performance disks (such as 2.5 inch, 15K RPM SAS disks), or add multiple disks for reconfiguration into RAID groups, where multiple spindles can enhance storage I/O.

Look for network contention

Network problems are common on servers where multiple virtualized workloads vie for limited bandwidth and port availability. For example, network contention might take longer to copy a file to another network node than it does to copy the same file to another location on the server. Similarly, network-intensive workloads on the server like Web servers or transactional database servers can demand substantial bandwidth. Benchmarks can typically confirm network performance issues by reporting on network bandwidth utilization and timing. There are several tactics to address network contention.

First, investigate the setup of the server's physical network adapter and verify that the NIC is configured for optimum speed. For example, a gigabit NIC that has inadvertently been configured for 10/100 megabit operation cannot deliver optimum throughput. Also check the NIC drivers and firmware and verify that both are recommended for the OS and hypervisor versions on the server; if not, firmware or driver updates may be required. And don't overlook the influence of firewalls, load balancing software or intrusive network-monitoring tools on the server. Savvy technicians may choose to disable that software temporarily and look for any change in performance.

However, IT professionals can often alleviate network contention by rebalancing VMs across different servers or by installing additional network interface card ports using a PCIe adapter with offload capabilities. Once additional network ports are available on the server, workloads can be reconfigured to use specific ports, or ports may be trunked for enhanced bandwidth and so on.

Workload performance was often an afterthought in physical server deployment. As long as the server met or exceeded the application's recommended system requirements, performance was almost never a worry. Virtualized servers hosting consolidated workloads are very different, and workloads often contend for limited resources that can easily be improperly allocated within a data center architecture that is poorly designed for the level of consolidation in use. Savvy IT professionals must be sensitive to VM performance and have an approach to locate and resolve VM performance issues when they occur. 

Dig Deeper on Virtual machine performance management