This content is part of the Essential Guide: Taking charge of VM allocation, troubleshooting methods

Simplify the VM troubleshooting process with these basic tips

Tools like Windows Performance Monitor and adjustments to memory, CPU and disk usage metrics can reduce the stress of the VM troubleshooting process.

It can be difficult to troubleshoot VMs, especially if there doesn't appear to be an immediate issue. There are, however, some basic steps you can take to make the VM troubleshooting process easier.

First, find out whether the issue affects several VMs or just one. If it affects multiple VMs, there's likely a problem with the underlying hardware or hypervisor configuration. If it affects only one VM, it's likely a localized issue or an issue with the machine's resources.

I find that the best place to start your search for potential issues is Windows log files. Although it sounds obvious, many people neglect to start here and miss out on assistance with VM troubleshooting. Some complain that the logs contain too much information, but I think an administrator can never have too many logs. Investigate any issues you come across.

If there are no glaring problems with the logs, move on to the machine's resources. There are a few areas to investigate: memory, CPU and disk. You can investigate these items with the built-in Windows Performance Monitor tool. Start the Performance Monitor -- PerfMon for short -- from the command line by typing in perfmon and then use the column on the left to enter additional metrics.

Windows Performance Monitor
Check performance levels with the PerfMon tool.

Memory metrics

RAM is usually the first resource to become scarce in a virtual environment. When a VM runs out of memory, it starts to execute memory management techniques, which can negatively affect performance. The most common form of memory management is page swapping. Think of memory as a set of pages; the computer uses these memory pages during normal operation.

You run into trouble when the computer -- or, in this case, the VM -- wants to use more pages than are available. Page swapping swaps out memory pages that aren't heavily used to a disk, with a pointer to their location so they can be reclaimed later. Page swapping then either reads or creates the data that needs to be brought into RAM.

A page fault occurs when a VM requests a memory page that isn't readily available. There are several varieties of page faults, but the most common are hard and soft faults. A hard fault occurs when the VM has to export pages to allow a new process to use freed pages. It can be detrimental to performance. Soft page faults occur when there is sufficient memory available, but no link to that memory. Since all soft fault operations take place in RAM, they don't have a significant effect on performance.

A large number of long-term memory hard page faults indicate that the VM does not have access to enough RAM. Allocate more virtual RAM and performance should increase.

CPU metrics

Another area of potential issue is CPU utilization, though this is a bit easier to diagnose. The VM has a number of virtual CPUs that the physical CPU schedules with a time-sharing mechanism. Most VM hypervisors receive a number of shares by default, which give them equal rating for time on the CPU. You can change these share values to give a VM higher or lower priority. If a VM's share value is too small, it won't be scheduled on the CPU often. The phenomenon of waiting for a VM to access CPU resources is called CPU wait time. VMware has a similar term for this called CPU Ready Time.

If several of your VMs exhibit poor performance, check the hypervisor's memory and CPU resources. Many hypervisors have memory management techniques that involve swapping out memory to disk when resources are in short supply. The VMs aren't aware of this swapping and will carry on, business as usual. Swapping RAM at the hypervisor level, on the other hand, is inadvisable as it causes large amounts of performance degradation. To check physical CPU performance at the hypervisor level, use tools such as esxtop in a VMware environment.

Disk-usage metrics

Disk use sometimes causes problems because it's a shared resource. This means that, at times, the disk's resources will be in contention. You can identify problems with disk usage with disk queue depth. As the name suggests, disk queue depth is the number of items in the disk queue waiting for the storage system to process them -- the deeper the queue, the lower the performance level.

As with all other metrics, be sure to look them over periodically. One of the most important parts of VM troubleshooting is to be familiar with the day-to-day functions of the environment; disk queue depths are no different. What was the disk queue depth yesterday? What is it today? Why is it different today?

A consistently high value coupled with poor performance indicates that a storage subsystem is unable to keep up with the demands of the infrastructure. This limits your options to reduce the overall load of the disk system. Free tools, such as Process Explorer, make the VM troubleshooting process easier by helping you understand which application generates the most I/O requests.

Next Steps

Everything you need to know about troubleshooting

Create a strategy to resolve VM performance issues

The five most common virtualization problems

Dig Deeper on Virtual machine monitoring, troubleshooting and alerting