It's easy for IT administrators to get frustrated when a VM is frozen, but with an established troubleshooting...
process that involves isolating the scope of the issue and checking common problem areas, admins will have everything up and running in no time.
A VM can appear to lock up, freeze or otherwise stop responding. Such problems can take numerous forms and can exhibit several symptoms.
For example, VM tasks might refuse to start, time out or fail in progress. The guest OS might also be unresponsive, refusing to respond to input or network activity, or the user interface or console for the VM might not update or refresh. Admins typically can't stop, restart or even power cycle the VM, and they might need to take more drastic action, such as killing the VM process via the hypervisor interface.
The challenge when a VM is frozen is that the cause can be rooted in several different areas, such as the guest OS, resource contention on the host server, or even problems outside the host in storage or the network. Before taking any action, first narrow the problem scope down to determine what is or isn't functioning.
If the problem appears to affect multiple host servers -- a broad problem involving many unresponsive VMs across multiple host systems -- chances are the trouble is rooted in a common infrastructure fault such as shared network or storage resources, something common to all of the affected VMs.
If the problem appears to affect multiple VMs on the same host server -- a narrower issue involving unresponsive VMs on the same host system -- it's likely the trouble is located in the host server itself, such as a defective network interface; a failure in the hypervisor, such as ESXi; or another system-level fault, something common to all of the affected VMs on that specific system.
If only one VM is frozen, check that the VM is actually powered on. If the VM is powered off, it will be completely unresponsive, so power it back on and see if it functions normally. If the VM is powered on as expected, verify that there is no accessibility through the workload, the guest OS or any other available interface.
If the VM responds through some interfaces, but not others, then the VM is likely responsive, but the trouble can usually be traced to the workload, OS services or network connectivity. For example, if the guest OS responds but the guest application doesn't, it might be possible to restart the VM in an orderly manner. Similarly, admins might be able to check logs or error messages at the hypervisor console to identify specific faults in the guest VMs.
Finding the cause when a VM is frozen
Once admins and technicians understand the overall scope of the problem, they can work to isolate the possible cause. In most cases, the underlying cause of an unresponsive VM will fall into one of three categories.
First, consider whether instances of unresponsiveness can be traced to or triggered by any particular task. For example, snapshots and live migrations, such as vMotion migrations, can render a VM unresponsive or stunned for short periods, potentially triggering a more serious or longer-lasting unresponsiveness.
Second, inspect the configuration of the VM and its host system and verify that adequate resources are available. For example, setting low limits on resources such as memory and CPU can starve the VM of resources and trigger performance problems. Similarly, VMs that suffer resource scheduling problems can hesitate or become unresponsive. For example, a VM with 100% processor utilization might become unresponsive.
Third, when a VM is frozen, consider the availability of the supporting infrastructure such as the supporting network and shared storage. For example, problems with shared storage connectivity can potentially stop a VM from responding while the VM attempts to connect to the storage resource. Similarly, VMs that await a shared resource might get stuck if the shared resource is unavailable, such as trying to read a scratched disc in a CD-ROM drive.
Dig Deeper on Virtual machine monitoring, troubleshooting and alerting
Related Q&A from Stephen J. Bigelow
Just because software passes functional tests doesn't mean it works. Dig into stress, load, endurance and other performance tests, and their ... Continue Reading
Don't neglect form factor as part of your data center server selection. Instead, figure out what type of environment you need and learn which server ... Continue Reading
Learn how load balancing in the cloud differs from a traditional network traffic distribution, and explore the different services available from AWS,... Continue Reading