Clustering problems with Hyper-V VM configuration files, VM states

Microsoft Hyper-V clustering problems, such as orphaned VM configuration files and unsynchronized VM states, require these workarounds to preserve system stability.

This four-part series focuses on clustering problems with Microsoft Hyper-V virtual machines (VMs). Part one covered...

how firmware, drivers, patches and updates affect virtual host cluster stability. Part two offers personal workarounds to two Hyper-V clustering problems that have helped the overall stability of my virtual environment.

Clustering problem No. 1: Unsynchronized VM states 

Recently, I had problems with my HP Virtual Connect firmware. I experienced prolonged public and private network interface card outages that caused nodes to sense other host failures in the cluster. As a result, VMs attempted to restart on alternate nodes.

In some instances, Failover Cluster Manager would display VMs in a "saving" or "starting" state. Hyper-V Manager would show these VMs as saved or running, but this information would not register in Failover Cluster Manager.

The following three workarounds sync the correct VM state with the cluster:

  • In Hyper-V manager, resume/start the VM in a saved state. Then, manually save the VM in Hyper-V Manager. Most of the time, this triggers the cluster to show the true VM state.
  • The second workaround is similar. Instead of stating the VM manually in a saved state, shut it down in Hyper-V Manager. At times, this releases the VM's hung state within Failover Cluster Manager.
  • The third option is more involved. A Microsoft TechNet forum suggests using Sysinternals Process Monitor to locate the VMWP.exe process associated with the troublesome VM. By killing this process, the VM will crash and restart on another cluster node --syncing the VM state in Failover Cluster Manager. It's not the best option, but sometimes a hammer is necessary. It also beats having to kill other cluster services that affect every VM on a node.

(Note: I use Hyper-V Manager because Failover Cluster Manager and System Center Virtual Machine Manager are not functional with VMs in this problem state. Hyper-V Manager is responsive and accurately displays the VM's true state.)

Clustering problem No. 2: Orphaned VM configuration files

After an unexpected VM failover, a few manual cleanup routines are necessary to return Hyper-V virtual cluster environments to their top efficiency levels.

One process involves deleting the configuration files that are found at C:\ProgramData\Microsoft\Windows\Hyper-V\Virtual Machines. These link files point Hyper-V to the location of the VM extensible markup language (XML) configuration files.

During a planned or controlled failover, the link files are deleted after the VMs shift to another node. When unexpected failure occurs, however, the failed node's VM link files are orphaned. The orphaned configuration files have little effect on a system, but I've seen instances when a quick migration of a failed VM back to a previous node causes a failure. The biggest nuisance of orphaned VM configuration files, though, is the continual appearance of 4096 errors in the event log, as seen in Figure 1.

Figure 1
(Click image for an enlarged view.)

These event log errors point directly to the files that need to be deleted. In this example, notice the hardware configuration global unique identifier (GUID). At this location, there will be a link file with the same GUID as the one in the error message. Delete this orphaned VM link file, and the event log error will be resolved.

Figure 2
(Click image for an enlarged view.)

Be careful, though. If an active link file is deleted, that VM will fail/crash, and you will have to add the VM back to the cluster.

After the orphaned VM link files are deleted, the error messages will stop and the VM failover process will be more stable.

Stay tuned, because in part three of this series, I present more personal fixes for Hyper-V virtual machine cluster problems. Until then, send me any feedback or issues you have seen.

About the expert
Rob McShinsky is a senior systems engineer at Dartmouth Hitchcock Medical Center in Lebanon, N.H., and has more than 12 years of experience in the industry -- including a focus on server virtualization since 2004. He has been closely involved with Microsoft as an early adopter of Hyper-V and System Center Virtual Machine Manager 2008, as well as a customer reference. In addition, he blogs at VirtuallyAware.com, writing tips and documenting experiences with various virtualization products.

Dig Deeper on Virtual machine performance management