This four-part series focuses on improving Hyper-V high-availability cluster performance. Part one covered how
firmware, drivers, patches and updates affect virtual host cluster stability. In parts two and three, I offer personal workarounds to Hyper-V clustering problems that have helped the stability of my virtual environment. Here, in part four, I present some perplexing network issues, and explain when and how to kill Hyper-V high-availability cluster services.
Hyper-V network issue No. 1: duplicate IP address, or Automatic Private IP Addressing (APIPA), after VM reboot
This network issue occurs after a sudden loss of private or public host network or a Fibre Channel drop of a Hyper-V cluster node, which triggers virtual machines (VMs)to restart on different hosts. I find that watching a large number of VMs try to find alternate hosts can be chaotic.
In many cases, a VM will try to restart on a surviving host and then move to another host until the VM restarts. As a result, some VMs will restart and report that there is a "duplicate IP address on the network" if a VM runs Windows 2003 or XP, or it will get an APIPA address if running Windows 2008 or Vista.. Other than the network problems, all other VM functions should work normally. Unfortunately, performing a repair or disabling and re-enabling the VM's network interface card have no effect. But manually restarting the affected VM one more time does work.
Here's a tip: As a shortcut, open Hyper-V Manager and right-click on the VM. Then, choose "shut down." The system will shut down, but the VM will immediately reboot because it is a part of the high-availability cluster.
This problem results from a mixture of bad timing and incorrect VM configurations within a cluster. In my observation, this situation occurs when the integration components do not match the Hyper-V versions installed on the host.
So, for instance, if your environment consists of a Hyper-V host with Windows Server 2008 Service Pack 2 and VMs with Hyper-V integration components from the Hyper-V release to manufacturing edition, these issues manifest after an upgrade of integration components. If these problems occurred before you upgraded integration components, however, a simple, manual reboot of the affected VM should resolve the issues.
Hyper-V network issue No. 2: VM pings after shutdown
In most cases, a reboot will fix a VM network problem, such as the one previously mentioned. Similarly, when a Hyper-V clustered host fails unexpectedly and VMs are forced to restart on alternate nodes, I have encountered systems that restarted fully and report that they are pinging correctly.
But on deeper inspection, VMs cannot be reached through a remote management process (i.e., Remote Desktop Protocol (RDP), eventvwr, universal naming convention, etc.) other than through a ping. A VM also cannot perform outward pings. Even stranger, if you completely shut down a VM, it will continue to ping.
To solve this network issue, use Failover Cluster Manager or System Center Virtual Machine Manager (SCVMM) to shut down the clustered VM. Shutting down a clustered VM in Hyper-V Manager creates a high-availability reaction from the cluster of restarting the VMs.
It can be strange to witness Failover Cluster Manager displaying the server as off, but seeing it report pings. In my experience, this situation results from configuring a VM with a legacy network adapter.
Fixing this issue is bit trickier, and it requires Failover Cluster Manager and Hyper-V Manager. Here are the steps:
- After a failure of a host or hosts in a cluster, it may be necessary to restart the Hyper-V Management Service on every node to refresh true VM statuses while using the Hyper-V Manager utility.
Then, in Failover Cluster Manager, right-click on the configuration of the VM experiencing this problem, and choose Shut Down.
- After shutdown, check the VM's status remotely through Hyper-V Manager by pinging it. It will most likely report an off status in Hyper-V Manager, but continue to ping.
- Use Failover Cluster Manager to move the VM to each cluster node, performing the process outlined in step two. Notice that after each VM move, the status reported in Hyper-V Manager will change to running, even though your VM still reports as off in Failover Cluster Manager.
- To correct this issue, right-click on the VM in Hyper-V Manager and choose Turn Off. At this point, the status will change to off and the pings will cease as well.
- Restart the VM. It will now return to full functionality.
To eliminate this problem, limit the use of VMs with legacy network adapters, which route traffic through the host partition.
Killing Hyper-V high-availability cluster services
At times, I've come to the realization that there is nothing I can do for an unresponsive virtual cluster node. Whether it's a driver issue, Volume Shadow Copy Service crash or some other unknown problem, there have been instances in which I've had to take out the hammer to kill high-availability cluster services on a node. This takes some courage when there are multiple virtual workloads in unknown states running on a node, but this may be necessary for the stability of the cluster.
Before taking this drastic step, however, it's important to understand the consequences. When you kill high-availability cluster services, it creates a high-availability reaction for the remaining cluster nodes. VMs residing on the problem host will be distributed to other nodes and restarted as if there were a power failure. In my experience, Failover Cluster Manager will now be available, and restarting the failed host should be possible. Before moving VMs back to the node, however, look through the event logs and other monitoring logs in detail.
Again, before taking this approach, you should exhaust every option.
In a couple of instances, for example, my Hyper-V nodes have been completely unresponsive to external cluster management. The management functionality of cluster utilities -- such as the cluster.exe command or any of the graphical user interface (GUI) management tools (i.e., Failover Cluster Manager, SCVMM, Hyper-V Manager, etc.) -- were unavailable or unresponsive. Nevertheless, some VMs functioned, while others weren't.
If this situation occurs, the following are some concrete items to check before killing high-availability cluster services:
- Use the cluster.exe command to query the affected node. This utility may still have limited functionality in terms of querying the VM statuses on an unresponsive node in the GUI. From this feedback, noting the VMs' cluster resources that are experiencing problems may lead you to the root cause.
- Use a product, such as PsKill or Taskkill. In the article "Clustering problems with Hyper-V VM configuration files, VM states," I covered how to find the VMWP.exe process for a particular VM and kill it. If you can find additional information about VMs stuck transitioning from the cluster.exe command, it may help terminate a problematic VM instead of having to kill the cluster service on a node.
- Try to save the VM workloads from a crash. You may not be able to get to the clustered host, but you may be able to get to the guests through RDP or another remote management process. Because manually shutting down a VM on a high-availability cluster will result only in a restart, it's wise to shut down applications that react poorly to a hard power-down.
Inevitably, you may come to the point where you will need to kill the high-availability cluster services to regain control. I have used PsKill and Taskkill to do so, and I have been successful.
Taskkill /s CLUSTERNODENAME /IM clussvc.exe
PsKill \\CLUSTERNODENAME clussvc.exe
(Note: After killing a high-availability cluster service, some of the previous issues may reappear, such as a duplicate IP address ,or APIPA, after a VM reboot or VMs still pinging after shutting down.)
Even with the issues outlined in this series on Hyper-V clusters, I believe the advantages of clustering your virtual hosts far outweigh the disadvantages. These problems do not happen often, but when they do, they can cause a fair amount of head scratching and nail biting.
Ultimately, these issues point to the growing pains of Hyper-V and other virtual server vendors. As more virtualization users adopt virtualization technology and the breadth of use cases expand, more problems -- like the ones detailed in this series -- will emerge.
About the expert
Rob McShinsky is a senior systems engineer at Dartmouth Hitchcock Medical Center in Lebanon, N.H., and has more than 12 years of experience in the industry -- including a focus on server virtualization since 2004. He has been closely involved with Microsoft as an early adopter of Hyper-V and System Center Virtual Machine Manager 2008, as well as a customer reference. In addition, he blogs at VirtuallyAware.com, writing tips and documenting experiences with various virtualization products.