How to troubleshoot common ESX problems in VMware

In this excerpt from VMware VI3 Implementation and Administration, learn some steps on how to identify ESX problems in virtual environments and solve them.

This Content Component encountered an error

Common Problems and Resolutions

There are some problems that occur with ESX hosts that are fairly common, and we cover those here. Many times, you can take simple steps to correct them, but some require more in-depth troubleshooting and resolutions.

About the book:

This chapter excerpt on troubleshooting ESX (download PDF) is taken from the book VMware VI3 Implementation and Administration. The book is a comprehensive guide to planning for, implementing, securing, maintaining and monitoring VMware VI3 in any IT environment.

Purple Screen of Death

One occurrence that can happen on both ESX and ESXi hosts is called the purple screen of death (PSoD), which is VMware's version of Microsoft's infamous blue screen of death (BSoD) and the result of a host crashing and becoming inoperative. A PSoD, as shown in Figure 10.6, is definitely not something you want to see on your hosts.

ESX host purple screen of death
 

Figure 10.6 ESX host purple screen of death

When one occurs, it will completely halt the host and cause it to become unresponsive. The causes of PSoDs are typically hardware-related (defective memory is the most common cause) or a bug in the ESX application. Your only recourse when it happens is to power off the host and power it back on. The information that displays on the screen is useful, however, and you should attempt to capture it by writing it down, using a camera phone to take a picture of it, or taking a screen capture from a remote management board if present. You might not be able to make much sense out of the information, but it will be very useful to VMware support. What is displayed consists of the ESX version and build number, the exception type, register dump, what was running on each CPU when the crash occurred, back-trace, server uptime, error messages, and memory core information.

When you reboot a host after experiencing a PSoD, a file beginning with the name ¬vmkernel-zdump should be present in your host's /root directory. This file will be useful to VMware support, and you can also use it to help determine the cause by using the vmkdump utility to extract the VMkernel log file and look for any clues as to the cause of the PSoD. To use this command, type vmkdump -l dump filename. As previously mentioned, defective memory is a common cause of a PSoD. You can use the dump file to help identify the memory module that caused the problem so that it can be replaced.

About the author:

Eric Siebert has more than 25 years of experience in the IT industry and specializes in Windows server administration and virtualization technology. He is a senior systems administrator in Golden, Colo., serves as a Guru on VMware's community support forums and offers tips at his website, VMware-land.com. He also writes a blog for TechTarget.

If you suspect defective memory is the cause, you can test your host's memory by using a utility application that burns in memory. These utilities require you to shut down your host and boot from a CD to run the memory tests. One commonly used utility is Memtest86+, which does extensive memory testing, including checking the interaction of adjacent memory cells to ensure that writing to one cell does not overwrite an adjacent cell. You can download this utility at www.memtest.org/.

It is a good idea to burn in and test your host's memory when you are first building it to avoid disruptions later. Most memory problems are not obvious and will not be detected by the simple memory test a server does as part of its POST boot procedure. You can download the free Memtest86+ utility as a small 2MB ISO file, which you can burn to a CD to boot from and let it run for at least 24 hours to run various memory tests on your host. The more RAM you have in the system, the longer it will take to complete one pass. A server with 32GB of RAM will generally take about one day to complete. Besides the system memory, Memtest86+ will also test your CPU's L1 and L2 cache memory. Memtest86+ will run indefinitely, and the pass counter will increment as all the tests are run.

Service Console Problems

You might sometimes experience a problem with your Service Console where it hangs and will not allow you to log on locally. The condition, which can be caused by hardware lockups or a deadlocked condition, will usually not affect the operation of the VMs running on the host. Rebooting is often the only recovery for this condition. Before doing this, however, you should shut down or VMotion the VMs to other hosts. You do this by whatever method works, such as using the VI Client, connecting to the Service Console remotely via SSH, or trying to use an alternative/emergency console, which can be accessed by pressing Alt-F2 through Alt-F6. When you move the VMs or shut them down, you can reboot the host with the reboot command. If all the console methods are unresponsive, you will need to cold boot the host instead.

Networking Problems

You may also experience a condition that causes you to lose all or part of your networking configuration or where a configuration change causes the Service Console to lose network connectivity. If this happens, you will not be able to connect to the host by any remote method, including the VI Client or SSH. Your only recourse will be to rebuild or fix the network configuration from the local Service Console using the esxcfg- command-line utilities. Here are some of the commands that you can use to configure networking from the ESX CLI:

  • esxcfg-nics This command displays a list of physical network adapters along with information about the driver, PCI device, and link state of each NIC. You can also use this command to control a physical network adapter's speed and duplexing. Type ¬esxcfg-nics -l to display NIC information and esxcfg-nics -h to display available options for this command. Here are some examples:
    • Set the speed and duplex of a NIC (vmnic2) to 100/Full:
      esxcfg-nics -s 100 -d full vmnic2
    • Set the speed and duplex of a NIC (vmnic2) to autonegotiate:
      esxcfg-nics -a vmnic2
  • esxcfg-vswif Creates and updates Service Console network settings, including IP address and port group. Type esxcfg-vswif -l to display current settings and esxcfg-vswif -h to display all available options for changing settings. Here are some examples:
    • Change your Service Console (vswif0) IP and subnet mask:
      esxcfg-vswif -i 172.20.20.5 -n 255.255.255.0 vswif0
    • Add a Service Console (vswif0):
      esxcfg-vswif -a vswif0 -p "Service Console"
      -i 172.20.20.40 -n 255.255.255.0
  • esxcfg-vswitch Creates and updates VM (vSwitch) network settings, including uplink NICs, port groups, and VLAN IDs. Type esxcfg-vswitch -l to display current vSwitch configurations and esxcfg-vswitch -h to display all available options for changing settings. Here are some examples:
    • Add a physical NIC (vmnic2) to a vSwitch (vSwitch1):
      esxcfg-vswitch -L vmnic2 vswitch1
    • Remove a pNIC (vmnic3) from a vSwitch (vSwitch0):
      esxcfg-vswitch -U vmnic3 vswitch0
    • Create a port group (VM Network3) on a vSwitch (vSwitch1):
      esxcfg-vswitch -A "VM Network 3" vSwitch1
    • Assign a VLAN ID (3) to a port group (VM Network 3) on a vSwitch (vSwitch1):
      esxcfg-vswitch -v 3 -p "VM Network 3" vSwitch1
  • esxcfg-route Sets or retrieves the default VMkernel gateway route. Type esxcfg-route -l to display current routes and esxcfg-route -h to display all available options for changing settings. Here are some examples:
    • Set the VMkernel default gateway route:
      esxcfg-route 172.20.20.1
    • Add a route to the VMkernel:
      esxcfg-route -a default 255.255.255.0 172.20.20.1
  • esxcfg-vmknic Creates and updates VMkernel TCP/IP settings for VMotion, NAS, and iSCSI. Type esxcfg-vmknic -l to display VMkernel NICs and esxcfg-vmknic -h to display all available options for changing settings. Here is an example:
    • Add a VMkernel NIC and set the IP and subnet mask:
      esxcfg-vmknic -a "VM Kernel" -i 172.20.20.19 -n 255.255.255.0

In addition, you can restart your Service Console network by using the command service network restart.

Other Problems

Sometimes just restarting some of the ESX services will resolve problems and not affect the VMs running on the host. Two services that can be restarted and often fix many problems are the hostd service and the vpxa service. The hostd service runs in the Service Console and is responsible for managing most of the operations on the ESX host. To restart the hostd service, log on to the Service Console and type service mgmt-vmware restart.

The vpxa service is the management agent that handles communication between the ESXhost and its clients, including the vCenter Server and anyone who connects to the host using the VI Client. If you experience problems with vCenter Server showing a host as ¬disconnected, not showing current information, or any other strange problems involving vCenter Server and a host, restarting this service may resolve it. To restart the vpxa service, log on to the Service Console and type service vmware-vpxa restart. It is recommended to try to restart these two services when you are experiencing problems because this will often resolve many problems.

Read the rest of this chapter excerpt.

Printed with permission from Prentice Hall PTR. Copyright 2009. VMware VI3 Implementation and Administration by Eric Siebert. For more information about this title and other similar books, please visit http://www.pearsonhighered.com.

This was first published in June 2009

Dig deeper on Virtual machine monitoring, troubleshooting and alerting

Pro+

Features

Enjoy the benefits of Pro+ membership, learn more and join.

0 comments

Oldest 

Forgot Password?

No problem! Submit your e-mail address below. We'll send you an email containing your password.

Your password has been sent to:

-ADS BY GOOGLE

SearchVMware

SearchWindowsServer

SearchCloudComputing

SearchVirtualDesktop

SearchDataCenter

Close