It doesn't matter what software you use; eventually, one of your hosts will crash. After all, the hypervisor is...
software, and all software at some point will have a bug or issue that can cause a crash. While crashes are common for most people in IT, a hypervisor host server crash is different because it can affect a much larger footprint in your datacenter, which makes it more visible than a single server or VM crash. As such, the risks of virtualization are put on display in an event like this.
One of the first questions management will ask is how this could have happened. Just because you have high availability, doesn't mean a crash of this nature is 100% avoidable. It's always best to explain before a crash that high availability involves a restart of the VMs, a point that management may not understand.
Uptime and system checklist
Before you jump into root-cause analysis and upload your logs to VMware support, take a look at the uptime of the hypervisor host. Has it been online for days, weeks, months or years? While we'd like software to run continuously with no issues, the reality is that all machines need to be rebooted at some point. If your host has been running for 30 days, you will need to troubleshoot it and keep it out of production until you have a solution. However, if your host has been online and stable for 300 days, you could reboot it and place it back into production. There is no golden rule for how many days something should be online before a reboot. As a general guideline, people gravitate toward 90 to 120 days of uptime before a reboot. It might seem old-fashioned to schedule reboots, but with vMotion and other migration technologies, we really don't have an excuse not to be proactive with maintenance.
Before you contact support, there are a few items that should be checked. First, is your hardware on the compatibility list from your hypervisor? It should go without saying, but it's very common and can be overlooked. Keep in mind this also includes any additional networking or storage cards installed on the hypervisor hosts. Second, are the drivers and patches up to date? Just because many of the hypervisors don't have a lot of patching doesn't mean you can ignore them. While they are not as frequent as traditional operating system patches, they are just as important.
If you are current with patches and have adhered to a reboot schedule, then it is time to diagnose the issue. This is often a two prong approach with many of the hypervisor vendors. Part one involves opening a support case, and the other part is what you can find. With each vendor, the logs are the most important aspect in troubleshooting. You will often need to upload them to the vendor to start the troubleshooting process, but that doesn't mean you can't use them to find a root cause on your own. Vendors typically give you some ability to view and search the logs, however, the built-in tools are often not robust enough to complete the process in an efficient manner. While it might be a more challenging approach, you can still use this method to identify critical events that can lead you to the root cause.
In the event you can't find anything on your own, the technical support groups can find out what's going on. Part of this is due to skills, but a part is also due to the amount of logging that is done. In fact, with some vendors, the logging is so extensive that it is almost impossible for the average person to open some of the log files without specialized software designed to handle larger amounts of data. This can limit what customers can research on their own and can emphasize the support contracts even more. This doesn't mean your only option is the vendor's technical support; it just means you have to be prepared with additional tools, such as Notepad++ or other applications that can handle larger files.
A hypervisor host server crash isn't the end of virtualization or your career; it's just that the software still has a few hiccups. While the audience that is exposed to the failure is much bigger than normal, the benefits of virtualization make up for it. The rules and guidelines may have changed, but in the end, it is still a piece of technology that we have to install, support and fix.
Setting a reboot schedule with Cron
Log analysis tool selection
Examine the pros and cons of virtualization