Before the phone calls start streaming in, build an IT outage communication plan to ensure all your stakeholders...
remain up to date during a recovery process.
Virtualization administrators never want to see that a host has crashed. Though multiple alerts from VMs going down and hopefully coming back up are likely, a host failure is different from a traditional Windows or Linux machine crash. The scale of the effect is larger, which means the incident will garner more attention. It won't be business as usual when a host crashes because the infrastructure requires different treatment and a more comprehensive IT outage communication plan.
Ensure the virtual guests are backed up and running before doing anything else. The host infrastructure should have been designed with an N+1 to allow for the loss of one host.
Here, you must make a key decision. If the host recovered and it wasn't a clear hardware error, should it return to service? Depending on how much excess infrastructure is available, the need for host resources can be critical. But, on the other hand, it's risky to put a critical piece of infrastructure back into service without having done a full root cause analysis.
Balance is necessary because root cause analysis can take a considerable amount of time. The logs can be massive and, even with tech support, you may go multiple days without that additional resource.
There is a middle ground, however, that a few procedures can help admins reach. Evaluate how long the host was in operation since its last reboot. The difference between 30 days online and 200 is significant. Memory leaks happen even with the best hypervisors, and the odds of a crash only rise as the reboot window grows longer.
If a crash occurs within a 60-day time frame, be prepared to take that host offline for a longer period of time to do a critical crash investigation. Once the 90-day mark passes, it's easier to put the host back in service without a full investigation. These ranges are estimates, so use them as starting points.
No matter how quickly the host comes back online, you still need to get the logs to the hypervisor vendor. If the host is stable and online, the case doesn't need to take a production environment down. Resolving the problem is imperative, but you must be patient with the hypervisor vendor.
IT outage communication plans require transparency
Work with the vendor and document the process in a predefined outage form located in a shared location. This is critical to the IT outage communication plan and recovery process because it keeps management in the loop.
Keep a regularly updated form in a place where management can easily check the status of an incident. Update the time and date frequently to prevent calls asking for status updates. Admins won't be able to resolve issues while they're on the phone with application owners or managers. The form updates can be time-consuming, but it's more efficient to proactively update once rather than deal with numerous calls.
Add details about the recovery process to the outage form. Admins often take VM recovery for granted, but recovery technology, such as VMware's vSphere High Availability, can have a significant effect. By highlighting how quickly VMs recovered, the form can reassure application owners that though an issue occurred, the automated recovery process worked quickly.
The outage form should also include the results of the vendor investigation once they are available. If the result of the investigation was a bug or a change in the environment, human error, or an act of nature, admins should communicate that in a clear and concise way in the documentation. Complete honesty and transparency are requisite for trust.
Admins can choose to close forms before the issue is resolved. Not every crash has a resolution, and many can't be resolved without undue effort and cost. Some problems are simply one-time issues that likely won't occur again. Be honest about this in the documentation.
Documented status updates can also help establish a pattern of issues. A thorough IT outage communication plan can help admins pinpoint and troubleshoot reoccurring issues.
IT outages aren't the end of the world. They're stressful, but with a few best practices, admins can handle them and communicate in a constructive manner.