An event like a complete data center power failure is something you never want to experience. Having recently gone through one I thought I would share some lessons learned from it.
This particular data center had a full UPS (uninterruptible power supply) system and backup diesel generator when a routine battery maintenance performed on the UPS shorted some circuits causing power loss to the entire data center. This event made me realize that a little preparedness can go along way in getting servers and virtual machines (VMs) back online after a power failure.
First and foremost, the DNS (Domain Name System) is probably the most important service in your data center. Most servers and workstations use DNS names instead of IP addresses to communicate with each other. Without DNS, servers can’t get to anything by hostname and will effectively be isolated from each other. Most administrators are used to using DNS names, so when DNS is not available they usually do not know the IP addresses of the server and subsequently can’t connect to them. So it is a good idea to have a hard copy of all your servers and their IP addresses somewhere in your data center for you to reference when DNS is not available.
Virtual servers can be even more problematic. If you have all your DNS servers virtualized which cannot be started because of network or shared storage issues, you can run into problems starting other servers and services that rely on DNS. Consider having at least one physical DNS server or having one or two DNS servers running on local storage instead of shared storage.
Another helpful insight: Make sure you know command line procedures for administration on your host servers. You may not be able to connect to your host via a graphic user interface (GUI) until certain systems are up so the command line can be your only way to check the host server health and perform VM operations. Again, it helps to have paper documentation of the host command line utilities and their syntaxes.
Finally you want to make sure you start your servers back up in the proper order due to dependencies that certain servers and applications have. Obviously, with the network unavailable, not much is going to function properly. The storage-area network (SAN) is also critical for your host servers that utilize shared storage for VMs. Windows servers also take a very long time to boot if a DNS server and domain controller are not available when they are starting.
Below is a general order for restarting your servers and applications.
- DNS servers
- DHCP servers
- Database servers
- Application/Web servers
The boy scout motto ‘be prepared’ holds true. A little preparation and planning can go along way to ensuring a smoother recovery.