What hardware or system options are available for improving virtual server reliability?
Improving server reliability starts with the careful selection of the server hardware and features. Start by selecting a server with redundant power supplies. Enterprise-class servers will incorporate two modular power supplies, either of which can power the server. When one power supply fails, the backup will take over without disrupting the system. Redundant power supplies are certainly not a new idea, but it's important to ensure that older, single-supply systems are replaced with redundant supply versions in future technology refresh cycles.
Virtual machines run as images in server memory, so IT professionals can also enhance server availability by selecting systems with memory reliability technologies. For example, memory module sparing (hot sparing) provides a server with extra (spare) memory modules that the system can invoke when errors occur on another module. Memory patrol scrubbing proactively looks ahead of memory addressing to locate and correct memory errors. Double device data correction can allow the server to recover from simultaneous multi-bit errors on two memory chips, while enhanced DDDC (dubbed DDDC+1) can detect and correct an additional single-bit error in addition to the protection in DDDC. Memory mirroring basically duplicates memory content across two memory modules -- effectively providing RAID 1 in memory.
Other reliability tactics often include fitting the server with multiple network interface ports. When all of the hardware is working properly, the additional ports can provide additional bandwidth and ensure connectivity for all of the server's workloads. If a network interface port fails, the server can still provide network connectivity while minimizing the disruption to virtual machines (VMs).
But IT professionals must do more than simply buy more reliable machines -- it's important to implement policies and procedures to address server faults as they occur. Remember that high-reliability technologies do not make servers immune from faults -- these technologies simply help the server to continue running when faults occur. When a memory module or power supply fails, the server's reliability is compromised until technicians are able to make repairs, so consider the alerting, failover processes and troubleshooting needs of the virtualized servers as well.
Software tools for virtual server reliability
One of the most effective approaches for mission-critical server reliability is the use of server clusters. Clusters create a group of servers, each of which runs redundant VMs. When one server fails, clustering software removes the troubled server from the cluster and another copy of the VM takes over without disruption. The principal advantage of clustering is that servers within the cluster can often forego many high-reliability features -- control is simply given to another server in the cluster.
Tools like Stratus Technologies' everRun adopt a similar approach, supporting synchronized copies of selected workloads across different servers. When the original workload is disrupted, the duplicate copy becomes active with little (if any) disruption. Although this is not clustering in the traditional sense, the approach achieves a level of redundancy that rivals true clustering.
And there are other variations on this idea. For example, hypervisors like VMware's offer high-availability tools that can automatically restart affected VMs on other servers. Although this may allow for brief workload disruption during the restart process, its automated nature helps organizations quickly address workload problems for suitable applications.
These are only a few simple examples of the high-availability software options available for mission-critical enterprise workloads. IT planners must match these tools to the relative value of each workload and ensure that each VM receives appropriate protection.
This was first published in December 2013