Murphy's Law states that if something can go wrong, it will go wrong. Perhaps nowhere does this law more firmly...
apply than in the world of server virtualization. Back in the days of physical only data centers, a server failure typically impacted a single workload. In contrast, virtualization hosts run multiple workloads which means a failure has the potential to result in a major outage.
The vast majority of the organizations that use server virtualization make use of technologies such as failover clustering or replication as a way to protect against hypervisor-level failures. Although such technologies do go a long way toward protecting virtualized workloads, clustering alone can be inadequate. It is possible for a major outage to occur even if the virtualization hosts are clustered and the VMs have been made highly available. Such failures can occur if some piece of the virtualization infrastructure becomes a single point of failure.
Although it is possible to eliminate every conceivable single point of failure, doing so requires deep pockets. In most cases organizations must identify potential risks and then evaluate the likelihood of the risk turning into a problem. That way, an organization can spend its money on threats that are deemed to be the greatest risks.
Of course this raises the question of what potential single points of failure exist. The actual failure risks can vary considerably depending on which vendors are being used and how the virtualization infrastructure has been implemented. Some risks are hardware related, while others are software related.
Hardware related risk applies to any piece of hardware whose failure could potentially bring down the entire virtualization infrastructure. Take power management for example. Many virtualization hosts are equipped with redundant power supplies. The idea is that a power supply failure will not bring down a host server because a secondary power supply can take over on the fly. Even so, administrators must consider what would happen if the power were to fail.
Virtualization hosts are usually connected to an uninterruptible power supply, and larger organizations may even have a generator that can take over in the event of a power failure. However, that generator could potentially become a single point of failure if all of the servers are tied into the same generator after a failure of the main power.
Of course, this is where the concept of risk assessment comes into play. A lot of things would have to go wrong before the failure of a backup generator could bring down the entire virtualization infrastructure. The power would have to go out, and all of the backup batteries would have to be depleted. Never mind the fact that the backup generator would have to malfunction. Consequently, the odds of the backup generator becoming a single point of failure are relatively low.
While it is possible to eliminate virtually all single points of failure, as mentioned before, it would be very expensive. Imagine what would be involved in setting up a separate backup generator for various groups of servers. Even that would not necessarily eliminate the potential for single points of failure. If the fuel for those backup generators all came from the same place, and the fuel happened to be contaminated with water, then the generator fuel could become a single point of failure. Again though, a lot of things would have to go wrong before that could happen.
In clustered environments it is much more common for the shared storage to become a single point of failure. Cluster storage is usually designed with redundant disks, but failures can occur at the array level, the switch level, or the cabling level if the proper degree of redundancy is not put into place.
On the software side, infrastructure servers can become a single point of failure if they are not implemented in a redundant manner. For example, suppose that an organization were to deploy System Center Virtual Machine Manager (SCVMM) as a tool for managing Hyper-V. SCVMM could become a single point of failure unless it resides on a highly available VM. Likewise, SCVMM depends on a SQL Server database which could also become a single point of failure unless it is made redundant. Some other potential single points of failure might include DNS servers, domain controllers, DHCP servers, backup servers or Internet gateways.
It probably isn't realistic for most organizations to eliminate every potential single point of failure. A better strategy is to identify the single points of failure and then assess each one based on the risk that it poses.
Addressing single points of failure in data center power design
Single points of failure in Active Directory design