EXPERT RESPONSE
People often ask me that question.
Although it may seem obvious, the more virtual machines (VMs) that are
put on a single physical server, the greater the detrimental effect
will be when that physical server suffers an outage. But you might
say, "Andrew, I thought the point of virtualization was to create a
zero-downtime infrastructure where if a physical server goes offline
then all of its VMs are just failed over to a separate physical
server?" You are correct, that is the idea in practice, but
unfortunately reality has a nasty habit of deviating from pristine
theory.
High availability technologies, such as VMware VI3 High
Availability (HA), work by creating a single, logical resource cluster
out of several physical servers that all have access to the same
shared storage (fibre channel (FC) and iSCSI typically). When a
physical server in the cluster fails the cluster's HA agent directs
the remaining, online, physical servers to take ownership of the
offline VMs and restart them. Because their is a resource penalty
during a VM boot, the virtualization software typically starts only
three VMs at a time on a given host. If you have deployed a
virtualization solution on several 4U servers with eight processors
and 64 gigabytes (GB) of random access memory (RAM) in order to host
30 VMs on each server, when a single server fails the event causes the
relocation of 30 VMs to occur. The speed at which this process happens
depends completely on the virtualization solution at hand, VMware VI3
offering one of the most intelligent HA agents available (and highly
so! HA! I kill me.).
The above scenario represents all too well a real-world situation that
is provides evidence that a high consolidation ratio (30 VMs on a
single server) also represents a high risk ratio.
The more you consolidate, the greater
risk you assume. This cannot be avoided. The same is true
of any type of consolidation project, physical or virtual. If you were
to consolidation 10 web sites on a single physical web server using
Apache virtual hosts then you have a consolidation ratio of 10 to 1,
but you have also assumed a risk ratio of 10 to 1. If one server goes
offline, you lose ten services. The level of risk is diminished with
virtualization through its high availability feature set, but it is
not erased. High availability fail overs take time -- down time. The
question is: how much down time are you willing to risk for greater
consolidation A rack with 42 1U rack-mounted servers that only host
5-10 VMs a piece will provide much speedier failover times, but you
have not consolidated physical hardware nearly as much as you could
with 4U servers, or even blade servers. Blade servers themselves
represent another type of consolidation, physical hardware
consolidation via use of shared components such as power supply units
(PSUs), network interface adapters (NICs), and fibre channel host bus
adapters (HBAs) for SANs. This just goes to show that in a
virtualization project the number of VMs to place on a single server
is not the only consolidation choice you make, the type of server you
choose also affects the consolidation ratio. If you use blade servers
then you increase the consolidation ratio, also increasing assumed risk.
A data center that is risk free will have completely redundant power,
cooling, networking, and servers. And those servers will only host a
single service, and each service will operate in a cluster. Get the
picture? Being risk free is not without its own price -- the risk
ratio is inversely proportionate to money spent.
To operate a close-to-risk-free data
center costs a pretty penny. The secret is to discover the most
harmonious balance between risk management and the amount of money
spent to get there. This will help you determine your consolidation
ratio, and subsequently assumed risk.
In conclusion, you should not approach a consolidation project with a
specific consolidation ratio in your head. Approach the project with
the goal of finding the middle ground between acquisition cost and
total cost of ownership (TCO) of an IT infrastructure and the amount
of money lost if that infrastructure, or part of it, goes down for a
given amount of time. This will help you determine your the right
amount of assumed risk, which in turn should be your target
consolidation ratio.
|