Eliminate I/O bottlenecks, improve VM performance

Virtualization deployments often introduce I/O bottlenecks at various levels of your physical infrastructure and undercut the gains of virtual servers. But with today’s tools, you can identify and loosen I/O choke holds at their source.

If you're a virtualization administrator struggling with input/output (I/O) bottlenecks, you're in a common, but not hopeless, situation. I/O clogs in virtualized infrastructures have many sources, including storage systems and networks. While some automated virtualization tools that can help you avoid I/O bottlenecks are on the way and some exist now, there's no panacea for network ills. So for the present, your best bet is to identify and ameliorate bottlenecks at their source.

In this article, we focus on the I/O bottlenecks that occur at various levels of your data center, which include CPU, memory, storage, the system bus, the network, the blade chassis and others. It's easy to get mired in predictions about the future and how you'll exploit next-generation data center automation tools to solve performance problems, but discussion of solutions on the horizon doesn't help you today. So I'll keep my discussion of upcoming I/O fix-it products to a minimum and focus on what's possible with technologies available now.

Products and promises

With today's products, you can get a handle on I/O bottlenecks using consolidation planning tools. Good ones can thoroughly analyze an environment, perform what-if modeling for prospective virtual machine (VM) physical host server platforms, and then map out a consolidation blueprint. CiRBA and PowerRecon from PlateSpin Ltd. (now Novell Inc.) are my favorite tools for this task.

Each VM's performance and I/O requirements will change over time, so if your company has room in its budget, I would recommend that you continue to monitor performance trends long after a consolidation project has concluded. Depending on the volatility of changes in VM performance patterns, a general rule of thumb is to conduct performance evaluations every three months.

Of course, technologies such as VMware Inc's Distributed Resource Scheduler (DRS), Virtual Iron Software Inc's Capacity Management or Platform Computing Corp.'s VM Orchestrator (VMO) enable a virtual infrastructure to dynamically adjust to changes in performance workloads; but these technologies are reactive, not proactive. Regular performance audits can spot potential performance bottlenecks before they trigger a DRS migration job, for example.

In the not-too-distant future, I expect many enterprise orchestration tools to perform preemptive live-migration jobs based on expected performance spikes. Such spikes are predicted based on historical performance data. So, for example, instead of waiting for VMware DRS to move a VM, an enterprise orchestration tool will move a VM to a physical host with greater available resources before the anticipated performance spike occurs.

If you look even further out, ultimately enterprise orchestration tools will add hardware (e.g., servers, storage and network resources) to VM clusters as performance demands require and then remove hardware when policies determine it's no longer needed.

I/O pain points

Data center managers are sometimes so eager to solve performance issues that it prompts them to force a virtualization technology into an environment without having evaluated true needs first.

So the essential rule of thumb here is that today's performance bottlenecks are best avoided by (1) following sound sizing practices up front and (2) not starting a consolidation project with a predetermined solution. Server vendors that offer free virtualization assessments, for example, may have a particular technology in mind before they even analyze your environment. As a result, your consolidated environment may be shoehorned onto a platform that's not ideal.

If you use a consolidation planning tool capable of modeling consolidation blueprints based on different prospective hardware platforms, exploit these features. If not, you need to consider the following sources of I/O contention when planning to consolidate systems via server virtualization:

  • Expected consolidation ratio
  • CPU
  • Memory
  • Storage I/O
  • Network I/O
  • System bus I/O
  • Blade chassis I/O
  • Application selection

Just as it's not beneficial to follow a predetermined solution, it's rarely beneficial to begin a consolidation project with an expected consolidation ratio in mind. It often forces organizations to choose between high-availability response and consolidation density.

Let's say you want a 30:1 consolidation ratio. For starters, you'll likely need a pretty large server (probably 4U) to handle the VM's I/O requirements. Since several applications today license based on physical server resources, your application licensing costs may increase. In addition, a major hardware failure on a server would result in the need to failover more than 30 VMs to the remaining hosts in the VM physical host cluster.

By comparison, a more modest consolidation ratio of 15:1 could be handled by a 2U server and VM failover/restart operations could complete more quickly. In addition, application software licensing may not be as costly. I've found that consolidation ratios of 8:1 to 12:1 are quite common. They meet the typical organization's consolidation needs while still offering plenty of overhead -- typically 30% to 40% of available resources -- for future expansion.

At the extreme end, I've worked with organizations that wound up with 1:1consolidation ratios when consolidating Exchange servers to VMs. The resulting benefit was the system portability offered by server virtualization.

When high consolidation ratios area set requirement, you may need to take a heterogeneous approach to the virtualization consolidation project. Operating system (OS) virtualization solutions such as Sun Microsystems Inc's Solaris Containers or Parallels' Virtuozzo Containers exploit shared OS and application files and libraries. The result is a reduction in overhead requirements of consolidated systems and, thus, higher consolidation ratios.

CPU Bottlenecks

The continued evolution of multicore CPUs has helped to remove the CPU as a bottleneck source, but it's still a consideration. When an organization draws up a consolidation plan, CPU and memory are often the first lines of resources that are evaluated. Today's typical two-way, 2U server yields a total of eight CPU cores (four cores per CPU), which typically provide an average CPU core-to VM ratio of less than 1:2 (one core for every two VMs). Again, I'm talking about an industry average. Because many organizations have virtualized servers with higher consolidation numbers, ultimately the workload requirements of each VM determines consolidation density.

When VM CPU performance is a chief consideration, a good rule of thumb is to limit the number of VM virtual CPUs running on a physical server to be less than or equal to the number of physical CPU cores on the server. By ensuring that the number of virtual CPUs does not exceed the number of physical CPUs or cores, you remove a significant performance burden from a hypervisor's CPU scheduler. Note that resource pools are required to guarantee correct CPU affinity levels.

When it comes to CPU bottlenecks, physical resources are not the only concern. Hypervisors play a significant role in scheduling VM access to CPU resources, and the more virtual CPUs that you assign per VM, the greater the potential for virtual CPU bottlenecks. If VMs don't run applications as multiple threads, they won't see a benefit from multiple virtual CPUs. In fact, performance may actually degrade given the added CPU scheduling overhead introduced by the additional virtual CPUs. For VMware environments, you'll find more information on virtual CPU performance considerations in VMware's "Performance Tuning Best Practices for ESX Server 3" technical note.

Memory Bottlenecks

Memory bottlenecks often occur in both physical system memory and with shadow page tables, which hypervisors use to present memory to VMs. Physical memory-sizing guidelines vary with different virtualization platforms, so it's best to consult your preferred virtualization vendor's planning guides for specific details on memory sizing.

To determine memory requirements, you need to look at the hypervisor requirements as well as the peak memory requirements of each planned VM. Depending on platform, determining memory requirements may not be as straightforward as adding up the hypervisor memory and the memory allocated to each VM. Assume, for example, that a hypervisor requires 512 MB of RAM (note that memory overhead varies depending on the number of VMs managed by a hypervisor) and that four VMs require 512 MB of RAM, and an additional four VMs require 1,024 MB of RAM. If you add up the total memory required, you have 512 MB + (4 x 512MB) + (4 x 1,024 MB), which equals 6,656 MB or 6.5 GB of required RAM.

There are a couple of X factors that ultimately skew the actual physical memory requirement: memory overcommit support and shared memory support. With memory overcommit, you can allocate more memory to VMs than the physical memory that exists on a system. This feature is especially useful for VMs whose performance spikes occur at different times. VMware, for example, manages over-committed memory using its memory balloon driver (vmmemctl).

With memory sharing, redundant physical memory pages can be consolidated into shared read-only memory pages and thus reduce physical memory requirements up to 40%. Grouping like operating systems (i.e., Windows Server2003) and like applications (i.e., SQL Server 2005) will provide the best memory-sharing results. Of course, the hypervisor needs to support memory sharing in order to realize this benefit.

Finally, you can't overlook the latency introduced by shadow page tables. Latency is most often visible when multithreaded applications run inside a VM that services a heavy load (greater than 80 concurrent connections). At that point, application latency is noticeable to connected clients. Note that suspected memory bottlenecks relating to shadow page table latency are often confirmed by running a performance monitor and checking for a high number of page faults in a VM's guest OS. Ultimately, shadow page table latency results in some enterprise applications simply not being virtualized on a server virtualization platform; alternatively, organizations can look to OS virtualization architectures for high-performance workloads.

Hardware-assisted memory virtualization (i.e., Intel extended page tables, AMD nested page tables), which will be available later in 2008, will enable VMs to manage their own physical page tables and, as a result, improve memory performance. That being said, vendors with which I have worked have confirmed that some applications perform better with shadow page tables than with hardware-assisted memory virtualization. So thus far, it doesn't look like hardware-assisted memory virtualization is the cure-all that some had hoped for.

Storage I/O Bottlenecks

The majority of server virtualization deployments are configured for high availability and, as a result, are heavily reliant on shared networked storage. As the number of VMs on a physical host increases, shared storage can quickly lead to storage I/O bottlenecks. The following issues are the typical root causes of storage I/O bottlenecks:

  • limited expansion slots in the physical server platform;
  • limited storage I/O bandwidth; and
  • storage I/O contention.

When evaluating server platforms for virtualized environments, physical server expansion slots are a significant consideration. In addition to a number of ports, the port type (e.g., PCI Express [PCIe], Peripheral Component Interconnect Extended [PCI-X]) should be a consideration as well. The table "Popular2U Server Expansion Capabilities" compares the expansion capabilities among popular 2U form-factor servers.

Since you have a limited number of PCIe slots to work with, how you divide them up is a significant design consideration. Two physical interfaces should be dedicated to storage, with load balancing and failover support provided by multipath device drivers. Depending on the storage requirements, 1 Gbps or 2 GbpsI/O connections may not be enough. You may need to deploy 4 Gbps or 8Gbps Fibre Channel (FC), 10 gigabit Ethernet (GbE) or InfiniBand.

Naturally, upgrading to a faster storage transport such as 4 Gbps FC may require you to upgrade your edge storage area network (SAN) switches in order to realize the total bandwidth available to each 4 Gbps FC adapter. The average difference in price between dual-port4 Gbps FC and dual-port 8 Gbps FC is about $500, so if there's room in the project budget, you could consider going with 8 Gbps FC adapters rather than replacing 4 Gbps adapters at a later date and as I/O requirements change. When additional storage I/O bandwidth this needed, with 8 Gbps FC interfaces already in place, you'll just need to upgrade your SAN switches to increase your storage bandwidth.

Using two dual-port adapters gives you a total of 4 storage ports. Assuming dual-port 4 Gbps FC, you'll have a total of 16 Gbps of available storage throughput. That number sounds high, but remember that it will be shared by all VMs on a physical host. Assuming 12VMs on a host, 16 Gbps of shared storage I/O leaves you with an average of 1.33Gbps storage bandwidth per VM. In comparison, consider an equal number of 1 Gbps iSCSI or FC ports, which leaves you with 0.33 Gbps per second of storage I/O per VM.

But because each VM's storage I/O spikes can come at different times, averaging storage I/O isn't always realistic. I prefer to plan for worst-case scenarios and err on the side of caution. Again, when you do the math, it's easy to see how 1 Gbps speed storage interconnects could cause a problem. Still, for some virtualization projects, 1 Gbps iSCSI is fine. Indeed, Dell EqualLogic Inc. and Left-Hand Networks Inc. have plenty of happy customers to prove it. Ultimately historical performance data and expected future requirements should drive the decision about which storage interface to purchase

Storage I/O contention is the result of a limited number of physical storage ports and available storage I/O bandwidth. Ideally, you want to avoid overlapping I/O performance spikes on the same physical server. Backup is a typical storage I/O performance drain, and storage I/O challenges related to backup can be avoided by leveraging serverless backups. While booting from SAN is an option for many virtualization platforms, I prefer to keep the hypervisor and hypervisor's swap file on local hard disks. This way, paging operations remain on a dedicated I/O channel and will not affect VM storage I/O flowing through networked storage (e.g., FC, iSCSI or Network File Storage).

Network I/O Bottlenecks

Most popular 2U servers today come with two onboard 1 GbE network interfaces that include TCP offload engine support. Many organizations prefer to use two network interface cards (NICs)configured as a team for hypervisor console connections and hypervisor cluster heartbeat communications, so you shouldn't plan on embedded NICs (if you have only two of them) in your VM network requirements scoping. Assuming that you dedicate two server PCIe slots for storage, you'll have up to two expansion slots left for network interfaces. You can maximize the two slots by installing two quad-port Gbps network interfaces, which gives you eight available gigabit ports to divide among virtual switches. Depending on the ultimate consolidation ratio, eight Gbps ports are usually adequate.

As 10 GbE interfaces and switches decrease in price, we should expect that10 GbE interfaces will become common for both network and storage (via iSCSI or FC over Ethernet) interconnectivity.
Price is often a concern with virtualization projects, and quad-port interfaces are costly in comparison with dual-port or single port interfaces. A dual-port Intel Pro/1000 PT adapter, for example, retails for $190, while a quad port Pro/1000 PT adapter retails for around $475. So the difference in price for the latter is nearly 2.5 times greater. But on shared systems where I/O throughput is a paramount concern, added ports are well worth the investment. Alternatively, if limited network I/O resulted in lower consolidation ratios, you're left with purchasing additional physical servers, related hardware, power, cooling and maintenance costs. In the end, the added throughput offered by quad-port adapters is usually easy to justify.

With networking, upstream bottlenecks can't be overlooked either. It's always great to have plenty of available bandwidth on a physical host, but upstream devices such as routers can quickly become choke points if they go unmonitored.

Having top-of-the-line interfaces is great, but if they're connected to a slow expansion bus, you'll never realize the full benefit of the interface. It's no coincidence that server vendors are engineering new server platforms with top-of-the-line PCIe expansion buses. If some of your servers are connecting multiport Gbps NICs to legacy first generation PCI-X interfaces, the source of a bottleneck may be the PCI-X bus itself and should not be overlooked.

Blade chassis

I/O Chassis I/O has long been the Achilles' heel of many blade servers. Some blade vendors have overcome limitations in chassis I/O. Some vendors, for example, as few as 18 physical network I/O ports per chassis. While that might seem like a reasonable number, let's assume that you'd like to run a modest 10 VMs per blade and load the blade chassis with14 blades. This means that you'll have140 VMs sharing 18 physical I/O ports. Such an architecture could be extremely I/O challenged.

When evaluating blade systems, I/O between individual blades is important. But equally important is the blade chassis's I/O capabilities, since that is what determines the bandwidth available for VMs communicating with systems outside the blade chassis.

Selecting virtualization applications with I/O in mind

When selecting applications to virtualize, limiting I/O bottleneck possibilities is a key consideration. Unfortunately, I can think of no virtualization planning scheme that provides a cut-and-dry formula for application selection. The immortal IT phrase "It depends" often creeps into discussions of virtualization planning.

But some approaches can help you sort through options and reduce confusion. So first, let's look at the common server applications and services, their common bottlenecks and their suitability for virtualization. The table "General Guidelines for Application/Service Virtualization Suitability" lists bottlenecks in order of occurrence. But you'll probably find that even the commonality of bottlenecks is debatable, as their source is highly dependent on each application's structure, as well as the configuration of the server on which it runs.

I don't like blanket statements about virtualization feasibility; an application's workload -- not the application -- tends to be the deciding factor. I've worked with plenty of organizations that have virtualized Oracle Database 11g, and I've worked with an equal number that could not virtualize Oracle. While the application was the same, the workload was completely different, and the workload analysis ultimately determined virtualization feasibility.

Also, a lot can be gained from exploiting the proper virtualization architecture for a given problem. For example, a high-performance Oracle database may not be feasible in a virtual machine but will run well as a Solaris Container. Alternatively, if the database needed the full resources of a physical server and is configured for high availability using Oracle Real Application Clusters (RAC), then it should be left alone.

As you determine whether an application or service is a good fit for virtualization, keep the following considerations in mind.

  • If the application requires the majority of the resources (i.e., more than 50%) on bare metal, it's probably a poor virtualization candidate. I/O virtualization and improvements in hardware-assisted virtualization (Intel VT, AMD-V) may change that down the road, but right now your best bet is to hold off on virtualizing such applications.
  • Enterprise database servers (Oracle, DB2 and SQL, for example) are resource-intensive and often require configurations of two or more virtual CPUs and 4 GB of RAM per virtual CPU. Remember that even if the consolidation ratio isn't high, the system portability gained by virtualization is still worth the effort.
  • Enterprise email servers (Exchange, Lotus Domino and GroupWise) should be sized in a way similar to enterprise database servers. Exchange 2003, however, cannot take advantage of more than 4 GB of RAM, so no benefit will be realized by allocating more than 4GB of RAM to a VM running ExchangeServer 2003.
  • For performance-intensive workloads, limit the number of virtual CPUs on a given physical host to less than or equal to the number of physical CPU cores on that system.
    Availability of 10 Gbps network interfaces that support I/O virtualization, such as the Neterion x3100, will further allow I/O bound services or applications to run in VMs. Of course, for Ethernet-based storage, it's likely that the storage array could become the bottleneck if VM storage isn't distributed among multiple arrays. Alternatively, you can evaluate products that perform in-band I/O caching, such as DataCore Software Corp.'s SAN symphony, which improve I/O performance because of the read/write caching that they provide.

The bottleneck blues

Several months after a virtualization deployment, it's easy to get a case of the bottleneck blues. So it's key to continually monitor I/O performance and spot potential bottlenecks before they become problems. Also, using tools such as Akorri BalancePoint or eG Innovations' eG Monitor can help to isolate the cause of performance bottlenecks as well. Ultimately, avoiding I/O bottlenecks is part planning and part monitoring. A properly planned virtual infrastructure can be highly resilient in the face of I/O bottlenecks. Still, I/O requirements always change over time, and continual monitoring of virtual infrastructure performance can help you to solve I/O and performance problems before users ever know a problem exists.

About the author:
Chris Wolf, a senior analyst in the Data Center Strategies service at Midvale, Utah-based Burton Group, has more than 14 years of experience in the IT trenches and eight years of experience with enterprise virtualization technologies. Wolf provides enterprise clients with practical research and advice about server virtualization, data center consolidation, business continuity and data protection. He is the author of Virtualization: From the Desktop to the Enterprise, the first book published on the topic, and has published dozens of articles on advanced virtualization topics, high availability and business continuity.

Next Steps

Know the difference between bottlenecks and faults in application performance

Identify bottlenecks and resolve VM performance problems

Eliminate I/O bottlenecks and improve VM performance

Dig Deeper on Virtual server backup and storage