Virtualization places greater demands on server memory, increasing the need for memory allocation tactics. Allocating less memory than a workload needs can seriously pare a workload's performance, forcing a virtual machine to rely on disk swapping to meet memory shortfalls. Conversely, allocating too much memory wastes resources and reduces the potential for consolidation.
Modern server designs also present allocation challenges in the way that processors and non-uniform memory architecture interact. This FAQ explains the concept of NUMA and illustrates some possible uses for this particular memory allocation technique.
What is NUMA, and how does it affect memory and virtual machine (VM) performance on my servers?
Non-uniform memory access (
So NUMA alters the way that memory is seen by the processors. This is accomplished by assigning a portion of the server's memory per processor. Each portion (or block) of memory is called a NUMA node, and the associated processor can access its NUMA memory faster and without contention than it can access other NUMA nodes (other blocks of memory assigned to other processors) on the server.
The concept of NUMA is closely related to cache. Processors run much faster than memory, so data is often moved to faster, more local cache memory, where the processor can access the data much faster than waiting for general memory. NUMA essentially gives each processor its own portion of thetotal system memory, reducing the contention and delays that occur when multiple processors try to use the same memory.
NUMA is completely compatible with server virtualization, and NUMA still allows any processor to access any memory on the server. A processor can certainly access data that is located in a different block of memory, but requests must travel outside of the local NUMA node and be accepted by the target NUMA node. This adds some overhead that slows overall performance of the CPU and memory subsystem.
How does NUMA fit into a typical data center architecture?
NUMA is a hardware-level architecture that requires support from the processors and underlying server chipset. For example, Intel introduced NUMA compatibility in 2007 with its Nehalem and Tukwila processor architectures, which share the same chipset. In order to support NUMA nodes, the traditional frontside bus (FSB) approach connecting processors and memory was replaced with a new point-to-point processor interconnect called the QuickPath Interconnect (or QPI). AMD Opterons provide a similar interconnect called HyperTransport.
Thus, any server hardware that uses a traditional FSB architecture does not support NUMA, but today almost all x86 and Itanium servers support NUMA nodes. Support is provided through the selection of hardware devices and firmware. If you're not sure, check the server's memory specifications for NUMA support.
NUMA does not pose any compatibility problems for virtual machine (VM) workloads, but VMs should ideally be deployed to run within a NUMA node. This prevents the processor from having to communicate with other NUMA nodes and reducing the workload's performance.
However, taking advantage of NUMA requires that the operating system and hypervisor support memory affinity so that the OS will not move workloads between processors and NUMA nodes. For example, Hyper-V for Windows Server 2008 and 2008 R2 do not handle memory affinity for NUMA nodes, so it's impossible to set up a VM on a particular NUMA node. Fortunately, Windows Server 2012 does a much better job of supporting memory affinity for NUMA nodes.
What are some of the challenges of using NUMA techniques?
NUMA is an excellent means of boosting computing performance, especially for symmetric multi-processing (SMP) tasks. But since NUMA can cause performance problems, administrators must have a strong understanding of each workload's memory requirements, along with a mastery of memory allocation skills to ensure that workloads remain within NUMA boundaries.
The problem with virtual machines and NUMA is that if a VM uses more memory than a single NUMA node contains, it will spill over to another NUMA node. This won't affect the VM's stability or prevent it from functioning, but the added overhead needed for the processor to communicate across multiple NUMA nodes can reduce the workload's performance. So the question is: Is the VM big enough to need more than one NUMA node?
For example, suppose your server has two 10-core processors -- that's a total of 20 cores. If the server has 256 GB of memory installed, each NUMA node would be 12.8 GB. So as long as the VM is allocated less than 12.8 GB, chances are that it could run on one NUMA node (though it's not a guarantee, especially if the processor is already running other VMs within the NUMA node).
Reducing the VM memory allocation below the NUMA node size will increase the likelihood of the VM running on a single NUMA node. For example, if the server provides 8 GB NUMA nodes and there are already several VMs totaling 4 GB on the NUMA node, deploying another VM with 4 GB or more will almost guarantee that it will cross NUMA nodes. Instead, setting the VM memory size below 4 GB (if appropriate) will increase the likelihood of the VM remaining on one NUMA node. Also remember that this can occur if the hypervisor performs workload rebalancing, which may unexpectedly shuffle workloads across processors and memory spaces.
Benchmarking tools can assist administrators in identifying and resolving VM performance problems. For example, if a VM requires more memory, benchmark the VM's performance before allocating the additional memory, and then run new benchmarks for comparison. If VM performance declines, it may be a cue to check for NUMA node performance issues.
This was first published in November 2013