Environments that support supercomputing and high-performance computing have traditionally been deployed to bare-metal hardware. Improvements in virtualization technologies now make it possible to run an HPC workload in a virtual environment, enabling IT teams to take advantage of benefits such as more flexibility, improved resource allocation and better fault isolation.
But these benefits can be realized only if the hardware and software are properly optimized to support the resource demands of HPC. Here we provide seven best practices when optimizing a virtualized HPC workload.
Plan your cluster around the HPC workload
Not all HPC workloads are the same, and your cluster must be configured to optimize operations. For example, HPC applications typically break down workloads into multiple tasks that run simultaneously. However, some of those workloads carry out a higher degree of communication between compute nodes during processing, which often requires specialized hardware and software.
A workload's computational requirements will determine the number of compute nodes needed in the cluster, the hardware requirements for each node and any special software or firmware that should be installed. A set of management nodes to keep systems running, and the resources necessary to maintain security and implement disaster recovery, is also be required. Throughout the planning process, optimizing workflows should be at the forefront of your thinking.
Take server configuration seriously
Before HPC applications are deployed, compute nodes must be configured to support a high-demand virtual environment. One of the most important areas to address is the system BIOS. Not only should each host server be configured with the latest version of BIOS, but the BIOS settings must be optimized to run the hypervisor and its virtualized workloads.
In addition to system BIOS, you'll need to configure other server components, but be careful how you proceed. No setting is universal when it comes to HPC. For example, in some cases, you might consider enabling memory interleaving, but this should depend on the system's hardware, installed hypervisor and supported workloads. These factors should be considered for all system settings.
Don't forget the hypervisor
No two hypervisors are the same, and their ability to support an HPC workload will vary. Pick a hypervisor that has a proven track record with HPC virtualization as it is still a relatively young technology.
Whichever hypervisor is deployed, use the most recent version, especially if it includes features to better handle HPC. The hypervisor should also be configured to support the specific workloads. For example, you might need to configure settings related to memory access, power management or GPU access. For many of these settings, look for recommendations from the hypervisor vendor. If a vendor does not provide configuration recommendations for the HPC workload, you might consider a different hypervisor.
Prioritize GPU configurations
Few components in an HPC cluster will have a greater impact on workloads than the GPUs and how they're implemented in your virtual environment. Most major GPU manufacturers offer software that integrates with the hypervisor to virtualize the GPUs and make them available to the VMs. However, the way in which virtual GPUs are implemented depends on the selected hypervisor, supported workloads and the GPUs themselves.
GPU virtualization will need to be addressed on a case-by-case basis, but in general, configurations will fall into one of three categories: One VM maps to one GPU, one VM maps to multiple GPUs and multiple VMs map to one GPU. Most HPC workloads require one of the first two configurations, which rely on pass-through technologies that enable VMs to communicate directly with the physical GPUs, cutting out the hypervisor overhead.
Properly size and configure VMs
Getting the VM size right is crucial to planning your HPC environment. Some workloads are more CPU-intensive, others require a greater amount of reserved memory and still others need both. As part of this process, you need to determine how many VMs can be hosted on each cluster node without compromising performance.
Assess which guest OS to install for the HPC workloads and how the OS should be configured to maximize performance, keeping in mind such factors as firewall rules and security policies. Before installing the OS, determine whether the VM will be configured to use United Extensible Firmware Interface booting, which is the preferred approach for an HPC workload. Some workloads might benefit from CPU oversubscription, although, like many other strategies, this can be detrimental to certain HPC workloads.
Don't treat storage as an afterthought
Without the right storage, an HPC offering is destined to fail. The storage system must meet performance demands, provide the necessary capacity and store data reliably and securely. It should also accommodate both application data and VM virtual disks and, in both cases, be optimized to meet the needs of the virtualized workloads.
Then determine which types of storage to use. For example, if VMs won't be moved around, local storage can probably be used for the virtual disks. On the other hand, application data usually requires shared storage so it can be accessed by all cluster nodes. An HPC workload requires high-performing storage systems that support parallel processing. Whatever type of storage is used, the systems must be optimized to meet workload demands. To this end, some offerings might benefit from software-defined storage or intelligent storage with self-optimization capabilities.
Configure the network with HPC in mind
The network that supports HPC workloads must be fast, reliable and easy to manage. The greater the demand for high throughput and low latency, the more important the need for a performant network. Your HPC environment requires high-speed networking components that can support all operations, and those components must be optimized to meet the needs of an HPC workload.
When optimizing your networks, consider communication requirements between the compute nodes as well as between the compute and storage nodes. And don't forget about the management network. Some HPC environments might benefit from software-defined networking, which abstracts the underlying hardware to increase agility and improve network flow.