High-performance computing thinks in terms of dedicated use of supercomputer clusters, perhaps more than any other...
segment of computing. These supercomputer clusters use tenants to run apps consecutively rather than with the multi-tenant/multi-app usage system commonly found in commercial clusters or the cloud.
As compute power has grown, businesses have made it their mission to get more out of their investments. This has pushed HPC to find a multi-tenant environment in which many small users can also share the computing resource. As a result, virtual HPC (vHPC) is on the rise, but it isn't without its challenges.
From HPC to vHPC
The need for HPC is obvious. Projects have accelerated by as much as 70% in environments in which graduate students and faculty have the ability to process data and simulate theories; HPC also allows for more thorough analysis and simulation. The entire spectrum of HPC use cases, from medical research to finding new gas fields, has seen similar improvement.
HPC -- especially supercomputer HPC -- typically employs leading-edge technology in each new installation. We're a far cry from mainframes as the performance leaders in computing. Most new supercomputers consist of top-end Xeon processors surrounded by a large amount of dynamic RAM and use remote direct memory access (RDMA) networks to reduce latency throughout the cluster. Most clusters use GPUs for parallelization to dramatically lift performance. In other words, HPC continuously pushes the edge of computing, and virtualization has taken a while to catch up with that necessary reality.
Areas in need of improvement include hardware assist for hypervisors as a way to reduce additional overhead from the hypervisor layer. RDMA support is relatively recent and still evolving, as the industry comes up with new ways to shrink overhead even further. GPU virtualization as a standard feature of VMware and other common hypervisors is also relatively new. The good news is that these pieces are in place, and HCP virtualization is becoming more common in the various communities HPC serves.
Challenges of HPC and virtualization
Despite these improvements, there are still challenges on the software side. HPC is often highly tuned and, moreover, might run on a nonmainstream Unix or Linux distribution that has many proprietary tweaks. Examples include Catamount OS, which the U.S. Department of Energy's National Nuclear Security Administration Advanced Simulation and Computing Program uses on the Red Storm supercomputer, the Compute Node Linux used in some Cray models and IBM's Compute Node Kernel. These are all lightweight kernels that minimize OS overhead.
It can be difficult to get these through any hypervisor certification process, especially if they involve device drivers. One might think the answer is to go directly to a certified and supported Linux distribution, but one major issue with parallelized operations is that a compute cycle -- say an iteration of a simulation -- isn't complete until the last server finishes. Standard OS versions can add an unpredictable overhead and effectively slow down the whole cluster. For user apps that push the performance envelope, such as one that simulates a nuclear explosion, these delays are too long for comfort. For multi-tenant operations, the ease of use and hypervisor support benefits of a standard environment usually outweigh those of a tightly tuned setup.
There's another aspect to ease of use. Researchers typically aren't data scientists, but there's enormous value in providing them with ready-made environments so they can operate autonomously. Academic researchers have peculiar hours, so a delay for a 9-to-5 IT admin is a serious schedule hit, especially on Friday night. One good way to work around this is to put prebuilt images into a library, coupled with a usable template system for system setup scripts, and so on.
Containers are starting to appear in the vHPC space. Once the technology matures a bit more, containers will be an efficient platform in parallelized environments because they are easy to use, quick to deploy and use less memory per instance than VMs.
Storage brings its own complications. Many HPC admins swear by Lustre or Gluster, but it can be a challenge to manage these in a multi-tenant environment that requires large amounts of storage at short notice during a job run. It's still more of an aim to achieve the level of automation seen in the cloud than it is a reality.
VHPC and the cloud
Speaking of the cloud, the big three cloud service providers (CSPs) -- Amazon Web Services, Google and Azure -- are all keen to own a part of HPC virtualization. New "very large" instances and GPU instance platforms abound. Google especially aims to take a good share of the opportunity here. On the downside, cluster configurations -- especially network bandwidth -- are weak compared to well-tuned supercomputers, so the CSPs have to make the performance up with more instances, which doesn't always work due to the "cycle completing when the last server finishes" type of issues.
One benefit of cloud competition, though, is that multi-tenant vHPC providers offer mainly cloud-like pay-as-you-go pricing, which makes supercomputing available even to small projects.
Navigate advancements in the memory market
Consider using open source storage software
Calculate network bandwidth requirements