Accessing Graphics Processing Units for Code Development

I’ve written before about using AWS for training, and have used this system for courses on the United Kingdom Chemistry and Aerosol (UKCA) model that I develop. This set-up worked well again for a course that I organised recently, where I also learned about the ability of MobaXTerm to have a full graphical connection to the LXDE desktop used by virtual machines.

A virtual machine as a development environment

Another interesting use for this environment is for testing and development, particularly when wanting to develop code for different types of hardware, such as graphics processing units (GPUs). It can be expensive to buy something new just to run some porting tests, but with cloud computing, you can get access to a wide range of different hardware that can be used to test your code.

The complicated software stack of the Unified Model means that using this virtual machine environment gives a similar experience to running on HPC resources such as ARCHER2 or Monsoon2, but with the ability to include GPUs as well as CPUs to facilitate porting code to run efficiently on these accelerators. Work to do this had been funded by the ExCALIBUR programme, and because of this UKCA now has had a large number of its routines offloaded to NVIDIA-based GPUs using OpenACC. This initial porting work was done on Amazon EC2 instances.

Is there a cheaper option?

However, using cloud computing can be expensive, especially if you want to use more esoteric hardware or lots of GPUs, and what if you already have a server somewhere with a suitable graphics card that you could make use of? The Met Office virtual machine configuration makes use of Vagrant and VirtualBox as standard and, unfortunately, VirtualBox doesn’t allow the “guest” machines that have been virtualised to access the “host’s” GPU. I had also enabled the use of VMware Workstation Player, another virtualisation method, to work with this system but this also does not have the capability for virtual machines to access the GPU.

However, a solution can be found in libvirt, a toolkit to manage virtualisation platforms that does allow for PCI passthrough for the GPU and works with Vagrant using the QEMU machine emulator and virtualiser. Some set-up is required to enable this though. The BIOS settings will need to be updated to allow for this virtualisation using the Intel Direct IO option, and the kernel options for the GPU addresses need to be determined and set so that this GPU can be used exclusively by the guest machine. A second GPU may be required for the host server as well, as it wouldn’t have any display output otherwise.

Once the hardware addresses of the GPU have been determined, for instance by using lspci for a GNU/Linux host, it is then straightforward to point to these in your Vagrantfile to enable GPU passthrough. Once your guest is up and running you can install the necessary drivers and utilities, such as nvidia-smi, and compilers to enable you to run the software you need on your GPU-enabled virtual machine. For the Unified Model, this also involved compiling up the netCDF libraries and other dependencies using the NVIDIA compiler suite.

Is it worth it?

Even though cloud computing can be expensive, is it really worth going through the effort to configure this on an existing server? There were a lot of steps to complete to get a working system and once it has been configured only one virtual machine can access the GPU at a time. Additionally, the host operating system cannot use the GPU at all. It is also restricted to the GPU you have: my system had an NVIDIA Quadro RTX 5000 and ideally the development work I was doing required access to multiple V100s or A100s. As an alternative, supercomputing systems such as Isambard, JASMIN, DAWN, and now ARCHER2 provide access to compute nodes with multiple GPUs from NVIDIA, Intel, and AMD. While it was satisfying to be able to set up this environment on my local server, in the end, larger hardware systems were the better option.