Loughborough University
Leicestershire, UK
LE11 3TU
+44 (0)1509 222222
Loughborough University

IT Services : High Performance Computing

Running MPI Jobs


MPI Parallel Jobs

If your job just needs a number of CPU cores then the simplest option is to specify the compute partition and the number of tasks with -ntasks you require where a task is equivalent to a rank in MPI. This will then select the number of nodes required to achieve this number of tasks. The downside of this approach is that there are nodes with 12 cores and nodes with 20 cores, so if you specify the need for 20 cores without additional hints then the job may run on two 12 core nodes and you will be charged for 24 cores even though you are only using 20.

If you know you definitely want to use only a multiple of 20 cores and a 1:1 mapping of tasks to real cores then you should select the appropriate partition for the 20 core nodes, but accept that if the 20 core portion of the system is busy then you will have a longer wait time for the job to run.

If benchmarking suggests that your job will run acceptably with more tasks than cores on a node, then you may do so via the SLURM option --ntasks-per-node=some_value where some_value is greater than the number of cores on the node. This is not recommended without benchmarking as if your code has significant memory bandwidth requirements this approach is inappropriate. You may also, if your memory bandwidth requirements are large, use --ntasks-per-node=some_value to under load nodes. There are other SLURM options which allow even more advanced options. Please read the SLURM documentation for more information.

It is not recommended to use the compute generic compute partition as SLURM may try to run the job on 12 and 20 core nodes, split across types, unless you are running on just one node, which you can force with --nodes=1, but bear in mind the value of --ntasks= and the potential for overloading a node.

MPI with Bull xmpi

Bull xmpi is the default MPI version and is well integrated with SLURM.

#!/bin/bash
#SBATCH --time=00:1:00
#SBATCH --job-name=example
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --ntasks=60
#SBATCH --account=ITTEST
#SBATCH --mail-type=ALL
#SBATCH --mail-user=a.turner@lboro.ac.uk

# Should be loaded by default, but we force it for certainty
module load bullxmpi

mpirun myprogram.exe       
        

which you should submit with sbatch --partition=some_part myjob.sh where some_part is either compute-12 or compute-20 (60 is a multiple of both 12 and 20, so it fits into an integer number of nodes whichever partition is used).

In the above the second line specifies how long in the real world (wallclock time) the job is expected to run (in this case 1 minute). The third line specifies the job name, the third and fourth output locations, the sixth the account, and seventh and eighth mail options. The fifth specifies the number of ranks of MPI to use, which can be run on 12 or 20 core nodes. The last line actually runs the program. The use of mpirun without any additional parameters other than the program will pick up the number of ranks defined by --ntasks=60 and an automatically generated hostfile.

MPI with Intel MPI

This is generally similar to the above except that:

You must ensure that your job script loads the intel mpi module with module unload bullxmpi;module load intel-mpi before the mpirun command.

#!/bin/bash
#SBATCH --time=00:1:00
#SBATCH --job-name=example
#SBATCH --output=slurm.out
#SBATCH --error=slurm.err
#SBATCH --partition=compute
#SBATCH --ntasks=60
#SBATCH --account=ITTEST
#SBATCH --mail-type=ALL
#SBATCH --mail-user=a.turner@lboro.ac.uk

module unload bullxmpi; module load intel_mpi

mpirun myprogram.exe
        

MPI with openmpi

This is generally similar to the above except that:

You must ensure that your job script loads the intel mpi module with module unload bullxmpi;module load openmpi before the mpirun command.

There is generally no advantage to using openmpi over bullxmpi unless the latest version of openmpi contains a required feature or bugfix. Bullxmpi is based on openmpi but the installed version of openmpi is more recent than that which bullxmpi is based on.

Running when the number of cores required is less than a node

Use an allocation (see Allocations ) and submit with sbatch into this. The number of tasks set in the allocation will be the number of separate MPI jobs that can be run, and the number of ranks for each individual MPI jobs will be set to the value of --cpus-per-task set for each task, and -np must be used to set this for mpirun.

Allocating Ranks Unequally Over Nodes

Whilst SLURM offers a facility for distributing ranks unequally (e.g. rank 0 on the first nodes for extra memory when gathering data from many worker ranks on other nodes) it is slightly unfriendly in that is expects a file of the form

hostnameforrank0
hostnameforrank1
hostnameforrank2
      

Thus there is a utiliity mpi_hostfile_from_mapping

The output from this commands should be sent to a file, e.g. using in a script and can be consumed using SLURM commands for the --distribution=arbitrary option or via an MPI hostfile.

module load utilities
mpi_hostfile_from_mapping some_index_file -n > hostfile.$SLURM_JOB_ID
mpirun -np 25 --hostfile hostfile.$SLURM_JOB_ID program arguments