Loughborough University
Leicestershire, UK
LE11 3TU
+44 (0)1509 222222
Loughborough University

IT Services : High Performance Computing

Small Jobs


Introduction

This advice is for where the number of ways parallel you wish to run is less than the number of cores available on a node, but you wish to maximise your usage of your allocation by filling nodes up with work. I.e. you do not wish to run a 4 core job on a 12 core node and have 8 cores wasted but would rather run 3 4 core jobs.

Where possible use parallel programs or applications.

Overview

The following is generic advice which, by using salloc , srun and Array Jobs you can submit any number of jobs of varying lengths and varying number of cores (less than 12 or less than 20) to a allocation ('pipe' for your throughput) on the system of a size that minimises execution time but maximises the chance of your job running.

This is a little more complex than might otherwise be the case because node exclusivity is forced on the nodes which requires some additional workarounds.

Key Concepts (Please Read)

--tasks-per-node

Use --tasks-per-node=somevalue to control the number of instances of a particular task on a node. (The number of CPU cores is this 12/somevalue or 20/somevalue depending on whether you submit to a partition with 12 or 20 core nodes).

--cpus-per-task

Use --cpus-per-task=somevalue to indicate how many CPUs will be assigned to each unique job. For single core jobs the value of somevalue will normally be 1. Make sure that the value here multiplied by the number from the paragraph above is the number of CPU cores on the nodes in the selected partition, unless you are doing something special (e.g underallocating a node). Under allocation will still cost you the number of CPU cores on the node times time, even if you don't use them all. Overallocating may result in poor performance.

--nodes

You will set --nodes=1 to ensure that each group of jobs uses only one node.

srun

You will use srun to run multiple versions of a shell script.

--array

You will use Array Jobs , via the --array=startvalue-stopvalue option to fire up many copies. The value of startvalue is normally 0, and the final value is calculated, as given in the paragraph below.

Note that the number of elements in the array job is the number of blocks of whole node's worth of jobs that will be used. For example, if you have a workload of 1200 single core jobs and wish to submit these to 12 core nodes, then the number of whole node's worth of jobs is not 1200, but rather 1200/12 = 100, and you would use --array=0-99. The minimum number of of nodes if you are submitting even just one single core job is 1. If the answer is 1 then you should use --array=0-0.

The number of nodes required should be a whole number of nodes.

sbatch

You will use sbatch to fire off the jobs.

Shell arithmetic

You will use shell arithmetic and SLURM variables to determine the data for each unique job.

Scripts

srun.sh

This script runs a job, with a block number you can use to determine the required parameters for the job. The block number will be a multiple of the array task ID, multiplied by the number of tasks set per node. The number of tasks per node here is the number of job elements that will be run in parallel. I.e. for single core jobs then this will be set to the number of cores on the node.

#!/bin/bash -l
# Set account name, etc. here if required
JOB_SCRIPT=$1
shift
srun $JOB_SCRIPT $*
        

If you customise your jobs (e.g. set hard coded account names via SLURM commands in the script) do that here.

Note that the use of a script called srun.sh here is because running sbatch .... srun job.sh parameters is not allowed by SLURM.

Final Scripts

We assume this is saved as myjob.sh. This is probably similar to your standard job script, with SBATCH commands stripped out, and additional logic to work out the number of CPUS per core, and so on.

#!/bin/bash
# Put module loads, and so on here.
ID=`expr $SLURM_ARRAY_TASK_ID \* $SLURM_NTASKS_PER_NODE + $SLURM_LOCALID`
some_application 
        

The value of ID is unique to each invocation of myjob.sh and can be used in a way similar to that of SLURM_ARRAY_TASK_ID as outlined in Array Jobs .

Job scripts are individual to your work, so it's hard to give a concrete example, and you can pass common parameters through to myjob.sh as outlined below.

Single core jobs

An example job submission might be sbatch --account=youraccountname --time=1:0:0 --array=0-31 --partition=compute-12 --tasks-per-node=12 --cpus-per-task=1 --nodes=1 srun.sh myjob.sh 1 2 3 where the time specification should be replaced by the maximum time you expect for any task. Here one node is used, and 12 tasks are specified for 12 core nodes, so each gets a single core. Adjust the array job count to suit. In this example it assumes that you have 32 elements in the array (0..31) each of which runs 12 single core jobs, for a total job count of 32*12 = 384, and ID will run from 0 to 383. The common parameters 1 2 and 3 will also be passed through.

MPI Jobs

In this case the definition for the job is changed in myjob.sh to change the line some_application ...... to mpirun -np $SLURM_CPUS_PER_TASK some_application $PROGRAM ........

An example job submission might be sbatch --account=youraccountname --time=1:0:0 --array=0-31 --partition=compute-12 --tasks-per-node=3 --cpus-per-task=4 --nodes=1 srun.sh myjob.sh 1 2 3 if you want each MPI job to use 4 CPUS (12/3). Adjust the array job count to suit. In this example it assumes that you have 32 elements in the array (0..31) each of which runs 3 quad-core jobs, for a total job count of 32*3 = 96, and ID will run from 0 to 95. The common parameters 1 2 and 3 will also be passed through.

OpenMP Jobs

Here use the serial job script but replace some_application ...... with export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK; some_application ........

An example job submission might be sbatch --account=youraccountname --time=1:0:0 --array=0-31 --partition=compute-12 --tasks-per-node=3 --cpus-per-task=4 --nodes=1 srun.sh myjob.sh 1 2 3 if you want each OpenMP job to use 3 CPUS (12/3). Adjust the array job count to suit. In this example it assumes that you have 32 elements in the array (0..31) each of which runs 3 quad-core jobs, for a total job count of 32*3 = 96, and ID will run from 0 to 95. The common parameters 1 2 and 3 will also be passed through.

Caveat on job times

Normally the value for the --time=sometimespec should be the maximum time required for any of the sub jobs to be run. If times for each element vary widely, see the task farmer link below.

Caveat on data output

If the main body of your job script was:

# module loads, etc.
my_application > some_output_file
      

then you need to user the unique ID for the job, noted above to distinguish some_output_file per subjob, e.g.

# module loads, etc.
my_application > some_output_file.$ID
      

If your program reads or writes to directories with fixed filenames, bear that in mind, and you may need to create or change directory (Linux cd command) as required.

Limitations

Note that this is not true task-farming as it assumes that group of tasks within an srun will use the same amount of time so if a long job combines with a number of small ones then it may waste CPU time.

For example, if of 12 tasks in an srun 11 take one hour, and one takes 10 hours and you set --time=10:0:0 then this array job element will waste 11*9 = 99 hours of your allocation. A workaround is to run with --time=1:0:0 hours and record which jobs run with this time specification and then gather up the remaining ones in a new parameters.txt and rerun these with a longer time specification. However, if the length of times are truly random this will not work.

True task farming

A task farming utility is provided. See mpi_task_farmer .

Advanced Usage

In theory it is possible to do combined OpenMP and MPI too, but it is not recommended. Advice on how to do this may be added later.

You can, in theory, combine this approach with array jobs. It requires shell arithmetic over the array job ID and local ID of a sub job on a node to do this, so it is not recommended unless you know how to do this. Advice on how to do this will be added later.