Slurm overview

There are many new Slurm commands available on the Discovery cluster.

Common user commands in Slurm include:

sbatch sbatch <job script> submit a batch job to the queue
squeue squeue show status of Slurm batch jobs
srun srun <job script> run interactive job
sinfo sinfo show information about partitions
scontrol scontrol show job <JOBID> used to check the status of a running, or idle job
scancel scancel <JOBID> cancel job

Batch jobs

To run a job in batch mode, first prepare a job script that specifies the application you want to launch and the resources required to run it. Then, use the sbatch command to submit your job script to Slurm.

For complete documentation about the sbatch command and its options, see the sbatch manual page via: man sbatch

Example submit script:

Slurm job scripts most commonly have at least one executable line preceded by a list of options that specify the resources and attributes needed to run your job (for example, wall-clock time, the number of nodes and processors, and filenames for job output and errors).

  • A job script for running a batch job on Discovery may look similar to the following:
    #!/bin/bash
    
    # Name of the job
    #SBATCH --job-name=my_first_slurm_job
    
    # Number of compute nodes
    #SBATCH --nodes=1
    
    # Number of tasks per node
    #SBATCH --ntasks-per-node=1
    
    # Number of CPUs per task
    #SBATCH --cpus-per-task=1
    
    # Request memory
    #SBATCH --mem=8G
    
    # Walltime (job duration)
    #SBATCH --time=00:15:00
    
    # Email notifications (comma-separated options: BEGIN,END,FAIL)
    #SBATCH --mail-type=FAIL
    
    module load module_name
    ./my_program arg1 arg2
    

    In the above example:

    • The first line indicates that the script should be read using the Bash command interpreter.
    • The next lines are #SBATCH directives used to pass options to the sbatch command:
      • --job_name specifies a name for the job allocation. The specified name will appear along with the job ID number when you query running jobs on the system.
      • -o filename_%j.txt and -e filename_%j.err instructs Slurm to connect the job's standard output and standard error, respectively, to the file names specified, where %j is automatically replaced by the job ID.
      • --mail-type=<type> directs Slurm to send job-related email when an event of the specified type(s) occurs; valid type values include allbeginend, and fail.
      • --nodes=1 requests one node be allocated to this job.
      • --ntasks-per-node=1 specifies that one task should be launched per node.
      • --cpus-per-task=1 specifies that one CPU should be allocated per task.
      • --mem=8G requests 8 GB of memory.
      • --time=00:15:00 requests 15 minutes.
    • The last two lines are the two executable lines that the job will run. In this case, the module command is used to load a specified module before launching the specified binary (my_program) with the specified arguments (my_program_arguments). In your script, replace my_program and arguments with your program's name and any needed arguments.
       
  • A job script for running a batch job on the gpu nodes should contain --partition gpuq and the --gres flag to indicate the type of GPU (k80 or v100) and the number of GPUs (1 or 4) to be allocated for the job. For example:
    #!/bin/bash
    
    #SBATCH -J job_name
    #SBATCH --partition gpuq
    #SBATCH --gres=gpu:k80:2
    #SBATCH -o filename_%j.txt
    #SBATCH -e filename_%j.err
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=1
    #SBATCH --cpus-per-task=1
    #SBATCH --time=02:00:00
    
    module load module_name
    ./my_program my_program_arguments
    

    In your script, replace my_program and my_program_arguments with your program's name and any needed arguments.

Depending on the resources needed to run your executable lines, you may need to include other sbatch options in your job script. Here a few other useful ones:

Option Action
--begin=YYYY-MM-DDTHH:MM:SS Defer allocation of your job until the specified date and time, after which the job is eligible to execute. For example, to defer allocation of your job until 10:30pm June 14, 2021, use:
--begin=2021-06-14T22:30:00
--no-requeue Specify that the job is not rerunnable. Setting this option prevents the job from being requeued after it has been interrupted, for example, by a scheduled downtime or preemption by a higher priority job.
--export=ALL Export all environment variables in the sbatch command's environment to the batch job.


Submit your job script

To submit your job script (for example, my_job.script), use the sbatch command. If the command runs successfully, it will return a job ID to standard output; for example, Discovery:

$ sbatch my_job.script
Submitted batch job 4311


MPI jobs

To run an MPI job, add #SBATCH directives to your script for requesting the required resources and add the srun command as an executable line for launching your application. For example, a job script for running an MPI job that launches 96 tasks across two nodes in the general partition on discovery could look similar to the following:

#!/bin/bash
  
#SBATCH -J mpi_job
#SBATCH -o mpi_%j.txt
#SBATCH -e mpi_%j.err
#SBATCH --mail-type=END,FAIL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=48
#SBATCH --time=00:30:00

cd /directory/with/stuff
srun my_program my_program_arguments

In your script, replace my_program and my_program_arguments with your program's name and any needed arguments.

Note:

If your application was compiled using a version of OpenMPI configured with --with-pmi (for example, openmpi/gnu/4.0.1 or openmpi/intel/4.0.1), you can use srun to launch it from your job script. If your application was compiled using a version of OpenMPI that was not configured with --with-pmi (for example, openmpi/gnu/2.1.0 or openmpi/intel/2.1.0), you can use mpirun to launch it from your job script.

OpenMP and hybrid OpenMP-MPI jobs

To run an OpenMP or hybrid OpenMP-MPI job, use the srun command and add the necessary #SBATCH directives as in the previous example, but also add an executable line that sets the OMP_NUM_THREADS environment variable to indicate the number of threads that should be used for parallel regions. For example, a job script for running a hybrid OpenMP-MPI job that launches 16 tasks across two nodes in the standard partition on discovery could look similar to the following:

#!/bin/bash

#SBATCH -J hybrid_job
#SBATCH -o hybrid_%j.txt
#SBATCH -e hybrid_%j.err
#SBATCH --mail-type=ALL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --time=00:05:00
  
export OMP_NUM_THREADS=2
cd /directory/with/stuff
srun my_program my_program_arguments

In your script, replace my_program and my_program_arguments with your program's name and any needed arguments.

You also can bind tasks to CPUs with the srun command's --cpu-bind option. For example, to modify the previous example so that it binds tasks to sockets, add the --cpu-bind=sockets option to the srun command:

#!/bin/bash
  
#SBATCH -J hybrid_job
#SBATCH -o hybrid_%j.txt
#SBATCH -e hybrid_%j.err
#SBATCH --mail-type=ALL
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=16
#SBATCH --cpus-per-task=8
#SBATCH --time=00:05:00

export OMP_NUM_THREADS=2
cd /directory/with/stuff
srun --cpu-bind=sockets my_program my_program_arguments

In your script, replace my_program and my_program_arguments with your program's name and any needed arguments.

Supported binding options include --cpu-bind=mask_cpu:<list>, which binds by setting CPU masks on tasks as indicated in the specified list. To view all available CPU bind options, on the discovery command line, enter:

$ srun --cpu-bind=help

Interactive jobs

To request resources for an interactive job, use the srun command with the --pty option.

For example,

$ srun --pty /bin/bash
$ hostname
p04.hpcc.dartmouth.edu
$

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. The account must also be specified else the job will not run otherwise. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 2 CPUS and 4GB of memory each:

$ srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --pty /bin/bash
[q06 ~]$ 

When the requested resources are allocated to your job, you will be placed at the command prompt within a cluster compute node. Once you are placed on a compute node, you can begin execute your code interactively

Note:

When you are finished with your interactive session, on the command line, enter exit to free the allocated resources.

For complete documentation about the srun command, see the srun manual page via: man srun

Monitor or delete your job

To monitor the status of jobs in a Slurm partition, use the squeue command. Some useful squeue options include:

-a                                Display information for all jobs.
-j <jobid>                    Display information for the specified job ID.
-j <jobid> -o %all        Display all information fields (with a vertical bar separating each field) for the specified job ID.
-l                                 Display information in long format.
-n <job_name>           Display information for the specified job name.
-p <partition_name>   Display jobs in the specified partition.
-t <state_list>              Display jobs that have the specified state(s). Valid jobs states include PENDING, RUNNING, SUSPENDED, COMPLETED, CANCELLED, FAILED, TIMEOUT, NODE_FAIL, PREEMPTED, BOOT_FAIL, DEADLINE, OUT_OF_MEMORY, COMPLETING, CONFIGURING, RESIZING, REVOKED, and SPECIAL_EXIT.
-u <username>            Display jobs owned by the specified user.

For complete documentation about the squeue command, see the squeue manual page.

To delete your pending or running job, use the scancel command with your job's job ID; for example, to delete your job that has a job ID of 4632, on the command line, enter:

$ scancel 4632

Alternatively:

  • To cancel a job named my_fist_job, enter:
    $ scancel -n my_first_job
    
  • To cancel a job owned by username , enter:
    $ scancel -u username
    

For complete documentation about the scancel command, see the scancel manual page via: man scancel

View partition and compute node information

To view information about the nodes and partitions that Slurm manages, use the sinfo command.

By default, sinfo (without any options) displays:

  • All partition names
  • Availability of each partition
  • Maximum wall time allowed for jobs in each partition
  • Number of compute nodes in each partition
  • State of the compute nodes in each partition
  • Names of the compute nodes in each partition

To display node-specific information, use sinfo -N, which will list:

  • All node names
  • Partition to which each node belongs
  • State of each node

To display additional node-specific information, use sinfo -lN, which adds the following fields to the previous output:

  • Number of cores per node
  • Number of sockets per node, cores per socket, and threads per core
  • Size of memory per node in megabytes
     
Specification Field displayed
%<#>P Partition name (set field width to # characters)
%<#>N List of node names (set field width to # characters)
%<#>c Number of cores per node (set field width to # characters)
%<#>m Size of memory per node in megabytes (set field width to # characters)
%<#>l Maximum wall time allowed (set field width to # characters)
%<#>s Maximum number of nodes allowed per job (set field width to # characters)
%<#>G Generic resource associated with a node (set field width to # characters)

 

$ sinfo -No "%10P %8N  %4c  %7m  %12l %10G"

The resulting output looks similar to this:

PARTITION  NODELIST  CPUS  MEMORY   TIMELIMIT    GRES
gpuq       g08       16    128640   infinite     gpu:k80:4(
gpuq       g10       16    128640   infinite     gpu:k80:4(
gpuq       g11       16    128640   infinite     gpu:k80:4(
bigmem     k25       16    64132    infinite     (null)
bigmem     k26       16    64132    infinite     (null)
bigmem     k27       16    64132    infinite     (null)
bigmem     k28       16    64132    infinite     (null)
bigmem     k29       16    64132    infinite     (null)
bigmem     k30       16    64132    infinite     (null)
bigmem     k31       16    64132    infinite     (null)
bigmem     k32       16    64132    infinite     (null)
bigmem     k33       16    64132    infinite     (null)
bigmem     k34       16    64132    infinite     (null)
bigmem     k35       16    64132    infinite     (null)
bigmem     k36       16    64132    infinite     (null)
bigmem     k37       16    64132    infinite     (null)
bigmem     k38       16    64132    infinite     (null)
bigmem     k39       16    64132    infinite     (null)
bigmem     k40       16    64132    infinite     (null)
bigmem     k41       16    64132    infinite     (null)

For complete documentation about the sinfo command, see the sinfo manual page via: man sinfo

Credit https://kb.iu.edu/d/awrz

100% helpful - 1 review

Details

Article ID: 132625
Created
Fri 5/21/21 12:48 PM
Modified
Wed 6/23/21 6:52 PM