Scheduling Jobs

The Batch System

  • The batch system used on Discovery is Slurm.
  • Users login, via ssh,  to one of the submit nodes and submit jobs to be run on the compute nodes by writing a script file that describes their job.
  • They submit the job to the cluster using the sbatch command.

There are four primary partitions on the cluster:

  • standard – This is the main queue for the cluster. It does not need to be specified as it is used by default.
  • testq – This is a queue setup to run on our test nodes,  with resource limits.
  • bigmem – This is a queue set up to run up on our compute nodes which have 8GB per core.
  • gpuq- This is a queue set up to run GPU related jobs on the two production GPU nodes.

Users specify the amount of time and the number of processors required by their jobs.

Managing and Monitoring your jobs

Some useful commands:

sbatch sbatch <job script> submit a batch job to the queue
squeue squeue show status of Slurm batch jobs
scancel scancel JOBID cancel job
sinfo sinfo show information about partitions
scontrol scontrol show job JOBID used to check the status of a running, or idle job

The default length of any job submitted to the queue is currently set at one hour and the default maximum number of processors per user is set to a value based on their user status.

Information on Submitting Jobs to the Queue

  • Jobs that run longer than thirty days will be terminated by the scheduler.
    • These parameters are subject to change as we become more familiar with users needs.
  • It is important for users to specify the resources required by their jobs.
  • In the current configuration, the walltime and the number of nodes are the two parameters that matter.
  • If you don’t specify the walltime, the system default of one hour will be assumed and your job may end early.
  • See the Single Processor Job Example for further details
  • Scripts initiated by sbatch will have the environment which exists when you run the sbatch command.  sbatch scripts do not run .bashrc or .bash_profile files like an interactive login shell.

Information for Multiprocessor Jobs

  • For multiprocessor jobs, it is important to specify the number of nodes and processors required and to select nodes that are of the same architecture.
  • The nodes are divided into cells. The nodes in each cell are homogeneous with similar chip vendors and speed, as well as disk and memory size.
     
  • See the Sample parallel job scripts for examples of how to submit parallel jobs.Parallel programs that need to communicate between processes will run more efficiently if all of the processes are in the same group.
  • You can specify which group of nodes to use by adding the group to the PBS directive where you specify the number of nodes and processors.
  • For example:
    • #SBATCH --nodes=2
    • #SBATCH --ntasks-per-node=4
      #SBATCH --hostlist=k[01-58]
    • This example specifies that the job will run on our cellk nodes requesting 4 processors per node.
    • Before you submit your job, use the sinfo command to see which nodes are currently running jobs so you can select a cell that has free nodes.
       

INTERACTIVE JOBS

An interactive job is a job that returns a command line prompt (instead of running a script) when the job runs. Interactive jobs are useful when debugging or interacting with an application. The srun command is used to submit an interactive job to Slurm. When the job starts, a command line prompt will appear on one of the compute nodes assigned to the job. From here commands can be executed using the resources allocated on the local node.

[john@discovery7 ~]$ srun --acount=rc --pty /bin/bash
[john@p04 ~]$ hostname
p04.hpcc.dartmouth.edu
[john@p04 ~]$

Jobs submitted with srun –pty /bin/bash will be assigned the cluster default values of 1 CPU and 1024MB of memory. The account must also be specified; the job will not run otherwise. If additional resources are required, they can be requested as options to the srun command. The following example job is assigned 2 nodes with 2 CPUS and 4GB of memory each:

srun --nodes=2 --ntasks-per-node=4 --mem-per-cpu=1GB --cpus-per-task=1 --account=rc --pty /bin/bash
[john@q06 ~]$ 

 

Details

Article ID: 132446
Created
Fri 5/14/21 1:33 PM
Modified
Fri 12/17/21 12:41 AM