Sample python lab (Walltime example)

In this lab we will create a simple python script, called invert_matrix.py which we will submit to the cluster. In addition we will explore what it is like for a job to run out of walltime.

For the purpose of this lab, we will use a conda environment which has the necessary packages installed via modules. For those who want to use python outside of this lab, then it is strongly encouraged to visit:

https://services.dartmouth.edu/TDClient/1806/Portal/KB/ArticleDet?ID=72888

The above KB will take you through creating a conda, environment so that you can manage you own python packages.

To get started, open a new file in your favorite editor and call it invert_matrix.py. Once created, paste in the following python code.

import numpy as np
import sys
for i in range(2,2001):
   x=np.random.rand(i,i)
   y=np.linalg.inv(x)
   z=np.dot(x,y)
   e=np.eye(i)
   r=z-e
   m=r.mean()
   if i%50 ==0:
    print( "i,mean",i,m )
    sys.stdout.flush()

Save the file, and test that the program works by issuing:

python invert.matrix.py

Next we will want to estimate how long the job will take to complete.  A  way to get an idea of that is to run an interactive job either by sshing directly to compute node x01, or by submitting for a slurm interactive job via srun. Lets submit for a slurm interactive job to estimate how much walltime we will need for our job.

$ srun --account=rc --cpus-per-task=8 --pty /bin/bash
$ module load python
$ export OMP_NUM_THREADS=8
$ time python invert_matrix.py
i,mean 50 -3.163405787045661e-17
i,mean 100 6.142970237536317e-17
i,mean 150 4.5949741833585534e-18
i,mean 200 3.893461968391442e-17
i,mean 250 -3.146382346137727e-17
i,mean 300 7.28979443833719e-17
i,mean 350 -2.1393282891571857e-17
i,mean 400 -1.0057995742663563e-16
i,mean 450 3.739472179051286e-17
i,mean 500 1.0962310076481413e-17
...
i,mean 2000 2.71473510528421e-16

real 4m36.904s
user 4m27.038s
sys 0m8.851s

Above you can see that I am using srun to create an interactive session, and asking for 8 cores on a compute node. Once the command executes you can see that I am on a new node, p04.

For this example, I will be using the time command. More information can be found about the time command by issuing "man time". 

After the interactive session has started I step through the steps I would normally to run my code, but I place the time command in front of the python command before pressing enter

$ time python invert_matrix.py

At the end, it will write out how long it took the job to complete. In the above example, you can see that it took 4 minutes and 36 seconds to complete. I now know that the amount of walltime I need should be at least 5 minutes.

Now lets try a batch example, and lets not give it enough walltime to see what happens.

Next we will create a batch script to submit to the cluster.  Copy the below example in a new file within the directory that your invert_matrix.py file is located.

#!/bin/bash -l
# Name of the cluster account
# How long should I job run for
#SBATCH --time=00:02:00
# Number of CPU cores, in this case 8 cores
#SBATCH --ntasks-per-node=8
# Number of compute nodes to use, in this case 1
#SBATCH --nodes=1
# Name of the output files to be created. If not specified the outputs will be joined
#SBATCH --output=%x.%j.out
#SBATCH --error=%x.%j.err
# The code you want to run in your job
module load python
export OMP_NUM_THREADS=8
python invert_matrix.py

Once you have your submit script written, submit the job via the sbatch command.

$ sbatch sample_python_lab.sh
Submitted batch job 4156

Upon submitting you receive a job number. In the example above my job number is 4156.

As soon as the job starts, you will notice two output files created within the directory. In our case they are:

sample_python_lab.sh.4156.err
sample_python_lab.sh.4156.out

$ ls -l
total 136
-rw-r--r-- 1  rc-users  218 May 14 16:20 invert_matrix.py
-rw-r--r-- 1  rc-users  568 May 14 18:44 sample_python_lab.sh
-rw-r--r-- 1  rc-users  107 May 14 18:44 sample_python_lab.sh.4156.err
-rw-r--r-- 1  rc-users 1309 May 14 18:46 sample_python_lab.sh.4156.out

Now that we see the .err and .out files. Go ahead and take a look at the .out file using the cat command:

$ cat sample_python_lab.sh.4156.out
i,mean 50 -9.642611316148498e-17
i,mean 100 1.87926695251823e-16
i,mean 150 1.8993939654816242e-16
i,mean 200 2.946282755137268e-17
...
i,mean 1950 -1.7521916145004967e-17

It looks like the job did not complete the code. If it ran to completion we would have expected to see the last line:

i,mean 2000 2.71473510528421e-16

Next, =take a look at the error file to see if it has any clues as to why our job aborted.

$ cat sample_python_lab.sh.4156.err
slurmstepd: error: *** JOB 4156 ON p04 CANCELLED AT 2021-05-14T18:46:49 DUE TO TIME LIMIT ***

After looking at the .err file, it is clear that our job ran out of walltime as indicated by the message above. In this case we requested 2 minutes of walltime, but should have requested at least 5.