Users submit jobs to Slurm for scheduling by means of a job script – a plain text file containing commands, directives, and syntax specific to Slurm, and shell scripting.
The script will launch a job that executes the
my_exe program using the
#!/bin/csh #SBATCH --account mscfcons # charged account #SBATCH --time 30 # 30 minute time limit #SBATCH --nodes 2 # 2 nodes #SBATCH --ntasks-per-node 16 # 16 processes on each per node #SBATCH --job-name my_job_name # job name in queue (``squeue``) #SBATCH --error my_job_name-%j.err # stderr file with job_name-job_id.err #SBATCH --output my_job_name-%j.out # stdout file #SBATCH --email@example.com # email user #SBATCH --mail-type END # when job ends module purge # removes the default module set module load intel/16.1.150 module load impi/126.96.36.199 mpirun -n 32 ./my_exe
After purging the modules, the script should load the compiler, MPI library, and math library specified when the program
my_exewas compiled. The order of load instructions is important.
#SBATCHlines must come before shell script commands.
Include your preferred shell as the first line in your batch script.
Options passed to Slurm are referred to as directives and can be specified in
a submission script like we see above in Job Scripts or specified on
the command-line during execution. The directives are the same, though when
inserted into the job script, the lines must have the
#SBATCH prefix. Some
common directives and their description:
- -A, --account=<account>
Charge resources used by this job to specified account. The account is an arbitrary string. The account name may be changed after job submission using the
- -d, --dependency=<dependency_list>
Defer the start of this job until the specified dependencies have been satisfied completed, e.g.
- -D, --workdir=<directory>
Set the working directory of the batch script to directory before it is executed. The path can be specified as full path or relative path to the directory where the command is executed.
- -e, --error=<filename pattern>
Instruct Slurm to connect the batch script’s standard error directly to the file name specified in the “filename pattern”. By default both standard output and standard error are directed to the same file. The default file name is “slurm-%j.out”, where the “%j” is replaced by the job ID.
- -J, --job-name=<jobname>
Specify a name for the job allocation. The specified name will appear along with the job id number when querying running jobs on the system. The default is the name of the batch script, or just “sbatch” if the script is read on sbatch’s standard input.
Notify user by email when certain event types occur. Some type values are NONE, BEGIN, END, FAIL, TIME_LIMIT_90 (reached 90 percent of time limit). Multiple type values may be specified in a comma separated list.
User to receive email notification of state changes as defined by
- -N, --nodes=<minnodes[-maxnodes]>
Request that a minimum of minnodes nodes be allocated to this job. A maximum node count may also be specified with maxnodes. If only one number is specified, this is used as both the minimum and maximum node count.
Request that ntasks be invoked on each node. Meant to be used with the –nodes option.
- -o, --output=<filename pattern>
Instruct Slurm to connect the batch script’s standard output directly to the file name specified in the “filename pattern”. By default both standard output and standard error are directed to the same file. The default file name is “slurm-%j.out”, where the “%j” is replaced by the job ID.
- -p, --partition=<partition_names>
Request a specific partition for the resource allocation. If not specified, the default behavior is to allow the slurm controller to select the default partition as designated by the system administrator. If the job can use more than one partition, specify their names in a comma separate list and the one offering earliest initiation will be used with no regard given to the partition name ordering.
- -t, --time=<time>
Set a limit on the total run time of the job allocation. If the requested time limit exceeds the partition’s time limit, the job will be left in a PENDING state (possibly indefinitely). The default time limit is the partition’s default time limit. When the time limit is reached, each task in each job step is sent SIGTERM followed by SIGKILL. Acceptable time formats include “minutes”, “minutes:seconds”, “hours:minutes:seconds”, “days-hours”, “days-hours:minutes” and “days-hours:minutes:seconds”.
Run this job on the NVIDIA Tesla GPGPU nodes. Note: Maximum time limit 2 hours.
All available directives can be listed on the command line using
sbatch --help, with more complete descriptions on the command-line using
man sbatch, or online at https://slurm.schedmd.com/sbatch.html.
Submission scripts are used with the
sbatch command to submit jobs. A
successful submission results in the job’s ID being returned:
$ sbatch myjobscript 226783
Queues and Queue Limits
Job queue default values limit the time a single job with a particular number of nodes is permitted to run. The system will automatically categorize your job for you based on the resources it requests.
Nodes in 1 Job
128 - 460
16 - 127
higher priority than smaller jobs
1 - 15
jobs will be used to backfill with the larger queues
1 - 16
lower priority jobs requesting longer run times with a limit of 16 nodes/job
Changes to the contents of the script after the job has been submitted do not affect the current job.
The default directory is where you submit your script from. If you need to be in another directory, then your job script will need to
cdto that directory or specify
Job Policy Constraints
The total number of jobs a person can have running depends on user activity. While busy with many users:
The maximum number of Active jobs in the Running state: 20
Maximum number of nodes per job (except by special arrangement): 460 (7360 processor-cores)
Minimum number of nodes per job: 1 (for test or interactive purposes only)
Jobs must be run from the batch queue rather than the login node. Jobs found running in the login node will be terminated by the system administrator and the user will be sent an e-mail message. Jobs in the batch queue that have been suspended for more than 12 hours will be deleted.
Jobs larger than 460 nodes can be run when needed. We ask that you first contact your Science Point-of-Contact (your PI should know who that is) to make arrangements. Or you may send an e-mail to firstname.lastname@example.org with details on what is required.
For communication intensive codes, using nodes on different top level switches will result in a fairly dramatic performance decrease. For this reason we have set the scheduler to preferentially only allocate nodes within one switch per job. That means if you have a 20 node job and only 7 nodes are available on each of the 3 top level switches (for a total of 21 available nodes) your job won’t start until 20 are available in one of the switches. This may be over-ridden for benchmarking purposes or after discussion with MSC staff to ensure that the application’s use of the interconnect is topology aware and/or sufficiently light so as not to cause congestion on Tahoma or Boreal interconnect.
Running Interactive Jobs
Boreal supports interactive jobs. The partition in which the job runs is determined by the time the user has requested.
If your application requires an X session (usually a graphical application),
you will need to make sure that you have used the
-Y option on your
command to login (enables X tunneling):
ssh -Y <name>@boreal.emsl.pnl.gov
To start an interactive job, use the
srun -A <allocation> -t <time> -N <nodes> --pty /bin/bash
-A- is the project for which the job is run.
-t- is the number of minutes
-N- is the number of nodes you want
/bin/bash- is the command you want to run. Typically it is your shell.
Alternatively, you can use the long options:
srun --account=<allocation> --time=<time> --nodes=<nodes> --pty /bin/bash
As with all jobs, an interactive job will wait in the queue until there is a set of available nodes for it.