Submitting Jobs

Most of the information on about linux processes comes from Dr. Bill Miller III at A. T. Still University.

There are a couple different ways of running/executing commands and calculations. Typically, if you are going to execute a command straight from the command line on your local computer, then that is called running a command (or job/calculation) interactively. Interactive jobs run immediately after you execute them regardless of the other processes that are occurring already on the computer that it may or may not conflict with for the available resources. If you run a job on a cluster (or supercomputer) you are likely running the command through a queue where you have to share resources with the other users and the queue scheduler (which is a script/program itself, not a real person so don’t try to bribe the scheduler; he doesn’t need your money) distributes calculations to free compute nodes that have enough available resources to run the program. This section of the manual, though, will focus on running jobs interactively on a local computer/workstation.

Scripts and calculations expected to take limited resources or very little time can simply be run from the command line without any real considerations. However, when calculations are expected to take long periods of times (i.e. more than a few minutes), I have a few suggestions that might help.

Halting and Canceling Interactive Processes

If you are running a job interactively (i.e. the command prompt is unresponsive will the command is running), then you can press ctrl+C to cancel/kill the process at any time while it is running. This will terminate the job from running any longer.

Additionally, you have the option to halt or suspend an interactive process while it is running using ctrl+Z. In other words, you can press ctrl+Z while a command is running and this will temporarily suspend the process from running. It halts, but it is not killed. This allows you to use the Terminal’s command prompt again while the job is halted. At any point, you can continue running the job by typing fg to bring the command back to the foreground.

Running in Background vs Foreground, and nohup

A typical, short calculation is run in the foreground. This means when you execute the command, the command prompt on your Terminal becomes unresponsive until the process has completed. For long processes, I suggest running commands in the background. This allows the job to continue running in the background (i.e. without you knowing any difference) and for you to be able to use that Terminal’s command prompt to do other things.

To run a job in the background, simply append an ampersand (&) to the end of the command. So if I were going to run a Gaussian 16 QM calculation in the foreground, the command would simply be

g16 structure.com

But if I wanted to run that same calculation in the background, the command would look like this

g16 structure.com &

When you press enter to run this command, a line will be printed to the screen letting you know the job is running

[1] g16 structure.com &

At this point (i.e. after this previous line is printed to the screen), if you press enter again your cursor will be back on the command prompt and you will be able to use your Terminal window to do more unix-y (yeah, that’s a word) things.

Occasionally, I accidentally execute a command without adding the ampersand at the end when I really wanted to run the job in the background. So instead of running in the background, the calculation is running in the foreground. In this situation, I press ctrl+Z to suspend the job. This means the job is simply waiting for you to let it know it can continue running. If you want the job to begin running in the foreground again (as it was originally), you could type fg (short for foreground). But if you are like me, and wanted to run the job in the background, simply type bg (short for background; see how they did that?) and press enter on the command prompt and the job will begin running again but now in the background (and you will get a notification line like the one mentioned above when I explained how to run g09 in the background).

An extremely helpful program when running process in the background is the nohup command. nohup can be added at the beginning of any command and essentially what it does it ensure that your job continues running, even if you logout of your computer. This is especially helpful if you are using a public or lb computer that others users access, or if you are working remotely (i.e. ssh’d into another computer) because using nohup means you can exit (or logout) the remote workstation and your process/command will continue to run. I use this frequently if I am running a command remotely just in case the connection between my computer and the remote computer is broken. If you do not use nohup, when the connection is severed, your command will be terminated. Continuing with the Gaussian 09 theme, if we wanted to initiate a g09 calculation on a remote computer using nohup, the command would look something like this

nohup g16 structure.com &

Notice that in addition to using nohup at the beginning of the command, we also made the process run in the background (using the &). If you run a command using nohup in the foreground, then you are defeating the purpose of using nohup altogether since you are unable to use your Terminal’s command prompt and when you close the Terminal window or the connection to the remote computer is lost, your calculation would be killed.

Listing and Killing Jobs Running in Background

I have already described to you how to run commands in the background. As long as you are still using the same Terminal window/tab you used to execute the background commands, you can type the command jobs to see what processes are running in the background.

Two processes are distinguished by the [1] or [2] shown at the beginning of each line. If for some reason I wanted to kill the first job (which I could do using the kill command along with the corresponding PID number), I could simple type

kill %1

and press enter. Similarly, to kill the second job I would just type

kill %2

In the event I am running a job in the background, I find using jobs and kill % much more convenient than determining the PID number from top or ps and using the kill -9 command.

The kill command is used to terminate processes that you are running on your computer. The general syntax for the kill command is

kill -9 PID

The -9 is added to smother the process so it has no chance of survival. The PID is a number that identifies each running process. You can obtain the `PID of any process using either the ps or top commands. This should only be used on the local linux machines, as SLURM has it's own way to kill/cancel a job.

Using a SLURM Queueing Manager

SLURM is what we use on ACME and on many high-performance computing clusters. It is a system manager that implements a queueing system for job submission, making sure that all users can run jobs even if nodes are not currently available. Here are a few helpful commands for using SLURM:

Submit SLURM Jobs

On clusters, you often cannot run jobs/processes on the head node, instead needing to submit to other nodes meant to handle these changes. This means that you can't submit jobs the way you can on local linux machines. Rather than just running a command, it's necessary to submit it to the queue.

To do this, you need to first make a submission script, such as this script to perform a CREST conformational search on ACME:

#!/bin/bash
#SBATCH -J crest_job
#SBATCH -p normal
#SBATCH -t 12:00:00
#SBATCH --export=NONE
#SBATCH --ntasks-per-node 16
#SBATCH --mem=96G
#SBATCH --output=%x.slurm_%J.out

# XTB
export XTBHOME=/opt/apps/xtb
export PATH=$PATH:$XTBHOME
xtbjob=crest_search_input.xyz
Sctchpath="/tmp/$SLURM_JOB_ID"
mkdir $Sctchpath

Homepath=$(pwd)

cp "$xtbjob.xyz" $Sctchpath

cd $Sctchpath

# Add useful info to top of output file

touch $Homepath/$xtbjob.out
echo "Job Start Time: $(date)" > $Homepath/$xtbjob.out
echo "SLURM Job ID: $SLURM_JOB_ID" >> $Homepath/$xtbjob.out
echo "SLURM Job Name: $SLURM_JOB_NAME" >> $Homepath/$xtbjob.out

export PATH=$PATH:$XTBHOME/bin:
$XTBHOME/crest $Sctchpath/$xtbjob.xyz --alpb water -T 16 -niceprint >> $Homepath/$xtbjob.out
rm -r METADYN* NORMMD* MRMSD wbo cregen_* coord*
mv crest_conformers.xyz $Homepath/$xtbjob.confs.xyz
mv crest_best.xyz $Homepath/$xtbjob.best.xyz
mv struc.xyz $Homepath/$xtbjob.struc.xyz
rm ./crest*

It is important to note the section at the top, all of the #SBATCH lines. The different flags mean different things to help tell the computer how to allocate resources to you. The -J sets the name of your job, here "crest_job". -p sets the partition you are going to submit to. This can be different for different clusters. For ACME, we have a "normal" partition, a "short" partition, a "long" partition, and a "debug" partition, all with different maximum wall times and priorities in the queue. -t sets that maximum wall time. In this case, your job will run no longer than 12 hours (12:00:00). There are also lines that specify the number of processors you want to use (--ntasks-per-node 16) and the memory you are allocating to the job (--mem=96GB). There are more lines that you can specify, but these are the most important.

Once you have made a file like this, you can submit it to the queue with

sbatch crest_submission_script.sh

Then your job will be submitted to the queue and run when it is your turn.

Hint

There is a very easy way to create the submission scripts and submit jobs on our cluster ACME: gsub. This command creates a submission script and automatically adds your job to the queue with the specifications you listed. You run this command with gsub input.com and can change specifications with different flags. To see what the options are, you can run gsub -h to get a list of the different ways to adjust the script.

Checking on Jobs

With the queueing system, it's possible that your jobs don't start immediately. You might want to check on them in the queue to see if they've started already. You can do this with

squeue

which will show you all the jobs in the queue. Often, you are only interested in your own jobs, which you can do with

queue -u <USER>

These are very easy to add as aliases in your .bashrc to make it easier. Here are the ones that I have set up:

alias squ='squeue --format="%.7A %.35j %.10u %.7C %.10M %.15l %.20R"'
alias q='squeue --user=`whoami` --format="%.7A %.35j %.10u %.7C %.10M %.15l %.20R"'

The extra --format tags just expand the display for this command to include the maximum wall time and number of processors requested, as well as a larger space allotted for the job title.

It may also be helpful to look into the Slurm Job Tracking developed within the group.

Cancel SLURM Jobs

Information for this section comes from Stack Overflow.

Using SLURM for job scheduling/queueing can be a really helpful tool for keeping track of jobs and sharing resources. However, sometimes you might make a mistake for a large number of job submissions and want/need to stop them. Here are a few commands to help with this.

First, maybe you have one job you want to cancel. This is easily soved with the following command:

scancel <JOB ID>

You can find the Job ID for your job with the squeue command under "JOBID". Each job has a distinct Job ID, so you only have to worry about cancelling yours.

Another thing that may happen is that you submitted a lot of jobs with some major flaw, such as an incorrect basis set. If you want to cancel all of your jobs, use:

scancel -u <USERNAME>

This will cancel all jobs in the SLURM queue that are associated with your account.

In the event that you submitted several jobs, then submitted several more, but want to only keep the new submissions, you can cancel ranges of Job IDs at a time. For example, if you started jobs 1000-1010, then started 1015-1030 without canceling the original 10 jobs, there is still hope! Cancel these jobs with:

scancel {1000..1010}

This cancels all jobs with Job IDs between and including 1000 and 1010. This can also be helpful if you have too many jobs started and need to stop some to help with organization.