I recently found myself in the following quandry: I wanted to set off a list of jobs on a shared cluster, for which I had just been allocated compute. However, I hadn’t quite finished putting together the list of jobs.
I could either submit the first jobs, write up a new submission with the remaining ones, and have those run once compute had been newly allocated, much later.
Or, I could try to be very clever, and use the time that the first jobs were running to finish putting together the list of jobs to run in the same allocation.
Here’s my little solution.
How It Works
Start by enumerating the commands you need to run, in a plain text file tasks.txt
:
[ ] gcc -o program myprogram.c [ ] python path/to/script.py arg=1 [ ] python path/to/script.py arg=2
The syntax is very simple:
- tasks that have yet to be run are marked with an empty box
[ ]
- running tasks are marked with a dash
[-]
- completed tasks are crossed off
[x]
- failed tasks are indicated thus:
[!]
Then, set off one (or more) runners which do the following:
- pick the first un-executed job (
[ ]
) in the list, - mark it as running (
[-]
) and annotate the line with the runner’s PID - once the task has finished, inspect the status code and fill in
[x]
or[!]
appropriately, removing the PID
In other words, half-way through execution, the tasks.txt
file might look
like this:
[x] gcc -o program myprogram.c [!] python -c "assert False" [-] python path/to/script.py arg=1 [29328] [-] python path/to/script.py arg=2 [31030] [ ] python path/to/script.py arg=3 [ ] python path/to/script.py arg=4
On the top line, the gcc
command ran successfully, as indicated by [x]
.
In the second line, the assert False
job predictably failed, and was thus
marked with a [!]
. While the runner(s) haven’t reached the end of the list
yet, we could still try to fix the script, then replace the [!]
with [ ]
by
editing tasks.txt
directly, and the runner(s) will try the failed task again
once they’re done with their current task.
The first two script.py
files are being executed, as indicated by the [-]
markings. In this example, there are two different runners launched separately.
Each worker leaves its PID in square brackets at the end of the line to
keep track of who’s running what. Note that the multi-worker setup assumes that
tasks.txt
is on the same file system (e.g. an NFS) which updates
reasonably quickly
0: Adding a locking mechanism is left as an exercise for
the reader ;)
0[0]
. Each runner will update its corresponding line once each
script finishes.
Finally—as the whole point of the exercise—you can add more tasks to the queue as long as the runners are still running and there is still time left on your compute reservation.
The Goods
Here’s the code. Copy it into some file (e.g. runner.sh
) and make it executable.
#!/bin/bash JOBS_FILE="tasks.txt" while true; do # Find the line number of the first un-executed job LINE_NUM=$(grep -n -m 1 "^\[ \]" "$JOBS_FILE" | cut -d: -f1) # Check if there are any un-executed jobs left if [ -z "$LINE_NUM" ]; then break fi # Extract the command and append this runner's PID JOB_LINE=$(sed -n "${LINE_NUM}p" "$JOBS_FILE") JOB_COMMAND=$(echo "$JOB_LINE" | sed -e 's/^\[ \] //') EXECUTING_JOB_LINE="[-] $JOB_COMMAND [$$]" # Replace the line with the executing status sed -i "${LINE_NUM}s#.*#$EXECUTING_JOB_LINE#" "$JOBS_FILE" # Execute the command eval "$JOB_COMMAND" STATUS=$? # Update the job status based on execution result if [ $STATUS -eq 0 ]; then # Update the job status to completed sed -i "/$$/s#.*#[x] $JOB_COMMAND#" "$JOBS_FILE" else # Update the job status to failed sed -i "/$$/s#.*#[!] $JOB_COMMAND#" "$JOBS_FILE" fi done echo "All jobs are completed."
Now, you can run ./runner.sh
on one or more workers, which will make progress
through the job queue; each taking the first un-executed ([ ]
) command
anywhere in the list.
You now can add new entries to the tasks.txt
queue, re-order them, delete some,
and even re-try failed ones. Lovely.
Note: Don't use `#` in your commands
As a parting note, do notice that I’ve chosen to use #
as the separator
character in the sed
commands above.
This means that you shouldn’t use the #
character in your commands in the
task queue.