Fragment


A Job Queue in 20 Lines of Bash

A simple job queue for scheduling, re-ordering, deleting and re-trying long-running tasks, using plain text and bash.

February 1, 2024

London, UK


I recently found myself in the following quandry: I wanted to set off a list of jobs on a shared cluster, for which I had just been allocated compute. However, I hadn’t quite finished putting together the list of jobs.

I could either submit the first jobs, write up a new submission with the remaining ones, and have those run once compute had been newly allocated, much later.

Or, I could try to be very clever, and use the time that the first jobs were running to finish putting together the list of jobs to run in the same allocation.

Here’s my little solution.

Start by enumerating the commands you need to run, in a plain text file tasks.txt:

[ ] gcc -o program myprogram.c
[ ] python path/to/script.py arg=1
[ ] python path/to/script.py arg=2

The syntax is very simple:

  • tasks that have yet to be run are marked with an empty box [ ]
  • running tasks are marked with a dash [-]
  • completed tasks are crossed off [x]
  • failed tasks are indicated thus: [!]

Then, set off one (or more) runners which do the following:

  1. pick the first un-executed job ([ ]) in the list,
  2. mark it as running ([-]) and annotate the line with the runner’s PID
  3. once the task has finished, inspect the status code and fill in [x] or [!] appropriately, removing the PID

In other words, half-way through execution, the tasks.txt file might look like this:

[x] gcc -o program myprogram.c
[!] python -c "assert False"
[-] python path/to/script.py arg=1 [29328]
[-] python path/to/script.py arg=2 [31030]
[ ] python path/to/script.py arg=3
[ ] python path/to/script.py arg=4

On the top line, the gcc command ran successfully, as indicated by [x].

In the second line, the assert False job predictably failed, and was thus marked with a [!]. While the runner(s) haven’t reached the end of the list yet, we could still try to fix the script, then replace the [!] with [ ] by editing tasks.txt directly, and the runner(s) will try the failed task again once they’re done with their current task.

The first two script.py files are being executed, as indicated by the [-] markings. In this example, there are two different runners launched separately. Each worker leaves its PID in square brackets at the end of the line to keep track of who’s running what. Note that the multi-worker setup assumes that tasks.txt is on the same file system (e.g. an NFS) which updates reasonably quickly 0: Adding a locking mechanism is left as an exercise for the reader ;) 0[0] . Each runner will update its corresponding line once each script finishes.

Finally—as the whole point of the exercise—you can add more tasks to the queue as long as the runners are still running and there is still time left on your compute reservation.

Here’s the code. Copy it into some file (e.g. runner.sh) and make it executable.

#!/bin/bash

JOBS_FILE="tasks.txt"

while true; do
    # Find the line number of the first un-executed job
    LINE_NUM=$(grep -n -m 1 "^\[ \]" "$JOBS_FILE" | cut -d: -f1)

    # Check if there are any un-executed jobs left
    if [ -z "$LINE_NUM" ]; then
        break
    fi

    # Extract the command and append this runner's PID
    JOB_LINE=$(sed -n "${LINE_NUM}p" "$JOBS_FILE")
    JOB_COMMAND=$(echo "$JOB_LINE" | sed -e 's/^\[ \] //')
    EXECUTING_JOB_LINE="[-] $JOB_COMMAND [$$]"

    # Replace the line with the executing status
    sed -i "${LINE_NUM}s#.*#$EXECUTING_JOB_LINE#" "$JOBS_FILE"

    # Execute the command
    eval "$JOB_COMMAND"
    STATUS=$?

    # Update the job status based on execution result
    if [ $STATUS -eq 0 ]; then
        # Update the job status to completed
        sed -i "/$$/s#.*#[x] $JOB_COMMAND#" "$JOBS_FILE"
    else
        # Update the job status to failed
        sed -i "/$$/s#.*#[!] $JOB_COMMAND#" "$JOBS_FILE"
    fi
done

echo "All jobs are completed."

Now, you can run ./runner.sh on one or more workers, which will make progress through the job queue; each taking the first un-executed ([ ]) command anywhere in the list.

You now can add new entries to the tasks.txt queue, re-order them, delete some, and even re-try failed ones. Lovely.

Note: Don't use `#` in your commands

As a parting note, do notice that I’ve chosen to use # as the separator character in the sed commands above.

This means that you shouldn’t use the # character in your commands in the task queue.