Avoiding leaks

Original Faster downloading in shell scripts. Why not xargs? Avoiding leaks Editable
version 3 of 3



Multithreading! Bash scripts! Have there ever been words with so much promise and despair? This is a little trick used in Aurphan make downloading a bunch of little things go several times faster.

Of course this can be used for a lot more than just downloading. It lets you multithread any bash function, something not possible with external tools.

Case Study 1 - Aurphan

Background: there is a function rpc_thread(). It uses curl to send a request to the AUR. It parses the reply and might echo a line and touch a temp file (named $temp). We need to call this function a lot. Waiting for a reply takes between 100 and 300 ms. With several hundred requests, prodding the AUR takes a full minute.
$packages contains a list of strings, each one to be processed in parallel

First, we need to gracefully handle exits. This means waiting for threads to finish and cleaning up any temp files.

trap "wait; rm $temp" TERM EXIT

So if the script is terminated (^C) it will wait for the child threads to join and clean up the temp file. Pressing ^C again during the pause will abort this. It also runs on a normal exit, saving the trouble of writing a cleanup() function at every possible exit point.

Second, the threading code It is pretty simple.

 

for n in $packages; do
    rpc_thread "$n" &
    while (( $(jobs | wc -l) >= 8 )); do
        sleep 0.1
        jobs > /dev/null
    done
done
wait

$(jobs 2>&1 | grep -c Running) >= x would get rid of the second jo...


 

 

This uses & to fork a thread and jobs to count the number of threads currently running. If the number is 8 or above, it sleeps. Since it is a bash builtin, jobs will only work if your shebang is #!/usr/bin/bash. It runs six to eight times faster. Aurphan is a really simple example though. It does not have to do anything with the output of rpc_search(). In other words, we have a parallel map but no reduce.

Whoops, #!/bin/bash


Nevermind. With the Great Usr Merge, this turned into a prophetic...


 

Case Study 2 - Packer

How about grafting this code into a complicated bash script?
Packer has the same problem. This part takes 68 seconds on my laptop:

for ((i=0; i<$total; i++)); do 
    aurbar "$((i+1))" "$total" "$bartype"
    pkg="${packages[i]%% *}"
    ver="${packages[i]##* }"
    if isignored "$pkg"; then
        checkignores+=("${packages[i]}")
    elif aurversionisnewer "$pkg" "$ver"; then
        newpackages+=("$pkg")
    fi
done

Here there is a C style loop over an array of package names. The array index is used to drive the aurbar() progress bar. The function aurversionisnewer() calls curl, processes the result and saves it to a temp file. Here is that function:

aurversionisnewer() {
    if ! [[ -f "$tmpdir/$1.info" ]]; then
        curl -fGs --data-urlencode "arg=$1" "http://aur.archlinux.org/rpc.php?type=info" > "$tmpdir/$1.info"
    fi
    unset aurversion
    if ! grep -Fq ':"No result found"' "$tmpdir/$1.info"; then
        aurversion="$(cut -d '"' -f 18 "$tmpdir/$1.info")"
        if [[ "$(LC_ALL=C vercmp "$aurversion" "$2")" -gt 0  ]]; then
            return 0
        else
            return 1
        fi
    else
        return 1
    fi
}

To make it more parallel friendly, refactor. The curl call becomes its own function and aurversionisnewer() becomes a thin wrapper for other code in Packer which still uses the non-threaded version.

rpcinfo_bg() {
    if ! [[ -f "$tmpdir/$1.info" ]]; then
        curl -fGs --data-urlencode "arg=$1" "http://aur.archlinux.org/rpc.php?type=info" > "$tmpdir/$1.info"
    fi
}

versionisnewer() {
    unset aurversion
    if ! grep -Fq ':"No result found"' "$tmpdir/$1.info"; then
        aurversion="$(cut -d '"' -f 18 "$tmpdir/$1.info")"
        if [[ "$(LC_ALL=C vercmp "$aurversion" "$2")" -gt 0  ]]; then
            return 0
        else
            return 1
        fi
    else
        return 1
    fi
}

aurversionisnewer() {
    rpcinfo_bg "$1"
    versionisnewer "$1" "$2"
}

Now we tackle the main loop. The idea here is to break it into two halves. A slow parallel download and a fast sequential calculation.

for ((i=0; i<$total; i++)); do 
    aurbar "$((i+1))" "$total" "$bartype"
    pkg="${packages[i]%% *}"
    rpcinfo_bg "$pkg" &
    while (( $(jobs | wc -l) >= 8 )); do
        sleep 0.1
        jobs > /dev/null
    done
done
wait
for ((i=0; i<$total; i++)); do
    pkg="${packages[i]%% *}"
    ver="${packages[i]##* }"
    if isignored "$pkg"; then
        checkignores+=("${packages[i]}")
    elif versionisnewer "$pkg" "$ver"; then
        newpackages+=("$pkg")
    fi
done

The code is now slightly more bulky, but only takes 16 seconds. Four times faster.

 

Let's take a closer look at jobs > /dev/null. There is a "feature" in jobs where it carries state between calls. This makes working jobs a little bit strange. Whenever a thread finishes, the thread leaves a note with jobs. The note is removed when you run jobs and see the output. However, the finished messages are not removed if you run jobs in a subshell or look at the running processes with jobs -r. So $(jobs | wc-l) accumulates cruft, and jobs > /dev/null clears it out. The cruft adds up slowly. Bash will leak 86KB for each 1000 jobs. It also makes spawning new thread O(N) instead of O(1), causing slowdowns over time.

The 86KiB leak only occurs if cruft isn't cleared out, or still oc...


It only leaks if you don't clear it out.


 

There are a few downsides to this method. It is non-portable, needing Bash and GNU sleep. You have to use temp files for communication. Named pipes seem to lock up. On the plus side threads can echo to the output, and it looks like echo is atomic.

 

The normal method of doing parallel shell scripts is to use xargs or parallel. Both have the same problem: functions are not supported. Despite the many oddities of Bash's functions, they are still preferred to writing a multitude of tiny xargs-callable scripts.

Bash functions have been working in GNU Parallel since 20110822. Y...