7AM November 23, 2010: Faster downloading in shell scripts.
2PM November 23, 2010: Why not xargs?

Original Faster downloading in shell scripts. Why not xargs? Avoiding leaks Editable
version 1 & 2 of 3

Multithreading! Bash scripts! Have there ever been words with so much promise and despair? This is a little trick used in Aurphan make downloading a bunch of little things go several times faster.

Case Study 1 - Aurphan

Background: there is a function rpc_thread(). It uses curl to send a request to the AUR. If the reply is positive, a line is echoed and a temp file (named $temp) is touched. We need to call this function a lot. Waiting for a reply takes between 100 and 300 ms. With several hundred requests, prodding the AUR takes a full minute.
$packages contains a list of strings, each one to be processed in parallel

First, we need to gracefully handle exits. This means waiting for threads to finish and cleaning up any temp files.

trap "wait; rm $temp" TERM EXIT

So if the script is terminated (^C) it will wait for the child threads to join and clean up the temp file. Pressing ^C again during the pause will abort this. It also runs on a normal exit, saving the trouble of writing a cleanup() function at every possible exit point.

Second, the threading code It is pretty simple.

for n in $packages; do
    rpc_thread "$n" &
    while (( $(jobs | wc -l) >= 8 )); do
        sleep 0.1
        jobs > /dev/null
    done
done
wait

This uses & to fork a thread and jobs to count the number of threads currently running. If the number is 8 or above, it sleeps. Since it is a bash builtin, jobs will only work if your shebang is #!/usr/bin/bash. It also carries state between calls, making jobs a little bit strange. Whenever a thread finishes, the thread leaves a note with jobs. The note is removed when you run jobs and see the output. However, the finished messages are not removed if you run jobs in a subshell. So $(jobs | wc-l) accumulates cruft, and jobs > /dev/null clears it out. It runs six to eight times faster. Aurphan is a really simple example though. It does not have to do anything with the output of rpc_search(). In other words, we have a parallel map but no reduce.

Case Study 2 - Packer

How about grafting this code into a complicated bash script?
Packer has the same problem. This part takes 68 seconds on my laptop:

for ((i=0; i<$total; i++)); do 
    aurbar "$((i+1))" "$total" "$bartype"
    pkg="${packages[i]%% *}"
    ver="${packages[i]##* }"
    if isignored "$pkg"; then
        checkignores+=("${packages[i]}")
    elif aurversionisnewer "$pkg" "$ver"; then
        newpackages+=("$pkg")
    fi
done

Here there is a C style loop over an array of package names. The array index is used to drive the aurbar() progress bar. The function aurversionisnewer() calls curl, processes the result and saves it to a temp file. Here is that function:

aurversionisnewer() {
    if ! [[ -f "$tmpdir/$1.info" ]]; then
        curl -fGs --data-urlencode "arg=$1" "http://aur.archlinux.org/rpc.php?type=info" > "$tmpdir/$1.info"
    fi
    unset aurversion
    if ! grep -Fq ':"No result found"' "$tmpdir/$1.info"; then
        aurversion="$(cut -d '"' -f 18 "$tmpdir/$1.info")"
        if [[ "$(LC_ALL=C vercmp "$aurversion" "$2")" -gt 0  ]]; then
            return 0
        else
            return 1
        fi
    else
        return 1
    fi
}

To make it more parallel friendly, refactor. The curl call becomes its own function and aurversionisnewer() becomes a thin wrapper for other code in Packer which still uses the non-threaded version.

rpcinfo_bg() {
    if ! [[ -f "$tmpdir/$1.info" ]]; then
        curl -fGs --data-urlencode "arg=$1" "http://aur.archlinux.org/rpc.php?type=info" > "$tmpdir/$1.info"
    fi
}

versionisnewer() {
    unset aurversion
    if ! grep -Fq ':"No result found"' "$tmpdir/$1.info"; then
        aurversion="$(cut -d '"' -f 18 "$tmpdir/$1.info")"
        if [[ "$(LC_ALL=C vercmp "$aurversion" "$2")" -gt 0  ]]; then
            return 0
        else
            return 1
        fi
    else
        return 1
    fi
}

aurversionisnewer() {
    rpcinfo_bg "$1"
    versionisnewer "$1" "$2"
}

Now we tackle the main loop. The idea here is to break it into two halves. A slow parallel download and a fast sequential calculation.

for ((i=0; i<$total; i++)); do 
    aurbar "$((i+1))" "$total" "$bartype"
    pkg="${packages[i]%% *}"
    rpcinfo_bg "$pkg" &
    while (( $(jobs | wc -l) >= 8 )); do
        sleep 0.1
        jobs > /dev/null
    done
done
wait
for ((i=0; i<$total; i++)); do
    pkg="${packages[i]%% *}"
    ver="${packages[i]##* }"
    if isignored "$pkg"; then
        checkignores+=("${packages[i]}")
    elif versionisnewer "$pkg" "$ver"; then
        newpackages+=("$pkg")
    fi
done

The code is now slightly more bulky, but only takes 16 seconds. Four times faster.

 

There are a few downsides to this method. It is non-portable, needed needing Bash and GNU sleep. You have to use temp files for communication. Named pipes seem to lock up. On the plus side threads can echo to the output, and it looks like echo is atomic.

McTypo (2010-11-23-07-48-20-391)

needing



 

The normal method of doing parallel shell scripts is to use xargs or parallel. Both have the same problem: functions are not supported. Despite the many oddities of Bash's functions, they are still preferred to writing a multitude of tiny xargs-callable scripts.