And that is what I've got now. The code is tiny, just 240 lines. It loads a 512x512 image and crunches away. Porting the original algorithm from python-that-looked-like-C to Numpy produced an 8x speedup. The rest of the code still needs huge amounts of polishing.

