Description
Part of the gap observed in #119 is due to webp-animation
using multi-threading. libwebp
also has a multi-threading option for still images. But right now image-webp
is single-threaded.
I think there are some opportunities for easy wins from parallelization.
Decoding frames in parallel
Each frame is contained in its own ANMF
chunk which specifies the length, so we should be able to read the raw bytes and spawn the decoding on a thread pool quite easily.
Complication: this may dramatically increase memory usage on machines with lots of cores, or if the decoding of the frames performs faster than compositing them and uncomposited frames pile up. We will need to introduce some sort of backpressure mechanism so that it doesn't happen. For example, decoding is only allowed to have at most X frames or at most Y megapixels in flight. We know all the frame dimensions so it's not too hard to enforce, but still adds some complexity to the code as opposed to just slapping .par_iter()
on it.
Compositing frames in parallel
In the general case compositing is inherently sequential: it requires the previous frame to be fully composited before the next one can start. However, each row in the image can be processed independently, so we should be able to just use .par_chunks_mut()
on the buffer and composite each row separately.
The profile in #119 shows that we're spending a lot of time just doing alpha blending, so this could translate to significant performance gains.
Still images?
libwebp
also has optional multi-threading support for still images. I've benchmarked dwebp
compared to dwebp -mt
on a lossy image and only saw a 20% improvement in performance. The single-threaded bottleneck in #71 also makes it seem like we won't get much out of single-image parallelism. I don't think it's worth the trouble for static images.