-
-
Notifications
You must be signed in to change notification settings - Fork 168
Performance vs OpenEXR
How the pure-C11 TinyEXR v3 codec stacks up against the reference OpenEXR library, codec by codec, for encode and decode — single-threaded and multi-threaded.
Setup & method
- CPU: AMD Ryzen 9 3950X (16C/32T, Zen2),
avx2+f16c. Machine idle.- Image:
asakusa.exr, 660×440, 4× HALF. (Small — see the caveat in Multi-threading about chunk-count-limited scaling.)- I/O: fully in memory (
StdOSStream/StdISStreamon the OpenEXR side); each library loads the same source independently. OpenEXR 4.0-dev,gcc -O3.- Throughput is megapixels/second (higher is better); sizes are the compressed payload. Numbers vary a few % run-to-run.
- Both libraries are pinned to the same thread count (
Imf::setGlobalThreadCount(n)/exr_set_num_threads(n)).
- Single-thread decode: TinyEXR wins big on the cheap codecs — 3.4× on uncompressed and 2.5× on RLE. On the DEFLATE family / PIZ / HTJ2K, OpenEXR leads (≈1.2× ZIP, ≈1.8× PXR24, ≈2–2.7× ZIPS/PIZ/HTJ2K) because of its libdeflate backend and tuned PIZ/JPH.
- Single-thread encode: TinyEXR ties or wins on RLE/PIZ/B44; OpenEXR is ≈1.5× on ZIP/ZIPS, ≈1.8× on PXR24, and ≈3× on HTJ2K.
- libdeflate (opt-in): with the same backend, TinyEXR matches or beats OpenEXR on the deflate family — e.g. ZIP decode 1.4× single-thread.
- Multi-threaded: TinyEXR's opt-in parallel path scales ~5–9× to 16 threads; at 16T it out-decodes OpenEXR on RLE/ZIP/ZIPS/B44 (in-tree), and with libdeflate it leads the whole deflate family by a wide margin.
none (uncompressed) is off the chart on purpose: TinyEXR 2699 vs OpenEXR
789 MP/s (~3.4×) — no thread-pool or framebuffer-copy overhead. TinyEXR also
leads RLE (230 vs 93, ~2.5×). On the compressed codecs OpenEXR is ahead —
ZIP ~1.2×, PXR24 ~1.8×, ZIPS ~2.1×, PIZ ~2.7×, HTJ2K ~2–2.4× — dominated by its
libdeflate inflate (and tuned PIZ/JPH). A TinyEXR ZIP-decode profile is ~95 %
inflate; the predictor and de-interleave passes are already vectorized.
TinyEXR ties or beats OpenEXR on RLE / PIZ / B44. On ZIP/ZIPS it is ~1.5×
behind and PXR24 ~1.8× — its in-tree, dependency-free LZ77 encoder is fast
but not libdeflate-level. HTJ2K encode is the widest gap (~3×: the separate
JPH/OpenJPH encoder). TinyEXR's HTJ2K paths recently gained an AVX2 forward 5/3
wavelet, an unstuffed-buffer entropy reader, and a clz-builtin fast path in the
per-sample prepare (encode +~39%, decode +~18% vs the pre-SIMD baseline); the
remaining gap is OpenJPH's fully-SIMD entropy coder.
Sizes are essentially identical for the lossless/standard codecs — the formats are interoperable (a TinyEXR file decodes in OpenEXR and vice-versa). Only ZIP and HTJ2K differ, from encoder tuning (not format):
| codec | TinyEXR KB | OpenEXR KB | codec | TinyEXR KB | OpenEXR KB | |
|---|---|---|---|---|---|---|
| none | 2276 | 2276 | pxr24 | 1158 | 1163 | |
| rle | 1644 | 1644 | b44 | 993 | 993 | |
| zips | 1205 | 1212 | htj2k256 | 1132 | 1016 | |
| zip | 1155 | 1070 | htj2k32 | 1160 | 1042 | |
| piz | 742 | 742 |
OpenEXR's deflate speed comes from libdeflate. TinyEXR vendors libdeflate 1.25 (MIT) as an optional, off-by-default backend for ZIP/ZIPS/PXR24 (the in-tree codec stays the default and the only freestanding path):
make bench-compare LIBDEFLATE=1 # -DEXR_USE_LIBDEFLATE, level 4 (= OpenEXR)Same backend ⇒ byte-identical sizes, and TinyEXR meets or beats OpenEXR: ZIP decode 80.8 vs 58.8 MP/s (1.37×), ZIPS 61.4 vs 46.4 (1.32×), PXR24 at parity. Encode reaches parity too (ZIP 15.3 vs 14.0, PXR24 16.4 vs 15.8). Both call libdeflate's inflate; TinyEXR's edge is its SSE2/AVX2 predictor and lower per-block overhead.
In-tree default:
| codec | tx enc | exr enc | tx dec | exr dec |
|---|---|---|---|---|
| none | 102.3 | 85.3 | 2699 | 789 |
| rle | 55.1 | 46.3 | 230 | 92.6 |
| zips | 10.3 | 15.5 | 23.2 | 48.9 |
| zip | 9.6 | 14.7 | 50.4 | 61.9 |
| piz | 23.6 | 25.4 | 24.6 | 67.5 |
| pxr24 | 9.2 | 16.4 | 48.0 | 88.0 |
| b44 | 31.7 | 34.8 | 145 | 178 |
| htj2k256 | 11.0 | 33.2 | 24.0 | 59.6 |
| htj2k32 | 11.6 | 31.6 | 27.1 | 55.0 |
LIBDEFLATE=1 (deflate family): zip 15.3/14.0/80.8/58.8, zips 16.2/14.8/61.4/46.4,
pxr24 16.4/15.8/83.6/83.8 (tx enc / exr enc / tx dec / exr dec).
TinyEXR supports per-block parallel encode and decode via portable C11
<threads.h> and a small ephemeral worker pool. It is opt-in (build
THREADS=1 / -DEXR_USE_THREADS; default and freestanding builds stay
single-threaded); the count is set at runtime:
exr_set_num_threads(16); /* 0/1 = serial (default) */It covers scanline and single-level tiled parts on the in-memory load/save paths; deep, mipmap/ripmap, and the streaming APIs remain single-threaded. Encode stays byte-deterministic and decode bit-identical regardless of thread count (unit-tested).
TinyEXR decode scales ~5× (ZIP, 28 blocks) to ~8.8× (ZIPS, 440 blocks) at
16 threads. Scaling is bounded by the number of chunks: asakusa.exr is small,
so ZIP (16 lines/block ⇒ 28 blocks) saturates earlier than ZIPS (1 line/block ⇒
440 blocks). Larger images would scale further. (none decode does not benefit
— it is memory-bound with no per-block work, so thread overhead dominates.)
Both libraries at EXR_THREADS=16, in-tree TinyEXR build. TinyEXR out-decodes
OpenEXR on RLE (456 vs 150), ZIP (251 vs 226), ZIPS (204 vs 174), B44 (362 vs
319); OpenEXR still leads PIZ (208 vs 110) and edges PXR24 (270 vs 242).
With LIBDEFLATE=1 at 16 threads TinyEXR leads the deflate family decisively:
| codec | tx dec | exr dec | tx enc | exr enc | |
|---|---|---|---|---|---|
| zip | 339.6 | 226.6 | 102.3 | 88.0 | |
| zips | 358.5 | 151.1 | 153.1 | 82.2 | |
| pxr24 | 341.7 | 285.0 | 105.7 | 96.9 |
(MP/s. OpenEXR's own per-block scaling is included — this is a like-for-like 16-thread comparison.)
Reproduce:
make bench-compare THREADS=1 # single thread (default)
EXR_THREADS=16 ./build/bench_compare # 16 threads, both libs
make bench-compare THREADS=1 LIBDEFLATE=1 # + libdeflate backendStill future work: parallelize the deep and mipmap/ripmap paths; add ARM/NEON throughput numbers (the NEON path is correctness-verified under qemu but not benchmarked here); sweep larger images and channel counts.
Charts: doc/perf-decode.svg, perf-encode.svg, perf-libdeflate-decode.svg,
perf-mt-scaling.svg, perf-mt-compare.svg. Harness:
benchmark/bench_compare.cpp (make bench-compare [THREADS=1] [LIBDEFLATE=1]).