Skip to content

Performance vs OpenEXR

Syoyo Fujita edited this page Jun 8, 2026 · 3 revisions

TinyEXR vs OpenEXR — performance

How the pure-C11 TinyEXR v3 codec stacks up against the reference OpenEXR library, codec by codec, for encode and decode — single-threaded and multi-threaded.

Setup & method

  • CPU: AMD Ryzen 9 3950X (16C/32T, Zen2), avx2+f16c. Machine idle.
  • Image: asakusa.exr, 660×440, 4× HALF. (Small — see the caveat in Multi-threading about chunk-count-limited scaling.)
  • I/O: fully in memory (StdOSStream/StdISStream on the OpenEXR side); each library loads the same source independently. OpenEXR 4.0-dev, gcc -O3.
  • Throughput is megapixels/second (higher is better); sizes are the compressed payload. Numbers vary a few % run-to-run.
  • Both libraries are pinned to the same thread count (Imf::setGlobalThreadCount(n) / exr_set_num_threads(n)).

TL;DR

  • Single-thread decode: TinyEXR wins big on the cheap codecs — 3.4× on uncompressed and 2.5× on RLE. On the DEFLATE family / PIZ / HTJ2K, OpenEXR leads (≈1.2× ZIP, ≈1.8× PXR24, ≈2–2.7× ZIPS/PIZ/HTJ2K) because of its libdeflate backend and tuned PIZ/JPH.
  • Single-thread encode: TinyEXR ties or wins on RLE/PIZ/B44; OpenEXR is ≈1.5× on ZIP/ZIPS, ≈1.8× on PXR24, and ≈3× on HTJ2K.
  • libdeflate (opt-in): with the same backend, TinyEXR matches or beats OpenEXR on the deflate family — e.g. ZIP decode 1.4× single-thread.
  • Multi-threaded: TinyEXR's opt-in parallel path scales ~5–9× to 16 threads; at 16T it out-decodes OpenEXR on RLE/ZIP/ZIPS/B44 (in-tree), and with libdeflate it leads the whole deflate family by a wide margin.

Single thread

Decode

Decode throughput, single thread

none (uncompressed) is off the chart on purpose: TinyEXR 2699 vs OpenEXR 789 MP/s (~3.4×) — no thread-pool or framebuffer-copy overhead. TinyEXR also leads RLE (230 vs 93, ~2.5×). On the compressed codecs OpenEXR is ahead — ZIP ~1.2×, PXR24 ~1.8×, ZIPS ~2.1×, PIZ ~2.7×, HTJ2K ~2–2.4× — dominated by its libdeflate inflate (and tuned PIZ/JPH). A TinyEXR ZIP-decode profile is ~95 % inflate; the predictor and de-interleave passes are already vectorized.

Encode

Encode throughput, single thread

TinyEXR ties or beats OpenEXR on RLE / PIZ / B44. On ZIP/ZIPS it is ~1.5× behind and PXR24 ~1.8× — its in-tree, dependency-free LZ77 encoder is fast but not libdeflate-level. HTJ2K encode is the widest gap (~3×: the separate JPH/OpenJPH encoder). TinyEXR's HTJ2K paths recently gained an AVX2 forward 5/3 wavelet, an unstuffed-buffer entropy reader, and a clz-builtin fast path in the per-sample prepare (encode +~39%, decode +~18% vs the pre-SIMD baseline); the remaining gap is OpenJPH's fully-SIMD entropy coder.

Compression size

Sizes are essentially identical for the lossless/standard codecs — the formats are interoperable (a TinyEXR file decodes in OpenEXR and vice-versa). Only ZIP and HTJ2K differ, from encoder tuning (not format):

codec TinyEXR KB OpenEXR KB codec TinyEXR KB OpenEXR KB
none 2276 2276 pxr24 1158 1163
rle 1644 1644 b44 993 993
zips 1205 1212 htj2k256 1132 1016
zip 1155 1070 htj2k32 1160 1042
piz 742 742

Optional: the libdeflate backend

OpenEXR's deflate speed comes from libdeflate. TinyEXR vendors libdeflate 1.25 (MIT) as an optional, off-by-default backend for ZIP/ZIPS/PXR24 (the in-tree codec stays the default and the only freestanding path):

make bench-compare LIBDEFLATE=1     # -DEXR_USE_LIBDEFLATE, level 4 (= OpenEXR)

Decode with libdeflate backend, single thread

Same backend ⇒ byte-identical sizes, and TinyEXR meets or beats OpenEXR: ZIP decode 80.8 vs 58.8 MP/s (1.37×), ZIPS 61.4 vs 46.4 (1.32×), PXR24 at parity. Encode reaches parity too (ZIP 15.3 vs 14.0, PXR24 16.4 vs 15.8). Both call libdeflate's inflate; TinyEXR's edge is its SSE2/AVX2 predictor and lower per-block overhead.

Full single-thread numbers

In-tree default:

codec tx enc exr enc tx dec exr dec
none 102.3 85.3 2699 789
rle 55.1 46.3 230 92.6
zips 10.3 15.5 23.2 48.9
zip 9.6 14.7 50.4 61.9
piz 23.6 25.4 24.6 67.5
pxr24 9.2 16.4 48.0 88.0
b44 31.7 34.8 145 178
htj2k256 11.0 33.2 24.0 59.6
htj2k32 11.6 31.6 27.1 55.0

LIBDEFLATE=1 (deflate family): zip 15.3/14.0/80.8/58.8, zips 16.2/14.8/61.4/46.4, pxr24 16.4/15.8/83.6/83.8 (tx enc / exr enc / tx dec / exr dec).

Multi-threading

TinyEXR supports per-block parallel encode and decode via portable C11 <threads.h> and a small ephemeral worker pool. It is opt-in (build THREADS=1 / -DEXR_USE_THREADS; default and freestanding builds stay single-threaded); the count is set at runtime:

exr_set_num_threads(16);   /* 0/1 = serial (default) */

It covers scanline and single-level tiled parts on the in-memory load/save paths; deep, mipmap/ripmap, and the streaming APIs remain single-threaded. Encode stays byte-deterministic and decode bit-identical regardless of thread count (unit-tested).

Scaling

TinyEXR decode scaling by thread count

TinyEXR decode scales ~5× (ZIP, 28 blocks) to ~8.8× (ZIPS, 440 blocks) at 16 threads. Scaling is bounded by the number of chunks: asakusa.exr is small, so ZIP (16 lines/block ⇒ 28 blocks) saturates earlier than ZIPS (1 line/block ⇒ 440 blocks). Larger images would scale further. (none decode does not benefit — it is memory-bound with no per-block work, so thread overhead dominates.)

TinyEXR vs OpenEXR at 16 threads

Decode throughput at 16 threads

Both libraries at EXR_THREADS=16, in-tree TinyEXR build. TinyEXR out-decodes OpenEXR on RLE (456 vs 150), ZIP (251 vs 226), ZIPS (204 vs 174), B44 (362 vs 319); OpenEXR still leads PIZ (208 vs 110) and edges PXR24 (270 vs 242).

With LIBDEFLATE=1 at 16 threads TinyEXR leads the deflate family decisively:

codec tx dec exr dec tx enc exr enc
zip 339.6 226.6 102.3 88.0
zips 358.5 151.1 153.1 82.2
pxr24 341.7 285.0 105.7 96.9

(MP/s. OpenEXR's own per-block scaling is included — this is a like-for-like 16-thread comparison.)

Reproduce:

make bench-compare THREADS=1                         # single thread (default)
EXR_THREADS=16 ./build/bench_compare                 # 16 threads, both libs
make bench-compare THREADS=1 LIBDEFLATE=1            # + libdeflate backend

Still future work: parallelize the deep and mipmap/ripmap paths; add ARM/NEON throughput numbers (the NEON path is correctness-verified under qemu but not benchmarked here); sweep larger images and channel counts.


Charts: doc/perf-decode.svg, perf-encode.svg, perf-libdeflate-decode.svg, perf-mt-scaling.svg, perf-mt-compare.svg. Harness: benchmark/bench_compare.cpp (make bench-compare [THREADS=1] [LIBDEFLATE=1]).