jsColorEngine docs: ← Project README · Bench · Performance · Roadmap · Examples · API: Profile · Transform · Loader
Deep Dive: ← Index · Architecture · LUT modes · JIT inspection · WASM kernels
Status: proof of concept — paused after v1.3 framing review (Apr 2026), no harm in current shape. The code lives at
src/Transform.js(Transform.compile()) andbench/compile_poc/. It currently handles the sRGB → CMYK chain end-to-end (6 of the 6 stages have JS emitters, including the two diagnostic markers) and is the measurement vehicle for the larger v1.5 effort planned in Roadmap § v1.5 — N-channel float inputs + compiled non-LUT pipeline +toModule(). This page is the post-mortem from the POC: what the compile target looks like, why we built it, what it taught us, and why it's on hold while v1.3 (16-bit I/O + lcms compat) shipped and v1.4 (ICCImagehelper + browser samples) lands as the next showcase release.Why on hold, not abandoned. The POC validated everything we needed it to: the speedup (1.7× bit-exact, up to 5.4× with the default LUT-gamma + hot-loop), the correctness path (bit-exact to f64 on demand), the measurement methodology (NOP-differential + profiler + instrumentation triangulated to the same bottleneck), and the architecture (per-stage emitters, opt-in diagnostic modes). Mapped onto browser-bench numbers,
compile({ useGammaLUT: false })lands the bit-exact tier at ~12 MPx/s (≈2× lcms-wasm NOOPT, ≈2× our own no-LUT runtime, bit-exact toMath.pow), and the default (LUT-gamma at lcms-equivalent ~32-bit precision, plushotLoop) lands at ~36 MPx/s — within ~1.5× of LUT-mode lcms. v1.3 (the 16-bit I/O work and the.it8-based lcms compat harness) is a stricter dependency for the project's credibility than another speed multiplier — and that work needs the engine as-is as its baseline. v1.3 has now shipped; compile work resumes as v1.5, scheduled after the v1.4ICCImagehelper + browser samples showcase release (see Should we ship this as default? below for the full assessment).
transform.compile()produces a single straight-line JavaScript function for one specific profile chain. No per-pixel option checks, no method dispatch, noFunction.prototype.call, no intermediate result arrays — just decoded inputs in registers, arithmetic, and a single typed-array allocation for the return.- Measured on sRGB → GRACoL CMYK, no-LUT pipeline:
- default
compile()(withuseGammaLUT: true, the new default — matches lcms's standard fast path, ~32-bit gamma precision): ~11 MPx/s vs ~2.3 MPx/s fort.forward()— ~5× faster. compile({ useGammaLUT: false })(bit-exact f64 toMath.pow): ~4 MPx/s vs ~2.3 MPx/s — ~1.7× faster, bit-exact.compile({ hotLoop })(default LUT + amortised loop): ~12 MPx/s, ~5.4× over runtime.
- default
- We confirmed why it's faster three different ways
(NOP-differential bench, in-line
hrtime.bigintinstrumentation, V8 CPU profiler). All three identified the same hot stage: the sRGB inverse-gammaMath.pow(x, 2.4)lives in thestage_Gamma_Inverseblock and accounts for ~65–70 % of the compiled body's wall time. - The compile-time architecture — emitters per stage, factory-scope
state, named per-stage functions opt-in — lays the groundwork for
three more deliverables we haven't shipped yet:
getSource()for build-time precompile (CSP-safe),toModule({...})for zero-runtime-dep standalone transforms, and the WASM emission path covered in the v1.5 roadmap.
new Transform({...}).create(src, dst, intent) builds a pipeline
array of stage objects. Each stage is roughly:
{
funct: someStageFunction, // runtime path, called per-pixel
stageData: {...}, // weights, tables, constants
stageName: 'stage_Gamma_Inverse',
inputEncoding: 0, // device / PCSv2 / PCSv4 / PCSXYZ
outputEncoding: 0
}The runtime path is a tight for (i = 0; i < pipeline.length; i++)
loop in Transform.forward() that calls
pipeline[i].funct.call(this, result, pipeline[i].stageData, pipeline[i])
on every pixel. That's flexible — you can rewire stages at runtime,
inspect result per stage, swap profiles — and it's fast enough for
typical use, especially once lutMode kicks in and most of the
pipeline collapses into a single LUT lookup.
What it's not is "what V8 wants to see if you want it to make truly fast machine code." Per pixel the runtime walker pays:
- a property load
pipeline[i].funct - a method dispatch via
Function.prototype.call - shape checks (
this,result,stageData) every time - an intermediate array result that gets spread or aliased between stages
- per-stage option re-reads (
this.clipRGBinPipeline,this.absoluteAdaptation, etc.) that are constant over the whole bulk transform
compile() collapses all of that. After create() has decided the
exact stage list, options, and constants for this specific
transform, we walk the pipeline once at compile time and emit a
single function whose body is the concatenation of each stage's JS
source, with all options resolved to literals.
The shape:
"use strict";
var r = input[0], g = input[1], b = input[2];
var X = 0, Y = 0, Z = 0;
var pcsL = 0, pcsa = 0, pcsb = 0;
var d0 = 0, d1 = 0, d2 = 0, d3 = 0, d4 = 0, d5 = 0, d6 = 0, d7 = 0;
// ----- stage 0 : stage_Gamma_Inverse -----
r = r <= 0.04045 ? r / 12.92 : Math.pow((r + 0.055) / 1.055, 2.4);
g = g <= 0.04045 ? g / 12.92 : Math.pow((g + 0.055) / 1.055, 2.4);
b = b <= 0.04045 ? b / 12.92 : Math.pow((b + 0.055) / 1.055, 2.4);
// ----- stage 1 : stage_matrix_rgb -----
X = r * 0.4360747 + g * 0.3850649 + b * 0.1430804;
Y = r * 0.2225045 + g * 0.7168786 + b * 0.0606169;
Z = r * 0.0139322 + g * 0.0971045 + b * 0.7141733;
// ----- stage 2 : stage_PCSXYZ_to_PCSv2 ----- (Lab encode, X/Y/Z normalised, cbrt'd, packed)
// ----- stage 3 : stage_curve_v2 ----- (3 × 1D-LUT lookups against PCSv2)
// ----- stage 4 : trilinearInterp3D ----- (the 33³ destination CLUT)
// ----- stage 5 : stage_curve_v2 ----- (4 × 1D-LUT lookups against device CMYK)
return [d0, d1, d2, d3];Important points about the code shape:
- No closures, no
this, no method dispatch in the body. The outernew Function('store', 'input', src)is.bind(null, store)-ed so the callable surface isfn(input). Inside the body, all stage data lives onstore.sN_*properties — typed arrays already preallocated byattachStore_js_*helpers at compile time. - Constants are baked.
0.4360747is the literal sRGB→XYZ matrix cell.0.04045is the sRGB cutoff. The ICC profile's TRC threshold isn't read fromstage.stageDataper pixel — it's emitted as a literal once at compile time. - The named slots (
r, g, b, X, Y, Z, pcsL, pcsa, pcsb, d0..d7) are the variable basis. Every stage assumes its inputs are already in the right slots and writes its outputs into the next set. There's no return-value passing, no destructuring, no array indexing. - The CLUT lives once on
store, referenced bystore.s4_table,store.s4_strideY,store.s4_strideZetc. The trilinear stage emits direct reads_t[corner_offset]against that table — V8 sees a Float64Array load with a constant offset, which is the fastest pattern. - All of this is stored at module-load time, not per pixel. The
expensive part of
compile()(parsing, optimising, machine-code emission via TurboFan) happens once. After that, every call is ~250 ns of arithmetic.
Bench scripts: bench/compile_poc/bench_compiled.js (the headline
number), bench/compile_poc/bench_body_variants.js (NOP-differential
analysis of the compiled body), bench/compile_poc/bench_profilable.js
bench/compile_poc/profile_run.js(V8 profiler attribution).
====== bench (sRGB → GRACoL CMYK, 500 000 px, no-LUT path) ======
best runtime : 216.6 ms (2.31 MPx/s) — t.forward() per-pixel walker
best compiled : 123.8 ms (4.04 MPx/s) — fn(input) emitted body
best speedup : 1.75x
Δmax between the two over 200 random pixels is 0.0 — bit-exact
to the f64 reference. The compiled function isn't trading accuracy
for speed; the math is identical, the savings are entirely from
removing dispatch and per-pixel option lookups.
The compiled function takes ~250 ns per pixel. We measured the breakdown three ways:
bench_body_variants.js rewrites the emitted source with parts
stubbed out, recompiles, benches, and attributes the difference
back to what was removed:
| Variant | MPx/s | Δ vs baseline |
|---|---|---|
| baseline (full body) | 4.04 | — |
Math.pow → 1 (kill all Math.pow) |
11.0 | -64 % body time vanishes |
| trilinear → constant 0.5 | 4.69 | -14 % body time |
| matrix block stripped | 4.10 | -1 % body time |
So Math.pow accounts for ~64 % of the body and the trilinear corner
loads about 14 %. Matrix and curves are noise (each <2 %).
The profiler can only see one function — and compile() produces
exactly one — so the default mode shows everything attributed to
compiled_fn. To get per-stage attribution, compile({ profilable: true })
lifts each stage into its own named function expression at factory
scope (closed over the shared state vars), then has the per-pixel
function call them in order:
function compiledTransform(input) {
r = input[0]; g = input[1]; b = input[2];
_s0_stage_Gamma_Inverse();
_s1_stage_matrix_rgb();
_s2_stage_PCSXYZ_to_PCSv2();
_s3_stage_curve_v2();
_s4_trilinearInterp3D();
_s5_stage_curve_v2();
return [d0, d1, d2, d3];
}Cost of named-function mode: ~45 % throughput hit (4.04 → 2.85 MPx/s) from per-stage call overhead. Acceptable for a profiler-only mode; bit-exact to the monolithic version (Δmax over 200 pixels = 0).
Run node bench/compile_poc/profile_run.js --pixels 5000000. It
spawns the inner bench under node --prof, processes the resulting
tick log, and prints a clean per-stage table:
===== per-stage V8 profiler attribution =====
total samples in run: 1926 (1 sample ≈ 1ms by default)
idx stage self self% +builtin incl incl%
0 stage_Gamma_Inverse 45 2.34% 500 545 28.30% ← THE hot stage
3 stage_curve_v2 82 4.26% 0 82 4.26%
4 trilinearInterp3D 144 7.48% 0 144 7.48%
─────
total compiled-stage ticks: 771
self ticks are samples taken with the stage as the topmost JS
frame. +builtin is the column we wrote a custom --prof-process
parser for: it walks the bottom-up profile, finds samples taken in
C++ (V8 builtins like Math.pow / Math.cbrt / GC), and charges
them to the JS stage that was on the stack at the time.
Of the 771 ticks attributable to compiled-stage code,
stage_Gamma_Inverse (where the three Math.pow(x, 2.4) calls
live) accounts for 545 ticks = 70.7 %. The matrix and curve
stages are too cheap to register samples at all.
compile({ instrument: true }) wraps each stage body with
process.hrtime.bigint() taps that accumulate per-stage ns into
store._instTime[]. The numbers look like a per-stage breakdown
but are not trustworthy in absolute terms:
- Every
hrtime.bigint()call is 50–100 ns of overhead — the same order as the work being measured. - The taps prevent V8 from optimising across stage boundaries (no cross-stage CSE, no dead-store elimination, no fused load-store), so the cost distribution shifts under measurement.
Transform.instrumentReport(compiled) is still useful for relative
comparisons (which stage shifted between two runs?) but the absolute
numbers will mislead anyone who treats them as the truth. The
profiler is the truth-teller; instrumentation is a sanity check.
| Method | gamma_inverse share of compiled body |
|---|---|
| NOP-differential | ~64 % (Math.pow(x, 2.4)) |
| In-line instrumentation | unreliable (timer perturbation) |
| V8 CPU profiler | ~71 % (545 / 771 stage ticks) |
Two independent measurements (differential + profiler) triangulate
to the same hot stage. The two compile options below were designed
against this finding — useGammaLUT reclaims the gamma ticks
(default-on for lcms parity); hotLoop removes the per-pixel
allocation/call overhead that sits above the body cost.
This deserves its own callout because the framing matters. After
implementing useGammaLUT from first principles (we saw Math.pow
in the profiler, replaced it with an LUT, measured the win), we
went and read the Little CMS source to check whether they do the
same thing. They do. At the same scale, with the same accuracy
classification, as the default fast path. We didn't invent this;
we independently rediscovered the standard CMS curve-optimization.
The receipts (lcms2-2.18, included for offline reference at
bench/lcms_c/lcms2-2.18/):
src/cmsopt.c line 418 — the canonical magic number:
#define PRELINEARIZATION_POINTS 4096src/cmsopt.c line 1346 — the per-pixel evaluator after the
optimizer has tabulated the parametric curve:
void FastEvaluateCurves8(... In[], ... Out[], ... D) {
Curves16Data* Data = (Curves16Data*) D;
for (i=0; i < Data->nCurves; i++) {
x = (In[i] >> 8); // 8-bit input → 256-entry table
Out[i] = Data->Curves[i][x]; // single load, no pow()
}
}pow() is called once per table cell at build time, never per
pixel. Two table sizes ship: 256 entries for u8 input, 65536 for
u16. The non-LUT pow() path only fires when the curve genuinely
can't be tabulated (rare).
plugins/fast_float/src/fast_float_curves.c line 27 — the
float-pipeline plugin (lcms's explicit speed plugin) uses the same
shape with Float32 storage:
typedef struct {
cmsFloat32Number CurveR[MAX_NODES_IN_CURVE]; // 4097 entries
cmsFloat32Number CurveG[MAX_NODES_IN_CURVE];
cmsFloat32Number CurveB[MAX_NODES_IN_CURVE];
...
} CurvesFloatData;…and crucially, line 393 — their own accuracy classification:
// Create linearization tables with a reasonable number of entries.
// Precision is about 32 bits.That comment is the citation that lets us stop calling our LUT "lossy by design" and start calling it what it is: a 32-bit-precision implementation of the gamma stage, the same way lcms's fast_float plugin describes its own. "About 32 bits" is the industry-consensus ceiling for this class of optimization, and it's well above what 8-bit or 16-bit perceptual workflows can resolve.
What this changes for our defaults. We flipped useGammaLUT to
default-true (lcms parity). Anyone who needs bit-exact Math.pow
(measurement-grade work, oracle generation, bit-for-bit cross-checks)
explicitly opts out:
t.compile({ useGammaLUT: false }); // bit-exact f64, no LUTBeing more conservative than the entire CMS industry doesn't help anyone — it just leaves a 2.8× speedup on the floor for users who mostly don't care about the 5th decimal place of a Lab triple. The conservative-by-default opt-out for measurement work is one short property setting; the speed-by-default for everyone else is free.
Two compile flags act on the bottleneck. useGammaLUT is now
default-on (per the prior-art reading above); hotLoop is opt-in
because it changes the function signature.
const compiled = t.compile({
useGammaLUT: true, // DEFAULT — 4096-entry LUT replaces Math.pow in stage_Gamma_Inverse
// (lcms-equivalent ~32-bit precision; opt out with `false` for bit-exact)
hotLoop: true, // wrap body in for(_i…); fn(input, output, n) instead of fn(pixel) → array
});useGammaLUT's LUT is built once in attachStore_js_stage_Gamma_Inverse
and parked on store.s{idx}_gammaLut; the emitter then writes
r = _gl[(_r * 4095) | 0] instead of Math.pow(...). hotLoop is
purely structural — same numeric output as the single-pixel form,
just amortises the call/alloc overhead across all pixels in one outer
loop.
Measured on the same sRGB → GRACoL chain
(bench/compile_poc/bench_gammalut_hotloop.js,
500 000 random pixels, best of 5):
| Mode | ms | MPx/s | vs runtime | vs bit-exact compile |
|---|---|---|---|---|
runtime forward() |
215.7 | 2.32 | 1.00× | — |
compile({ useGammaLUT: false }) |
126.7 | 3.95 | 1.70× | 1.00× |
compile() (default-LUT) |
45.0 | 11.12 | 4.80× | 2.82× |
compile({ hotLoop, useGammaLUT: false }) |
117.9 | 4.24 | 1.83× | 1.07× |
compile({ hotLoop }) (default-LUT + hot) |
40.1 | 12.46 | 5.37× | 3.16× |
Three things to read off this table:
- The default is now the fast path.
compile()with no options delivers ~5× over runtime, matching lcms's "build a curve LUT once then load per pixel" optimisation. This is the path most callers will hit without thinking about it. - Bit-exact still costs only what it should. Opting out of the
LUT (back to
Math.pow) costs the difference between 11.12 and 3.95 MPx/s — that's the gamma stage's true cost surfacing. For measurement work where that cost is acceptable, the opt-out is one property. hotLoopadds ~1.09× on top of the default. Without the LUT the body cost so dominates that removing the per-pixel array allocation barely registers (~1.07×); with the LUT the allocation/call overhead becomes a meaningful fraction again and the additional ~9 % materialises.
The combined compile({ hotLoop }) mode is what an image-pixel
pipeline (fn(input, output, pixelCount) over a typed-array buffer)
should use. The single-pixel form remains the right call for
one-off colour lookups, test code, named-colour resolution, etc.
For measurement-grade work, prepend useGammaLUT: false.
A third change in this round: compile() now defaults to
strict: true, throwing if any pipeline stage lacks an
emit_<target>_<stageName> function. The previous behaviour —
silent fallback to a runtime call — produced wrong output for
non-trivial encoding boundaries (the CMYK → CMYK case
documented in the next section) and gave us a misleading
"compiles fine" success that masked the real coverage gap. Pass
{ strict: false } to opt back in to the runtime fallback for
chains you've audited yourself.
CMYK → CMYK is the natural "is the speedup gamma-specific?" test —
no sRGB inverse-gamma at the front, just curves and CLUTs. We
tried it (bench/compile_poc/bench_compiled_cmyk2cmyk.js,
GRACoL → SWOP) and the bench refuses to run. Two known gaps
in the POC surface immediately:
- The compile-time preamble is hardcoded for 3-channel RGB
input. Every emitted function starts with
var r = input[0], g = input[1], b = input[2];. CMYK input needs a 4-channel start (var d0 = input[0], d1 = input[1], d2 = input[2], d3 = input[3];) and the right output-side variable basis. tetrahedralInterp4Dandstage_PCSv2_to_PCSv4have no JS emitter yet. They fall through to the runtime stub (_compile_emit_runtime_fallback), which works for individual stages but produces wrong output when the chain depends on them — the fallback can't bridge the variable-basis change at the runtime boundary.
The CMYK→CMYK chain looks like:
stage_curve_v2 > tetrahedralInterp4D > stage_curve_v2
> stage_PCSv2_to_PCSv4
> stage_curve_v2 > trilinearInterp3D > stage_curve_v2
Five of the seven stages already have emitters (curves and 3D trilinear are reused from the RGB→CMYK path). To finish the LUT-only test we need:
emit_js_stage_tetrahedralInterp4D(4D version of the existing trilinear emitter; lifts the K-plane setup into the body and reuses the C/M/Y interp pattern from the WASM kernel).emit_js_stage_PCSv2_to_PCSv4(a small Lab encoding/scale conversion — handful of multiply-adds).- A multi-channel input preamble in
compile()driven by the first stage'sinputEncodingand channel count.
Until those land we don't have a real CMYK→CMYK number. The bench
script aborts with a useful explanation rather than reporting a
bogus "1.98×" speedup — see the abort path in
bench_compiled_cmyk2cmyk.js.
Why this matters: RGB → CMYK is currently 1.75× because the
runtime walker has 6 stages of dispatch and the compiled function
removes all of them; the fact that one stage (gamma) dominates means
compiling helps a lot but the absolute ceiling is bounded by
Math.pow. CMYK → CMYK should look different — no Math.pow in
the body, much higher relative cost in the trilinear/tetrahedral
stages, and likely a larger speedup in the 2.0–2.5× band because
dispatch overhead is a higher share of the runtime baseline. We
expect to confirm that in the next pass.
The runtime walker is a general-purpose execution model. It can
handle any pipeline shape, any combination of stages, any options
flipped at any time. The price of that generality is that V8 sees
nothing more than "a for loop over an array of unknown function
pointers" — no inlining, no constant folding, no specialisation.
Every per-pixel call goes through:
forward(in)
└─ for (i = 0; i < pipeline.length; i++)
└─ pipeline[i].funct.call(this, result, stageData, stage)
└─ a generic stage function with its own option checks
compile() flips the trade-off. We give up runtime flexibility (you
can't swap a stage out of a compiled function — you'd have to
recompile) and in return V8 sees:
fn(input)
└─ a single function body with every constant baked, every option
resolved, every stage inlined, every typed-array access at a
fixed stride. TurboFan inlines everything, hoists invariants,
register-allocates the named slots, and emits a tight x64
machine-code body.
That's the whole point: trade per-call flexibility for per-call
speed, by specialising at create time once we know exactly what
this transform does. The runtime walker stays exactly as it is and
remains the default — compile() is opt-in and additive.
The roadmap covers the v1.5 plan in detail (Roadmap § Inspection / distribution story). The four use cases that fall out of having a single emitted function per transform are:
const t = new Transform({...}).create(src, dst, intent);
const compiled = t.compile({ target: 'js' });
const cmyk = compiled.fn([r, g, b]);Same Transform API, monolithic compiled body, ~1.75× faster on no-LUT chains. This is what you'd use in any long-running converter where parse + warm-up cost (~10–100 ms) is amortised across millions of pixels.
new Function() is blocked by strict Content Security Policy
(some browser extensions, many enterprise sites). The compile output
is just a JS string — expose it directly:
const src = compiled.source; // already returned by compile()
// write to a file at build time, ship as a static moduleThe user precompiles at build time via their bundler, ships the emitted JS as a normal source file, and pays zero runtime CSP cost. The engine becomes a build-time tool for that transform; the runtime dependency disappears. This is the quietly important one — most of the people who care about CSP also care about bundle size, and a 50–80 KB single-purpose transform module beats a 192 KB general-purpose engine for "convert these specific colours, never anything else."
The emitted code is human-readable JavaScript with structural comments per stage:
// ----- stage 0 : stage_Gamma_Inverse -----
// sRGB inverse-gamma (IEC 61966-2-1): linear segment under 0.04045,
// gamma 2.4 above. Per-channel, in-place.
r = r <= 0.04045 ? r / 12.92 : Math.pow((r + 0.055) / 1.055, 2.4);
g = g <= 0.04045 ? g / 12.92 : Math.pow((g + 0.055) / 1.055, 2.4);
b = b <= 0.04045 ? b / 12.92 : Math.pow((b + 0.055) / 1.055, 2.4);new Function() strips the comments at parse time (zero runtime
cost), so we emit them generously. A colour bug becomes "read 60
lines of self-documenting JavaScript" instead of "step through the
stage walker for an hour." Matrix rows are annotated with their
source/destination whitepoint, curve lines with their ICC tag
origin, CLUT references with their grid size — the artefact is a
teaching tool and the running code.
The big one for distribution. compile() already produces the
function body and has a reference to the store (curves, CLUTs,
constants). toModule({ name: 'srgb_to_gracol' }) would emit the
store data as inline typed-array constructors at the top of the
file plus the function body, producing a standalone JS module:
// srgb_to_gracol.js — generated by jsColorEngine
'use strict';
// 33³ destination CLUT, baked at module load (~17 KB Uint16)
const _s4_table = new Uint16Array([0x1234, 0x5678, ...]);
const _s3_curve_table = new Float64Array([0.0, 0.012, ...]);
// ... other store data ...
module.exports = function srgb_to_gracol(input) {
let r = input[0], g = input[1], b = input[2];
// ... emitted body ...
return [d0, d1, d2, d3];
};A typical RGB→CMYK module lands around ~50–80 KB unminified (~25–40 KB gzipped, dominated by the 17⁴×4 u16 CLUT). That's smaller than the 192 KB engine for one specific transform — no ICC parser, no Loader, no dispatcher, no bytes spent on transform shapes you don't use. Bit-exact to the engine output.
This unlocks deployment shapes the engine can't:
- CDN-hosted point conversions — drop a
<script src="srgb_to_gracol.js">on a marketing page that needs one specific colour conversion; no engine, no profiles, no parser. - Edge runtimes that don't allow
new Function()(Cloudflare Workers, some Deno deploy modes) — the module is just JavaScript. - Pre-baked LUT distribution for known profile pairs — print
shops with three production presses publish three modules; their
customers
importthe one they need.
Engineering-wise the work to ship this is small (the function body
is already in compiled.source; the store serialisation is a
typed-array → new Uint16Array([...]) template). The interesting
design decisions are about format: CommonJS vs ESM, embedded
binary vs base64, single file vs split bundle. v1.5 territory.
If you're picking this up, the things that surprised us:
- Intermediate
let _r = rsnapshot variables don't cost anything. TurboFan's SSA form treats them as register copies and elides them at machine-code time. We tried both styles (bench_body_variants.jsstrip-matrix-snapshot variant) — no measurable difference. Write the emitter in whichever style is clearer. - Don't try to fuse expressions across stages. We tested a "trilinear-fully-fused" variant that inlined every CLUT load into a single mega-expression. It was slower than the unfused version because V8 lost CSE opportunities — the same CLUT cell was loaded multiple times. Trust the JIT to do CSE on named temps; don't pre-fuse.
Math.pow(x, 1.0/3.0) → Math.cbrt(x)is a free win.Math.cbrtis a TurboFan intrinsic that lowers to a small polynomial;Math.powfalls through to the libmpowshim. We swapped this in the Lab-encode stage (emit_js_stage_PCSXYZ_to_PCSv2) for ~35 % time reduction in that stage's body.Math.pow(x, 2.4)has no equivalent intrinsic. sRGB inverse gamma is the obvious next target. Options:- 256-entry LUT keyed on
(x * 255) | 0— exact for u8 inputs, needs interpolation for float, but our gamma-decode chain almost always runs on u8 anyway. - Polynomial approximation (
fma(x, fma(x, ...), ...)) — keeps full float accuracy at the cost of ~12 multiply-adds vs the LUT's 1 load. - Neither has been benched yet; expected to land alongside the v1.5 emitter expansion.
- 256-entry LUT keyed on
- The diagnostic stages (
stage_history,stage_debug) are already routed through a single emitter (emit_js_stage_debug/emit_js_stage_history) that produces a clean source comment and zero runtime code. They're safe to leave in the pipeline at compile time — they'll show up in the emitted source as// ----- stage N : Start -----with no per-pixel cost. - The
profilablemode is the right thing to reach for whenMath.pow-style hot spots aren't obvious from inspection. Don't bother withinstrument: truefor absolute numbers — it perturbs the JIT too much. Use it only to compare two configurations under the same instrumentation.
Captured during the v1.3 / v1.5 framing review (Apr 2026). Re-read
this when picking compile work back up as v1.5 (after the v1.4
ICCImage helper + samples showcase release).
Applying the compile_poc speedup ratios (1.78× / 4.92× / 5.36×)
to the no-LUT runtime baseline measured by the
browser bench (6.7 MPx/s for jsce no-LUT (f64)
on RGB→GRACoL CMYK), the tier picture looks like this:
jsce int-wasm-simd 171 MPx/s speed-tier (image bulk work) — SIMD CLUT
jsce int-wasm-scalar 80 MPx/s speed-tier
jsce int 57 MPx/s speed-tier
jsce float (33³ LUT) 54 MPx/s speed-tier
lcms-wasm default 52 MPx/s speed-tier
─────────────────────────────────────────────────────────────────────────────────
jsce compile() default+hot ~36 MPx/s [NEW: LUT-gamma (lcms-equivalent ~32-bit), ~5× over runtime]
jsce compile() default ~33 MPx/s [NEW: LUT-gamma (lcms-equivalent ~32-bit), ~5× over runtime]
jsce compile({ useGammaLUT:false }) ~12 MPx/s [NEW: bit-exact f64, ~1.8× over runtime — opt-in]
jsce no-LUT runtime 6.7 MPx/s accuracy-tier (today, baseline)
lcms-wasm NOOPT 6.1 MPx/s accuracy-tier (lcms equivalent)
Caveat — extrapolation, not direct measurement. The compile_poc bench harness has a tighter per-pixel call envelope than the browser bench, so the ratios are reproducible (they fall out of the per-stage cost analysis) but the absolute MPx/s numbers for the compile rows above are projections, accurate within ~±25 %. Direct browser-bench rows for compile() will land alongside v1.5 when the API is stable.
Three things to read off the corrected stack:
- Bit-exact compile is already a real win on the accuracy tier.
compile({ useGammaLUT: false })lands at ~12 MPx/s, bit-exact toMath.pow, ~2× over both lcms-wasm NOOPT and our own no-LUT runtime walker. No accuracy trade-offs. This is the path measurement-grade callers should pick; the speedup is structural (codegen vs interpreted dispatch). - Default compile (LUT-gamma) punches into the LUT tier.
~33 MPx/s vs lcms-wasm default's 52 MPx/s and
jsce float's 54 MPx/s — within ~1.6× of the LUT speed tier on the accuracy path, with the same lcms-equivalent ~32-bit gamma precision (see Prior art section). This was the surprise: the curve-tabulated codegen is no longer in a different league from LUT-mode lcms; it's in the same conversation, using the same optimisation lcms uses. compile({ hotLoop })punches in further. ~36 MPx/s — within ~1.5× of LUT-mode lcms. Same numerical output as the default single-pixel form, just amortises the call/alloc overhead.- The SIMD-CLUT speed tier (171 MPx/s) stays untouched. Image
bulk work that already lives on
int-wasm-simdhas nothing to gain from compile. Compile is for callers who picked the no-LUT accuracy path specifically — colour measurement, ΔE reporting, single-colour lookups, named-colour resolution, gamut-boundary checks.
The tier rewrite: compile is not a LUT-killer, but it is a
no-LUT-tier promotion strong enough that "compile() ON" should
become the default for the no-LUT path the moment v1.5 ships.
People who deliberately picked the no-LUT pipeline get a free
~5× boot at lcms-equivalent precision (free in the sense lcms
itself defaults to this trade-off); measurement work flips one
flag for full bit-exact Math.pow.
- Big win on the slowest tier. ~5× over runtime walker by
default (lcms-equivalent precision), or 1.7× with bit-exact
Math.powopt-in. - Side benefits are unique to this approach, not nice-to-haves:
getSource()→ CSP-safe build-time precompile. Works in browser extensions, locked-down enterprise sites, Cloudflare Workers. Nothing else in the JS color-management space offers this.toModule()→ standalone, dep-free, bit-exact transform module (~50–80 KB for one transform, vs ~192 KB for the full engine). No competitor (lcms, color.js, culori) can do this. Reframes jsColorEngine as both a runtime and a build-time tool.- Debug-by-reading:
console.log(compiled.source)shows the actual math, comments and all. Makes the engine teachable.
- Compile cost is one-time and small (~1 ms). Amortised over thousands of pixels it's free; over millions it's invisible.
- Validated a measurement methodology — three independent methods (NOP-diff, profiler, in-line instrumentation) agreed on the bottleneck. That triangulation pattern is reusable beyond this work.
-
Stage emitter coverage is the real bill. Quick inventory of stages in
src/Transform.js:- Input adapters (~5):
stage_cmsLab_to_LabD50,stage_Int_to_Device,stage_RGB_to_Device,stage_Lab_to_PCS_v2/v4, the encoding decoders - Core math (~7):
Gamma_Inverse✅,matrix_rgb✅,PCSXYZ_to_PCSv2✅,PCSv2_to_PCSv4,chromaticAdaptation,absoluteAdaptation_in/out - CLUT (~3):
curve_v2✅,trilinearInterp3D✅,tetrahedralInterp4D - Intent/BPC (~3–5): perceptual intent table, BPC, gamut mapping markers
- Output adapters (~5): the inverses of the input set
- Diagnostic (2):
stage_debug✅,stage_history✅
Roughly 25–30 emitters, 6 ship today. At ~30 lines avg, that's ~750 lines added to
Transform.js(~6 % growth). Each emitter needs a bit-exactness test against the runtime stage. - Input adapters (~5):
-
Maintenance coupling. Changing a
stage_*functnow means changing the matchingemit_*to keep parity. Mitigated by a CI test that comparescompile()output tot.forward()for every supported chain on every PR. -
Compile() doesn't beat the LUT path. People doing bulk image work will stay on
'int-wasm-simd'regardless. Compile is the accuracy-tier speedup, not a universal one. -
Dist size. Adding emitters to
Transform.jsgrows the engine bundle. Splitting into a separate@jscolorengine/compilesub-package is possible but makes dist messy. Pragmatic answer: keep it in core; the engine is still under 200 KB.
- Pure JavaScript, no native deps → compile() preserves this; output is JS.
- Accuracy-first → compile() defaults to lcms-equivalent
~32-bit gamma precision (same as lcms's own default fast path —
see Prior art);
{ useGammaLUT: false }is the one-flag opt-in for full bit-exactMath.pow. The accuracy-first ethos is preserved by making the bit-exact opt-in trivial and well-named, not by leaving 2.8× on the table for everyone. - No magic / inspect the source →
getSource()literally exposes the math. - Smaller than lcms-wasm →
toModule()is the most extreme expression of this. - No telemetry, no upload → unchanged.
The one positioning shift: toModule() reframes jsColorEngine as
also a build-time tool, not just a runtime library. That's
evolution, not contradiction — but the README will need a small
update when v1.5 ships.
- v1.3 stays scoped: 16-bit I/O (
int16JS kernels — landed, WASM SIMD u16 still pending),.it8lcms compat harness,lutGridSize. The float-WASM tier originally pencilled in here has since been dropped from the roadmap — the v1.3 int16 measurements showed u16 LUT cells are the accuracy ceiling for every shipping ICC v2/v4 profile, so an f32 wasm kernel would not unlock a meaningful precision tier above u16. See the DROPPED block in Roadmap.md for the full reasoning. Compile is NOT pulled into v1.3. v1.3's compat harness needs the engine as-is as its baseline; introducing a new code path mid-stream would muddle the regression signal. - v1.5 reframes: flip the headline from "WASM emit + 200 MPx/s
ceiling" to "compile() stable +
toModule()+ coverage matrix". Defer the WASM emit target to a later release — the JS-emit + LUT + hot-loop combination already gets us to ~5× over runtime, and the bottleneck isMath.pow, which WASM doesn't make materially faster (the LUT does). v1.4 (ICCImagehelper + samples) lands ahead of this so the v1.3 perf story has runnable showcases by the time the compile work resumes. toModule()is the marquee feature. More important than the speedup. It's the one thing competitors can't match.
Yes worth shipping. The corrected tier numbers turn out stronger than the original framing suggested:
- Bit-exact
compile({ useGammaLUT: false })at ~12 MPx/s already doubles both lcms-wasm NOOPT and our own no-LUT runtime walker. No accuracy trade-off, just removed dispatch overhead. This is the path measurement-grade callers should pick — same numerics, better dispatch. - Default
compile()at ~33 MPx/s is the lcms-parity path. Same gamma-LUT trick lcms uses by default at the same scale and precision (see Prior art). Anyone who picked the no-LUT pipeline for accuracy reasons gets a ~5× boot at lcms-equivalent precision for free. compile({ hotLoop })at ~36 MPx/s lands within ~1.5× of LUT-mode lcms (52 MPx/s) andjsce float(54 MPx/s) — the no-LUT accuracy tier is no longer in a different league from LUT-mode lcms; it's in the same conversation, and at the same precision lcms ships by default.- The SIMD-CLUT speed tier (171 MPx/s) is untouched. Bulk image
work stays where it is. Compile doesn't compete with
int-wasm-simd, it complements the accuracy tier nothing else accelerated.
Direction for v1.5 (when we resume after the v1.4 ICCImage helper +
samples showcase release):
- Stage-emitter coverage to all ~25 stages (the real bill).
compile()becomes the default for the no-LUT path the moment coverage is complete — anyone using no-LUT mode gets the ~5× boot for free at lcms-equivalent ~32-bit gamma precision.useGammaLUT: falseis the documented one-line opt-in for full bit-exactMath.pow(measurement / oracle work).hotLoopstays opt-in (it changes the function signature fromfn(pixel)tofn(input, output, n)).getSource()andtoModule()ship as the marquee distribution features.
The current POC is paused, not abandoned — committed at the v1.3-pivot checkpoint so the measurement record stands.
# Headline: monolithic compiled vs runtime walker
node bench/compile_poc/bench_compiled.js --pixels 500000
# Differential analysis: which stage dominates the body?
node bench/compile_poc/bench_body_variants.js --pixels 500000
# V8 CPU profiler: per-stage attribution from real samples
node bench/compile_poc/profile_run.js --pixels 5000000
# (planned) CMYK→CMYK LUT-only chain — currently aborts on missing emitters
node bench/compile_poc/bench_compiled_cmyk2cmyk.jsAll four run from a fresh checkout in a few seconds each. profile_run.js
takes ~30 s for the 5M-pixel default because the V8 tick log needs
the bigger sample size for stable attribution.
- Roadmap § v1.5 — N-channel float inputs + compiled non-LUT pipeline +
toModule()— the production plan this POC is informing - Architecture — pipeline model, runtime walker
- JIT inspection — TurboFan-emitted x64 for the related kernel work; same JIT, same techniques
- Performance — overall throughput numbers across all kernel modes