jsColorEngine docs: ← Project README · Bench · Performance · Roadmap · Examples · API: Profile · Transform · Loader
Deep Dive: ← Index · LUT modes · JIT inspection · WASM kernels · Compiled pipeline
Everything jsColorEngine does is built around one class:
Transform. You give it a source profile and a
destination profile (plus a rendering intent, optional custom stages, and some
flags) and it assembles an optimised pipeline between them. Once built, the
pipeline converts one colour at a time — or a whole image — as fast as the
host JavaScript engine will let us.
srcProfile ──┐
│ new Transform({...}).create(src, dst, intent)
dstProfile ──┤ ──────────────────────────────────────────────► [pipeline]
│ ~μs/call accuracy
intent ──────┘ ~MPx/s image
There are two very different execution paths a Transform can take once it's
built — transform() for one colour at a time, and transformArray() for
bulk pixel data — and picking the right one is the single biggest performance
lever in the library. We'll walk through both.
A pipeline is an ordered list of stages. Each stage is a small function that takes the previous stage's output and applies one well-defined transformation: decode a tone-reproduction curve, multiply by a 3×3 matrix, interpolate a LUT, encode back out to a device space, and so on. For a typical RGB → CMYK conversion the pipeline might look like:
input: int8 RGB (raw bytes, 0..255 per channel)
└─► decode TRC ← sRGB curves, RGB to linear light
└─► matrix + CAT ← 3x3 to PCS XYZ, chromatic adaptation if needed
└─► XYZ → Lab ← PCS normalisation
└─► Lab → CMYK (LUT) ← BToA tag, 4-input tetrahedral interp
└─► encode output ← u8 clamp + pack
output: int8 CMYK
Stages are stored as small tagged records — input encoding, function, output encoding, state bag — and the pipeline carries them in order. The build-time pipeline optimiser inspects adjacent stage pairs and collapses redundancy: encode-then-decode of the same curve becomes a no-op and is deleted, PCS-version conversions that cancel out are dropped, matrix + matrix pairs could be premultiplied (future work).
-
Building is offline. Execution is hot. Pipeline construction runs once per
create()call — its speed doesn't matter. Pipeline execution runs per pixel. Its speed is everything. The implementation aggressively trades build-time cost for per-pixel speed: baking LUTs, unrolling interpolators, compiling WASM modules, pre-computing constants. -
Accuracy path and image path are different code. A 32×32 colour-picker swatch has hundreds of pixels, each rendered exactly once, and wants the last 0.05 dE of accuracy. A 4 K image has 8 million pixels, all rendered the same way, and wants every extra fixed-point saturation instruction the JIT can give us. Jamming those two workloads through the same hot loop would either cripple the fast path (allocations per pixel, dispatch per stage) or wreck the accuracy path (LUT quantisation when you don't want it). So we split them.
| Accuracy path | Image path | |
|---|---|---|
| API | transform.transform(obj) / transformArray(objs) with dataFormat: 'object' or 'objectFloat' |
transform.transformArray(typedArray, ...) with dataFormat: 'int8' and buildLut: true |
| Per pixel | Walks the entire pipeline, stage by stage, in f64 | Single n-D LUT interpolation |
| Typical cost | ~µs/pixel (micro seconds) | ~ns/pixel (nano seconds) |
| Allocations | One small object per pixel, fine for thousands | Zero in the hot loop (typed arrays, pre-baked LUT) |
| Accuracy | Full f64, bit-for-bit deterministic | LUT-quantised (typically ≪ 1 dE from the float path on u8 output) |
| When to use | Colour pickers, dE calculators, swatch libraries, prepress analysis | Images, video frames, canvas, batch ICC conversion |
These are selected at construction time via dataFormat and buildLut. You
do not switch between them at run time — pick at create() and stick with it.
// DON'T do this.
for (let i = 0; i < pixelCount; i++) {
out[i] = transform.transform(pixels[i]);
}That call bypasses the LUT entirely, allocates ~6 arrays per pixel, and
dispatches every stage via .call(this, ...). On a 4 MP image you're
roughly 30× slower than transformArrayViaLUT and you will GC-thrash
the host. The accuracy path is correct; it is not a fast loop. If you
have > ~10 k pixels, always set { buildLut: true, dataFormat: 'int8' }
and call transformArray.
For the image path, { buildLut: true } asks the engine to collapse the
entire pipeline into a single lookup table at create() time. The LUT is
an N-dimensional grid:
- 1D (for 1-channel inputs — greyscale) — 256-point typical
- 3D (for RGB input) — 33×33×33 typical, 49³ possible at
lutGridPoints3D: 49 - 4D (for CMYK input) — 17⁴ typical, 33⁴ possible at
lutGridPoints4D: 33
Each grid cell stores the output of the full pipeline at that input coordinate. At run time we find the enclosing tetrahedron, fetch its corners, and do a barycentric blend. This is ~10-100 ops depending on input dimensionality, versus 100-1000+ for walking the full pipeline.
The cost of pre-baking is paid once in create() — typically 2-20 ms for
3D, 10-30 ms for 4D, depending on grid size and host. That's a fair
trade on any workload bigger than a few thousand pixels.
The image path uses tetrahedral interpolation regardless of input
channel count. For 3 channels it splits the enclosing cube into 6
tetrahedra based on the input weights; for 4 channels it splits the
hypercube into 24. The tetrahedral scheme is both faster and more
accurate than trilinear for device-space LUTs — fewer grid fetches,
no "stripe" artefacts at the cube boundaries. LittleCMS uses the same
scheme for the same reasons; see the
lessons from reading lcms2's cmsintrp.c section
in the Performance doc.
(The one exception: for 3-channel PCS-to-device LUTs, addStageLUT()
automatically falls back to trilinear. That matches LittleCMS,
Photoshop, and SampleICC behaviour for compatibility with profile
manufacturers' expectations on those specific tags. Set
interpolation3D: 'tetrahedral' in the constructor to override.)
When the image path is active, the inner loop is selected by lutMode.
Each mode trades safety, accuracy, and speed differently. All four
produce bit-identical output for 8-bit input within their respective
kernel families (verified across a 6-configuration matrix in
bench/wasm_poc/).
lutMode |
CLUT storage | Math | Where the code lives |
|---|---|---|---|
'float' (default) |
Float64Array |
f64 tetrahedral interp | src/Transform.js *_loop functions |
'int' |
Uint16Array (Q0.16) |
int32 via Math.imul |
src/Transform.js, same unrolled shape as float |
'int-wasm-scalar' |
Uint16Array |
int32 in WASM | src/wasm/tetra3d_nch.wat, tetra4d_nch.wat |
'int-wasm-simd' |
Uint16Array |
v128 SIMD in WASM | src/wasm/tetra3d_simd.wat, tetra4d_simd.wat |
Each mode silently falls back to the previous one if it can't service the
LUT shape (e.g. SIMD kernels only support 3- and 4-channel output) or if
the host lacks the required capability (no WASM → demote to 'int', no
v128 → demote to 'int-wasm-scalar'). The selected kernel is reported in
transform.lutMode after create().
For the why of each mode's performance characteristic, see the separate LUT modes page.
Near the top of src/Transform.js there's a 200-line
header block with the authoritative usage guide, dataformat reference, and
performance caveats — written for the person who's about to modify the hot
loop. It's the closest thing this library has to a design doc. If you're
here because you're considering a PR against the kernel, read that before
you touch anything.
Key callouts from that header that matter architecturally:
- Pipeline construction runs once per
create()— its speed is irrelevant. Pipeline EXECUTION is per pixel — its speed is critical. - The hot-path interpolators look weird on purpose. Long inline arithmetic, duplicated code paths, very few helpers or named temps. The JIT compilers inline aggressively and reuse registers across folded expressions; introducing a helper call or a named temp can force a memory round-trip that measurably slows the loop. Don't "tidy" them without re-benchmarking. The JIT inspection page walks you through the V8 assembly that justifies this.
- Bounds checks are deliberately omitted in the inner loops. Passing out-of-range or wrong-length input is undefined behaviour (garbage out, no exception). Validation belongs at the API boundary, not the hot loop.
- Custom stages are baked into the LUT when
buildLut: true, so per-pixel custom effects cost zero at run time. That's the recommended way to apply per-image effects (ink limiting, saturation tweaks, greyscale conversion) without sacrificing LUT-path speed.
- LUT modes — what each
lutModeactually does at the instruction level - JIT inspection — why the scalar JS kernel hits the speeds it does
- WASM kernels — how the
.watfiles are laid out and the SIMD channel-parallel design - Performance — measured numbers across modes
- API: Transform — the constructor options reference