Skip to content

Perf optimizations#164

Merged
cmdcolin merged 11 commits intomainfrom
perf-optimizations
Apr 27, 2026
Merged

Perf optimizations#164
cmdcolin merged 11 commits intomainfrom
perf-optimizations

Conversation

@cmdcolin
Copy link
Copy Markdown
Contributor

Modest performance improvements for CRAM long and short reads. It also investigated lazily parsing tags but didn't cause any speed improvement (yet...)

This PR is largely claude code generated

I also investigated whether multi-threading with sharedarraybuffer could help but it said probably not (yet...)

Key optimizations

  • Batch ITF8 pre-decoding via WASM — Pre-decodes variable-length integers in bulk from
    external codec blocks, replacing per-call overhead with a single pass. Biggest win for long
    reads where ExternalCodec.decode dominated CPU time.
  • Bound decoder closure cachinggetCodecForDataSeries results are now cached and reused
    in the hot decode loop instead of being re-resolved per record.
  • Eliminated intermediate object allocationsdecodeRecord() now constructs CramRecord
    directly instead of returning a temporary 17-property plain object that was immediately
    destructured and GC'd. Also eliminated the mateToUse intermediate object. Removes ~81k
    transient objects per 54k-record slice.
  • Node strip-types compatibility — Migrated from tsx to node --experimental-strip-types,
    removed CommonJS shims.

Benchmark results (p50, 40 iterations)

   | File | Records | master | optimized | Speedup |
   |------|---------|--------|-----------|---------|
   | Short reads (2.5MB) | 54k | 281ms | 215ms | **1.30x** |
   | Long reads (1.5MB) | 1k | 91ms | 60ms | **1.52x** |
   | 400x short reads (14MB) | 800k | 7,521ms | 6,713ms | **1.12x** |
   | 400x long reads (70MB) | 2k | 4,287ms | 2,874ms | **1.49x** |

GC / memory impact

  • Heap per record: 902 → 854 bytes (~5% reduction in retained heap)
  • Eliminated ~81k transient intermediate objects per slice decode
  • Long reads benefit most (~1.5x) because ExternalCodec.decode was a larger fraction of CPU time

cmdcolin and others added 5 commits March 21, 2026 13:28
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two optimizations targeting ExternalCodec.decode, which was 10-28% of CPU:

1. Batch ITF8 pre-decode: Before the record decode loop, decode all ITF8
   values from external int blocks into Int32Arrays in a tight loop. During
   record decoding, reading a pre-decoded int is just values[index++]
   instead of branchy ITF8 parsing with per-call cursor/block lookups.

2. Bound decode closures: For each data series, create a closure at slice
   setup time that captures the resolved content buffer and cursor directly.
   This eliminates per-call codec cache lookup, blocksByContentId Record
   lookup, cursors.externalBlocks.getCursor() Map lookup, and dataType
   branching.

Also adds batch_itf8_decode to the htscodecs WASM module (C implementation)
for potential future use, though the pure JS batch approach proved faster
due to avoiding WASM memory copy overhead.

Benchmarks (p50, 40 iterations):
- Short reads (54k records): ~1.4x faster
- Long reads (37 records): ~1.4-1.7x faster

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Have decodeRecord() construct CramRecord directly instead of returning
a temporary plain object that gets immediately destructured and GC'd.
Also eliminates the mateToUse intermediate object by building the mate
record in its final shape. Removes ~81k transient objects per 54k-record
slice decode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts:
#	eslint.config.mjs
#	package.json
#	scripts/analyze-profile.ts
#	scripts/bench-large.ts
#	scripts/profile-compare.ts
#	scripts/profile-cpu-branch.ts
#	scripts/profile-cpu.ts
#	src/craiIndex.ts
#	src/cramFile/codecs/byteArrayLength.ts
#	src/cramFile/codecs/external.ts
#	src/cramFile/container/index.ts
#	src/cramFile/file.ts
#	src/cramFile/record.ts
#	src/cramFile/sectionParsers.ts
#	src/cramFile/slice/decodeRecord.ts
#	src/cramFile/slice/index.ts
#	yarn.lock
@cmdcolin cmdcolin force-pushed the perf-optimizations branch from 3a8257f to a61813c Compare April 27, 2026 16:54
cmdcolin and others added 5 commits April 27, 2026 13:26
Replaces the string-keyed decodeDataSeries indirection with a fixed-shape
object literal holding all 28 data-series decoders. Hot call sites in
decodeRecord and decodeReadFeatures become direct property accesses (bd.FC(),
bd.BF()) so V8 inline-caches them. Read-feature schemas now hold pre-resolved
decoder references rather than string keys, and the inner FC/FP loop fetches
its decoders into locals.

HuffmanIntCodec.buildCaches now no-ops on empty codeBooks instead of throwing
RangeError on Math.max(...[]); this is required so the bd literal can call
getCodecForDataSeries for every series eagerly without try/catch.

~22% faster on long-read decoding (decodeReadFeatures was 16% of CPU);
modest gain on short reads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Simplified exports and removed redundant types declarations
- Standardized build scripts to use pnpm consistently
- Added main field for backwards compatibility
- Removed redundant module field
- Standardized tsconfig with strict TypeScript and es2022 target
- Fixed type errors and infrastructure issues where applicable

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Required to import JavaScript files generated by WASM build

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Modern fork with better performance and fewer dependencies.
Updates eslint config to use import-x rules.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Better type safety for array/object access.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@cmdcolin cmdcolin force-pushed the perf-optimizations branch 3 times, most recently from 0897349 to 39bf0ad Compare April 27, 2026 19:40
- huffman: fix crash when inner loop reaches last code (bounds check was
  after array access); remove dead commented-out method; nest early-return
  in buildCaches into if block; use ?? -1 instead of ! for bitCodeToValue
  lookup; remove spurious inner braces in _decode
- decodeRecord: fold lengthOnRef computation into decodeReadFeatures return
  value, eliminating the second pass over read features; fix push(...spread)
  in getAllMatedRecords; hoist duplicate `content` variable in bind(); extract
  decodeQualityScores/decodeReadBases helpers; use Uint8Array+decodeLatin1
  in decodeReadBases fallback; remove dead RFFn alias; fix stale comment
- index.ts: inline ByteArrayStopCodec decode in bind() fast path; deduplicate
  tag decoder subarray body via readTagLen closure; fix indentation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cmdcolin cmdcolin force-pushed the perf-optimizations branch from 39bf0ad to ebd003f Compare April 27, 2026 19:47
@cmdcolin cmdcolin merged commit ecb7cb6 into main Apr 27, 2026
1 check passed
@cmdcolin cmdcolin deleted the perf-optimizations branch April 27, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant