Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two optimizations targeting ExternalCodec.decode, which was 10-28% of CPU: 1. Batch ITF8 pre-decode: Before the record decode loop, decode all ITF8 values from external int blocks into Int32Arrays in a tight loop. During record decoding, reading a pre-decoded int is just values[index++] instead of branchy ITF8 parsing with per-call cursor/block lookups. 2. Bound decode closures: For each data series, create a closure at slice setup time that captures the resolved content buffer and cursor directly. This eliminates per-call codec cache lookup, blocksByContentId Record lookup, cursors.externalBlocks.getCursor() Map lookup, and dataType branching. Also adds batch_itf8_decode to the htscodecs WASM module (C implementation) for potential future use, though the pure JS batch approach proved faster due to avoiding WASM memory copy overhead. Benchmarks (p50, 40 iterations): - Short reads (54k records): ~1.4x faster - Long reads (37 records): ~1.4-1.7x faster Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Have decodeRecord() construct CramRecord directly instead of returning a temporary plain object that gets immediately destructured and GC'd. Also eliminates the mateToUse intermediate object by building the mate record in its final shape. Removes ~81k transient objects per 54k-record slice decode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts: # eslint.config.mjs # package.json # scripts/analyze-profile.ts # scripts/bench-large.ts # scripts/profile-compare.ts # scripts/profile-cpu-branch.ts # scripts/profile-cpu.ts # src/craiIndex.ts # src/cramFile/codecs/byteArrayLength.ts # src/cramFile/codecs/external.ts # src/cramFile/container/index.ts # src/cramFile/file.ts # src/cramFile/record.ts # src/cramFile/sectionParsers.ts # src/cramFile/slice/decodeRecord.ts # src/cramFile/slice/index.ts # yarn.lock
3a8257f to
a61813c
Compare
Replaces the string-keyed decodeDataSeries indirection with a fixed-shape object literal holding all 28 data-series decoders. Hot call sites in decodeRecord and decodeReadFeatures become direct property accesses (bd.FC(), bd.BF()) so V8 inline-caches them. Read-feature schemas now hold pre-resolved decoder references rather than string keys, and the inner FC/FP loop fetches its decoders into locals. HuffmanIntCodec.buildCaches now no-ops on empty codeBooks instead of throwing RangeError on Math.max(...[]); this is required so the bd literal can call getCodecForDataSeries for every series eagerly without try/catch. ~22% faster on long-read decoding (decodeReadFeatures was 16% of CPU); modest gain on short reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Simplified exports and removed redundant types declarations - Standardized build scripts to use pnpm consistently - Added main field for backwards compatibility - Removed redundant module field - Standardized tsconfig with strict TypeScript and es2022 target - Fixed type errors and infrastructure issues where applicable Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Required to import JavaScript files generated by WASM build Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Modern fork with better performance and fewer dependencies. Updates eslint config to use import-x rules. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Better type safety for array/object access. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
0897349 to
39bf0ad
Compare
- huffman: fix crash when inner loop reaches last code (bounds check was after array access); remove dead commented-out method; nest early-return in buildCaches into if block; use ?? -1 instead of ! for bitCodeToValue lookup; remove spurious inner braces in _decode - decodeRecord: fold lengthOnRef computation into decodeReadFeatures return value, eliminating the second pass over read features; fix push(...spread) in getAllMatedRecords; hoist duplicate `content` variable in bind(); extract decodeQualityScores/decodeReadBases helpers; use Uint8Array+decodeLatin1 in decodeReadBases fallback; remove dead RFFn alias; fix stale comment - index.ts: inline ByteArrayStopCodec decode in bind() fast path; deduplicate tag decoder subarray body via readTagLen closure; fix indentation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
39bf0ad to
ebd003f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Modest performance improvements for CRAM long and short reads. It also investigated lazily parsing tags but didn't cause any speed improvement (yet...)
This PR is largely claude code generated
I also investigated whether multi-threading with sharedarraybuffer could help but it said probably not (yet...)
Key optimizations
external codec blocks, replacing per-call overhead with a single pass. Biggest win for long
reads where ExternalCodec.decode dominated CPU time.
getCodecForDataSeriesresults are now cached and reusedin the hot decode loop instead of being re-resolved per record.
decodeRecord()now constructsCramRecorddirectly instead of returning a temporary 17-property plain object that was immediately
destructured and GC'd. Also eliminated the
mateToUseintermediate object. Removes ~81ktransient objects per 54k-record slice.
tsxtonode --experimental-strip-types,removed CommonJS shims.
Benchmark results (p50, 40 iterations)
GC / memory impact