Skip to content

Commit 39bf0ad

Browse files
cmdcolinclaude
andcommitted
Further simplifications and correctness fixes in decode hot path
- huffman: fix crash when inner loop reaches last code (bounds check was after array access); remove dead commented-out method; nest early-return in buildCaches into if block; use ?? -1 instead of ! for bitCodeToValue lookup; remove spurious inner braces in _decode - decodeRecord: fold lengthOnRef computation into decodeReadFeatures return value, eliminating the second pass over read features; fix push(...spread) in getAllMatedRecords; hoist duplicate `content` variable in bind(); extract decodeQualityScores/decodeReadBases helpers; use Uint8Array+decodeLatin1 in decodeReadBases fallback; remove dead RFFn alias; fix stale comment - index.ts: inline ByteArrayStopCodec decode in bind() fast path; deduplicate tag decoder subarray body via readTagLen closure; fix indentation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1 parent 35bc443 commit 39bf0ad

7 files changed

Lines changed: 154 additions & 166 deletions

File tree

README.md

Lines changed: 44 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,8 @@
33
[![NPM version](https://img.shields.io/npm/v/@gmod/cram.svg?style=flat-square)](https://npmjs.org/package/@gmod/cram)
44
[![Build Status](https://img.shields.io/github/actions/workflow/status/GMOD/cram-js/push.yml?branch=main)](https://github.com/GMOD/cram-js/actions?query=branch%3Amain+workflow%3APush+)
55

6-
Read CRAM files with pure JS, works in node or the browser. Supports CRAM 2.x and 3.x, `.crai` indexes, and bzip2/lzma codecs.
6+
Read CRAM files with pure JS, works in node or the browser. Supports CRAM 2.x
7+
and 3.x, `.crai` indexes, and bzip2/lzma codecs.
78

89
## Install
910

@@ -53,7 +54,11 @@ samHeader
5354
})
5455

5556
// Fetch records for a range (1-based, closed coordinates)
56-
const records = await indexedFile.getRecordsForRange(nameToId['chr1'], 10000, 20000)
57+
const records = await indexedFile.getRecordsForRange(
58+
nameToId['chr1'],
59+
10000,
60+
20000,
61+
)
5762

5863
for (const record of records) {
5964
console.log(record.readName, record.alignmentStart, record.mappingQuality)
@@ -66,25 +71,27 @@ for (const record of records) {
6671
}
6772
```
6873
69-
See the [example directory](./example) for browser usage with `<script>` tag and the bundled `cram-bundle.js`.
74+
See the [example directory](./example) for browser usage with `<script>` tag and
75+
the bundled `cram-bundle.js`.
7076
7177
## API
7278
7379
### `IndexedCramFile`
7480
7581
```js
7682
new IndexedCramFile({
77-
cramPath, // local path
78-
cramUrl, // remote URL
79-
cramFilehandle, // generic-filehandle2 compatible handle
80-
index, // CraiIndex instance (or any object with getEntriesForRange)
81-
seqFetch, // async (seqId, start, end) => string
83+
cramPath, // local path
84+
cramUrl, // remote URL
85+
cramFilehandle, // generic-filehandle2 compatible handle
86+
index, // CraiIndex instance (or any object with getEntriesForRange)
87+
seqFetch, // async (seqId, start, end) => string
8288
checkSequenceMD5, // default true; set false to avoid large reference fetches
83-
cacheSize, // max cached records, default 20000
89+
cacheSize, // max cached records, default 20000
8490
})
8591
```
8692
87-
- `getRecordsForRange(seqId, start, end, opts?)``Promise<CramRecord[]>` — 1-based closed coords. `opts`: `{ viewAsPairs, pairAcrossChr, maxInsertSize }`
93+
- `getRecordsForRange(seqId, start, end, opts?)``Promise<CramRecord[]>`
94+
1-based closed coords. `opts`: `{ viewAsPairs, pairAcrossChr, maxInsertSize }`
8895
- `hasDataForReferenceSequence(seqId)``Promise<boolean>`
8996
9097
### `CraiIndex`
@@ -93,29 +100,33 @@ Takes `{ path, url, filehandle }` — one of the three is required.
93100
94101
### `CramRecord`
95102
96-
| Field | Description |
97-
|---|---|
98-
| `readName` | read name |
99-
| `sequenceId` | numeric reference ID |
100-
| `alignmentStart` | 1-based start |
101-
| `qualityScores` | `Int8Array` of per-base quality scores |
102-
| `readFeatures` | array of read features (see below) |
103-
| `tags` | auxiliary tags object |
103+
| Field | Description |
104+
| ---------------- | -------------------------------------- |
105+
| `readName` | read name |
106+
| `sequenceId` | numeric reference ID |
107+
| `alignmentStart` | 1-based start |
108+
| `qualityScores` | `Int8Array` of per-base quality scores |
109+
| `readFeatures` | array of read features (see below) |
110+
| `tags` | auxiliary tags object |
104111
105-
Flag methods (return `boolean`): `isPaired`, `isProperlyPaired`, `isSegmentUnmapped`, `isMateUnmapped`, `isReverseComplemented`, `isMateReverseComplemented`, `isRead1`, `isRead2`, `isSecondary`, `isFailedQc`, `isDuplicate`, `isSupplementary`
112+
Flag methods (return `boolean`): `isPaired`, `isProperlyPaired`,
113+
`isSegmentUnmapped`, `isMateUnmapped`, `isReverseComplemented`,
114+
`isMateReverseComplemented`, `isRead1`, `isRead2`, `isSecondary`, `isFailedQc`,
115+
`isDuplicate`, `isSupplementary`
106116
107-
`getReadBases()` — returns the read sequence string. Requires `seqFetch` and is populated automatically by `getRecordsForRange`.
117+
`getReadBases()` — returns the read sequence string. Requires `seqFetch` and is
118+
populated automatically by `getRecordsForRange`.
108119
109120
### ReadFeatures
110121
111122
Each entry in `record.readFeatures`:
112123
113-
| Field | Description |
114-
|---|---|
115-
| `code` | feature type — one of `bqBXIDiQNSPH` (see CRAM spec §8) |
116-
| `pos` | read position (1-based) |
117-
| `refPos` | reference position (1-based) |
118-
| `ref` / `sub` | reference and substituted base (code `X` only) |
124+
| Field | Description |
125+
| ------------- | ------------------------------------------------------- |
126+
| `code` | feature type — one of `bqBXIDiQNSPH` (see CRAM spec §8) |
127+
| `pos` | read position (1-based) |
128+
| `refPos` | reference position (1-based) |
129+
| `ref` / `sub` | reference and substituted base (code `X` only) |
119130
120131
### Error classes
121132
@@ -125,19 +136,23 @@ Each entry in `record.readFeatures`:
125136
126137
## Publishing
127138
128-
Push a git tag to trigger a release via GitHub Actions and [npm trusted publishing](https://docs.npmjs.com/generating-provenance-statements).
139+
Push a git tag to trigger a release via GitHub Actions and
140+
[npm trusted publishing](https://docs.npmjs.com/generating-provenance-statements).
129141
130142
## Academic Use
131143
132-
Written with [NHGRI](http://genome.gov) funding as part of [JBrowse](http://jbrowse.org). If you use this in a publication, please cite the most recent JBrowse paper at [jbrowse.org](http://jbrowse.org).
144+
Written with [NHGRI](http://genome.gov) funding as part of
145+
[JBrowse](http://jbrowse.org). If you use this in a publication, please cite the
146+
most recent JBrowse paper at [jbrowse.org](http://jbrowse.org).
133147
134148
## License
135149
136150
MIT © [Robert Buels](https://github.com/rbuels)
137151
138152
## Publishing
139153
140-
[Trusted publishing](https://docs.npmjs.com/about-trusted-publishing) via GitHub Actions.
154+
[Trusted publishing](https://docs.npmjs.com/about-trusted-publishing) via GitHub
155+
Actions.
141156
142157
```bash
143158
npm version patch # or minor/major

src/craiIndex.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ function addRecordToIndex(index: ParsedIndex, record: number[]) {
2525
index[s] = []
2626
}
2727

28-
index[s]!.push({
28+
index[s].push({
2929
start: start!,
3030
span: span!,
3131
containerStart: containerStart!,

src/cramFile/codecs/huffman.ts

Lines changed: 17 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -152,14 +152,12 @@ export default class HuffmanIntCodec extends CramCodec<
152152
this.sortedValuesByBitCode = this.sortedCodes.map(c => c.value)
153153
this.sortedBitCodes = this.sortedCodes.map(c => c.bitCode)
154154
this.sortedBitLengthsByBitCode = this.sortedCodes.map(c => c.bitLength)
155-
if (this.sortedBitCodes.length === 0) {
156-
return
157-
}
158-
const maxBitCode = Math.max(...this.sortedBitCodes)
159-
160-
this.bitCodeToValue = new Array(maxBitCode + 1).fill(-1)
161-
for (let i = 0; i < this.sortedBitCodes.length; i += 1) {
162-
this.bitCodeToValue[this.sortedCodes[i]!.bitCode] = i
155+
if (this.sortedBitCodes.length > 0) {
156+
const maxBitCode = Math.max(...this.sortedBitCodes)
157+
this.bitCodeToValue = new Array(maxBitCode + 1).fill(-1)
158+
for (let i = 0; i < this.sortedBitCodes.length; i += 1) {
159+
this.bitCodeToValue[this.sortedCodes[i]!.bitCode] = i
160+
}
163161
}
164162
}
165163

@@ -172,10 +170,6 @@ export default class HuffmanIntCodec extends CramCodec<
172170
return this._decode(slice, coreDataBlock, cursors.coreBlock)
173171
}
174172

175-
// _decodeNull() {
176-
// return -1
177-
// }
178-
179173
// the special case for zero-length codes
180174
_decodeZeroLengthCode() {
181175
return this.sortedCodes[0]!.value
@@ -194,20 +188,17 @@ export default class HuffmanIntCodec extends CramCodec<
194188
bits |= getBitsInline(input, coreCursor, bitsToRead)
195189
}
196190
prevLen = length
197-
{
198-
const index = this.bitCodeToValue[bits]!
199-
if (index > -1 && this.sortedBitLengthsByBitCode[index] === length) {
200-
return this.sortedValuesByBitCode[index]!
201-
}
202-
203-
for (
204-
let j = i;
205-
this.sortedCodes[j + 1]!.bitLength! === length &&
206-
j < this.sortedCodes.length;
207-
j += 1
208-
) {
209-
i += 1
210-
}
191+
const index = this.bitCodeToValue[bits] ?? -1
192+
if (index > -1 && this.sortedBitLengthsByBitCode[index] === length) {
193+
return this.sortedValuesByBitCode[index]!
194+
}
195+
for (
196+
let j = i;
197+
j + 1 < this.sortedCodes.length &&
198+
this.sortedCodes[j + 1]!.bitLength === length;
199+
j += 1
200+
) {
201+
i += 1
211202
}
212203
}
213204
throw new CramMalformedError('Huffman symbol not found.')

src/cramFile/slice/decodeRecord.ts

Lines changed: 26 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -52,16 +52,9 @@ export interface BoundDecoders {
5252
TN(): number | undefined
5353
}
5454

55-
/**
56-
* parse a BAM tag's array value from a binary buffer
57-
* @private
58-
*/
59-
// Uses DataView instead of typed arrays (e.g. new Int32Array(buffer.buffer))
60-
// because the buffer may be a subarray of a larger ArrayBuffer. Typed array
61-
// constructors like Int32Array interpret .buffer as the entire underlying
62-
// ArrayBuffer starting at byte 0, ignoring the subarray's byteOffset. This
63-
// caused silent data corruption when reading tag values. DataView with explicit
64-
// byteOffset reads from the correct position within the parent buffer.
55+
// Uses DataView rather than typed arrays because the buffer is a subarray of a
56+
// larger ArrayBuffer. Int32Array(buffer.buffer) would start at byte 0 of the
57+
// parent, ignoring buffer.byteOffset, causing silent data corruption.
6558
function parseTagValueArray(buffer: Uint8Array) {
6659
const arrayType = String.fromCharCode(buffer[0]!)
6760

@@ -148,14 +141,13 @@ function parseTagData(tagType: string, buffer: Uint8Array) {
148141
throw new CramMalformedError(`Unrecognized tag type ${tagType}`)
149142
}
150143

151-
// Read-feature schema: a charCode-indexed array of [letter, fn] tuples where
152-
// fn() decodes the feature's data, fully transformed
153-
// (character → fromCharCode, string → decodeLatin1, numArray → Array.from,
154-
// number → identity, B → [base, qualityScore]). Built once per slice; the
155-
// inner loop becomes a charCode lookup + monomorphic call.
144+
// Read-feature schema: a charCode-indexed array of [code, fn] tuples where
145+
// fn() decodes and transforms the feature's data (character → fromCharCode,
146+
// string → decodeLatin1, numArray → Array.from, number → identity,
147+
// B → [base, qualityScore]). Built once per slice; the inner loop becomes
148+
// a charCode lookup + monomorphic call with no per-feature allocation.
156149
type RFData = string | number | number[] | [string, number]
157-
type RFFn = () => RFData
158-
export type RFEntry = readonly [code: string, fn: RFFn]
150+
export type RFEntry = readonly [code: string, fn: () => RFData]
159151

160152
export function buildRFSchema(
161153
bd: BoundDecoders,
@@ -165,7 +157,7 @@ export function buildRFSchema(
165157
const arr: (RFEntry | undefined)[] = new Array(128)
166158
arr['B'.charCodeAt(0)] = [
167159
'B',
168-
() => [String.fromCharCode(bd.BA()!), bd.QS()!],
160+
() => [String.fromCharCode(bd.BA()!), bd.QS()!] as [string, number],
169161
]
170162
arr['X'.charCodeAt(0)] = ['X', () => bd.BS()!]
171163
arr['D'.charCodeAt(0)] = ['D', () => bd.DL()!]
@@ -186,14 +178,10 @@ function decodeReadFeatures(
186178
readFeatureCount: number,
187179
bd: BoundDecoders,
188180
schema: (RFEntry | undefined)[],
189-
) {
190-
// Track the running offset between ref and read coordinates so that
191-
// refPos = readPos + refOffset. Deletions advance ref past consumed
192-
// ref bases (offset goes up); insertions advance read past consumed
193-
// read bases (offset goes down). This mirrors CIGAR consume-ref vs
194-
// consume-read semantics.
181+
): [ReadFeature[], number] {
195182
let readPos = 0
196-
let refOffset = alignmentStart - 1
183+
let refDelta = 0
184+
const base = alignmentStart - 1
197185
const readFeatures: ReadFeature[] = new Array(readFeatureCount)
198186
const decodeFC = bd.FC
199187
const decodeFP = bd.FP
@@ -215,19 +203,19 @@ function decodeReadFeatures(
215203
readFeatures[i] = {
216204
code,
217205
pos: readPos,
218-
refPos: readPos + refOffset,
206+
refPos: readPos + base + refDelta,
219207
data,
220208
} as ReadFeature
221209

222210
if (code === 'D' || code === 'N') {
223-
refOffset += data as number
211+
refDelta += data as number
224212
} else if (code === 'I' || code === 'S') {
225-
refOffset -= (data as string).length
213+
refDelta -= (data as string).length
226214
} else if (code === 'i') {
227-
refOffset -= 1
215+
refDelta -= 1
228216
}
229217
}
230-
return readFeatures
218+
return [readFeatures, refDelta]
231219
}
232220

233221
export type BulkByteRawDecoder = (
@@ -260,11 +248,11 @@ function decodeReadBases(
260248
if (raw) {
261249
return decodeLatin1(raw)
262250
}
263-
let s = ''
251+
const buf = new Uint8Array(readLength)
264252
for (let i = 0; i < readLength; i++) {
265-
s += String.fromCharCode(decodeBA()!)
253+
buf[i] = decodeBA()!
266254
}
267-
return s
255+
return decodeLatin1(buf)
268256
}
269257

270258
export type BoundTagDecoders = Record<
@@ -304,7 +292,7 @@ export default function decodeRecord(
304292
: sliceHeader.parsedContent.refSeqId
305293

306294
const readLength = bd.RL()!
307-
// if APDelta, will calculate the true start in a second pass
295+
// if APDelta, AP is a delta from the previous record's alignmentStart
308296
let alignmentStart = bd.AP()!
309297
if (compressionScheme.APdelta) {
310298
alignmentStart = alignmentStart + cursors.lastAlignmentStart
@@ -393,28 +381,16 @@ export default function decodeRecord(
393381
if (!BamFlagsDecoder.isSegmentUnmapped(flags)) {
394382
// reading read features
395383
const readFeatureCount = bd.FN()!
384+
lengthOnRef = readLength
396385
if (readFeatureCount) {
397-
readFeatures = decodeReadFeatures(
386+
const [features, refDelta] = decodeReadFeatures(
398387
alignmentStart,
399388
readFeatureCount,
400389
bd,
401390
rfSchema,
402391
)
403-
}
404-
405-
// compute the read's true span on the reference sequence, and the end
406-
// coordinate of the alignment on the reference
407-
lengthOnRef = readLength
408-
if (readFeatures) {
409-
for (const { code, data } of readFeatures) {
410-
if (code === 'D' || code === 'N') {
411-
lengthOnRef += data
412-
} else if (code === 'I' || code === 'S') {
413-
lengthOnRef = lengthOnRef - data.length
414-
} else if (code === 'i') {
415-
lengthOnRef = lengthOnRef - 1
416-
}
417-
}
392+
readFeatures = features
393+
lengthOnRef += refDelta
418394
}
419395
if (Number.isNaN(lengthOnRef)) {
420396
console.warn(

0 commit comments

Comments
 (0)