Skip to content

feat(xml): add parseXmlRecords#7111

Open
tomas-zijdemans wants to merge 8 commits into
denoland:mainfrom
tomas-zijdemans:record-stream
Open

feat(xml): add parseXmlRecords#7111
tomas-zijdemans wants to merge 8 commits into
denoland:mainfrom
tomas-zijdemans:record-stream

Conversation

@tomas-zijdemans

@tomas-zijdemans tomas-zijdemans commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

Adds parseXmlRecords: an async generator that adapts SAX-style XML event callbacks into a typed AsyncGenerator<T> of records. Mirrors the function shape of its sibling parseXmlStream, but yields records as they're parsed so callers can iterate large feeds with per-record backpressure without buffering the whole document.

for await (const item of parseXmlRecords<Item>(source, (emit) => ({
  onEndElement(name) { if (name === "item") emit(buildItem()); },
  // ...build state in the other callbacks
}))) {
  // process item
}

For pipeThrough composition, wrap with ReadableStream.from(parseXmlRecords(...)).

@github-actions github-actions Bot added the xml label Apr 25, 2026
@codecov

codecov Bot commented Apr 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 94.58%. Comparing base (f0c9f14) to head (7c985dc).

Files with missing lines Patch % Lines
xml/parse_records.ts 96.29% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #7111   +/-   ##
=======================================
  Coverage   94.57%   94.58%           
=======================================
  Files         636      639    +3     
  Lines       52138    52173   +35     
  Branches     9399     9404    +5     
=======================================
+ Hits        49311    49348   +37     
+ Misses       2249     2247    -2     
  Partials      578      578           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bartlomieju bartlomieju left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation is correct and well-tested — SAX init mirrors parseXmlStream, TransformStream auto-handles errors in transform/flush, and the start try/catch guards the factory. A few concerns:

Thin wrapper over parseXmlStream

The user still writes the exact same SAX callback boilerplate (tracking insideItem, accumulating text, emitting on end element). The only difference is emit(record) vs items.push(record). The real value is stream composability (pipeThrough), but users can already achieve this with a small wrapper around parseXmlStream:

new ReadableStream({
  async start(controller) {
    await parseXmlStream(source, {
      onEndElement(name) { if (name === "item") controller.enqueue(item); }
    });
    controller.close();
  }
});

Is the convenience of XmlRecordStream enough to justify a new public API surface? The createCallbacks factory pattern is also unusual — most TransformStream subclasses in std take declarative options. This one takes a callback factory that returns callbacks, so the user is essentially writing a SAX handler with an extra indirection layer rather than getting a higher-level abstraction.

Duplicated initialization logic

Lines 93–106 of record_stream.ts duplicate the option extraction and XmlTokenizer/XmlEventParser construction from parseXmlStream (lines 67–71 of parse_stream.ts). If defaults change in one place, they'd need to change in both.

Backpressure test may be flaky

The test "pauses upstream pulls while downstream is blocked" uses setTimeout with 5ms and 30ms delays to detect backpressure behavior. This is timing-sensitive and may flake on slow CI runners.

Chunk-level backpressure limitation

As noted in the JSDoc, all records from a single input chunk are enqueued synchronously. If one chunk produces many records, they all buffer at once. This is inherent to the synchronous SAX model but worth a note in user-facing docs since "stream" implies per-record backpressure.

@tomas-zijdemans

Copy link
Copy Markdown
Contributor Author

Thin wrapper over parseXmlStream / unusual factory pattern

Yep. You're right that the class shape was the wrong abstraction here, and the createCallbacks factory only existed to thread emit through a class constructor. I've replaced XmlRecordStream with parseXmlRecords, an async generator that mirrors parseXmlStream's shape (function, not class). This gives a substantially smaller surface.

For users who want pipeThrough composition, one-line: ReadableStream.from(parseXmlRecords(...)).

The SAX boilerplate is unchanged, that's fundamental to event-driven parsing. But the goal here isn't to abstract that away, it's to give a small idiomatic adapter that turns "stream of XML chunks" into "stream of records" with per-record backpressure.

Duplicated initialization logic

Good catch. Extracted to _pipeline.ts (createXmlPipeline) and used by both parseXmlStream and parseXmlRecords.

Backpressure test may be flaky

Agreed. The test no longer uses fixed setTimeout delays.

Chunk-level backpressure limitation

The async generator fixes the consumer-side concern: records are yielded one-by-one, so a slow consumer applies backpressure per record. JSDoc on parseXmlRecords now states this explicitly so callers know to size input chunks accordingly.

@tomas-zijdemans tomas-zijdemans changed the title feat(xml): add XmlRecordStream feat(xml): add parseXmlRecords Apr 28, 2026

@fibibot fibibot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four prior concerns addressed: createXmlPipeline in _pipeline.ts dedups the tokenizer/parser init shared with parseXmlStream; the class+factory shape is replaced with an async generator that gives per-record backpressure (for await ... yield); JSDoc on parseXmlRecords calls out the chunk-level buffer ceiling explicitly; and the new tests no longer rely on setTimeout thresholds — the early-break test instruments pull count, and per-record yielding is asserted by setTimeout(0) between iterations rather than racing a deadline.

The fail-fast comment in parse_records.ts is load-bearing — it explains why the try { … } finally { drain } pattern is unsafe (iter.return() swallowing the pending exception). Worth keeping. Skipping tokenizer.finalize / parser.finalize on consumer break is fine; both are validation-only (unclosed-element / no-root checks) with no resource cleanup. CI green, xml is 0.1.1 so landing in mod.ts is allowed.

  • nit: test at xml/parse_records_test.ts:484 is named "…rejects with the parse error even if the consumer breaks early when the next chunk would throw" but the body has no assertRejects — it asserts received === ["1"] after a clean break. The behaviour the test documents (early break stops iteration, malformed chunk never processed) is the opposite of what the name says. Rename to e.g. "consumer break before a later malformed chunk yields cleanly without parsing it".

@fibibot

fibibot commented May 14, 2026

Copy link
Copy Markdown

@bartlomieju this is ready to merge

@bartlomieju

Copy link
Copy Markdown
Member

Solid addition — the fail-fast contract is well-documented and the test coverage of error paths, chunk boundaries, and consumer-break is thorough. A few things to address:

  • Missing parseXmlRecordsFromBytes. parse_stream.ts ships both parseXmlStream(AsyncIterable<string>) and parseXmlStreamFromBytes(AsyncIterable<Uint8Array>). The new module only ships the string variant; users with a ReadableStream<Uint8Array> will need to bolt on TextDecoderStream themselves. Either add the bytes variant for symmetry, or note in the JSDoc that bytes consumers should pipe through TextDecoderStream first (with a snippet).
  • Lead with the type-inference win. Users don't have to specify T — TypeScript infers it from emit({...}) inside the factory. The current example uses <string> which suggests the type parameter is required; a short example without it would make this discoverable.
  • Concrete memory-bound example. "Peak memory is bounded by the records produced by a single chunk" is accurate but abstract. A line like "a 1 MB chunk containing 1000 items will buffer 1000 records before the consumer sees the first" makes the trade-off vs. true per-record streaming explicit.
  • Test name is misleading. "parseXmlRecords() rejects with the parse error even if the consumer breaks early when the next chunk would throw" actually asserts the opposite — when the consumer breaks before the malformed chunk is reached, the error is not surfaced. Suggest "... suppresses parse errors when the consumer breaks before the malformed chunk" to match the assertion.
  • Comment about try { … } finally { drain } is correct in spirit but slightly hand-wavy on the spec mechanics. Consider trimming to one sentence ("a drain-then-throw approach interacts poorly with consumer break, so we propagate immediately and drop the chunk's pending records") to avoid relitigating ECMAScript semantics in source.
  • Nit: buffer.length = 0 after the final drain is unnecessary — the generator is about to return.

The _pipeline.ts extraction is a clean lift-and-shift and the per-record backpressure / early-termination tests are exactly what I wanted to see. Ready to go once the FromBytes asymmetry is resolved (either added, or documented with a workaround snippet).

@tomas-zijdemans

Copy link
Copy Markdown
Contributor Author

Thanks for the feedback @bartlomieju

  • parseXmlRecordsFromBytes — thin wrapper, hoist decodeAsyncIterable to a shared private module, update parse_stream.ts to import from there.
  • JSDoc example — replace with record-shaped (emit({id, title})), no explicit .
  • Memory-bound docs — add concrete "1 MB / 1000 items" sentence.
  • Test rename — "parseXmlRecords() suppresses parse errors when the consumer breaks before the malformed chunk".
  • Fail-fast comment — trim to one sentence, keep "Fail-fast contract:" label.
  • Delete buffer.length = 0 after final drain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants