[levanter] Defer token boundary rendering by dlwh · Pull Request #5023 · marin-community/marin

dlwh · 2026-04-21T23:28:13Z

Move token-boundary rendering for perplexity-gap literal examples out of the document scan hot path. Long-document reports now keep compact token span metadata while scoring and render boundaries only for literals that survive into the final report. Adds regression coverage for the lazy rendering contract.

Token boundary rendering for literal examples was happening while scanning every short segment, making long-document gap reports CPU-bound. Keep compact token span metadata in the hot path and render boundaries only for literals that survive into the final report.

claude · 2026-04-21T23:28:28Z

Claude finished @dlwh's task in 2m 21s —— View job

Code review

Reviewed the lazy boundary rendering refactor in lib/levanter/src/levanter/analysis/perplexity_gap.py. The refactor cleanly defers the expensive render_token_boundaries work from the per-segment hot path to the post-selection path in _top_literal_rows, and the new regression test (test_gap_report_builder_renders_literal_boundaries_only_for_reported_literals) verifies the contract end-to-end by monkeypatching render_token_boundaries and asserting the call list.

A few things I verified that looked like candidate issues but turned out to be fine:

Tie-break behavior preserved. The old path kept the first-seen example on ties (candidate.abs_delta_bits > current.abs_delta_bits); the new early-return (abs_delta_bits <= current.abs_delta_bits) keeps the same semantics.
np.searchsorted(..., side="right") on byte_ends correctly skips tokens whose byte_end == segment_byte_start (non-overlapping) and correctly positions the loop at the first token whose byte_end > segment_byte_start. The subsequent if token_end <= segment_byte_start: continue guard is redundant but harmless. Relies on byte_ends being monotonically non-decreasing after the >=0 filter, which tokenize_text_with_byte_spans guarantees (per-part HF offsets are monotonic and char_offset is cumulative across parts; BOS/EOS sentinels are filtered out).
Memory profile. literal_examples still carries one entry per distinct literal (unchanged), but each entry is now compact span metadata instead of the rendered strings — a net reduction. The rendering workload is capped at 2 * top_k_literals calls.
_RenderedLiteralExample/_tokenized_document_from_boundary_spans indirection. render_token_boundaries only reads byte_starts/byte_ends, so the placeholder token_ids=zeros(...) and num_bytes=0 are safe. Minor: you could pass arrays directly to render_token_boundaries instead of rebuilding a TokenizedDocument, but the current shape keeps the public rendering signature stable.

No issues found. Checked for bugs and CLAUDE.md / lib/levanter/AGENTS.md compliance.
• Branch

dlwh added the agent-generated Created by automation/agent label Apr 21, 2026

dlwh mentioned this pull request Apr 22, 2026

[data] Add a raw technical-text tranche beyond GitHub #4961

Open

dlwh merged commit 9ec4ccd into main Apr 22, 2026
38 checks passed

dlwh deleted the codex/perplexity-gap-lazy-boundaries branch April 22, 2026 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[levanter] Defer token boundary rendering#5023

[levanter] Defer token boundary rendering#5023
dlwh merged 1 commit intomainfrom
codex/perplexity-gap-lazy-boundaries

dlwh commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 21, 2026 •

edited

Loading