Skip to content

[levanter] Defer token boundary rendering#5023

Merged
dlwh merged 1 commit intomainfrom
codex/perplexity-gap-lazy-boundaries
Apr 22, 2026
Merged

[levanter] Defer token boundary rendering#5023
dlwh merged 1 commit intomainfrom
codex/perplexity-gap-lazy-boundaries

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 21, 2026

Move token-boundary rendering for perplexity-gap literal examples out of the document scan hot path. Long-document reports now keep compact token span metadata while scoring and render boundaries only for literals that survive into the final report. Adds regression coverage for the lazy rendering contract.

Token boundary rendering for literal examples was happening while scanning every short segment, making long-document gap reports CPU-bound. Keep compact token span metadata in the hot path and render boundaries only for literals that survive into the final report.
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 21, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @dlwh's task in 2m 21s —— View job


Code review

Reviewed the lazy boundary rendering refactor in lib/levanter/src/levanter/analysis/perplexity_gap.py. The refactor cleanly defers the expensive render_token_boundaries work from the per-segment hot path to the post-selection path in _top_literal_rows, and the new regression test (test_gap_report_builder_renders_literal_boundaries_only_for_reported_literals) verifies the contract end-to-end by monkeypatching render_token_boundaries and asserting the call list.

A few things I verified that looked like candidate issues but turned out to be fine:

  • Tie-break behavior preserved. The old path kept the first-seen example on ties (candidate.abs_delta_bits > current.abs_delta_bits); the new early-return (abs_delta_bits <= current.abs_delta_bits) keeps the same semantics.
  • np.searchsorted(..., side="right") on byte_ends correctly skips tokens whose byte_end == segment_byte_start (non-overlapping) and correctly positions the loop at the first token whose byte_end > segment_byte_start. The subsequent if token_end <= segment_byte_start: continue guard is redundant but harmless. Relies on byte_ends being monotonically non-decreasing after the >=0 filter, which tokenize_text_with_byte_spans guarantees (per-part HF offsets are monotonic and char_offset is cumulative across parts; BOS/EOS sentinels are filtered out).
  • Memory profile. literal_examples still carries one entry per distinct literal (unchanged), but each entry is now compact span metadata instead of the rendered strings — a net reduction. The rendering workload is capped at 2 * top_k_literals calls.
  • _RenderedLiteralExample/_tokenized_document_from_boundary_spans indirection. render_token_boundaries only reads byte_starts/byte_ends, so the placeholder token_ids=zeros(...) and num_bytes=0 are safe. Minor: you could pass arrays directly to render_token_boundaries instead of rebuilding a TokenizedDocument, but the current shape keeps the public rendering signature stable.

No issues found. Checked for bugs and CLAUDE.md / lib/levanter/AGENTS.md compliance.
Branch

@dlwh dlwh merged commit 9ec4ccd into main Apr 22, 2026
38 checks passed
@dlwh dlwh deleted the codex/perplexity-gap-lazy-boundaries branch April 22, 2026 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant