Skip to content

feat: add support for long-context documents#179

Open
eurekayuan wants to merge 3 commits into
mainfrom
feature/long-context
Open

feat: add support for long-context documents#179
eurekayuan wants to merge 3 commits into
mainfrom
feature/long-context

Conversation

@eurekayuan

Copy link
Copy Markdown

Summary

Several stages embedded the whole document in a single prompt and hit DataDesigner's 512K (MAX_RENDERED_LEN) render cap, failing outright on long inputs. Every such stage is now windowed: each chunked generator renders its own per-window prompt and calls the model directly, bypassing the cap. Stages keep a single-call fast path when the rendered prompt already fits, so short-document behavior is unchanged.

Per-stage windowing

  • Detection (chunked_detection.py, new): Overlapping fixed-size character windows; each window is a raw text slice sent to the detector. Per-window offsets are rebased to global, boundary-touching spans are dropped, and overlaps are resolved (resolve_overlaps).
  • Validation (chunked_validation.py): Not a text window — batches candidate entities (≤100 per call), each with a ±500-character excerpt. Calls run in parallel across the validator pool with round-robin + failover. Decisions are merged per row; the row is dropped only if every pool member fails.
  • Augmentation (chunked_augmentation.py): Overlapping character windows over tagged text plus seed JSON. A window dynamically shrinks if its rendered prompt exceeds the cap. Outputs are unioned and deduped by (value, label).
  • Latent (chunked_latent.py): Same mechanism as augmentation (rewrite mode only); deduped by (label, value).
  • Substitution map (chunked_replace.py): Abutting newline-aligned windows, no overlap. Each chunk carries the accumulated replacement map and a rolling summary, proposing replacements only for new entities so mappings stay consistent across chunks.
  • Rewrite generation (chunked_rewrite.py): Abutting newline-aligned windows, no overlap. Runs sequentially, passing a continuity preamble and rolling summary between chunks; rewritten parts are stitched.
  • Final judge (chunked_final_judge.py, new): Splits original and rewritten text into N positionally-paired slices, scores each, and aggregates per-dimension by minimum. Rubric scales are embedded in the prompt with structured output. Replaces the non-windowedLLMJudgeColumnConfig.

Parallel processing

  • Stateless stages (detection, validation, augmentation, latent, judge) dispatch windows in parallel (bounded ThreadPoolExecutor; the per-alias rate limit still governs real in-flight calls) and merge afterward.
  • Stateful stages (substitution-map, rewrite generation) stay sequential to thread the map / rolling summary across seams for consistency.

Window sizing

  • detection_window_max_render_chars (default 128 KiB, clamped ≤ NDD's render cap) is the single knob; it is threaded into detection, augmentation, latent, substitute-map, rewrite, and judge.
  • detection_window_safety_margin_chars (8K) leaves headroom for prompt scaffolding; detection_window_overlap_chars (1K) sets the overlap for the overlapping stages; a 4K floor prevents pathological shrinking.

Fault tolerance & failure tracking

  • Augmentation, latent, and the final judge are resilient to a single bad window: a window whose call fails is logged and skipped rather than dropping the whole record. Skipped-window counts are surfaced in trace_dataframe (COL_AUGMENTATION_FAILED_WINDOWS / COL_LATENT_FAILED_WINDOWS); the judge degrades to defaults if all windows fail.
  • Detection windows re-raise (the record fails) to preserve detection completeness, and validation relies on pool failover.

Observability

Per-window debug logging across all chunked stages: window ranges/sizes, rendered length vs cap, shrink events, rolling-summary contents, and per-stage entity/replacement/window counts.

Type of Change

  • Bug fix
  • New feature
  • Breaking change
  • Documentation update
  • Refactoring

Testing

  • make test passes locally
  • make check passes locally (format + lint + typecheck + lock-check)
  • Added/updated tests for changes

Documentation

  • If docs changed: make docs-build passes locally

@eurekayuan eurekayuan requested review from a team as code owners June 3, 2026 18:42
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@eurekayuan eurekayuan changed the title Add support for handling long-context docs feat/long-context Jun 3, 2026
@eurekayuan eurekayuan changed the title feat/long-context feat: add support for long-context documents Jun 3, 2026
@eurekayuan

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a comprehensive windowing layer across every long-document bottleneck in the anonymizer pipeline, replacing single-call LLMStructuredColumnConfig/LLMTextColumnConfig usages with custom generators that render prompts directly and dispatch window-sized LLM calls, bypassing DataDesigner's 512 K render cap.

  • Stateless stages (detection, augmentation, latent, final judge) use overlapping character windows dispatched in parallel via bounded ThreadPoolExecutor; augmentation, latent, and judge tolerate per-window failures gracefully, while detection re-raises to preserve recall completeness.
  • Stateful stages (substitute-map, rewrite generation) use non-overlapping newline-aligned windows processed sequentially, threading a rolling summary and accumulated replacement map across chunk seams.
  • A single detection_window_max_render_chars config knob (default 128 KiB, clamped to NDD's cap) controls window sizing across all stages.

Confidence Score: 4/5

Safe to merge with minor fixes; all correctness issues are limited to long-document paths and affect output cosmetics or serialization consistency rather than data loss or pipeline crashes.

The windowing design is sound and well-tested for core paths. The most visible issue is the stitching in chunked_rewrite: because iter_boundary_windows aligns cuts to newlines, each chunk already ends with a newline, and models naturally mirror that — so every chunk boundary in a long anonymized document will have a spurious blank line. The fast-path cap measurement is also off (preamble overhead counted but not sent), silently routing near-cap documents into the chunked path. In qa_generation.py both paths call .model_dump() without mode='json', inconsistently with every other windowed generator in the PR.

chunked_rewrite.py (stitching and cap measurement) and qa_generation.py (model_dump mode and private import of _compile_template)

Important Files Changed

Filename Overview
src/anonymizer/engine/rewrite/chunked_rewrite.py Sequential chunked rewrite with rolling-summary continuity. Fast path measures prompt with preamble but sends without it. "\n".join(...) stitching introduces double newlines at chunk boundaries.
src/anonymizer/engine/rewrite/qa_generation.py Adds windowed meaning-unit extraction and quality-QA batching. Imports private _compile_template from chunked_steps. Fast and batched paths use .model_dump() without mode="json" inconsistently.
src/anonymizer/engine/rewrite/chunked_steps.py Generic boundary-windowed step runner. _compile_template is a private function imported by qa_generation.py, creating a brittle cross-module coupling.
src/anonymizer/engine/rewrite/domain_classification.py Uses first-window-only windowing. _first_output would raise IndexError on empty outputs list, unreachable in practice but unguarded.
src/anonymizer/engine/detection/chunked_detection.py New windowed seed detection: rebases offsets, drops boundary spans, resolves overlaps. Logic is sound for spans shorter than overlap_chars.
src/anonymizer/engine/detection/chunked_augmentation.py New windowed augmentation: overlapping windows, shrink-on-overflow, parallel dispatch, dedup merge. Well-structured with fast path.
src/anonymizer/engine/windowing.py New shared boundary-windowing utility. next_window_end always makes forward progress. Logic is correct.
src/anonymizer/config/anonymizer_config.py Adds three window-sizing config fields. Imports NDD cap with graceful fallback. Single source of truth for defaults.

Reviews (1): Last reviewed commit: "format scripts" | Re-trigger Greptile

_clip(summary),
)

stitched = "\n".join(part for part in rewritten_parts if part)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Chunk boundaries are aligned to newlines by iter_boundary_windows, so each tagged[start:end] slice already ends with " ". When the LLM mirrors that structure in its output (natural for paragraph-aware models), every rewritten_chunk also ends with " ", and " ".join(...) then inserts a second newline — producing a blank line between every chunk boundary in the final anonymized document. Joining with "" is sufficient because the delimiter is already part of each chunk.

Suggested change
stitched = "\n".join(part for part in rewritten_parts if part)
stitched = "".join(part for part in rewritten_parts if part)

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +172 to +181
# Fast path: the full single-call rewrite prompt fits under the cap.
single_rendered = _render_chunk_prompt(template=params.single_call_prompt_template, chunk_row=row, summary="")
if len(single_rendered) <= cap:
logger.debug("rewrite: single-call fast path (rendered=%d chars <= cap=%d)", len(single_rendered), cap)
text = _rewrite_chunk(
facade=facade,
prompt=_compile_template(params.single_call_prompt_template).render(**row),
system_prompt=params.system_prompt,
purpose="rewrite-generation",
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The fast path measures single_rendered as _render_chunk_prompt(..., summary=""), which prepends the ~270-char continuity preamble, but then the actual LLM call omits that preamble. This means a document whose body-only prompt falls in (cap - 270, cap] chars will be routed into the chunked path unnecessarily. Measure with just the body to match what is actually sent.

Suggested change
# Fast path: the full single-call rewrite prompt fits under the cap.
single_rendered = _render_chunk_prompt(template=params.single_call_prompt_template, chunk_row=row, summary="")
if len(single_rendered) <= cap:
logger.debug("rewrite: single-call fast path (rendered=%d chars <= cap=%d)", len(single_rendered), cap)
text = _rewrite_chunk(
facade=facade,
prompt=_compile_template(params.single_call_prompt_template).render(**row),
system_prompt=params.system_prompt,
purpose="rewrite-generation",
)
# Fast path: measure body-only prompt (no continuity preamble) since that is what is sent.
single_rendered = _compile_template(params.single_call_prompt_template).render(**row)
if len(single_rendered) <= cap:
logger.debug("rewrite: single-call fast path (rendered=%d chars <= cap=%d)", len(single_rendered), cap)
text = _rewrite_chunk(
facade=facade,
prompt=single_rendered,
system_prompt=params.system_prompt,
purpose="rewrite-generation",
)

Comment on lines 27 to 29
)
from anonymizer.engine.ndd.model_loader import resolve_model_alias
from anonymizer.engine.prompt_utils import substitute_placeholders

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Private symbol imported across module boundary. _compile_template is module-private (underscore-prefixed) in chunked_steps.py. Importing it here creates a hidden coupling: if the function is renamed or inlined, qa_generation.py breaks without any clear contract. Consider exposing it as a public helper in chunked_steps.py or defining a local copy with its own lru_cache in this module.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +372 to +384
row[COL_QUALITY_QA] = _generate(full_rendered, "quality-qa-generation").model_dump()
return row

units = json.loads(row.get(COL_MEANING_UNITS_SERIALIZED) or "[]")
base_len = len(compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: "[]"}))
batches = _batch_units_by_size(units, base_len, max_render_chars - safety_margin_chars)
items: list[dict[str, Any]] = []
for batch_idx, batch in enumerate(batches):
rendered = compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: json.dumps(batch, ensure_ascii=False)})
out = _generate(rendered, f"quality-qa-generation-batch-{batch_idx}")
for item in out.items:
items.append({**item.model_dump(mode="json"), "id": len(items) + 1})
row[COL_QUALITY_QA] = QualityQAPairsSchema.model_validate({"items": items}).model_dump()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The fast path stores the result via .model_dump() (no mode="json"), while every other windowed generator in this PR consistently uses .model_dump(mode="json"). Without mode="json", Pydantic returns native Python objects rather than JSON-serializable equivalents, which can cause downstream serialization failures. The batched path has the same inconsistency.

Suggested change
row[COL_QUALITY_QA] = _generate(full_rendered, "quality-qa-generation").model_dump()
return row
units = json.loads(row.get(COL_MEANING_UNITS_SERIALIZED) or "[]")
base_len = len(compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: "[]"}))
batches = _batch_units_by_size(units, base_len, max_render_chars - safety_margin_chars)
items: list[dict[str, Any]] = []
for batch_idx, batch in enumerate(batches):
rendered = compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: json.dumps(batch, ensure_ascii=False)})
out = _generate(rendered, f"quality-qa-generation-batch-{batch_idx}")
for item in out.items:
items.append({**item.model_dump(mode="json"), "id": len(items) + 1})
row[COL_QUALITY_QA] = QualityQAPairsSchema.model_validate({"items": items}).model_dump()
row[COL_QUALITY_QA] = _generate(full_rendered, "quality-qa-generation").model_dump(mode="json")
return row
units = json.loads(row.get(COL_MEANING_UNITS_SERIALIZED) or "[]")
base_len = len(compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: "[]"}))
batches = _batch_units_by_size(units, base_len, max_render_chars - safety_margin_chars)
items: list[dict[str, Any]] = []
for batch_idx, batch in enumerate(batches):
rendered = compiled.render(**{**row, COL_MEANING_UNITS_SERIALIZED: json.dumps(batch, ensure_ascii=False)})
out = _generate(rendered, f"quality-qa-generation-batch-{batch_idx}")
for item in out.items:
items.append({**item.model_dump(mode="json"), "id": len(items) + 1})
row[COL_QUALITY_QA] = QualityQAPairsSchema.model_validate({"items": items}).model_dump(mode="json")

Comment on lines +27 to +30
_DEFAULT_MAX_RENDER_CHARS: int = _DetectConfig.model_fields["detection_window_max_render_chars"].default
_DEFAULT_SAFETY_MARGIN_CHARS: int = _DetectConfig.model_fields["detection_window_safety_margin_chars"].default


Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Unguarded index on potentially empty list. _first_output calls outputs[0] without checking length. In run_windowed_step with first_only=True, if iter_boundary_windows returns an empty list, outputs is empty and this raises IndexError. The fast path makes this unreachable today, but a defensive guard would make the failure mode explicit.

@andreatgretel andreatgretel left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on. This is a substantial first PR, and the overall direction makes sense: split long records into bounded windows, carry forward the state needed for consistency, and keep the replacement map explicit.

I left a few comments on edge cases I think are worth tightening before merge. The main themes are:

  • thread the user-supplied window sizing through every windowed stage
  • make per-window failures local where possible, instead of dropping the whole record
  • validate overlap settings early so a bad config cannot explode into thousands of model calls
  • avoid silently accepting empty rewrite chunks as successful output

The tests and docs coverage are in good shape, and I think the feature is close. These changes should make it more reliable on real long documents.

),
*self._qa_wf.columns(selected_models=selected_models),
*self._rewrite_gen_wf.columns(
window_max_render_chars=window_max_render_chars,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only threads the user-supplied window cap into rewrite generation. domain classification, sensitivity disposition, QA generation, and final judge still build their window params from module defaults, so a user who lowers Detect.detection_window_max_render_chars still gets ~128k prompts in those stages. Could pass the same kwargs through those columns() calls and _run_final_judge too?

if params.first_only:
windows = windows[:1]
outputs = []
for start, end in windows:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code caught this one: once this takes the windowed path, a single transient model error or a chunk that legitimately has no meaning units can drop the whole record. Could wrap each window call, skip/log failed windows, and handle the all-failed case explicitly?

"prompt scaffolding and tags when sizing augmentation/latent windows."
),
)
detection_window_overlap_chars: int = Field(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we validate that detection_window_overlap_chars is smaller than the effective window size? Right now overlap == window is accepted and the planners advance one character at a time. My smoke test turned a 20k-char row into 16,001 windows.

_clip(summary),
)

stitched = "\n".join(part for part in rewritten_parts if part)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate from the newline-stitching comment already here: filtering with if part also hides an empty rewrite chunk. If one window returns {"rewritten_text": ""}, that section disappears with no failed-window count or review signal. Maybe count/flag empty chunks instead of treating them as successful output?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants