Skip to content

harness: Add extract/chunk observability artifacts#1585

Open
jioffe502 wants to merge 1 commit intoNVIDIA:mainfrom
jioffe502:jioffe/harness-observability-rebuild
Open

harness: Add extract/chunk observability artifacts#1585
jioffe502 wants to merge 1 commit intoNVIDIA:mainfrom
jioffe502:jioffe/harness-observability-rebuild

Conversation

@jioffe502
Copy link
Collaborator

@jioffe502 jioffe502 commented Mar 11, 2026

Description

This adds harness-visible extract and chunk artifact dumps with the smallest viable wiring on top of current retriever behavior. It keeps LanceDB/recall semantics intact and avoids broader schema/runtime changes.

Adds harness config + command plumbing for extract/chunk artifact paths and records them in results.json (including ingest_errors.json path)

Adds a lightweight JSONL snapshot helper and wires dumps at two seams only: post-extract and pre-embed
Expands focused harness/batch tests for config validation, artifact path reporting, and snapshot stage wiring

Observability artifacts (extracts/chunks)

This change adds optional harness observability artifacts for semantic inspection and debugging:

  • run-wide JSONL shards for extracts (extract_artifacts_dir) and pre-embed chunks (chunk_manifest_dir)
  • ingest_errors_file path captured in results.json
  • optional durable mirror paths when observability_archive_dir is configured

Usage (harness config):

  • write_extract_artifacts: true
  • write_chunk_manifest: true
  • observability_archive_dir: null (or durable root path)

This is intentionally minimal wiring on top of existing behavior; no LanceDB schema migration or broader runtime changes are introduced.

How to run and inspect artifacts

source .retriever/bin/activate
# standard run (uses harness config defaults)
retriever harness run --dataset jp20 --preset single_gpu
# optional: force observability via overrides
retriever harness run --dataset jp20 --preset single_gpu \
  --override write_extract_artifacts=true \
  --override write_chunk_manifest=true \
  --override observability_archive_dir=/path/to/archive

Artifacts are written under:

nemo_retriever/artifacts/<run_name>_<timestamp>/

Check paths in results.json:

import json,sys
p="nemo_retriever/artifacts/<run_dir>/results.json"
d=json.load(open(p))
print(d["artifacts"]["extract_artifacts_dir"])
print(d["artifacts"]["chunk_manifest_dir"])
print(d["artifacts"]["ingest_errors_file"])

Then inspect shards:

ls <extract_artifacts_dir>/*.jsonl | head
ls <chunk_manifest_dir>/*.jsonl | head

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.
  • If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

- wire extract/chunk dump paths through harness command and results
- add lightweight JSONL snapshot helper with optional durable mirror
- expand focused harness and batch tests for observability contract

Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>
@jioffe502 jioffe502 requested a review from a team as a code owner March 11, 2026 21:30
@jioffe502 jioffe502 requested a review from drobison00 March 11, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant