harness: Add extract/chunk observability artifacts by jioffe502 · Pull Request #1585 · NVIDIA/NeMo-Retriever

jioffe502 · 2026-03-11T21:30:15Z

Description

This adds harness-visible extract and chunk artifact dumps with the smallest viable wiring on top of current retriever behavior. It keeps LanceDB/recall semantics intact and avoids broader schema/runtime changes.

Adds harness config + command plumbing for extract/chunk artifact paths and records them in results.json (including ingest_errors.json path)

Adds a lightweight JSONL snapshot helper and wires dumps at two seams only: post-extract and pre-embed
Expands focused harness/batch tests for config validation, artifact path reporting, and snapshot stage wiring

Observability artifacts (extracts/chunks)

This change adds optional harness observability artifacts for semantic inspection and debugging:

run-wide JSONL shards for extracts (extract_artifacts_dir) and pre-embed chunks (chunk_manifest_dir)
ingest_errors_file path captured in results.json
optional durable mirror paths when observability_archive_dir is configured

Usage (harness config):

write_extract_artifacts: true
write_chunk_manifest: true
observability_archive_dir: null (or durable root path)

This is intentionally minimal wiring on top of existing behavior; no LanceDB schema migration or broader runtime changes are introduced.

How to run and inspect artifacts

source .retriever/bin/activate
# standard run (uses harness config defaults)
retriever harness run --dataset jp20 --preset single_gpu
# optional: force observability via overrides
retriever harness run --dataset jp20 --preset single_gpu \
  --override write_extract_artifacts=true \
  --override write_chunk_manifest=true \
  --override observability_archive_dir=/path/to/archive

Artifacts are written under:

nemo_retriever/artifacts/<run_name>_<timestamp>/

Check paths in results.json:

import json,sys
p="nemo_retriever/artifacts/<run_dir>/results.json"
d=json.load(open(p))
print(d["artifacts"]["extract_artifacts_dir"])
print(d["artifacts"]["chunk_manifest_dir"])
print(d["artifacts"]["ingest_errors_file"])

Then inspect shards:

ls <extract_artifacts_dir>/*.jsonl | head
ls <chunk_manifest_dir>/*.jsonl | head

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
If adjusting docker-compose.yaml environment variables have you ensured those are mimicked in the Helm values.yaml file.

- wire extract/chunk dump paths through harness command and results - add lightweight JSONL snapshot helper with optional durable mirror - expand focused harness and batch tests for observability contract Signed-off-by: Jacob Ioffe <jioffe@nvidia.com>

jioffe502 requested a review from a team as a code owner March 11, 2026 21:30

jioffe502 requested a review from drobison00 March 11, 2026 21:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

harness: Add extract/chunk observability artifacts#1585

harness: Add extract/chunk observability artifacts#1585
jioffe502 wants to merge 1 commit intoNVIDIA:mainfrom
jioffe502:jioffe/harness-observability-rebuild

jioffe502 commented Mar 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jioffe502 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Observability artifacts (extracts/chunks)

How to run and inspect artifacts

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jioffe502 commented Mar 11, 2026 •

edited

Loading