Skip to content

feat(rerank): add cross-encoder reranking recipe#139

Draft
oliverholworthy wants to merge 20 commits intomainfrom
oholworthy/rerank-recipe-v1
Draft

feat(rerank): add cross-encoder reranking recipe#139
oliverholworthy wants to merge 20 commits intomainfrom
oholworthy/rerank-recipe-v1

Conversation

@oliverholworthy
Copy link
Copy Markdown
Contributor

@oliverholworthy oliverholworthy commented Apr 10, 2026

Summary

  • Add nemotron rerank recipe for fine-tuning cross-encoder reranking models
  • 4-stage pipeline: finetune → eval → export → deploy
  • Consumes training data from embed prep stage directly (same {query, pos_doc[], neg_doc[]} format via nemo-automodel's model_type: cross_encoder)
  • Uses TrainCrossEncoderRecipe + NeMoAutoModelCrossEncoder + CrossEncoderCollator from nemo-automodel

Stages

Stage Command Description
finetune nemotron rerank finetune Fine-tune cross-encoder with classification loss
eval nemotron rerank eval First-stage retrieval + cross-encoder re-ranking evaluation (BEIR)
export nemotron rerank export Export to ONNX/TensorRT via nemo-export reranker adapter
deploy nemotron rerank deploy Deploy NIM reranker container

Key design decisions

  • No separate SDG or data prep stage — reuses embed prep output
  • Pins nemo-automodel at commit 3a3f6858 (includes cross-encoder support)
  • Default base model: nvidia/llama-nemotron-rerank-1b-v2
  • NIM image: nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2:1.10.0
  • Monkeypatches create_bidirectional_mask during ONNX export to work around transformers masking_utils tracing incompatibility
  • Eval uses SentenceTransformer for first-stage retrieval (trust_remote_code support)

Test plan

  • nemotron rerank --help shows all commands
  • nemotron rerank finetune trains successfully
  • nemotron rerank eval produces nDCG/Recall metrics comparing base vs finetuned
  • nemotron rerank export produces ONNX model with correct logits output shape
  • nemotron rerank deploy launches NIM container
  • nemotron rerank run end-to-end pipeline
  • Remote execution via --run

Add a new `nemotron rerank` recipe for fine-tuning cross-encoder
reranking models, following the same stage-based pattern as the
embed recipe.

Stages:
- finetune: Fine-tune using TrainCrossEncoderRecipe from nemo-automodel
- eval: Evaluate reranking quality via BEIR (first-stage retrieval + re-rank)
- export: Export to ONNX/TensorRT using nemo-export reranker adapter
- deploy: Deploy NIM reranker container

The recipe consumes training data from the embed prep stage directly —
the same {query, pos_doc[], neg_doc[]} format works for both biencoder
and cross-encoder training via nemo-automodel's model_type parameter.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- Monkeypatch create_bidirectional_mask during ONNX export to avoid
  transformers masking_utils tracing incompatibility
- Cap lr_warmup_steps to total_steps-1 for small datasets
- Add eval_nim code path with NIM reranker /v1/ranking API support
- Use SentenceTransformer for first-stage retrieval (trust_remote_code)
- Pass trust_remote_code=True to CrossEncoder in eval
- Rename model from llama-3.2-nv-rerankqa-1b-v2 to
  llama-nemotron-rerank-1b-v2 across all configs and code
- Pin NIM image to nvcr.io/nim/nvidia/llama-nemotron-rerank-1b-v2:1.10.0
- Pin onnx<1.20 and add ml_dtypes compat, add UV environments constraint
- Set trust_remote_code in crossencoder_base.yaml for model and tokenizer
- Pin transformers>=5.3.0,<5.4.0 for nemo-automodel compat in finetune

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
@oliverholworthy oliverholworthy self-assigned this Apr 10, 2026
Add sdg and prep subcommands to the rerank recipe that delegate to the
embed implementations. This lets users run the full pipeline end-to-end
with `nemotron rerank run` without needing to know about the embed recipe.

The pipeline now runs: sdg → prep → finetune → eval (→ export → deploy).

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- Add sdg and prep subcommands delegating to embed implementations,
  so users can run the full pipeline with `nemotron rerank run`
- Pipeline now runs: sdg → prep → finetune → eval (→ export → deploy)
- Update nemo-automodel to 897ebedf
- Use transformers 5.5.x via uv override-dependencies

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
- Renumber rerank stages to match 6-stage layout:
  stage0_sdg, stage1_prep, stage2_finetune, stage3_eval,
  stage4_export, stage5_deploy
- Add rerank-specific config for sdg and prep stages that write to
  output/rerank/ instead of output/embed/
- Create proper CLI commands for sdg/prep that use embed scripts with
  rerank config directories
- Fix retriever-sdg deduplication to use data-designer 0.5.3+ API
  (generate_text_embeddings instead of removed _router.embedding)
- Bump data-designer minimum to >=0.5.3
- Update all output paths to use new stage numbering

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Add NIM vs finetuned metrics comparison in eval stage output,
with accuracy threshold checks matching embed recipe pattern.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
@oliverholworthy oliverholworthy force-pushed the oholworthy/rerank-recipe-v1 branch from e0137e4 to bc23241 Compare April 13, 2026 11:36
oliverholworthy and others added 13 commits April 13, 2026 16:29
Local evaluation was feeding raw (query, passage) pairs to the
cross-encoder without the "question:{query} \n \n passage:{passage}"
template used during training and by NIM internally. This caused local
scores to underreport by ~10% NDCG, making it impossible to compare
fine-tuned checkpoints against NIM baselines.

Replace sentence_transformers CrossEncoder with direct
AutoModelForSequenceClassification + AutoTokenizer so we control
input formatting and apply the prompt template consistently.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Use torch.distributed.run with --nproc_per_node=gpu so training
automatically uses all available GPUs (works correctly with 1 GPU too).

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Detect whether the input file is a JSON array or JSONL (one object per
line) by peeking at the first character, so both formats are handled.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Add retrieval_batch_size (default 32) for first-stage retrieval encoding,
keeping batch_size (128) for reranker scoring. The embedding model needs
a smaller batch size due to longer sequence processing.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Torch was being locked to cu130 wheels in uv.lock, causing GPU to not
be used on CUDA 12.x systems. Instead, exclude torch from dependency
resolution and supply it explicitly via `--with torch` in the CLI
commands, with UV_TORCH_BACKEND=auto to resolve the correct CUDA variant.

Applies to rerank (finetune, eval, export, prep) and embed (prep) stages.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Use SentenceTransformer multi-process pool for first-stage retrieval
encoding and DataParallel for cross-encoder reranking when multiple
GPUs are available.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Show scoring progress with batch count and size during evaluation.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Reduce top_k from 100 to 10 and k_values from [1,5,10,100] to
[1,5,10]. This gives ~10x speedup on the cross-encoder scoring
step since fewer candidates are re-ranked per query.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Load cross-encoder with torch_dtype=bfloat16 and padding_side="left"
to align with nemo-retriever-research evaluation defaults. Reduces
memory usage and matches the reference implementation's tokenization.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
sentence-transformers now handles multi-GPU automatically in encode().
The explicit multi-process pool and encode_multi_process calls are
deprecated and no longer needed.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
…ntainers

When torch is already importable (e.g., inside an NVIDIA container),
create a venv with --system-site-packages and exclude torch from UV
resolution. This avoids the CUDA version mismatch where UV's
torch-backend=auto detects the kernel driver CUDA version (via
nvidia-smi) but the container's libcuda.so is a different version.

When torch is NOT importable (bare machine), fall back to the existing
uv run --with torch approach with UV_TORCH_BACKEND=auto.

Consolidates duplicated _execute_uv_local logic from 10 CLI commands
into nemotron.kit.uv_local.execute_uv_local.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
* add rerank recipe readme

Signed-off-by: Steve Han <sthan@nvidia.com>

* fix wrong repo url

Signed-off-by: Steve Han <sthan@nvidia.com>

---------

Signed-off-by: Steve Han <sthan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants