Skip to content

Llama3 vLLM tests and benchmarking scripts#1501

Draft
gagank1 wants to merge 3 commits intomainfrom
gkaushik/llama3-vllm
Draft

Llama3 vLLM tests and benchmarking scripts#1501
gagank1 wants to merge 3 commits intomainfrom
gkaushik/llama3-vllm

Conversation

@gagank1
Copy link
Collaborator

@gagank1 gagank1 commented Mar 9, 2026

Description

Adds golden value tests, inference scripts, and benchmark scripts to validate Llama3 inference with vLLM. Also adds hooks for CI.

Usage

python export_llama3.py
pytest -s -v tests/test_vllm.py

More usage in the readme

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Llama-3 vLLM inference support with model round-trip conversion between HuggingFace and Transformer Engine formats.
    • Introduced benchmarking tools to compare HuggingFace native vs. vLLM inference performance across batch sizes and prompt lengths.
    • Added Docker containerization with configurable vLLM installation.
  • Documentation

    • Added comprehensive README documenting the workflow, installation methods, and benchmarking commands.
  • Tests

    • Introduced test suite validating greedy text generation, log-probability similarity, and token prediction overlap between inference paths.
  • Chores

    • Updated CI workflows and pre-commit configuration to support new recipe structure.

gagank1 added 2 commits March 9, 2026 17:04
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 self-assigned this Mar 9, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

📝 Walkthrough

Walkthrough

This PR introduces a complete Llama-3 vLLM inference recipe, including containerization, conversion utilities between HuggingFace and Transformer Engine formats, benchmarking tools, example scripts, and comprehensive tests. Updates include workflow configuration and file synchronization checks.

Changes

Cohort / File(s) Summary
Workflow & Configuration
.github/workflows/unit-tests-recipes.yml, .pre-commit-config.yaml
Updated GitHub Actions workflow to include vLLM inference subdirectories in test collection. Added exclude patterns to pre-commit configuration for vLLM-specific Python files to skip linting/formatting.
Container & Installation
bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile, bionemo-recipes/recipes/vllm_inference/llama3/install_vllm.sh, bionemo-recipes/recipes/vllm_inference/llama3/.ci_build.sh
Added Dockerfile using NVIDIA PyTorch base with optional VLLM installation via build-time arguments. Added shell scripts for automated vLLM installation with CUDA architecture detection and dependency management via pip and uv.
Documentation & Requirements
bionemo-recipes/recipes/vllm_inference/llama3/README.md, bionemo-recipes/recipes/vllm_inference/llama3/requirements.txt
Created README documenting the Llama-3 vLLM workflow, conversion steps, installation methods, and benchmarking instructions. Added requirements.txt specifying Python dependencies including transformers, transformer_engine, torch, and lm-eval.
Transformer Engine Model Implementation
bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py
Implemented Transformer Engine-optimized Llama-3 model suite with NVLlamaConfig, NVLlamaPreTrainedModel, NVLlamaModel, NVLlamaForCausalLM, and auxiliary head classes. Added support for THD/BSHD input formats, rotary embedding handling, KV-caching, and HuggingFace generation compatibility via HFInferenceParams. Includes padding/unpacking utilities and automatic model registration via AUTO_MAP.
State Dict Transformations
bionemo-recipes/recipes/vllm_inference/llama3/state.py
Created comprehensive state-dict transformation framework with TransformCTX, StateDictTransform, and TransformFns utilities. Supports bidirectional parameter mapping, QKV/FC merging and splitting, multi-source transformations, dtype preservation, and validation. Main entry point apply_transforms handles renaming, per-key transforms, shape validation, and weight tying for HF ↔ TE conversions.
Format Conversion
bionemo-recipes/recipes/vllm_inference/llama3/convert.py
Added bidirectional conversion functions convert_llama_hf_to_te and convert_llama_te_to_hf using state transformation framework. Manages parameter name mappings, handles tied word embeddings, preserves rotary embedding frequencies, and applies QKV/FC reformatting between formats.
Checkpoint Export
bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py
Introduced checkpoint export workflow with create_te_checkpoint (loads HF Llama-3, converts to TE, patches config) and convert_te_to_vllm (converts TE to HF for vLLM serving). Provides CLI interface with optional TE checkpoint path and output directory specification.
Benchmarking Framework
bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py, bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py, bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py
Added shared benchmarking utilities (BenchmarkConfig, BenchmarkResult, prompt generation, timing/metrics computation, CSV export) via benchmark_common.py. Created grid-sweep benchmarks for HuggingFace native and vLLM inference measuring TTFT, E2E latency, and throughput across batch sizes, prompt lengths, and output lengths.
Example Scripts
bionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.py, bionemo-recipes/recipes/vllm_inference/llama3/llama3_vllm_inference_example.py
Added smoke-test example scripts for HF native inference and vLLM inference demonstrating model loading, tokenization, generation, and output decoding for quick validation.
Tests & Test Configuration
bionemo-recipes/recipes/vllm_inference/llama3/tests/conftest.py, bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py
Added pytest configuration with parent directory path setup. Implemented comprehensive vLLM validation tests including greedy-text matching against HF reference, per-token log-probability similarity comparison, and top-K token overlap validation across generation steps with automatic skip when vLLM unavailable.
CI Script Updates
ci/scripts/check_copied_files.py
Extended SOURCE_TO_DESTINATION_MAP to include mappings for vLLM inference recipe files (state.py, modeling_llama_te.py, convert.py, requirements.txt) sourced from bionemo-recipes/models/llama3 to maintain synchronization checks across the codebase.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Checkpoint as Checkpoint<br/>(HF)
    participant ConvertHF2TE as ConvertHF to TE
    participant TEModel as TE Model<br/>(checkpoint)
    participant ConvertTE2HF as ConvertTE to HF
    participant vLLMModel as vLLM Model<br/>(HF format)
    participant Inference as Inference<br/>Engine

    User->>ConvertHF2TE: load HF Llama-3
    ConvertHF2TE->>ConvertHF2TE: create TE config
    ConvertHF2TE->>ConvertHF2TE: initialize TE model on meta device
    ConvertHF2TE->>ConvertHF2TE: apply state transforms (QKV merge, FC1)
    ConvertHF2TE->>TEModel: save TE checkpoint
    
    User->>ConvertTE2HF: load TE checkpoint
    ConvertTE2HF->>ConvertTE2HF: extract TE config
    ConvertTE2HF->>ConvertTE2HF: create HF model on meta device
    ConvertTE2HF->>ConvertTE2HF: apply reverse transforms (QKV split, FC1)
    ConvertTE2HF->>ConvertTE2HF: copy rotary embeddings
    ConvertTE2HF->>vLLMModel: save HF checkpoint
    
    User->>Inference: start vLLM with HF checkpoint
    Inference->>vLLMModel: load weights
    Inference->>User: ready for inference
Loading
sequenceDiagram
    participant User
    participant BenchConfig as Config<br/>(batch, prompt_len, output_len)
    participant ModelLoader as Model Loader
    participant Tokenizer as Tokenizer
    participant PromptGen as Prompt Generation
    participant Warmup as Warmup Phase
    participant Measurement as Timing Measurement
    participant Metrics as Metrics Computation
    participant Output as Results Output

    User->>BenchConfig: specify sweep parameters
    BenchConfig->>ModelLoader: load model & tokenizer
    ModelLoader->>Tokenizer: initialize tokenizer
    
    loop over (batch_size, prompt_len, output_len)
        BenchConfig->>PromptGen: build_prompts(batch_size, prompt_len)
        PromptGen->>Warmup: send warmup batch
        Warmup->>ModelLoader: forward pass (discard)
        
        Warmup->>Measurement: measure TTFT & E2E
        Measurement->>ModelLoader: generate with timing hooks
        ModelLoader->>Measurement: return latencies
        
        Measurement->>Metrics: compute_metrics(e2e, ttft, batch, output_len)
        Metrics->>Output: accumulate result
    end
    
    Output->>Output: print_results(table)
    Output->>User: write_csv(results, path)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Poem

🐰 A vLLM feast for Llama so fine,
State dicts dance through transforms divine,
HF to TE, then back once more,
Benchmarks blazing through each core,
Inference blooms in formats bright!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main changes: adding tests and benchmarking scripts for Llama3 vLLM inference.
Description check ✅ Passed The PR description follows the template with all required sections completed: detailed description, usage examples, type of changes marked, and pre-submit checklist items checked.
Docstring Coverage ✅ Passed Docstring coverage is 81.03% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch gkaushik/llama3-vllm

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
@gagank1 gagank1 marked this pull request as ready for review March 9, 2026 21:41
@gagank1
Copy link
Collaborator Author

gagank1 commented Mar 9, 2026

/ok to test 6a3905c

@gagank1
Copy link
Collaborator Author

gagank1 commented Mar 9, 2026

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 9, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (7)
bionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.py (1)

22-36: Consider wrapping execution in if __name__ == "__main__": guard.

The script executes GPU operations at import time. While acceptable for a smoke-test script, adding a main guard prevents accidental execution when imported.

♻️ Proposed refactor
-CHECKPOINT = "./llama3_hf_roundtrip_checkpoint"
-
-model = AutoModelForCausalLM.from_pretrained(CHECKPOINT, torch_dtype=torch.bfloat16)
-model.to("cuda")
-
-tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
-tokenizer.pad_token = tokenizer.eos_token
-
-inputs = tokenizer("The quick brown fox", return_tensors="pt")
-inputs = {k: v.to("cuda") for k, v in inputs.items()}
-
-with torch.no_grad():
-    output_ids = model.generate(**inputs, max_new_tokens=16, use_cache=False)
-
-print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
+CHECKPOINT = "./llama3_hf_roundtrip_checkpoint"
+
+
+def main():
+    model = AutoModelForCausalLM.from_pretrained(CHECKPOINT, torch_dtype=torch.bfloat16)
+    model.to("cuda")
+
+    tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
+    tokenizer.pad_token = tokenizer.eos_token
+
+    inputs = tokenizer("The quick brown fox", return_tensors="pt")
+    inputs = {k: v.to("cuda") for k, v in inputs.items()}
+
+    with torch.no_grad():
+        output_ids = model.generate(**inputs, max_new_tokens=16, use_cache=False)
+
+    print(tokenizer.decode(output_ids[0], skip_special_tokens=True))
+
+
+if __name__ == "__main__":
+    main()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.py`
around lines 22 - 36, The script currently runs GPU/model/tokenizer setup and
generation at import time (CHECKPOINT, AutoModelForCausalLM.from_pretrained,
model.to, tokenizer, inputs creation and model.generate); wrap this execution in
a main guard by moving the procedural code into a main() function (or an inline
block) and surrounding the call with if __name__ == "__main__": so importing the
module doesn't trigger GPU operations automatically.
bionemo-recipes/recipes/vllm_inference/llama3/requirements.txt (1)

1-5: Consider pinning dependency versions for reproducibility.

All dependencies except torchao are unpinned. This can lead to inconsistent CI results if upstream packages release breaking changes. Consider adding version constraints (e.g., transformers>=4.40,<5.0) for critical dependencies.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/requirements.txt` around lines
1 - 5, The requirements file currently has unpinned packages (lm-eval, torch,
transformer_engine[pytorch], transformers) which can cause non-reproducible
builds; update requirements.txt to add explicit version constraints for these
entries (for example use ranges like transformers>=4.40,<5.0 and
semantically-safe ranges for torch, lm-eval and transformer_engine[pytorch])
while keeping the existing torchao!=0.14.0 constraint—ensure each package line
in requirements.txt (lm-eval, torch, transformer_engine[pytorch], transformers)
is replaced with a pinned or range-pinned spec to enforce reproducible CI
installs.
bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile (1)

1-24: Consider adding a non-root user for improved security posture.

The container runs as root by default. While acceptable for CI/development, adding a non-root user would improve security for production deployments.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile` around lines 1 -
24, Add a non-root user and switch to it after installing system-level deps:
create a user/group (e.g., bionemo) and home directory, chown /workspace and
/workspace/bionemo (and any pip cache or build dirs used during RUN steps) to
that user, and add a final USER bionemo directive so subsequent operations run
unprivileged; ensure the vllm build steps that require root (the RUN block
guarded by ARG INSTALL_VLLM) still run as root by keeping them before the USER
switch and preserve required environment variables (INSTALL_VLLM,
TORCH_CUDA_ARCH_LIST, MAX_JOBS) and WORKDIR adjustments so folders used by
functions like the pip installs and git clone (vllm) are owned by the new user
after chown.
bionemo-recipes/recipes/vllm_inference/llama3/.ci_build.sh (1)

1-3: Add error handling for safer CI execution.

The script lacks set -e or similar error handling. If pip install fails, the script will continue to ./install_vllm.sh, potentially masking the failure.

Note: The PIP_CONSTRAINT= syntax (shellcheck SC1007) is actually correct here—it sets PIP_CONSTRAINT to an empty string for the pip command only.

♻️ Proposed fix
-#!/bin/bash -x
-PIP_CONSTRAINT= pip install -r requirements.txt
+#!/bin/bash -xe
+PIP_CONSTRAINT='' pip install -r requirements.txt
 ./install_vllm.sh
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/.ci_build.sh` around lines 1 -
3, The CI script .ci_build.sh currently lacks fail-fast error handling so
failures in the pip install step (PIP_CONSTRAINT= pip install -r
requirements.txt) won’t stop execution and may mask errors before
./install_vllm.sh runs; add a fail-fast header (e.g. set -euo pipefail or at
least set -e) immediately after the shebang in .ci_build.sh to ensure the script
exits on errors, preserve the existing PIP_CONSTRAINT= pip install -r
requirements.txt syntax, and keep the subsequent ./install_vllm.sh invocation
unchanged.
bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py (1)

82-101: convert_te_to_vllm() still returns a partially assembled checkpoint.

This helper stops before tokenizer files are written, so every caller has to remember a second save_pretrained() step. That leakage is already duplicated here and in tests/test_vllm.py.

♻️ Suggested API cleanup
-def convert_te_to_vllm(te_checkpoint: Path, output_dir: Path) -> Path:
+def convert_te_to_vllm(
+    te_checkpoint: Path,
+    output_dir: Path,
+    tokenizer_source: str | Path | None = None,
+) -> Path:
     """Convert a TE checkpoint to standard HF format for vLLM serving."""
     print(f"  Loading TE checkpoint: {te_checkpoint}")
     model_te = NVLlamaForCausalLM.from_pretrained(te_checkpoint)
@@
     output_dir.mkdir(parents=True, exist_ok=True)
     model_hf.save_pretrained(output_dir)
     del model_hf
+
+    if tokenizer_source is not None:
+        tokenizer = AutoTokenizer.from_pretrained(tokenizer_source)
+        tokenizer.save_pretrained(output_dir)
 
     print(f"  vLLM-ready checkpoint saved to {output_dir}")
     return output_dir
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py` around lines
82 - 101, convert_te_to_vllm currently returns a checkpoint missing tokenizer
files so callers must run a second save; fix by ensuring tokenizer artifacts are
copied or saved into output_dir before returning: load the tokenizer from
te_checkpoint (e.g., AutoTokenizer.from_pretrained(te_checkpoint) or equivalent)
and call tokenizer.save_pretrained(output_dir), or if original tokenizer files
exist under te_checkpoint simply copy them into output_dir, then return
output_dir; update convert_te_to_vllm to perform this step (references:
convert_te_to_vllm, model_hf, te_checkpoint, output_dir).
bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py (1)

26-39: Regroup the imports to match isort.

benchmark_common is a local module, so it should live in its own first-party block instead of between third-party imports.

♻️ Suggested import order
 import argparse
 import itertools
 
 import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
 from benchmark_common import (
     add_common_args,
     build_prompts,
     compute_metrics,
     median_timing,
     parse_config,
     print_results,
     write_csv,
 )
-from transformers import AutoModelForCausalLM, AutoTokenizer

As per coding guidelines, **/*.py: Follow import sorting configuration as per isort with 2 lines after imports configuration.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py` around lines
26 - 39, The import block places the local module benchmark_common between
third-party imports; move the "from benchmark_common import (...)" line into the
first-party import block (i.e., group it after the third-party imports like
torch and transformers) so imports follow stdlib, third-party, first-party order
and obey the project's isort spacing rules; keep the same imported symbols
(add_common_args, build_prompts, compute_metrics, median_timing, parse_config,
print_results, write_csv) and run isort/formatting to verify spacing.
bionemo-recipes/recipes/vllm_inference/llama3/state.py (1)

279-285: Don't mutate NumPy's global print options from the transform hot path.

np.set_printoptions() is process-wide, so every transform call changes formatting for unrelated logs and tests. If the goal is shorter debug output, format the specific arrays at log time instead of touching global state here.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/state.py` around lines 279 -
285, In __call__ of the transformer (state.py) remove the process-wide
np.set_printoptions(threshold=10) call because it mutates global NumPy state on
every transform; instead, when you need compact debug output, format the
specific arrays at log time (e.g., use numpy.array2string or an ephemeral
formatting helper) before logging or return, leaving the transform method and
__call__ behavior unchanged otherwise; update references in __call__ and any
logging sites that relied on the global printoptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py`:
- Around line 189-199: In parse_config, validate that repeats is > 0 and that
every value in batch_sizes, prompt_lens, and output_lens parses to an integer >
0 before constructing the BenchmarkConfig; if any check fails, raise a clear
exception (e.g., ValueError) with a descriptive message so the CLI fails fast.
Locate parse_config and the use of args.repeats, args.batch_sizes,
args.prompt_lens, args.output_lens and perform the checks on the parsed lists
(after splitting) and the repeats value, returning the BenchmarkConfig only if
all validations pass.
- Around line 74-93: The helper currently decodes token_ids then lets vLLM
re-tokenize prompt_str, which can diverge; instead build prompt_str from the
exact token ids so vLLM and HF see identical tokens: after producing token_ids
(token_ids variable in build_prompts), convert those ids to tokens with
tokenizer.convert_ids_to_tokens(token_ids) and then turn that token sequence
into a string using tokenizer.convert_tokens_to_string(...) (or an equivalent
join that preserves token boundaries), then replicate that prompt_str into
prompt_strings and keep input_ids built from token_ids; also ensure
encode/decode calls continue to use add_special_tokens=False so special-token
handling stays consistent.
- Around line 96-116: compute_metrics currently uses the requested output_len
rather than the actual number of tokens produced, which inflates TPOT/throughput
when generation ends early; modify compute_metrics (and its callers) to accept
the actual output token count (e.g., actual_output_len or
output_tokens_per_example) instead of the configured output_len, compute
total_output_tokens = batch_size * actual_output_len, and compute tpot_ms using
max(actual_output_len - 1, 1) with safe zero-division guards; update the
benchmark scripts that call _generate / model.generate / engine.generate to
extract the real generated token counts from the generate return values and pass
those counts into compute_metrics, keeping BenchmarkResult fields consistent
(batch_size, prompt_len, output_len -> actual output length, e2e_ms, ttft_ms,
tpot_ms, throughput_tok_s).

In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py`:
- Around line 28-43: DEFAULT_MODEL currently points to
"./llama3_hf_roundtrip_checkpoint" which resolves from the process CWD; change
it to resolve relative to the script file so the default uses the recipe
directory. Import pathlib and set DEFAULT_MODEL = str(Path(__file__).parent /
"llama3_hf_roundtrip_checkpoint"), updating any code that expects a string path;
reference DEFAULT_MODEL and __file__ to locate where to change.

In `@bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile`:
- Around line 11-20: ARG UV_BREAK_SYSTEM_PACKAGES is defined but not exported,
so the subsequent invocations of the wrapper command "uv pip" won’t see it as an
environment variable; update the Dockerfile so the build-time ARG is passed to
the environment before running the vllm install block (for example, set ENV
UV_BREAK_SYSTEM_PACKAGES=$UV_BREAK_SYSTEM_PACKAGES or export it inline before
the "uv pip" commands) so that the "uv pip install -r requirements/build.txt
--system" and "uv pip install --no-build-isolation -e . --system" invocations
receive the flag; ensure this change occurs inside the same conditional that
checks INSTALL_VLLM and before any use of "uv".

In `@bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py`:
- Around line 206-235: The forward() path assumes input_ids is always present
when computing batch_size and padded_seq_len which crashes for
inputs_embeds-only calls; update the logic in modeling_llama_te.forward to
derive batch_size and padded_seq_len from whichever input is populated (use
inputs_embeds.size(0)/size(1) when input_ids is None, falling back to
input_ids), perform that determination before the THD squeeze branch that
modifies hidden_states, and likewise compute lengths using the available tensor
rather than input_ids.shape when computing past_key_values lengths; adjust
references to input_ids.size(1) and input_ids.shape[0] to use the selected
source so inputs_embeds-only calls no longer fail.
- Around line 237-241: The warning condition is inverted: change the check in
the block that computes te_rope_emb (within the torch.autocast context) to warn
when the rotary embedding is NOT float32. Locate the computation of te_rope_emb
via self.rotary_emb(max_seq_len=self.config.max_position_embeddings) and replace
the current if te_rope_emb.dtype == torch.float32 warning with an if
te_rope_emb.dtype != torch.float32 warning so the warning fires on the
incorrect/mismatched dtype rather than the correct hot path.
- Around line 63-80: The method init_empty_weights unconditionally references
self.model but NVLlamaModel stores embed_tokens and rotary_emb on self directly,
causing AttributeError; change the code to pick the real model holder (e.g.,
target = getattr(self, "model", self)) and then operate on target.embed_tokens
and target.rotary_emb (call target.embed_tokens.to_empty(device="cuda");
target.embed_tokens.apply(self._init_weights); set target.rotary_emb.inv_freq =
LlamaRotaryEmbedding(config=target.config).inv_freq.to("cuda")), then call
self.tie_weights() as before so weight tying works for both the original wrapper
and NVLlamaModel.
- Around line 263-275: The returned BaseModelOutputWithPast mixes THD and BSHD
layouts when should_pack_inputs is true because only the final hidden_states is
run through _pad_input; fix by converting every entry in all_hidden_states back
to BSHD before returning (use the same _pad_input(indices, batch_size,
padded_seq_len) call on each element), e.g. map/transform the tuple
all_hidden_states to a new tuple of padded tensors when should_pack_inputs and
output_hidden_states are true so last_hidden_state and all entries in
hidden_states share the same BSHD layout.
- Around line 391-416: Remove the process-wide
torch._dynamo.config.capture_scalar_outputs assignment from module scope (do not
set torch._dynamo.config.capture_scalar_outputs = True at import); instead
either (A) move that configuration to the benchmark/test/export entrypoint so it
runs only when those scripts execute, or (B) make the local functions robust to
the default config by refactoring _unpad_input/_pad_input to avoid depending on
capture_scalar_outputs (or wrap compilation calls in a local
torch._dynamo.config.patch/context that sets capture_scalar_outputs only for the
duration of compilation). Locate the module-level assignment and delete it, then
choose one of the two options: update the entrypoint(s) used for
benchmarking/tests to enable capture_scalar_outputs there, or change
_unpad_input/_pad_input so they don’t require the global toggle and, if
necessary, enable the config inside a narrow scoped patch around the call to
torch.compile.

In `@bionemo-recipes/recipes/vllm_inference/llama3/state.py`:
- Around line 574-584: The helper merge functions (e.g., merge_qkv_concat) still
reference Megatron-only fields num_query_groups and kv_channels; update them to
use the new KV-head fields used by TransformFns.merge_qkv: read
num_key_value_heads from ctx.target.config and compute head_size as
ctx.target.config.hidden_size // ctx.target.config.num_attention_heads, then use
those values in the qkv.split sizes (head_num * head_size, num_key_value_heads *
head_size, num_key_value_heads * head_size) and any other merge helpers in the
same area (the block around lines 623–667) so they match
TransformFns.merge_qkv's expectations and avoid AttributeError on NVLlamaConfig.

In `@bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py`:
- Around line 34-41: The test module currently imports export_llama3 and
performs .to("cuda") in fixtures unconditionally, causing import/collection
failures when TransformerEngine or CUDA are missing; wrap or guard those imports
and GPU usage: detect TE availability (e.g., try/except import export_llama3 to
set _TE_AVAILABLE) and CUDA availability (torch.cuda.is_available() to set
_CUDA_AVAILABLE), only import or call
HF_MODEL_ID/convert_te_to_vllm/create_te_checkpoint when both _VLLM_AVAILABLE
and _TE_AVAILABLE and _CUDA_AVAILABLE are true, and modify the
hf_reference_outputs and vllm_checkpoint fixtures to skip or return None when
GPU/TE are unavailable (or use pytest.importorskip/pytest.skip at fixture start)
to prevent test collection errors.

---

Nitpick comments:
In `@bionemo-recipes/recipes/vllm_inference/llama3/.ci_build.sh`:
- Around line 1-3: The CI script .ci_build.sh currently lacks fail-fast error
handling so failures in the pip install step (PIP_CONSTRAINT= pip install -r
requirements.txt) won’t stop execution and may mask errors before
./install_vllm.sh runs; add a fail-fast header (e.g. set -euo pipefail or at
least set -e) immediately after the shebang in .ci_build.sh to ensure the script
exits on errors, preserve the existing PIP_CONSTRAINT= pip install -r
requirements.txt syntax, and keep the subsequent ./install_vllm.sh invocation
unchanged.

In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py`:
- Around line 26-39: The import block places the local module benchmark_common
between third-party imports; move the "from benchmark_common import (...)" line
into the first-party import block (i.e., group it after the third-party imports
like torch and transformers) so imports follow stdlib, third-party, first-party
order and obey the project's isort spacing rules; keep the same imported symbols
(add_common_args, build_prompts, compute_metrics, median_timing, parse_config,
print_results, write_csv) and run isort/formatting to verify spacing.

In `@bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile`:
- Around line 1-24: Add a non-root user and switch to it after installing
system-level deps: create a user/group (e.g., bionemo) and home directory, chown
/workspace and /workspace/bionemo (and any pip cache or build dirs used during
RUN steps) to that user, and add a final USER bionemo directive so subsequent
operations run unprivileged; ensure the vllm build steps that require root (the
RUN block guarded by ARG INSTALL_VLLM) still run as root by keeping them before
the USER switch and preserve required environment variables (INSTALL_VLLM,
TORCH_CUDA_ARCH_LIST, MAX_JOBS) and WORKDIR adjustments so folders used by
functions like the pip installs and git clone (vllm) are owned by the new user
after chown.

In `@bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py`:
- Around line 82-101: convert_te_to_vllm currently returns a checkpoint missing
tokenizer files so callers must run a second save; fix by ensuring tokenizer
artifacts are copied or saved into output_dir before returning: load the
tokenizer from te_checkpoint (e.g., AutoTokenizer.from_pretrained(te_checkpoint)
or equivalent) and call tokenizer.save_pretrained(output_dir), or if original
tokenizer files exist under te_checkpoint simply copy them into output_dir, then
return output_dir; update convert_te_to_vllm to perform this step (references:
convert_te_to_vllm, model_hf, te_checkpoint, output_dir).

In
`@bionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.py`:
- Around line 22-36: The script currently runs GPU/model/tokenizer setup and
generation at import time (CHECKPOINT, AutoModelForCausalLM.from_pretrained,
model.to, tokenizer, inputs creation and model.generate); wrap this execution in
a main guard by moving the procedural code into a main() function (or an inline
block) and surrounding the call with if __name__ == "__main__": so importing the
module doesn't trigger GPU operations automatically.

In `@bionemo-recipes/recipes/vllm_inference/llama3/requirements.txt`:
- Around line 1-5: The requirements file currently has unpinned packages
(lm-eval, torch, transformer_engine[pytorch], transformers) which can cause
non-reproducible builds; update requirements.txt to add explicit version
constraints for these entries (for example use ranges like
transformers>=4.40,<5.0 and semantically-safe ranges for torch, lm-eval and
transformer_engine[pytorch]) while keeping the existing torchao!=0.14.0
constraint—ensure each package line in requirements.txt (lm-eval, torch,
transformer_engine[pytorch], transformers) is replaced with a pinned or
range-pinned spec to enforce reproducible CI installs.

In `@bionemo-recipes/recipes/vllm_inference/llama3/state.py`:
- Around line 279-285: In __call__ of the transformer (state.py) remove the
process-wide np.set_printoptions(threshold=10) call because it mutates global
NumPy state on every transform; instead, when you need compact debug output,
format the specific arrays at log time (e.g., use numpy.array2string or an
ephemeral formatting helper) before logging or return, leaving the transform
method and __call__ behavior unchanged otherwise; update references in __call__
and any logging sites that relied on the global printoptions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 78172a79-8067-41b6-a36a-bdf36eaaa681

📥 Commits

Reviewing files that changed from the base of the PR and between 470e10d and 6a3905c.

📒 Files selected for processing (19)
  • .github/workflows/unit-tests-recipes.yml
  • .pre-commit-config.yaml
  • bionemo-recipes/recipes/vllm_inference/llama3/.ci_build.sh
  • bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile
  • bionemo-recipes/recipes/vllm_inference/llama3/README.md
  • bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py
  • bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py
  • bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py
  • bionemo-recipes/recipes/vllm_inference/llama3/convert.py
  • bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py
  • bionemo-recipes/recipes/vllm_inference/llama3/install_vllm.sh
  • bionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.py
  • bionemo-recipes/recipes/vllm_inference/llama3/llama3_vllm_inference_example.py
  • bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py
  • bionemo-recipes/recipes/vllm_inference/llama3/requirements.txt
  • bionemo-recipes/recipes/vllm_inference/llama3/state.py
  • bionemo-recipes/recipes/vllm_inference/llama3/tests/conftest.py
  • bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py
  • ci/scripts/check_copied_files.py

Comment on lines +74 to +93
def build_prompts(
tokenizer: PreTrainedTokenizerBase,
batch_size: int,
prompt_length: int,
) -> tuple[list[str], torch.Tensor]:
"""Generate deterministic synthetic prompts of exactly *prompt_length* tokens.

Returns:
A tuple of (prompt_strings, input_ids_tensor).
*prompt_strings* is a list[str] of length *batch_size* (for vLLM).
*input_ids_tensor* is a (batch_size, prompt_length) int64 tensor (for HF).
Both represent byte-identical inputs.
"""
repeated = STOCK_TEXT * ((prompt_length // 10) + 2)
token_ids = tokenizer.encode(repeated, add_special_tokens=False)[:prompt_length]
prompt_str = tokenizer.decode(token_ids)

prompt_strings = [prompt_str] * batch_size
input_ids = torch.tensor([token_ids] * batch_size, dtype=torch.long)
return prompt_strings, input_ids
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, find the related benchmark files
fd -t f "benchmark_" bionemo-recipes/recipes/vllm_inference/llama3/ 2>/dev/null | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 259


🏁 Script executed:

# Search for usage of build_prompts function
rg "build_prompts" bionemo-recipes/recipes/vllm_inference/llama3/ -A 3 -B 1

Repository: NVIDIA/bionemo-framework

Length of output: 2341


🏁 Script executed:

# Get the full context of benchmark_common.py to understand STOCK_TEXT and surrounding code
cat -n bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py | head -100

Repository: NVIDIA/bionemo-framework

Length of output: 3751


🏁 Script executed:

# View how build_prompts is used in benchmark_vllm.py
rg -A 10 "prompts, _ = build_prompts" bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py

Repository: NVIDIA/bionemo-framework

Length of output: 499


🏁 Script executed:

# View how build_prompts is used in benchmark_hf.py
rg -A 10 "_, input_ids = build_prompts" bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py

Repository: NVIDIA/bionemo-framework

Length of output: 471


🏁 Script executed:

# Check if same tokenizer is initialized in both files
rg "tokenizer.*=" bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py | head -20

Repository: NVIDIA/bionemo-framework

Length of output: 531


The "identical inputs" guarantee is not enforced here.

HF consumes input_ids directly, while the vLLM path re-tokenizes prompt_str. Even with the same tokenizer model, the decode/re-encode cycle can introduce drift through special-token handling, whitespace normalization, or subword tokenization differences, meaning the two backends may benchmark different prompt tokens while this helper reports them as equivalent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py` around
lines 74 - 93, The helper currently decodes token_ids then lets vLLM re-tokenize
prompt_str, which can diverge; instead build prompt_str from the exact token ids
so vLLM and HF see identical tokens: after producing token_ids (token_ids
variable in build_prompts), convert those ids to tokens with
tokenizer.convert_ids_to_tokens(token_ids) and then turn that token sequence
into a string using tokenizer.convert_tokens_to_string(...) (or an equivalent
join that preserves token boundaries), then replicate that prompt_str into
prompt_strings and keep input_ids built from token_ids; also ensure
encode/decode calls continue to use add_special_tokens=False so special-token
handling stays consistent.

Comment on lines +96 to +116
def compute_metrics(
e2e_seconds: float,
ttft_seconds: float,
batch_size: int,
output_len: int,
) -> BenchmarkResult:
"""Derive TPOT and throughput from raw wall-clock timings."""
e2e_ms = e2e_seconds * 1000.0
ttft_ms = ttft_seconds * 1000.0
tpot_ms = ((e2e_seconds - ttft_seconds) / max(output_len - 1, 1)) * 1000.0
total_output_tokens = batch_size * output_len
throughput = total_output_tokens / e2e_seconds if e2e_seconds > 0 else 0.0
return BenchmarkResult(
batch_size=batch_size,
prompt_len=0,
output_len=output_len,
e2e_ms=e2e_ms,
ttft_ms=ttft_ms,
tpot_ms=tpot_ms,
throughput_tok_s=throughput,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the file exists and examine the relevant section
fd -p "benchmark_common.py" -t f

Repository: NVIDIA/bionemo-framework

Length of output: 133


🏁 Script executed:

# Find the file and read the relevant section
find . -name "benchmark_common.py" -type f 2>/dev/null | head -5

Repository: NVIDIA/bionemo-framework

Length of output: 135


🏁 Script executed:

# Search for the function in the codebase
rg "def compute_metrics" -A 30

Repository: NVIDIA/bionemo-framework

Length of output: 6496


🏁 Script executed:

# Also search for where compute_metrics is called
rg "compute_metrics\(" -B 3 -A 3

Repository: NVIDIA/bionemo-framework

Length of output: 3229


🏁 Script executed:

# Get more context around the compute_metrics calls
rg "compute_metrics\(" -B 20 -A 5 "bionemo-recipes/recipes/vllm_inference/llama3/"

Repository: NVIDIA/bionemo-framework

Length of output: 8200


🏁 Script executed:

# Check the full benchmark_vllm.py file to understand the generation flow
wc -l ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py

Repository: NVIDIA/bionemo-framework

Length of output: 136


🏁 Script executed:

# Read the relevant section of benchmark_vllm.py
sed -n '1,200p' ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py

Repository: NVIDIA/bionemo-framework

Length of output: 3331


🏁 Script executed:

# Also check benchmark_hf.py
wc -l ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py

Repository: NVIDIA/bionemo-framework

Length of output: 134


🏁 Script executed:

# Read benchmark_hf.py to understand the generation flow
sed -n '1,200p' ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py

Repository: NVIDIA/bionemo-framework

Length of output: 3305


🏁 Script executed:

# Check if there's any code that processes the output of generate() to get actual token counts
rg "\.generate\(" -A 10 "bionemo-recipes/recipes/vllm_inference/llama3/benchmark"

Repository: NVIDIA/bionemo-framework

Length of output: 163


🏁 Script executed:

# Check the entire benchmark_common.py file to see full context
cat ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py

Repository: NVIDIA/bionemo-framework

Length of output: 6802


Pass actual token counts to compute_metrics instead of requested lengths.

compute_metrics() uses the requested output_len to compute TPOT and throughput, but language models may generate fewer tokens when they produce end-of-sequence (EOS). This inflates both metrics whenever generation stops early. The _generate() calls in both benchmark scripts discard the generation outputs and only time the execution, so actual token counts are never captured. To fix this, extract the actual number of tokens generated from the model.generate() and engine.generate() return values and pass those counts to compute_metrics() instead of the configured output_len.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py` around
lines 96 - 116, compute_metrics currently uses the requested output_len rather
than the actual number of tokens produced, which inflates TPOT/throughput when
generation ends early; modify compute_metrics (and its callers) to accept the
actual output token count (e.g., actual_output_len or output_tokens_per_example)
instead of the configured output_len, compute total_output_tokens = batch_size *
actual_output_len, and compute tpot_ms using max(actual_output_len - 1, 1) with
safe zero-division guards; update the benchmark scripts that call _generate /
model.generate / engine.generate to extract the real generated token counts from
the generate return values and pass those counts into compute_metrics, keeping
BenchmarkResult fields consistent (batch_size, prompt_len, output_len -> actual
output length, e2e_ms, ttft_ms, tpot_ms, throughput_tok_s).

Comment on lines +189 to +199
def parse_config(args: argparse.Namespace) -> BenchmarkConfig:
"""Convert parsed CLI args into a BenchmarkConfig."""
return BenchmarkConfig(
model=args.model,
batch_sizes=[int(x) for x in args.batch_sizes.split(",")],
prompt_lens=[int(x) for x in args.prompt_lens.split(",")],
output_lens=[int(x) for x in args.output_lens.split(",")],
warmup=args.warmup,
repeats=args.repeats,
csv_path=args.csv_path,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Reject non-positive sweep parameters up front.

--repeats 0 currently falls through to statistics.median([]), and zero/negative batch or token lengths fail later with much less actionable errors. Validate them here so the CLI fails fast.

🛠️ Suggested validation
 def parse_config(args: argparse.Namespace) -> BenchmarkConfig:
     """Convert parsed CLI args into a BenchmarkConfig."""
+    def _parse_positive_csv(flag_name: str, raw: str) -> list[int]:
+        values = [int(x) for x in raw.split(",")]
+        if any(value <= 0 for value in values):
+            raise ValueError(f"{flag_name} must contain only positive integers")
+        return values
+
+    if args.warmup < 0:
+        raise ValueError("--warmup must be >= 0")
+    if args.repeats <= 0:
+        raise ValueError("--repeats must be > 0")
+
     return BenchmarkConfig(
         model=args.model,
-        batch_sizes=[int(x) for x in args.batch_sizes.split(",")],
-        prompt_lens=[int(x) for x in args.prompt_lens.split(",")],
-        output_lens=[int(x) for x in args.output_lens.split(",")],
+        batch_sizes=_parse_positive_csv("--batch-sizes", args.batch_sizes),
+        prompt_lens=_parse_positive_csv("--prompt-lens", args.prompt_lens),
+        output_lens=_parse_positive_csv("--output-lens", args.output_lens),
         warmup=args.warmup,
         repeats=args.repeats,
         csv_path=args.csv_path,
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py` around
lines 189 - 199, In parse_config, validate that repeats is > 0 and that every
value in batch_sizes, prompt_lens, and output_lens parses to an integer > 0
before constructing the BenchmarkConfig; if any check fails, raise a clear
exception (e.g., ValueError) with a descriptive message so the CLI fails fast.
Locate parse_config and the use of args.repeats, args.batch_sizes,
args.prompt_lens, args.output_lens and perform the checks on the parsed lists
(after splitting) and the repeats value, returning the BenchmarkConfig only if
all validations pass.

Comment on lines +28 to +43
import argparse
import itertools

from benchmark_common import (
add_common_args,
build_prompts,
compute_metrics,
median_timing,
parse_config,
print_results,
write_csv,
)
from vllm import LLM, SamplingParams


DEFAULT_MODEL = "./llama3_hf_roundtrip_checkpoint"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Resolve the default checkpoint from __file__, not the current working directory.

export_llama3.py writes its default output under the recipe directory, but this default points at ./llama3_hf_roundtrip_checkpoint. Running the benchmark from repo root with an explicit script path will miss the exported checkpoint unless --model is passed.

🛠️ Suggested fix
 import argparse
 import itertools
+from pathlib import Path
 
+from vllm import LLM, SamplingParams
+
 from benchmark_common import (
     add_common_args,
     build_prompts,
     compute_metrics,
     median_timing,
     parse_config,
     print_results,
     write_csv,
 )
-from vllm import LLM, SamplingParams
 
-
-DEFAULT_MODEL = "./llama3_hf_roundtrip_checkpoint"
+DEFAULT_MODEL = str(Path(__file__).resolve().parent / "llama3_hf_roundtrip_checkpoint")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py` around lines
28 - 43, DEFAULT_MODEL currently points to "./llama3_hf_roundtrip_checkpoint"
which resolves from the process CWD; change it to resolve relative to the script
file so the default uses the recipe directory. Import pathlib and set
DEFAULT_MODEL = str(Path(__file__).parent / "llama3_hf_roundtrip_checkpoint"),
updating any code that expects a string path; reference DEFAULT_MODEL and
__file__ to locate where to change.

Comment on lines +11 to +20
ARG UV_BREAK_SYSTEM_PACKAGES=1
RUN if [ "$INSTALL_VLLM" = "true" ]; then \
if [ -z "$TORCH_CUDA_ARCH_LIST" ]; then \
echo "ERROR: TORCH_CUDA_ARCH_LIST must be set when INSTALL_VLLM=true" && exit 1; \
fi && \
git clone --branch v0.15.1 --depth 1 https://github.com/vllm-project/vllm.git && \
cd vllm && \
python use_existing_torch.py && \
uv pip install -r requirements/build.txt --system && \
uv pip install --no-build-isolation -e . --system && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

UV_BREAK_SYSTEM_PACKAGES ARG may not be visible to uv commands.

ARG values are only available during build as build-time variables, but uv reads UV_BREAK_SYSTEM_PACKAGES from the environment. The ARG needs to be exported as an ENV or set inline for the uv commands.

🐛 Proposed fix
 ARG UV_BREAK_SYSTEM_PACKAGES=1
+ENV UV_BREAK_SYSTEM_PACKAGES=${UV_BREAK_SYSTEM_PACKAGES}
 RUN if [ "$INSTALL_VLLM" = "true" ]; then \
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile` around lines 11 -
20, ARG UV_BREAK_SYSTEM_PACKAGES is defined but not exported, so the subsequent
invocations of the wrapper command "uv pip" won’t see it as an environment
variable; update the Dockerfile so the build-time ARG is passed to the
environment before running the vllm install block (for example, set ENV
UV_BREAK_SYSTEM_PACKAGES=$UV_BREAK_SYSTEM_PACKAGES or export it inline before
the "uv pip" commands) so that the "uv pip install -r requirements/build.txt
--system" and "uv pip install --no-build-isolation -e . --system" invocations
receive the flag; ensure this change occurs inside the same conditional that
checks INSTALL_VLLM and before any use of "uv".

Comment on lines +237 to +241
# Ensure that rotary embeddings are computed with at a higher precision
with torch.autocast(device_type="cuda", enabled=False):
te_rope_emb = self.rotary_emb(max_seq_len=self.config.max_position_embeddings)
if te_rope_emb.dtype == torch.float32:
warnings.warn("Rotary embeddings should be in float32 for optimal performance.", UserWarning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Invert this warning condition.

Line 240 warns when the rotary embedding is already float32, so the hot path emits a false warning on the optimal dtype and never flags the real mismatch case.

Suggested fix
-            if te_rope_emb.dtype == torch.float32:
+            if te_rope_emb.dtype != torch.float32:
                 warnings.warn("Rotary embeddings should be in float32 for optimal performance.", UserWarning)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Ensure that rotary embeddings are computed with at a higher precision
with torch.autocast(device_type="cuda", enabled=False):
te_rope_emb = self.rotary_emb(max_seq_len=self.config.max_position_embeddings)
if te_rope_emb.dtype == torch.float32:
warnings.warn("Rotary embeddings should be in float32 for optimal performance.", UserWarning)
# Ensure that rotary embeddings are computed with at a higher precision
with torch.autocast(device_type="cuda", enabled=False):
te_rope_emb = self.rotary_emb(max_seq_len=self.config.max_position_embeddings)
if te_rope_emb.dtype != torch.float32:
warnings.warn("Rotary embeddings should be in float32 for optimal performance.", UserWarning)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py` around
lines 237 - 241, The warning condition is inverted: change the check in the
block that computes te_rope_emb (within the torch.autocast context) to warn when
the rotary embedding is NOT float32. Locate the computation of te_rope_emb via
self.rotary_emb(max_seq_len=self.config.max_position_embeddings) and replace the
current if te_rope_emb.dtype == torch.float32 warning with an if
te_rope_emb.dtype != torch.float32 warning so the warning fires on the
incorrect/mismatched dtype rather than the correct hot path.

Comment on lines +263 to +275
# add hidden states from the last decoder layer. Note that these will be in THD format; we could possibly pad
# these with the same _pad_input call as below if we wanted them returned in BSHD format.
if output_hidden_states:
all_hidden_states = (*all_hidden_states, hidden_states)

if should_pack_inputs:
# If we've converted BSHD to THD for our TE layers, we need to convert back to BSHD for the output.
hidden_states = _pad_input(hidden_states, indices, batch_size, padded_seq_len)

return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=past_key_values,
hidden_states=all_hidden_states if output_hidden_states else None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Re-pad hidden_states before returning them.

When should_pack_inputs is true, last_hidden_state is converted back to BSHD at Line 270, but every entry accumulated in hidden_states stays THD. That mixes layouts inside one output object and breaks callers that index hidden states by batch/sequence.

Suggested fix
         if should_pack_inputs:
             # If we've converted BSHD to THD for our TE layers, we need to convert back to BSHD for the output.
+            if output_hidden_states:
+                all_hidden_states = tuple(
+                    _pad_input(state, indices, batch_size, padded_seq_len) for state in all_hidden_states
+                )
             hidden_states = _pad_input(hidden_states, indices, batch_size, padded_seq_len)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py` around
lines 263 - 275, The returned BaseModelOutputWithPast mixes THD and BSHD layouts
when should_pack_inputs is true because only the final hidden_states is run
through _pad_input; fix by converting every entry in all_hidden_states back to
BSHD before returning (use the same _pad_input(indices, batch_size,
padded_seq_len) call on each element), e.g. map/transform the tuple
all_hidden_states to a new tuple of padded tensors when should_pack_inputs and
output_hidden_states are true so last_hidden_state and all entries in
hidden_states share the same BSHD layout.

Comment on lines +391 to +416
torch._dynamo.config.capture_scalar_outputs = True


@torch.compile
def _pad_input(hidden_states, indices, batch, seqlen):
"""Convert a THD tensor to a BSHD equivalent tensor.

Adapted from huggingface/transformers/modeling_flash_attention_utils.py

Arguments:
hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask.
indices: (total_nnz), the indices that represent the non-masked tokens of the original padded input sequence.
batch: int, batch size for the padded sequence.
seqlen: int, maximum sequence length for the padded sequence.

Return:
hidden_states: (batch, seqlen, ...)
"""
dim = hidden_states.shape[1:]
output = torch.zeros((batch * seqlen), *dim, device=hidden_states.device, dtype=hidden_states.dtype)
output[indices] = hidden_states
return output.view(batch, seqlen, *dim)


@torch.compile
def _unpad_input(hidden_states, attention_mask, unused_mask=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify this is a module-scope TorchDynamo config mutation and see who imports the TE Llama module.
rg -n --type=py '^\s*torch\._dynamo\.config\.capture_scalar_outputs\s*='
rg -nP --type=py 'from\s+.*modeling_llama_te\s+import|import\s+.*modeling_llama_te'

Repository: NVIDIA/bionemo-framework

Length of output: 1814


Avoid a process-wide TorchDynamo toggle in module scope.

Setting torch._dynamo.config.capture_scalar_outputs = True at import time changes compilation behavior for every other model in the same process that imports this module. This file is imported by at least 11 files across tests, training, and export scripts, making the global configuration impact widespread. Move this setting to the benchmark/test entrypoint, or refactor _unpad_input() so it doesn't require a global configuration change.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py` around
lines 391 - 416, Remove the process-wide
torch._dynamo.config.capture_scalar_outputs assignment from module scope (do not
set torch._dynamo.config.capture_scalar_outputs = True at import); instead
either (A) move that configuration to the benchmark/test/export entrypoint so it
runs only when those scripts execute, or (B) make the local functions robust to
the default config by refactoring _unpad_input/_pad_input to avoid depending on
capture_scalar_outputs (or wrap compilation calls in a local
torch._dynamo.config.patch/context that sets capture_scalar_outputs only for the
duration of compilation). Locate the module-level assignment and delete it, then
choose one of the two options: update the entrypoint(s) used for
benchmarking/tests to enable capture_scalar_outputs there, or change
_unpad_input/_pad_input so they don’t require the global toggle and, if
necessary, enable the config inside a narrow scoped patch around the call to
torch.compile.

Comment on lines +574 to +584
def merge_qkv_concat(ctx: TransformCTX, qkv: torch.Tensor):
"""Merge naively concatenated q, k, v to interleave-concatenated qkv.

Example: import HF qkv to layer linear_qkv
"""
megatron_config = ctx.target.config
head_num = megatron_config.num_attention_heads
num_query_groups = megatron_config.num_query_groups
head_size = megatron_config.kv_channels
q, k, v = qkv.split([head_num * head_size, num_query_groups * head_size, num_query_groups * head_size], dim=0)
return TransformFns.merge_qkv(ctx, q, k, v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the new KV-head field names in these merge helpers.

This block still reads Megatron-only num_query_groups / kv_channels, while TransformFns.merge_qkv() just below already uses num_key_value_heads plus hidden_size // num_attention_heads. The new NVLlamaConfig in bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py Lines 45-52 follows that newer shape, so these import paths will raise AttributeError as soon as they target the new TE model.

Suggested fix
     def merge_qkv_concat(ctx: TransformCTX, qkv: torch.Tensor):
         """Merge naively concatenated q, k, v to interleave-concatenated qkv.

         Example: import HF qkv to layer linear_qkv
         """
         megatron_config = ctx.target.config
         head_num = megatron_config.num_attention_heads
-        num_query_groups = megatron_config.num_query_groups
-        head_size = megatron_config.kv_channels
-        q, k, v = qkv.split([head_num * head_size, num_query_groups * head_size, num_query_groups * head_size], dim=0)
+        num_key_value_heads = megatron_config.num_key_value_heads
+        head_size = megatron_config.hidden_size // head_num
+        q, k, v = qkv.split(
+            [head_num * head_size, num_key_value_heads * head_size, num_key_value_heads * head_size], dim=0
+        )
         return TransformFns.merge_qkv(ctx, q, k, v)
@@
     def merge_qkv_bias_concat(ctx: TransformCTX, qkv_bias: torch.Tensor):
@@
         megatron_config = ctx.target.config
         head_num = megatron_config.num_attention_heads
-        num_query_groups = megatron_config.num_query_groups
-        head_size = megatron_config.kv_channels
+        num_key_value_heads = megatron_config.num_key_value_heads
+        head_size = megatron_config.hidden_size // head_num
         qb, kb, vb = qkv_bias.split(
-            [head_num * head_size, num_query_groups * head_size, num_query_groups * head_size], dim=0
+            [head_num * head_size, num_key_value_heads * head_size, num_key_value_heads * head_size], dim=0
         )
         return TransformFns.merge_qkv_bias(ctx, qb, kb, vb)
@@
     def merge_qkv_bias(ctx: TransformCTX, qb: torch.Tensor, kb: torch.Tensor, vb: torch.Tensor):
@@
         megatron_config = ctx.target.config

         head_num = megatron_config.num_attention_heads
-        num_query_groups = megatron_config.num_query_groups
-        heads_per_group = head_num // num_query_groups
-        head_size = megatron_config.kv_channels
+        num_key_value_heads = megatron_config.num_key_value_heads
+        heads_per_group = head_num // num_key_value_heads
+        head_size = megatron_config.hidden_size // head_num

         new_q_tensor_shape = (head_num, head_size)
-        new_kv_tensor_shape = (num_query_groups, head_size)
+        new_kv_tensor_shape = (num_key_value_heads, head_size)
@@
-        for i in range(num_query_groups):
+        for i in range(num_key_value_heads):
             qkv_bias = torch.cat((qkv_bias, qb[i * heads_per_group : (i + 1) * heads_per_group, :]))
             qkv_bias = torch.cat((qkv_bias, kb[i : i + 1, :]))
             qkv_bias = torch.cat((qkv_bias, vb[i : i + 1, :]))
         qkv_bias = qkv_bias.reshape(
             [
-                head_size * (head_num + 2 * num_query_groups),
+                head_size * (head_num + 2 * num_key_value_heads),
             ]
         )
         return qkv_bias

Also applies to: 623-667

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/state.py` around lines 574 -
584, The helper merge functions (e.g., merge_qkv_concat) still reference
Megatron-only fields num_query_groups and kv_channels; update them to use the
new KV-head fields used by TransformFns.merge_qkv: read num_key_value_heads from
ctx.target.config and compute head_size as ctx.target.config.hidden_size //
ctx.target.config.num_attention_heads, then use those values in the qkv.split
sizes (head_num * head_size, num_key_value_heads * head_size,
num_key_value_heads * head_size) and any other merge helpers in the same area
(the block around lines 623–667) so they match TransformFns.merge_qkv's
expectations and avoid AttributeError on NVLlamaConfig.

Comment on lines +34 to +41
try:
from vllm import LLM, SamplingParams

_VLLM_AVAILABLE = True
except ImportError:
_VLLM_AVAILABLE = False

from export_llama3 import HF_MODEL_ID, convert_te_to_vllm, create_te_checkpoint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py | head -60

Repository: NVIDIA/bionemo-framework

Length of output: 2469


🏁 Script executed:

cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py | sed -n '52,100p'

Repository: NVIDIA/bionemo-framework

Length of output: 2432


🏁 Script executed:

cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py | sed -n '100,160p'

Repository: NVIDIA/bionemo-framework

Length of output: 2782


🏁 Script executed:

# Check if there are any other environment guards or CUDA checks in the file
cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py | sed -n '1,200p' | grep -i "cuda\|skip\|torch.cuda"

Repository: NVIDIA/bionemo-framework

Length of output: 642


🏁 Script executed:

# Check if export_llama3 is defined in the same directory
find bionemo-recipes/recipes/vllm_inference/llama3 -name "export_llama3.py" -o -name "export_llama3" -type d

Repository: NVIDIA/bionemo-framework

Length of output: 130


🏁 Script executed:

cat -n bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py | head -80

Repository: NVIDIA/bionemo-framework

Length of output: 3492


🏁 Script executed:

cat -n bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py | sed -n '30,100p'

Repository: NVIDIA/bionemo-framework

Length of output: 2954


🏁 Script executed:

# Check if convert.py and modeling_llama_te.py exist in the same directory
ls -la bionemo-recipes/recipes/vllm_inference/llama3/ | grep -E "(convert|modeling)"

Repository: NVIDIA/bionemo-framework

Length of output: 199


🏁 Script executed:

# Check if there are conftest.py files that might provide fixtures or setup
find bionemo-recipes/recipes/vllm_inference/llama3 -name "conftest.py"

Repository: NVIDIA/bionemo-framework

Length of output: 131


🏁 Script executed:

cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/conftest.py

Repository: NVIDIA/bionemo-framework

Length of output: 994


🏁 Script executed:

wc -l bionemo-recipes/recipes/vllm_inference/llama3/tests/conftest.py

Repository: NVIDIA/bionemo-framework

Length of output: 134


Module-level environment guards are needed for GPU and export_llama3 dependencies.

The conditional vllm import (lines 34–39) only checks for that package. However, export_llama3 is imported unconditionally at line 41, and both the hf_reference_outputs and vllm_checkpoint fixtures unconditionally call .to("cuda") and model loading functions. During test collection, pytest evaluates fixtures before @pytest.mark.skipif filters are applied, so test discovery will fail on environments without CUDA or missing TransformerEngine dependencies, even if vllm is unavailable.

Suggested guard pattern
 try:
     from vllm import LLM, SamplingParams
 
     _VLLM_AVAILABLE = True
 except ImportError:
     _VLLM_AVAILABLE = False
 
-from export_llama3 import HF_MODEL_ID, convert_te_to_vllm, create_te_checkpoint
+if not _VLLM_AVAILABLE:
+    pytest.skip("requires vllm", allow_module_level=True)
+if not torch.cuda.is_available():
+    pytest.skip("requires CUDA", allow_module_level=True)
+
+try:
+    from export_llama3 import HF_MODEL_ID, convert_te_to_vllm, create_te_checkpoint
+except ImportError as exc:
+    pytest.skip(f"requires export_llama3 dependencies: {exc}", allow_module_level=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py` around
lines 34 - 41, The test module currently imports export_llama3 and performs
.to("cuda") in fixtures unconditionally, causing import/collection failures when
TransformerEngine or CUDA are missing; wrap or guard those imports and GPU
usage: detect TE availability (e.g., try/except import export_llama3 to set
_TE_AVAILABLE) and CUDA availability (torch.cuda.is_available() to set
_CUDA_AVAILABLE), only import or call
HF_MODEL_ID/convert_te_to_vllm/create_te_checkpoint when both _VLLM_AVAILABLE
and _TE_AVAILABLE and _CUDA_AVAILABLE are true, and modify the
hf_reference_outputs and vllm_checkpoint fixtures to skip or return None when
GPU/TE are unavailable (or use pytest.importorskip/pytest.skip at fixture start)
to prevent test collection errors.

@gagank1 gagank1 marked this pull request as draft March 9, 2026 23:34
@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.45%. Comparing base (470e10d) to head (6a3905c).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1501   +/-   ##
=======================================
  Coverage   76.45%   76.45%           
=======================================
  Files         102      102           
  Lines        7952     7952           
=======================================
  Hits         6080     6080           
  Misses       1872     1872           

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant