Conversation
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
Signed-off-by: Gagan Kaushik <gkaushik@nvidia.com>
📝 WalkthroughWalkthroughThis PR introduces a complete Llama-3 vLLM inference recipe, including containerization, conversion utilities between HuggingFace and Transformer Engine formats, benchmarking tools, example scripts, and comprehensive tests. Updates include workflow configuration and file synchronization checks. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Checkpoint as Checkpoint<br/>(HF)
participant ConvertHF2TE as ConvertHF to TE
participant TEModel as TE Model<br/>(checkpoint)
participant ConvertTE2HF as ConvertTE to HF
participant vLLMModel as vLLM Model<br/>(HF format)
participant Inference as Inference<br/>Engine
User->>ConvertHF2TE: load HF Llama-3
ConvertHF2TE->>ConvertHF2TE: create TE config
ConvertHF2TE->>ConvertHF2TE: initialize TE model on meta device
ConvertHF2TE->>ConvertHF2TE: apply state transforms (QKV merge, FC1)
ConvertHF2TE->>TEModel: save TE checkpoint
User->>ConvertTE2HF: load TE checkpoint
ConvertTE2HF->>ConvertTE2HF: extract TE config
ConvertTE2HF->>ConvertTE2HF: create HF model on meta device
ConvertTE2HF->>ConvertTE2HF: apply reverse transforms (QKV split, FC1)
ConvertTE2HF->>ConvertTE2HF: copy rotary embeddings
ConvertTE2HF->>vLLMModel: save HF checkpoint
User->>Inference: start vLLM with HF checkpoint
Inference->>vLLMModel: load weights
Inference->>User: ready for inference
sequenceDiagram
participant User
participant BenchConfig as Config<br/>(batch, prompt_len, output_len)
participant ModelLoader as Model Loader
participant Tokenizer as Tokenizer
participant PromptGen as Prompt Generation
participant Warmup as Warmup Phase
participant Measurement as Timing Measurement
participant Metrics as Metrics Computation
participant Output as Results Output
User->>BenchConfig: specify sweep parameters
BenchConfig->>ModelLoader: load model & tokenizer
ModelLoader->>Tokenizer: initialize tokenizer
loop over (batch_size, prompt_len, output_len)
BenchConfig->>PromptGen: build_prompts(batch_size, prompt_len)
PromptGen->>Warmup: send warmup batch
Warmup->>ModelLoader: forward pass (discard)
Warmup->>Measurement: measure TTFT & E2E
Measurement->>ModelLoader: generate with timing hooks
ModelLoader->>Measurement: return latencies
Measurement->>Metrics: compute_metrics(e2e, ttft, batch, output_len)
Metrics->>Output: accumulate result
end
Output->>Output: print_results(table)
Output->>User: write_csv(results, path)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
|
/ok to test 6a3905c |
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (7)
bionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.py (1)
22-36: Consider wrapping execution inif __name__ == "__main__":guard.The script executes GPU operations at import time. While acceptable for a smoke-test script, adding a main guard prevents accidental execution when imported.
♻️ Proposed refactor
-CHECKPOINT = "./llama3_hf_roundtrip_checkpoint" - -model = AutoModelForCausalLM.from_pretrained(CHECKPOINT, torch_dtype=torch.bfloat16) -model.to("cuda") - -tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT) -tokenizer.pad_token = tokenizer.eos_token - -inputs = tokenizer("The quick brown fox", return_tensors="pt") -inputs = {k: v.to("cuda") for k, v in inputs.items()} - -with torch.no_grad(): - output_ids = model.generate(**inputs, max_new_tokens=16, use_cache=False) - -print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) +CHECKPOINT = "./llama3_hf_roundtrip_checkpoint" + + +def main(): + model = AutoModelForCausalLM.from_pretrained(CHECKPOINT, torch_dtype=torch.bfloat16) + model.to("cuda") + + tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT) + tokenizer.pad_token = tokenizer.eos_token + + inputs = tokenizer("The quick brown fox", return_tensors="pt") + inputs = {k: v.to("cuda") for k, v in inputs.items()} + + with torch.no_grad(): + output_ids = model.generate(**inputs, max_new_tokens=16, use_cache=False) + + print(tokenizer.decode(output_ids[0], skip_special_tokens=True)) + + +if __name__ == "__main__": + main()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.py` around lines 22 - 36, The script currently runs GPU/model/tokenizer setup and generation at import time (CHECKPOINT, AutoModelForCausalLM.from_pretrained, model.to, tokenizer, inputs creation and model.generate); wrap this execution in a main guard by moving the procedural code into a main() function (or an inline block) and surrounding the call with if __name__ == "__main__": so importing the module doesn't trigger GPU operations automatically.bionemo-recipes/recipes/vllm_inference/llama3/requirements.txt (1)
1-5: Consider pinning dependency versions for reproducibility.All dependencies except
torchaoare unpinned. This can lead to inconsistent CI results if upstream packages release breaking changes. Consider adding version constraints (e.g.,transformers>=4.40,<5.0) for critical dependencies.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/vllm_inference/llama3/requirements.txt` around lines 1 - 5, The requirements file currently has unpinned packages (lm-eval, torch, transformer_engine[pytorch], transformers) which can cause non-reproducible builds; update requirements.txt to add explicit version constraints for these entries (for example use ranges like transformers>=4.40,<5.0 and semantically-safe ranges for torch, lm-eval and transformer_engine[pytorch]) while keeping the existing torchao!=0.14.0 constraint—ensure each package line in requirements.txt (lm-eval, torch, transformer_engine[pytorch], transformers) is replaced with a pinned or range-pinned spec to enforce reproducible CI installs.bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile (1)
1-24: Consider adding a non-root user for improved security posture.The container runs as root by default. While acceptable for CI/development, adding a non-root user would improve security for production deployments.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile` around lines 1 - 24, Add a non-root user and switch to it after installing system-level deps: create a user/group (e.g., bionemo) and home directory, chown /workspace and /workspace/bionemo (and any pip cache or build dirs used during RUN steps) to that user, and add a final USER bionemo directive so subsequent operations run unprivileged; ensure the vllm build steps that require root (the RUN block guarded by ARG INSTALL_VLLM) still run as root by keeping them before the USER switch and preserve required environment variables (INSTALL_VLLM, TORCH_CUDA_ARCH_LIST, MAX_JOBS) and WORKDIR adjustments so folders used by functions like the pip installs and git clone (vllm) are owned by the new user after chown.bionemo-recipes/recipes/vllm_inference/llama3/.ci_build.sh (1)
1-3: Add error handling for safer CI execution.The script lacks
set -eor similar error handling. Ifpip installfails, the script will continue to./install_vllm.sh, potentially masking the failure.Note: The
PIP_CONSTRAINT=syntax (shellcheck SC1007) is actually correct here—it setsPIP_CONSTRAINTto an empty string for the pip command only.♻️ Proposed fix
-#!/bin/bash -x -PIP_CONSTRAINT= pip install -r requirements.txt +#!/bin/bash -xe +PIP_CONSTRAINT='' pip install -r requirements.txt ./install_vllm.sh🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/vllm_inference/llama3/.ci_build.sh` around lines 1 - 3, The CI script .ci_build.sh currently lacks fail-fast error handling so failures in the pip install step (PIP_CONSTRAINT= pip install -r requirements.txt) won’t stop execution and may mask errors before ./install_vllm.sh runs; add a fail-fast header (e.g. set -euo pipefail or at least set -e) immediately after the shebang in .ci_build.sh to ensure the script exits on errors, preserve the existing PIP_CONSTRAINT= pip install -r requirements.txt syntax, and keep the subsequent ./install_vllm.sh invocation unchanged.bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py (1)
82-101:convert_te_to_vllm()still returns a partially assembled checkpoint.This helper stops before tokenizer files are written, so every caller has to remember a second
save_pretrained()step. That leakage is already duplicated here and intests/test_vllm.py.♻️ Suggested API cleanup
-def convert_te_to_vllm(te_checkpoint: Path, output_dir: Path) -> Path: +def convert_te_to_vllm( + te_checkpoint: Path, + output_dir: Path, + tokenizer_source: str | Path | None = None, +) -> Path: """Convert a TE checkpoint to standard HF format for vLLM serving.""" print(f" Loading TE checkpoint: {te_checkpoint}") model_te = NVLlamaForCausalLM.from_pretrained(te_checkpoint) @@ output_dir.mkdir(parents=True, exist_ok=True) model_hf.save_pretrained(output_dir) del model_hf + + if tokenizer_source is not None: + tokenizer = AutoTokenizer.from_pretrained(tokenizer_source) + tokenizer.save_pretrained(output_dir) print(f" vLLM-ready checkpoint saved to {output_dir}") return output_dir🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py` around lines 82 - 101, convert_te_to_vllm currently returns a checkpoint missing tokenizer files so callers must run a second save; fix by ensuring tokenizer artifacts are copied or saved into output_dir before returning: load the tokenizer from te_checkpoint (e.g., AutoTokenizer.from_pretrained(te_checkpoint) or equivalent) and call tokenizer.save_pretrained(output_dir), or if original tokenizer files exist under te_checkpoint simply copy them into output_dir, then return output_dir; update convert_te_to_vllm to perform this step (references: convert_te_to_vllm, model_hf, te_checkpoint, output_dir).bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py (1)
26-39: Regroup the imports to match isort.
benchmark_commonis a local module, so it should live in its own first-party block instead of between third-party imports.♻️ Suggested import order
import argparse import itertools import torch +from transformers import AutoModelForCausalLM, AutoTokenizer + from benchmark_common import ( add_common_args, build_prompts, compute_metrics, median_timing, parse_config, print_results, write_csv, ) -from transformers import AutoModelForCausalLM, AutoTokenizerAs per coding guidelines,
**/*.py: Follow import sorting configuration as per isort with 2 lines after imports configuration.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py` around lines 26 - 39, The import block places the local module benchmark_common between third-party imports; move the "from benchmark_common import (...)" line into the first-party import block (i.e., group it after the third-party imports like torch and transformers) so imports follow stdlib, third-party, first-party order and obey the project's isort spacing rules; keep the same imported symbols (add_common_args, build_prompts, compute_metrics, median_timing, parse_config, print_results, write_csv) and run isort/formatting to verify spacing.bionemo-recipes/recipes/vllm_inference/llama3/state.py (1)
279-285: Don't mutate NumPy's global print options from the transform hot path.
np.set_printoptions()is process-wide, so every transform call changes formatting for unrelated logs and tests. If the goal is shorter debug output, format the specific arrays at log time instead of touching global state here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/vllm_inference/llama3/state.py` around lines 279 - 285, In __call__ of the transformer (state.py) remove the process-wide np.set_printoptions(threshold=10) call because it mutates global NumPy state on every transform; instead, when you need compact debug output, format the specific arrays at log time (e.g., use numpy.array2string or an ephemeral formatting helper) before logging or return, leaving the transform method and __call__ behavior unchanged otherwise; update references in __call__ and any logging sites that relied on the global printoptions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py`:
- Around line 189-199: In parse_config, validate that repeats is > 0 and that
every value in batch_sizes, prompt_lens, and output_lens parses to an integer >
0 before constructing the BenchmarkConfig; if any check fails, raise a clear
exception (e.g., ValueError) with a descriptive message so the CLI fails fast.
Locate parse_config and the use of args.repeats, args.batch_sizes,
args.prompt_lens, args.output_lens and perform the checks on the parsed lists
(after splitting) and the repeats value, returning the BenchmarkConfig only if
all validations pass.
- Around line 74-93: The helper currently decodes token_ids then lets vLLM
re-tokenize prompt_str, which can diverge; instead build prompt_str from the
exact token ids so vLLM and HF see identical tokens: after producing token_ids
(token_ids variable in build_prompts), convert those ids to tokens with
tokenizer.convert_ids_to_tokens(token_ids) and then turn that token sequence
into a string using tokenizer.convert_tokens_to_string(...) (or an equivalent
join that preserves token boundaries), then replicate that prompt_str into
prompt_strings and keep input_ids built from token_ids; also ensure
encode/decode calls continue to use add_special_tokens=False so special-token
handling stays consistent.
- Around line 96-116: compute_metrics currently uses the requested output_len
rather than the actual number of tokens produced, which inflates TPOT/throughput
when generation ends early; modify compute_metrics (and its callers) to accept
the actual output token count (e.g., actual_output_len or
output_tokens_per_example) instead of the configured output_len, compute
total_output_tokens = batch_size * actual_output_len, and compute tpot_ms using
max(actual_output_len - 1, 1) with safe zero-division guards; update the
benchmark scripts that call _generate / model.generate / engine.generate to
extract the real generated token counts from the generate return values and pass
those counts into compute_metrics, keeping BenchmarkResult fields consistent
(batch_size, prompt_len, output_len -> actual output length, e2e_ms, ttft_ms,
tpot_ms, throughput_tok_s).
In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py`:
- Around line 28-43: DEFAULT_MODEL currently points to
"./llama3_hf_roundtrip_checkpoint" which resolves from the process CWD; change
it to resolve relative to the script file so the default uses the recipe
directory. Import pathlib and set DEFAULT_MODEL = str(Path(__file__).parent /
"llama3_hf_roundtrip_checkpoint"), updating any code that expects a string path;
reference DEFAULT_MODEL and __file__ to locate where to change.
In `@bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile`:
- Around line 11-20: ARG UV_BREAK_SYSTEM_PACKAGES is defined but not exported,
so the subsequent invocations of the wrapper command "uv pip" won’t see it as an
environment variable; update the Dockerfile so the build-time ARG is passed to
the environment before running the vllm install block (for example, set ENV
UV_BREAK_SYSTEM_PACKAGES=$UV_BREAK_SYSTEM_PACKAGES or export it inline before
the "uv pip" commands) so that the "uv pip install -r requirements/build.txt
--system" and "uv pip install --no-build-isolation -e . --system" invocations
receive the flag; ensure this change occurs inside the same conditional that
checks INSTALL_VLLM and before any use of "uv".
In `@bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py`:
- Around line 206-235: The forward() path assumes input_ids is always present
when computing batch_size and padded_seq_len which crashes for
inputs_embeds-only calls; update the logic in modeling_llama_te.forward to
derive batch_size and padded_seq_len from whichever input is populated (use
inputs_embeds.size(0)/size(1) when input_ids is None, falling back to
input_ids), perform that determination before the THD squeeze branch that
modifies hidden_states, and likewise compute lengths using the available tensor
rather than input_ids.shape when computing past_key_values lengths; adjust
references to input_ids.size(1) and input_ids.shape[0] to use the selected
source so inputs_embeds-only calls no longer fail.
- Around line 237-241: The warning condition is inverted: change the check in
the block that computes te_rope_emb (within the torch.autocast context) to warn
when the rotary embedding is NOT float32. Locate the computation of te_rope_emb
via self.rotary_emb(max_seq_len=self.config.max_position_embeddings) and replace
the current if te_rope_emb.dtype == torch.float32 warning with an if
te_rope_emb.dtype != torch.float32 warning so the warning fires on the
incorrect/mismatched dtype rather than the correct hot path.
- Around line 63-80: The method init_empty_weights unconditionally references
self.model but NVLlamaModel stores embed_tokens and rotary_emb on self directly,
causing AttributeError; change the code to pick the real model holder (e.g.,
target = getattr(self, "model", self)) and then operate on target.embed_tokens
and target.rotary_emb (call target.embed_tokens.to_empty(device="cuda");
target.embed_tokens.apply(self._init_weights); set target.rotary_emb.inv_freq =
LlamaRotaryEmbedding(config=target.config).inv_freq.to("cuda")), then call
self.tie_weights() as before so weight tying works for both the original wrapper
and NVLlamaModel.
- Around line 263-275: The returned BaseModelOutputWithPast mixes THD and BSHD
layouts when should_pack_inputs is true because only the final hidden_states is
run through _pad_input; fix by converting every entry in all_hidden_states back
to BSHD before returning (use the same _pad_input(indices, batch_size,
padded_seq_len) call on each element), e.g. map/transform the tuple
all_hidden_states to a new tuple of padded tensors when should_pack_inputs and
output_hidden_states are true so last_hidden_state and all entries in
hidden_states share the same BSHD layout.
- Around line 391-416: Remove the process-wide
torch._dynamo.config.capture_scalar_outputs assignment from module scope (do not
set torch._dynamo.config.capture_scalar_outputs = True at import); instead
either (A) move that configuration to the benchmark/test/export entrypoint so it
runs only when those scripts execute, or (B) make the local functions robust to
the default config by refactoring _unpad_input/_pad_input to avoid depending on
capture_scalar_outputs (or wrap compilation calls in a local
torch._dynamo.config.patch/context that sets capture_scalar_outputs only for the
duration of compilation). Locate the module-level assignment and delete it, then
choose one of the two options: update the entrypoint(s) used for
benchmarking/tests to enable capture_scalar_outputs there, or change
_unpad_input/_pad_input so they don’t require the global toggle and, if
necessary, enable the config inside a narrow scoped patch around the call to
torch.compile.
In `@bionemo-recipes/recipes/vllm_inference/llama3/state.py`:
- Around line 574-584: The helper merge functions (e.g., merge_qkv_concat) still
reference Megatron-only fields num_query_groups and kv_channels; update them to
use the new KV-head fields used by TransformFns.merge_qkv: read
num_key_value_heads from ctx.target.config and compute head_size as
ctx.target.config.hidden_size // ctx.target.config.num_attention_heads, then use
those values in the qkv.split sizes (head_num * head_size, num_key_value_heads *
head_size, num_key_value_heads * head_size) and any other merge helpers in the
same area (the block around lines 623–667) so they match
TransformFns.merge_qkv's expectations and avoid AttributeError on NVLlamaConfig.
In `@bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py`:
- Around line 34-41: The test module currently imports export_llama3 and
performs .to("cuda") in fixtures unconditionally, causing import/collection
failures when TransformerEngine or CUDA are missing; wrap or guard those imports
and GPU usage: detect TE availability (e.g., try/except import export_llama3 to
set _TE_AVAILABLE) and CUDA availability (torch.cuda.is_available() to set
_CUDA_AVAILABLE), only import or call
HF_MODEL_ID/convert_te_to_vllm/create_te_checkpoint when both _VLLM_AVAILABLE
and _TE_AVAILABLE and _CUDA_AVAILABLE are true, and modify the
hf_reference_outputs and vllm_checkpoint fixtures to skip or return None when
GPU/TE are unavailable (or use pytest.importorskip/pytest.skip at fixture start)
to prevent test collection errors.
---
Nitpick comments:
In `@bionemo-recipes/recipes/vllm_inference/llama3/.ci_build.sh`:
- Around line 1-3: The CI script .ci_build.sh currently lacks fail-fast error
handling so failures in the pip install step (PIP_CONSTRAINT= pip install -r
requirements.txt) won’t stop execution and may mask errors before
./install_vllm.sh runs; add a fail-fast header (e.g. set -euo pipefail or at
least set -e) immediately after the shebang in .ci_build.sh to ensure the script
exits on errors, preserve the existing PIP_CONSTRAINT= pip install -r
requirements.txt syntax, and keep the subsequent ./install_vllm.sh invocation
unchanged.
In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py`:
- Around line 26-39: The import block places the local module benchmark_common
between third-party imports; move the "from benchmark_common import (...)" line
into the first-party import block (i.e., group it after the third-party imports
like torch and transformers) so imports follow stdlib, third-party, first-party
order and obey the project's isort spacing rules; keep the same imported symbols
(add_common_args, build_prompts, compute_metrics, median_timing, parse_config,
print_results, write_csv) and run isort/formatting to verify spacing.
In `@bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile`:
- Around line 1-24: Add a non-root user and switch to it after installing
system-level deps: create a user/group (e.g., bionemo) and home directory, chown
/workspace and /workspace/bionemo (and any pip cache or build dirs used during
RUN steps) to that user, and add a final USER bionemo directive so subsequent
operations run unprivileged; ensure the vllm build steps that require root (the
RUN block guarded by ARG INSTALL_VLLM) still run as root by keeping them before
the USER switch and preserve required environment variables (INSTALL_VLLM,
TORCH_CUDA_ARCH_LIST, MAX_JOBS) and WORKDIR adjustments so folders used by
functions like the pip installs and git clone (vllm) are owned by the new user
after chown.
In `@bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py`:
- Around line 82-101: convert_te_to_vllm currently returns a checkpoint missing
tokenizer files so callers must run a second save; fix by ensuring tokenizer
artifacts are copied or saved into output_dir before returning: load the
tokenizer from te_checkpoint (e.g., AutoTokenizer.from_pretrained(te_checkpoint)
or equivalent) and call tokenizer.save_pretrained(output_dir), or if original
tokenizer files exist under te_checkpoint simply copy them into output_dir, then
return output_dir; update convert_te_to_vllm to perform this step (references:
convert_te_to_vllm, model_hf, te_checkpoint, output_dir).
In
`@bionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.py`:
- Around line 22-36: The script currently runs GPU/model/tokenizer setup and
generation at import time (CHECKPOINT, AutoModelForCausalLM.from_pretrained,
model.to, tokenizer, inputs creation and model.generate); wrap this execution in
a main guard by moving the procedural code into a main() function (or an inline
block) and surrounding the call with if __name__ == "__main__": so importing the
module doesn't trigger GPU operations automatically.
In `@bionemo-recipes/recipes/vllm_inference/llama3/requirements.txt`:
- Around line 1-5: The requirements file currently has unpinned packages
(lm-eval, torch, transformer_engine[pytorch], transformers) which can cause
non-reproducible builds; update requirements.txt to add explicit version
constraints for these entries (for example use ranges like
transformers>=4.40,<5.0 and semantically-safe ranges for torch, lm-eval and
transformer_engine[pytorch]) while keeping the existing torchao!=0.14.0
constraint—ensure each package line in requirements.txt (lm-eval, torch,
transformer_engine[pytorch], transformers) is replaced with a pinned or
range-pinned spec to enforce reproducible CI installs.
In `@bionemo-recipes/recipes/vllm_inference/llama3/state.py`:
- Around line 279-285: In __call__ of the transformer (state.py) remove the
process-wide np.set_printoptions(threshold=10) call because it mutates global
NumPy state on every transform; instead, when you need compact debug output,
format the specific arrays at log time (e.g., use numpy.array2string or an
ephemeral formatting helper) before logging or return, leaving the transform
method and __call__ behavior unchanged otherwise; update references in __call__
and any logging sites that relied on the global printoptions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 78172a79-8067-41b6-a36a-bdf36eaaa681
📒 Files selected for processing (19)
.github/workflows/unit-tests-recipes.yml.pre-commit-config.yamlbionemo-recipes/recipes/vllm_inference/llama3/.ci_build.shbionemo-recipes/recipes/vllm_inference/llama3/Dockerfilebionemo-recipes/recipes/vllm_inference/llama3/README.mdbionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.pybionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.pybionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.pybionemo-recipes/recipes/vllm_inference/llama3/convert.pybionemo-recipes/recipes/vllm_inference/llama3/export_llama3.pybionemo-recipes/recipes/vllm_inference/llama3/install_vllm.shbionemo-recipes/recipes/vllm_inference/llama3/llama3_transformers_inference_example.pybionemo-recipes/recipes/vllm_inference/llama3/llama3_vllm_inference_example.pybionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.pybionemo-recipes/recipes/vllm_inference/llama3/requirements.txtbionemo-recipes/recipes/vllm_inference/llama3/state.pybionemo-recipes/recipes/vllm_inference/llama3/tests/conftest.pybionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.pyci/scripts/check_copied_files.py
| def build_prompts( | ||
| tokenizer: PreTrainedTokenizerBase, | ||
| batch_size: int, | ||
| prompt_length: int, | ||
| ) -> tuple[list[str], torch.Tensor]: | ||
| """Generate deterministic synthetic prompts of exactly *prompt_length* tokens. | ||
|
|
||
| Returns: | ||
| A tuple of (prompt_strings, input_ids_tensor). | ||
| *prompt_strings* is a list[str] of length *batch_size* (for vLLM). | ||
| *input_ids_tensor* is a (batch_size, prompt_length) int64 tensor (for HF). | ||
| Both represent byte-identical inputs. | ||
| """ | ||
| repeated = STOCK_TEXT * ((prompt_length // 10) + 2) | ||
| token_ids = tokenizer.encode(repeated, add_special_tokens=False)[:prompt_length] | ||
| prompt_str = tokenizer.decode(token_ids) | ||
|
|
||
| prompt_strings = [prompt_str] * batch_size | ||
| input_ids = torch.tensor([token_ids] * batch_size, dtype=torch.long) | ||
| return prompt_strings, input_ids |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, find the related benchmark files
fd -t f "benchmark_" bionemo-recipes/recipes/vllm_inference/llama3/ 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 259
🏁 Script executed:
# Search for usage of build_prompts function
rg "build_prompts" bionemo-recipes/recipes/vllm_inference/llama3/ -A 3 -B 1Repository: NVIDIA/bionemo-framework
Length of output: 2341
🏁 Script executed:
# Get the full context of benchmark_common.py to understand STOCK_TEXT and surrounding code
cat -n bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py | head -100Repository: NVIDIA/bionemo-framework
Length of output: 3751
🏁 Script executed:
# View how build_prompts is used in benchmark_vllm.py
rg -A 10 "prompts, _ = build_prompts" bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.pyRepository: NVIDIA/bionemo-framework
Length of output: 499
🏁 Script executed:
# View how build_prompts is used in benchmark_hf.py
rg -A 10 "_, input_ids = build_prompts" bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.pyRepository: NVIDIA/bionemo-framework
Length of output: 471
🏁 Script executed:
# Check if same tokenizer is initialized in both files
rg "tokenizer.*=" bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.py bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py | head -20Repository: NVIDIA/bionemo-framework
Length of output: 531
The "identical inputs" guarantee is not enforced here.
HF consumes input_ids directly, while the vLLM path re-tokenizes prompt_str. Even with the same tokenizer model, the decode/re-encode cycle can introduce drift through special-token handling, whitespace normalization, or subword tokenization differences, meaning the two backends may benchmark different prompt tokens while this helper reports them as equivalent.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py` around
lines 74 - 93, The helper currently decodes token_ids then lets vLLM re-tokenize
prompt_str, which can diverge; instead build prompt_str from the exact token ids
so vLLM and HF see identical tokens: after producing token_ids (token_ids
variable in build_prompts), convert those ids to tokens with
tokenizer.convert_ids_to_tokens(token_ids) and then turn that token sequence
into a string using tokenizer.convert_tokens_to_string(...) (or an equivalent
join that preserves token boundaries), then replicate that prompt_str into
prompt_strings and keep input_ids built from token_ids; also ensure
encode/decode calls continue to use add_special_tokens=False so special-token
handling stays consistent.
| def compute_metrics( | ||
| e2e_seconds: float, | ||
| ttft_seconds: float, | ||
| batch_size: int, | ||
| output_len: int, | ||
| ) -> BenchmarkResult: | ||
| """Derive TPOT and throughput from raw wall-clock timings.""" | ||
| e2e_ms = e2e_seconds * 1000.0 | ||
| ttft_ms = ttft_seconds * 1000.0 | ||
| tpot_ms = ((e2e_seconds - ttft_seconds) / max(output_len - 1, 1)) * 1000.0 | ||
| total_output_tokens = batch_size * output_len | ||
| throughput = total_output_tokens / e2e_seconds if e2e_seconds > 0 else 0.0 | ||
| return BenchmarkResult( | ||
| batch_size=batch_size, | ||
| prompt_len=0, | ||
| output_len=output_len, | ||
| e2e_ms=e2e_ms, | ||
| ttft_ms=ttft_ms, | ||
| tpot_ms=tpot_ms, | ||
| throughput_tok_s=throughput, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the file exists and examine the relevant section
fd -p "benchmark_common.py" -t fRepository: NVIDIA/bionemo-framework
Length of output: 133
🏁 Script executed:
# Find the file and read the relevant section
find . -name "benchmark_common.py" -type f 2>/dev/null | head -5Repository: NVIDIA/bionemo-framework
Length of output: 135
🏁 Script executed:
# Search for the function in the codebase
rg "def compute_metrics" -A 30Repository: NVIDIA/bionemo-framework
Length of output: 6496
🏁 Script executed:
# Also search for where compute_metrics is called
rg "compute_metrics\(" -B 3 -A 3Repository: NVIDIA/bionemo-framework
Length of output: 3229
🏁 Script executed:
# Get more context around the compute_metrics calls
rg "compute_metrics\(" -B 20 -A 5 "bionemo-recipes/recipes/vllm_inference/llama3/"Repository: NVIDIA/bionemo-framework
Length of output: 8200
🏁 Script executed:
# Check the full benchmark_vllm.py file to understand the generation flow
wc -l ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.pyRepository: NVIDIA/bionemo-framework
Length of output: 136
🏁 Script executed:
# Read the relevant section of benchmark_vllm.py
sed -n '1,200p' ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.pyRepository: NVIDIA/bionemo-framework
Length of output: 3331
🏁 Script executed:
# Also check benchmark_hf.py
wc -l ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.pyRepository: NVIDIA/bionemo-framework
Length of output: 134
🏁 Script executed:
# Read benchmark_hf.py to understand the generation flow
sed -n '1,200p' ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_hf.pyRepository: NVIDIA/bionemo-framework
Length of output: 3305
🏁 Script executed:
# Check if there's any code that processes the output of generate() to get actual token counts
rg "\.generate\(" -A 10 "bionemo-recipes/recipes/vllm_inference/llama3/benchmark"Repository: NVIDIA/bionemo-framework
Length of output: 163
🏁 Script executed:
# Check the entire benchmark_common.py file to see full context
cat ./bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.pyRepository: NVIDIA/bionemo-framework
Length of output: 6802
Pass actual token counts to compute_metrics instead of requested lengths.
compute_metrics() uses the requested output_len to compute TPOT and throughput, but language models may generate fewer tokens when they produce end-of-sequence (EOS). This inflates both metrics whenever generation stops early. The _generate() calls in both benchmark scripts discard the generation outputs and only time the execution, so actual token counts are never captured. To fix this, extract the actual number of tokens generated from the model.generate() and engine.generate() return values and pass those counts to compute_metrics() instead of the configured output_len.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py` around
lines 96 - 116, compute_metrics currently uses the requested output_len rather
than the actual number of tokens produced, which inflates TPOT/throughput when
generation ends early; modify compute_metrics (and its callers) to accept the
actual output token count (e.g., actual_output_len or output_tokens_per_example)
instead of the configured output_len, compute total_output_tokens = batch_size *
actual_output_len, and compute tpot_ms using max(actual_output_len - 1, 1) with
safe zero-division guards; update the benchmark scripts that call _generate /
model.generate / engine.generate to extract the real generated token counts from
the generate return values and pass those counts into compute_metrics, keeping
BenchmarkResult fields consistent (batch_size, prompt_len, output_len -> actual
output length, e2e_ms, ttft_ms, tpot_ms, throughput_tok_s).
| def parse_config(args: argparse.Namespace) -> BenchmarkConfig: | ||
| """Convert parsed CLI args into a BenchmarkConfig.""" | ||
| return BenchmarkConfig( | ||
| model=args.model, | ||
| batch_sizes=[int(x) for x in args.batch_sizes.split(",")], | ||
| prompt_lens=[int(x) for x in args.prompt_lens.split(",")], | ||
| output_lens=[int(x) for x in args.output_lens.split(",")], | ||
| warmup=args.warmup, | ||
| repeats=args.repeats, | ||
| csv_path=args.csv_path, | ||
| ) |
There was a problem hiding this comment.
Reject non-positive sweep parameters up front.
--repeats 0 currently falls through to statistics.median([]), and zero/negative batch or token lengths fail later with much less actionable errors. Validate them here so the CLI fails fast.
🛠️ Suggested validation
def parse_config(args: argparse.Namespace) -> BenchmarkConfig:
"""Convert parsed CLI args into a BenchmarkConfig."""
+ def _parse_positive_csv(flag_name: str, raw: str) -> list[int]:
+ values = [int(x) for x in raw.split(",")]
+ if any(value <= 0 for value in values):
+ raise ValueError(f"{flag_name} must contain only positive integers")
+ return values
+
+ if args.warmup < 0:
+ raise ValueError("--warmup must be >= 0")
+ if args.repeats <= 0:
+ raise ValueError("--repeats must be > 0")
+
return BenchmarkConfig(
model=args.model,
- batch_sizes=[int(x) for x in args.batch_sizes.split(",")],
- prompt_lens=[int(x) for x in args.prompt_lens.split(",")],
- output_lens=[int(x) for x in args.output_lens.split(",")],
+ batch_sizes=_parse_positive_csv("--batch-sizes", args.batch_sizes),
+ prompt_lens=_parse_positive_csv("--prompt-lens", args.prompt_lens),
+ output_lens=_parse_positive_csv("--output-lens", args.output_lens),
warmup=args.warmup,
repeats=args.repeats,
csv_path=args.csv_path,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_common.py` around
lines 189 - 199, In parse_config, validate that repeats is > 0 and that every
value in batch_sizes, prompt_lens, and output_lens parses to an integer > 0
before constructing the BenchmarkConfig; if any check fails, raise a clear
exception (e.g., ValueError) with a descriptive message so the CLI fails fast.
Locate parse_config and the use of args.repeats, args.batch_sizes,
args.prompt_lens, args.output_lens and perform the checks on the parsed lists
(after splitting) and the repeats value, returning the BenchmarkConfig only if
all validations pass.
| import argparse | ||
| import itertools | ||
|
|
||
| from benchmark_common import ( | ||
| add_common_args, | ||
| build_prompts, | ||
| compute_metrics, | ||
| median_timing, | ||
| parse_config, | ||
| print_results, | ||
| write_csv, | ||
| ) | ||
| from vllm import LLM, SamplingParams | ||
|
|
||
|
|
||
| DEFAULT_MODEL = "./llama3_hf_roundtrip_checkpoint" |
There was a problem hiding this comment.
Resolve the default checkpoint from __file__, not the current working directory.
export_llama3.py writes its default output under the recipe directory, but this default points at ./llama3_hf_roundtrip_checkpoint. Running the benchmark from repo root with an explicit script path will miss the exported checkpoint unless --model is passed.
🛠️ Suggested fix
import argparse
import itertools
+from pathlib import Path
+from vllm import LLM, SamplingParams
+
from benchmark_common import (
add_common_args,
build_prompts,
compute_metrics,
median_timing,
parse_config,
print_results,
write_csv,
)
-from vllm import LLM, SamplingParams
-
-DEFAULT_MODEL = "./llama3_hf_roundtrip_checkpoint"
+DEFAULT_MODEL = str(Path(__file__).resolve().parent / "llama3_hf_roundtrip_checkpoint")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/benchmark_vllm.py` around lines
28 - 43, DEFAULT_MODEL currently points to "./llama3_hf_roundtrip_checkpoint"
which resolves from the process CWD; change it to resolve relative to the script
file so the default uses the recipe directory. Import pathlib and set
DEFAULT_MODEL = str(Path(__file__).parent / "llama3_hf_roundtrip_checkpoint"),
updating any code that expects a string path; reference DEFAULT_MODEL and
__file__ to locate where to change.
| ARG UV_BREAK_SYSTEM_PACKAGES=1 | ||
| RUN if [ "$INSTALL_VLLM" = "true" ]; then \ | ||
| if [ -z "$TORCH_CUDA_ARCH_LIST" ]; then \ | ||
| echo "ERROR: TORCH_CUDA_ARCH_LIST must be set when INSTALL_VLLM=true" && exit 1; \ | ||
| fi && \ | ||
| git clone --branch v0.15.1 --depth 1 https://github.com/vllm-project/vllm.git && \ | ||
| cd vllm && \ | ||
| python use_existing_torch.py && \ | ||
| uv pip install -r requirements/build.txt --system && \ | ||
| uv pip install --no-build-isolation -e . --system && \ |
There was a problem hiding this comment.
UV_BREAK_SYSTEM_PACKAGES ARG may not be visible to uv commands.
ARG values are only available during build as build-time variables, but uv reads UV_BREAK_SYSTEM_PACKAGES from the environment. The ARG needs to be exported as an ENV or set inline for the uv commands.
🐛 Proposed fix
ARG UV_BREAK_SYSTEM_PACKAGES=1
+ENV UV_BREAK_SYSTEM_PACKAGES=${UV_BREAK_SYSTEM_PACKAGES}
RUN if [ "$INSTALL_VLLM" = "true" ]; then \🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/Dockerfile` around lines 11 -
20, ARG UV_BREAK_SYSTEM_PACKAGES is defined but not exported, so the subsequent
invocations of the wrapper command "uv pip" won’t see it as an environment
variable; update the Dockerfile so the build-time ARG is passed to the
environment before running the vllm install block (for example, set ENV
UV_BREAK_SYSTEM_PACKAGES=$UV_BREAK_SYSTEM_PACKAGES or export it inline before
the "uv pip" commands) so that the "uv pip install -r requirements/build.txt
--system" and "uv pip install --no-build-isolation -e . --system" invocations
receive the flag; ensure this change occurs inside the same conditional that
checks INSTALL_VLLM and before any use of "uv".
| # Ensure that rotary embeddings are computed with at a higher precision | ||
| with torch.autocast(device_type="cuda", enabled=False): | ||
| te_rope_emb = self.rotary_emb(max_seq_len=self.config.max_position_embeddings) | ||
| if te_rope_emb.dtype == torch.float32: | ||
| warnings.warn("Rotary embeddings should be in float32 for optimal performance.", UserWarning) |
There was a problem hiding this comment.
Invert this warning condition.
Line 240 warns when the rotary embedding is already float32, so the hot path emits a false warning on the optimal dtype and never flags the real mismatch case.
Suggested fix
- if te_rope_emb.dtype == torch.float32:
+ if te_rope_emb.dtype != torch.float32:
warnings.warn("Rotary embeddings should be in float32 for optimal performance.", UserWarning)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Ensure that rotary embeddings are computed with at a higher precision | |
| with torch.autocast(device_type="cuda", enabled=False): | |
| te_rope_emb = self.rotary_emb(max_seq_len=self.config.max_position_embeddings) | |
| if te_rope_emb.dtype == torch.float32: | |
| warnings.warn("Rotary embeddings should be in float32 for optimal performance.", UserWarning) | |
| # Ensure that rotary embeddings are computed with at a higher precision | |
| with torch.autocast(device_type="cuda", enabled=False): | |
| te_rope_emb = self.rotary_emb(max_seq_len=self.config.max_position_embeddings) | |
| if te_rope_emb.dtype != torch.float32: | |
| warnings.warn("Rotary embeddings should be in float32 for optimal performance.", UserWarning) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py` around
lines 237 - 241, The warning condition is inverted: change the check in the
block that computes te_rope_emb (within the torch.autocast context) to warn when
the rotary embedding is NOT float32. Locate the computation of te_rope_emb via
self.rotary_emb(max_seq_len=self.config.max_position_embeddings) and replace the
current if te_rope_emb.dtype == torch.float32 warning with an if
te_rope_emb.dtype != torch.float32 warning so the warning fires on the
incorrect/mismatched dtype rather than the correct hot path.
| # add hidden states from the last decoder layer. Note that these will be in THD format; we could possibly pad | ||
| # these with the same _pad_input call as below if we wanted them returned in BSHD format. | ||
| if output_hidden_states: | ||
| all_hidden_states = (*all_hidden_states, hidden_states) | ||
|
|
||
| if should_pack_inputs: | ||
| # If we've converted BSHD to THD for our TE layers, we need to convert back to BSHD for the output. | ||
| hidden_states = _pad_input(hidden_states, indices, batch_size, padded_seq_len) | ||
|
|
||
| return BaseModelOutputWithPast( | ||
| last_hidden_state=hidden_states, | ||
| past_key_values=past_key_values, | ||
| hidden_states=all_hidden_states if output_hidden_states else None, |
There was a problem hiding this comment.
Re-pad hidden_states before returning them.
When should_pack_inputs is true, last_hidden_state is converted back to BSHD at Line 270, but every entry accumulated in hidden_states stays THD. That mixes layouts inside one output object and breaks callers that index hidden states by batch/sequence.
Suggested fix
if should_pack_inputs:
# If we've converted BSHD to THD for our TE layers, we need to convert back to BSHD for the output.
+ if output_hidden_states:
+ all_hidden_states = tuple(
+ _pad_input(state, indices, batch_size, padded_seq_len) for state in all_hidden_states
+ )
hidden_states = _pad_input(hidden_states, indices, batch_size, padded_seq_len)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py` around
lines 263 - 275, The returned BaseModelOutputWithPast mixes THD and BSHD layouts
when should_pack_inputs is true because only the final hidden_states is run
through _pad_input; fix by converting every entry in all_hidden_states back to
BSHD before returning (use the same _pad_input(indices, batch_size,
padded_seq_len) call on each element), e.g. map/transform the tuple
all_hidden_states to a new tuple of padded tensors when should_pack_inputs and
output_hidden_states are true so last_hidden_state and all entries in
hidden_states share the same BSHD layout.
| torch._dynamo.config.capture_scalar_outputs = True | ||
|
|
||
|
|
||
| @torch.compile | ||
| def _pad_input(hidden_states, indices, batch, seqlen): | ||
| """Convert a THD tensor to a BSHD equivalent tensor. | ||
|
|
||
| Adapted from huggingface/transformers/modeling_flash_attention_utils.py | ||
|
|
||
| Arguments: | ||
| hidden_states: (total_nnz, ...), where total_nnz = number of tokens in selected in attention_mask. | ||
| indices: (total_nnz), the indices that represent the non-masked tokens of the original padded input sequence. | ||
| batch: int, batch size for the padded sequence. | ||
| seqlen: int, maximum sequence length for the padded sequence. | ||
|
|
||
| Return: | ||
| hidden_states: (batch, seqlen, ...) | ||
| """ | ||
| dim = hidden_states.shape[1:] | ||
| output = torch.zeros((batch * seqlen), *dim, device=hidden_states.device, dtype=hidden_states.dtype) | ||
| output[indices] = hidden_states | ||
| return output.view(batch, seqlen, *dim) | ||
|
|
||
|
|
||
| @torch.compile | ||
| def _unpad_input(hidden_states, attention_mask, unused_mask=None): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify this is a module-scope TorchDynamo config mutation and see who imports the TE Llama module.
rg -n --type=py '^\s*torch\._dynamo\.config\.capture_scalar_outputs\s*='
rg -nP --type=py 'from\s+.*modeling_llama_te\s+import|import\s+.*modeling_llama_te'Repository: NVIDIA/bionemo-framework
Length of output: 1814
Avoid a process-wide TorchDynamo toggle in module scope.
Setting torch._dynamo.config.capture_scalar_outputs = True at import time changes compilation behavior for every other model in the same process that imports this module. This file is imported by at least 11 files across tests, training, and export scripts, making the global configuration impact widespread. Move this setting to the benchmark/test entrypoint, or refactor _unpad_input() so it doesn't require a global configuration change.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py` around
lines 391 - 416, Remove the process-wide
torch._dynamo.config.capture_scalar_outputs assignment from module scope (do not
set torch._dynamo.config.capture_scalar_outputs = True at import); instead
either (A) move that configuration to the benchmark/test/export entrypoint so it
runs only when those scripts execute, or (B) make the local functions robust to
the default config by refactoring _unpad_input/_pad_input to avoid depending on
capture_scalar_outputs (or wrap compilation calls in a local
torch._dynamo.config.patch/context that sets capture_scalar_outputs only for the
duration of compilation). Locate the module-level assignment and delete it, then
choose one of the two options: update the entrypoint(s) used for
benchmarking/tests to enable capture_scalar_outputs there, or change
_unpad_input/_pad_input so they don’t require the global toggle and, if
necessary, enable the config inside a narrow scoped patch around the call to
torch.compile.
| def merge_qkv_concat(ctx: TransformCTX, qkv: torch.Tensor): | ||
| """Merge naively concatenated q, k, v to interleave-concatenated qkv. | ||
|
|
||
| Example: import HF qkv to layer linear_qkv | ||
| """ | ||
| megatron_config = ctx.target.config | ||
| head_num = megatron_config.num_attention_heads | ||
| num_query_groups = megatron_config.num_query_groups | ||
| head_size = megatron_config.kv_channels | ||
| q, k, v = qkv.split([head_num * head_size, num_query_groups * head_size, num_query_groups * head_size], dim=0) | ||
| return TransformFns.merge_qkv(ctx, q, k, v) |
There was a problem hiding this comment.
Use the new KV-head field names in these merge helpers.
This block still reads Megatron-only num_query_groups / kv_channels, while TransformFns.merge_qkv() just below already uses num_key_value_heads plus hidden_size // num_attention_heads. The new NVLlamaConfig in bionemo-recipes/recipes/vllm_inference/llama3/modeling_llama_te.py Lines 45-52 follows that newer shape, so these import paths will raise AttributeError as soon as they target the new TE model.
Suggested fix
def merge_qkv_concat(ctx: TransformCTX, qkv: torch.Tensor):
"""Merge naively concatenated q, k, v to interleave-concatenated qkv.
Example: import HF qkv to layer linear_qkv
"""
megatron_config = ctx.target.config
head_num = megatron_config.num_attention_heads
- num_query_groups = megatron_config.num_query_groups
- head_size = megatron_config.kv_channels
- q, k, v = qkv.split([head_num * head_size, num_query_groups * head_size, num_query_groups * head_size], dim=0)
+ num_key_value_heads = megatron_config.num_key_value_heads
+ head_size = megatron_config.hidden_size // head_num
+ q, k, v = qkv.split(
+ [head_num * head_size, num_key_value_heads * head_size, num_key_value_heads * head_size], dim=0
+ )
return TransformFns.merge_qkv(ctx, q, k, v)
@@
def merge_qkv_bias_concat(ctx: TransformCTX, qkv_bias: torch.Tensor):
@@
megatron_config = ctx.target.config
head_num = megatron_config.num_attention_heads
- num_query_groups = megatron_config.num_query_groups
- head_size = megatron_config.kv_channels
+ num_key_value_heads = megatron_config.num_key_value_heads
+ head_size = megatron_config.hidden_size // head_num
qb, kb, vb = qkv_bias.split(
- [head_num * head_size, num_query_groups * head_size, num_query_groups * head_size], dim=0
+ [head_num * head_size, num_key_value_heads * head_size, num_key_value_heads * head_size], dim=0
)
return TransformFns.merge_qkv_bias(ctx, qb, kb, vb)
@@
def merge_qkv_bias(ctx: TransformCTX, qb: torch.Tensor, kb: torch.Tensor, vb: torch.Tensor):
@@
megatron_config = ctx.target.config
head_num = megatron_config.num_attention_heads
- num_query_groups = megatron_config.num_query_groups
- heads_per_group = head_num // num_query_groups
- head_size = megatron_config.kv_channels
+ num_key_value_heads = megatron_config.num_key_value_heads
+ heads_per_group = head_num // num_key_value_heads
+ head_size = megatron_config.hidden_size // head_num
new_q_tensor_shape = (head_num, head_size)
- new_kv_tensor_shape = (num_query_groups, head_size)
+ new_kv_tensor_shape = (num_key_value_heads, head_size)
@@
- for i in range(num_query_groups):
+ for i in range(num_key_value_heads):
qkv_bias = torch.cat((qkv_bias, qb[i * heads_per_group : (i + 1) * heads_per_group, :]))
qkv_bias = torch.cat((qkv_bias, kb[i : i + 1, :]))
qkv_bias = torch.cat((qkv_bias, vb[i : i + 1, :]))
qkv_bias = qkv_bias.reshape(
[
- head_size * (head_num + 2 * num_query_groups),
+ head_size * (head_num + 2 * num_key_value_heads),
]
)
return qkv_biasAlso applies to: 623-667
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/state.py` around lines 574 -
584, The helper merge functions (e.g., merge_qkv_concat) still reference
Megatron-only fields num_query_groups and kv_channels; update them to use the
new KV-head fields used by TransformFns.merge_qkv: read num_key_value_heads from
ctx.target.config and compute head_size as ctx.target.config.hidden_size //
ctx.target.config.num_attention_heads, then use those values in the qkv.split
sizes (head_num * head_size, num_key_value_heads * head_size,
num_key_value_heads * head_size) and any other merge helpers in the same area
(the block around lines 623–667) so they match TransformFns.merge_qkv's
expectations and avoid AttributeError on NVLlamaConfig.
| try: | ||
| from vllm import LLM, SamplingParams | ||
|
|
||
| _VLLM_AVAILABLE = True | ||
| except ImportError: | ||
| _VLLM_AVAILABLE = False | ||
|
|
||
| from export_llama3 import HF_MODEL_ID, convert_te_to_vllm, create_te_checkpoint |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py | head -60Repository: NVIDIA/bionemo-framework
Length of output: 2469
🏁 Script executed:
cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py | sed -n '52,100p'Repository: NVIDIA/bionemo-framework
Length of output: 2432
🏁 Script executed:
cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py | sed -n '100,160p'Repository: NVIDIA/bionemo-framework
Length of output: 2782
🏁 Script executed:
# Check if there are any other environment guards or CUDA checks in the file
cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py | sed -n '1,200p' | grep -i "cuda\|skip\|torch.cuda"Repository: NVIDIA/bionemo-framework
Length of output: 642
🏁 Script executed:
# Check if export_llama3 is defined in the same directory
find bionemo-recipes/recipes/vllm_inference/llama3 -name "export_llama3.py" -o -name "export_llama3" -type dRepository: NVIDIA/bionemo-framework
Length of output: 130
🏁 Script executed:
cat -n bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py | head -80Repository: NVIDIA/bionemo-framework
Length of output: 3492
🏁 Script executed:
cat -n bionemo-recipes/recipes/vllm_inference/llama3/export_llama3.py | sed -n '30,100p'Repository: NVIDIA/bionemo-framework
Length of output: 2954
🏁 Script executed:
# Check if convert.py and modeling_llama_te.py exist in the same directory
ls -la bionemo-recipes/recipes/vllm_inference/llama3/ | grep -E "(convert|modeling)"Repository: NVIDIA/bionemo-framework
Length of output: 199
🏁 Script executed:
# Check if there are conftest.py files that might provide fixtures or setup
find bionemo-recipes/recipes/vllm_inference/llama3 -name "conftest.py"Repository: NVIDIA/bionemo-framework
Length of output: 131
🏁 Script executed:
cat -n bionemo-recipes/recipes/vllm_inference/llama3/tests/conftest.pyRepository: NVIDIA/bionemo-framework
Length of output: 994
🏁 Script executed:
wc -l bionemo-recipes/recipes/vllm_inference/llama3/tests/conftest.pyRepository: NVIDIA/bionemo-framework
Length of output: 134
Module-level environment guards are needed for GPU and export_llama3 dependencies.
The conditional vllm import (lines 34–39) only checks for that package. However, export_llama3 is imported unconditionally at line 41, and both the hf_reference_outputs and vllm_checkpoint fixtures unconditionally call .to("cuda") and model loading functions. During test collection, pytest evaluates fixtures before @pytest.mark.skipif filters are applied, so test discovery will fail on environments without CUDA or missing TransformerEngine dependencies, even if vllm is unavailable.
Suggested guard pattern
try:
from vllm import LLM, SamplingParams
_VLLM_AVAILABLE = True
except ImportError:
_VLLM_AVAILABLE = False
-from export_llama3 import HF_MODEL_ID, convert_te_to_vllm, create_te_checkpoint
+if not _VLLM_AVAILABLE:
+ pytest.skip("requires vllm", allow_module_level=True)
+if not torch.cuda.is_available():
+ pytest.skip("requires CUDA", allow_module_level=True)
+
+try:
+ from export_llama3 import HF_MODEL_ID, convert_te_to_vllm, create_te_checkpoint
+except ImportError as exc:
+ pytest.skip(f"requires export_llama3 dependencies: {exc}", allow_module_level=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/vllm_inference/llama3/tests/test_vllm.py` around
lines 34 - 41, The test module currently imports export_llama3 and performs
.to("cuda") in fixtures unconditionally, causing import/collection failures when
TransformerEngine or CUDA are missing; wrap or guard those imports and GPU
usage: detect TE availability (e.g., try/except import export_llama3 to set
_TE_AVAILABLE) and CUDA availability (torch.cuda.is_available() to set
_CUDA_AVAILABLE), only import or call
HF_MODEL_ID/convert_te_to_vllm/create_te_checkpoint when both _VLLM_AVAILABLE
and _TE_AVAILABLE and _CUDA_AVAILABLE are true, and modify the
hf_reference_outputs and vllm_checkpoint fixtures to skip or return None when
GPU/TE are unavailable (or use pytest.importorskip/pytest.skip at fixture start)
to prevent test collection errors.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1501 +/- ##
=======================================
Coverage 76.45% 76.45%
=======================================
Files 102 102
Lines 7952 7952
=======================================
Hits 6080 6080
Misses 1872 1872 |
Description
Adds golden value tests, inference scripts, and benchmark scripts to validate Llama3 inference with vLLM. Also adds hooks for CI.
Usage
python export_llama3.pypytest -s -v tests/test_vllm.pyMore usage in the readme
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests
Chores