[tx] Skip full logits computation during prefill #878

raulchen · 2026-01-14T22:32:59Z

Summary

Add last_token_logits_only parameter to Llama3 and Qwen3 models to avoid materializing the full [B, T, V] logits tensor dur
ing prefill
When prompt_logprobs=False (the common case), only compute logits for the last token
Add parametrized tests for both models verifying output shape and generation equivalence

Motivation

During prefill, only the last token's logits are needed to start decoding. Computing logits for all prompt tokens requires a
[B, T, V] matmul where V (vocab size) is typically 32K-128K. This is wasteful when prompt_logprobs is not requested.

This optimization benefits:

Standard inference/chat (most common)
RL training (only generation logprobs needed, not prompt logprobs)

Test plan

test_last_token_logits_only[llama3] - verifies output shape and generation equivalence
test_last_token_logits_only[qwen3] - verifies output shape and generation equivalence
Existing generator tests pass

During prefill, only the last token's logits are needed to start decoding. This optimization avoids materializing the full [B, T, V] logits tensor when prompt_logprobs is not requested.

Add parametrized test in test_models_common.py that verifies both llama3 and qwen3 models produce correct output shape and matching logits when using last_token_logits_only=True. Also tests generation equivalence with and without prompt_logprobs. Co-Authored-By: Claude Opus 4.5 <[email protected]>

gemini-code-assist

Code Review

This pull request introduces a valuable optimization to skip the computation of full logits during the prefill stage when prompt log probabilities are not required. The changes are implemented by adding a last_token_logits_only parameter to the Llama3 and Qwen3 models and leveraging it in the generator's prefill logic. The implementation is clean and the logic is sound. A new parametrized test file is added, which thoroughly validates the new functionality by checking output shapes, value equivalence, and generation consistency. I have one suggestion to improve the clarity of the new test code.

gemini-code-assist · 2026-01-14T22:34:58Z

skyrl-tx/tests/models/test_models_common.py

+        base_config = (
+            AutoConfig.from_pretrained(model_name)
+            if config_cls == Llama3Config
+            else PretrainedConfig.from_pretrained(model_name)
+        )


The conditional logic for loading the base configuration is a bit confusing and contains unreachable code. Both Llama3Config and Qwen3Config are aliases for ModelConfig, so the condition config_cls == Llama3Config will be true for both parameterized test cases, making the else branch dead code.

To simplify and improve clarity, you can consistently use AutoConfig.from_pretrained(model_name), which is the standard approach for loading configurations from the HuggingFace Hub.

base_config = AutoConfig.from_pretrained(model_name)

raulchen · 2026-01-14T22:37:49Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a valuable optimization to skip computing full logits during prefill when they are not needed, controlled by a new last_token_logits_only parameter. The implementation is clean and correctly applied to both Llama3 and Qwen3 models. The logic is correctly wired into the generator utility, activating the optimization when prompt_logprobs is false. The addition of a new parameterized test in test_models_common.py is excellent, as it thoroughly verifies the correctness of the change by checking output shapes, value equivalence, and end-to-end generation consistency. I have one suggestion to improve the new test for better failure diagnostics.

skyrl-tx/tests/models/test_models_common.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

raulchen and others added 3 commits January 14, 2026 12:00

feat: add last_token_logits_only to skip full logits during prefill

904f627

During prefill, only the last token's logits are needed to start decoding. This optimization avoids materializing the full [B, T, V] logits tensor when prompt_logprobs is not requested.

no kwargs

1a10947

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

raulchen added 2 commits January 14, 2026 14:35

handle dummpy model

940d27e

base config

3c1189f

gemini-code-assist bot reviewed Jan 14, 2026

View reviewed changes

skyrl-tx/tests/models/test_models_common.py Outdated Show resolved Hide resolved

raulchen and others added 2 commits January 14, 2026 14:44

Update skyrl-tx/tests/models/test_models_common.py

5c76b41

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

lint

5056318

pcmoritz added the tx label Jan 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[tx] Skip full logits computation during prefill #878

[tx] Skip full logits computation during prefill #878

raulchen commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 14, 2026

Uh oh!

raulchen commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[tx] Skip full logits computation during prefill #878

Are you sure you want to change the base?

[tx] Skip full logits computation during prefill #878

Conversation

raulchen commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

raulchen commented Jan 14, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants