Skip to content

fix(model): replace INVALID_TOKEN sentinel with real first-token decode in prefill for whisper models#593

Open
rebel-eunji wants to merge 6 commits intodevfrom
fix/whisper
Open

fix(model): replace INVALID_TOKEN sentinel with real first-token decode in prefill for whisper models#593
rebel-eunji wants to merge 6 commits intodevfrom
fix/whisper

Conversation

@rebel-eunji
Copy link
Copy Markdown
Collaborator

🚀 Summary of Changes

Problem

When running whisper models through vLLM serve , and a client POSTs to /v1/audio/transcriptions, vLLM raises the following error:

torch._dynamo.exc.TorchRuntimeError: Dynamo failed to run FX node with fake tensors:
call_method masked_fill_(*(FakeTensor(..., size=(1, 51966)),
                           FakeTensor(..., size=(1, 51866), dtype=torch.bool), -inf), **{}):
got RuntimeError('Attempting to broadcast a dimension of length 51866 at -1!
                  Mismatching argument at index 1 had torch.Size([1, 51866]);
                  but expected shape should be broadcastable to [1, 51966]')

from user code:
  File ".../vllm_rbln/v1/sample/rbln_sampler.py", line 275, in forward
    logits = self.apply_logits_processors(...)
  File ".../vllm/v1/sample/sampler.py", line 288, in apply_logits_processors
    logits.masked_fill_(sampling_metadata.allowed_token_ids_mask, float("-inf"))

Solution

Replace the INVALID_TOKEN sentinel trick in Whisper prefill: now run the encoder and the first decoder step with
decoder_start_token_id to return real first-token logits to the sampler.


📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • 🚀 Release (release)
  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run ...
  2. Verify output: ...
  3. Edge case tested: ...

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes


@rebel-eunji rebel-eunji changed the title fix: run both encoder and decoder in prefill step fix(model): replace INVALID_TOKEN sentinel with real first-token decode in prefill for whisper models May 7, 2026
@rebel-eunji rebel-eunji self-assigned this May 7, 2026
@rebel-eunji rebel-eunji added bug Something isn't working optimum Optimum based implmenetion labels May 7, 2026
@rebel-eunji rebel-eunji changed the title fix(model): replace INVALID_TOKEN sentinel with real first-token decode in prefill for whisper models fix(model): replace INVALID_TOKEN sentinel with real first-token decode in prefill for whisper models (WIP) May 7, 2026
@rebel-eunji rebel-eunji marked this pull request as draft May 7, 2026 16:14
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@rebel-eunji rebel-eunji changed the title fix(model): replace INVALID_TOKEN sentinel with real first-token decode in prefill for whisper models (WIP) fix(model): replace INVALID_TOKEN sentinel with real first-token decode in prefill for whisper models May 8, 2026
@rebel-eunji rebel-eunji requested a review from rebel-jonghewk May 8, 2026 01:50
@rebel-eunji rebel-eunji marked this pull request as ready for review May 8, 2026 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working optimum Optimum based implmenetion

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant