Skip to content

refactor(model): sync whisper with SupportsTranscription interface (WIP)#668

Open
rebel-eunji wants to merge 18 commits into
dev-0.22.0-optimumfrom
eunji/refactor-whisper
Open

refactor(model): sync whisper with SupportsTranscription interface (WIP)#668
rebel-eunji wants to merge 18 commits into
dev-0.22.0-optimumfrom
eunji/refactor-whisper

Conversation

@rebel-eunji

Copy link
Copy Markdown
Collaborator

🚀 Summary of Changes

What does this PR do? What feature, fix, or improvement does it bring?


📌 Related Issues / Tickets

  • Resolves #
  • Related to #

✅ Type of Change

  • 🚀 Release (release)
  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

  1. Run ...
  2. Verify output: ...
  3. Edge case tested: ...

📸 Screenshots / Logs (if applicable)


📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes


rebel-eunji and others added 18 commits June 7, 2026 13:22
- scheduler: accept hash_block_size from EngineCore and assert it equals
  block_size; set enable_return_routed_experts read by inherited
  update_from_output; drop dead num_cached_tokens fix-up (field removed
  from Request in 0.22)
- kv cache manager/coordinator: plumb max_num_batched_tokens down to
  get_manager_for_kv_cache_spec (new required args); replace use_eagle
  flag with eagle_group_ids (empty set + assert, eagle unsupported)
- model runner: InputBatch is_spec_decode -> num_spec_tokens; rework
  _get_prompt_logprobs_dict to per-request in_progress_prompt_logprobs_cpu
- input batch: upload renamed _make_prompt_token_ids_cpu_tensor result
  to device (caller-side upload in 0.22)
- worker: return CompilationTimes from compile_or_warm_up_model
  (executor unconditionally reads .language_model/.encoder)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
vLLM 0.18's ECConnectorModelRunnerMixin.get_finished_ec_transfers() was
removed in 0.22; its logic now lives inside the maybe_get_ec_connector_output
context manager (get_finished + clear on exit). sample_tokens() still called
the old method, raising AttributeError.

Drive the full EC lifecycle through the upstream context manager in
execute_model(): bind metadata + load caches on entry, poll finished
transfers into ec_connector_output on exit. Carry the result via
ExecuteModelState and forward it to ModelRunnerOutput in sample_tokens(),
matching upstream gpu_model_runner. Preserves RBLN's non-blocking decode /
blocking prefill cache load.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vLLM 0.22's cleanup_dist_env_and_memory() calls torch.accelerator.empty_cache()
for any non-CPU platform. RBLN is non-CPU (OOT) but runs on a CPU-only torch
build with no torch accelerator, so torch 2.11 raises "Cannot access accelerator
device when none is available" and the EngineCore dies during shutdown cleanup.

Wrap torch.accelerator.empty_cache() to swallow that one case (matching its
documented no-op contract); other errors propagate. Applied via register_ops()
at plugin load.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@rebel-eunji rebel-eunji self-assigned this Jun 10, 2026
@rebel-eunji rebel-eunji force-pushed the dev-0.22.0-optimum branch 3 times, most recently from bab39ce to 8502914 Compare June 11, 2026 01:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants