refactor(model): sync whisper with SupportsTranscription interface (WIP)#668
Open
rebel-eunji wants to merge 18 commits into
Open
refactor(model): sync whisper with SupportsTranscription interface (WIP)#668rebel-eunji wants to merge 18 commits into
rebel-eunji wants to merge 18 commits into
Conversation
- scheduler: accept hash_block_size from EngineCore and assert it equals block_size; set enable_return_routed_experts read by inherited update_from_output; drop dead num_cached_tokens fix-up (field removed from Request in 0.22) - kv cache manager/coordinator: plumb max_num_batched_tokens down to get_manager_for_kv_cache_spec (new required args); replace use_eagle flag with eagle_group_ids (empty set + assert, eagle unsupported) - model runner: InputBatch is_spec_decode -> num_spec_tokens; rework _get_prompt_logprobs_dict to per-request in_progress_prompt_logprobs_cpu - input batch: upload renamed _make_prompt_token_ids_cpu_tensor result to device (caller-side upload in 0.22) - worker: return CompilationTimes from compile_or_warm_up_model (executor unconditionally reads .language_model/.encoder) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
vLLM 0.18's ECConnectorModelRunnerMixin.get_finished_ec_transfers() was removed in 0.22; its logic now lives inside the maybe_get_ec_connector_output context manager (get_finished + clear on exit). sample_tokens() still called the old method, raising AttributeError. Drive the full EC lifecycle through the upstream context manager in execute_model(): bind metadata + load caches on entry, poll finished transfers into ec_connector_output on exit. Carry the result via ExecuteModelState and forward it to ModelRunnerOutput in sample_tokens(), matching upstream gpu_model_runner. Preserves RBLN's non-blocking decode / blocking prefill cache load. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
vLLM 0.22's cleanup_dist_env_and_memory() calls torch.accelerator.empty_cache() for any non-CPU platform. RBLN is non-CPU (OOT) but runs on a CPU-only torch build with no torch accelerator, so torch 2.11 raises "Cannot access accelerator device when none is available" and the EngineCore dies during shutdown cleanup. Wrap torch.accelerator.empty_cache() to swallow that one case (matching its documented no-op contract); other errors propagate. Applied via register_ops() at plugin load. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bab39ce to
8502914
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🚀 Summary of Changes
📌 Related Issues / Tickets
✅ Type of Change
release)feature)model)core)fix)perf)refactor)docs)other): please describe🧪 How to Test
.........📸 Screenshots / Logs (if applicable)
📋 Checklist
💬 Notes