-
Notifications
You must be signed in to change notification settings - Fork 638
test: Add request migration graceful shutdown E2E test #3585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2320c0e
to
c1c30b1
Compare
Signed-off-by: Jacky <[email protected]>
Signed-off-by: Jacky <[email protected]>
Signed-off-by: Jacky <[email protected]>
c1c30b1
to
d5d4c15
Compare
WalkthroughRefactors fault-tolerance documentation toward migration-focused tests. Removes the prior end-to-end migration test, adds a dedicated vLLM migration test module and a shared utilities module under tests/fault_tolerance/migration. Tests cover worker kill and graceful shutdown, with and without migration enabled. Introduces process wrappers, request orchestration, log polling, and migration verification. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Test as Test Harness
participant Frontend
participant WA as Worker A
participant WB as Worker B
Test->>Frontend: Start Dynamo Frontend
Test->>WA: Start Worker A (vLLM)
Test->>WB: Start Worker B (vLLM)
Note over Test,Frontend: Begin long-running completion request
Client->>Frontend: POST /completions
Frontend->>WA: Route request (round-robin)
par Log polling
Test->>WA: Poll logs to detect receipt
Test->>WB: Poll logs
end
Note over Test,WA: Identify handling worker
rect rgba(200,230,255,0.3)
Test-->>WA: Trigger kill or graceful shutdown
end
alt Migration enabled
Frontend->>WB: Migrate stream
WB-->>Frontend: Continue tokens
Frontend-->>Client: 200 OK + completed response
Test->>Frontend: Verify migration log
else Migration disabled
Frontend-->>Client: Error / failed request
Test-->>Frontend: Assert no migration log
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (7)
tests/fault_tolerance/migration/utils.py (4)
4-11
: Use sys.executable instead of hardcoded pythonPrevents PATH/env mismatches in CI and across OSes; aligns worker/frontend interpreter.
import logging +import sys import shutil import threading import time @@ - command = ["python", "-m", "dynamo.frontend", "--router-mode", "round-robin"] + command = [sys.executable, "-m", "dynamo.frontend", "--router-mode", "round-robin"]Also applies to: 22-26
194-212
: Use public API for logs; avoid blind exceptionsRead via read_logs() and assert content; don’t access _log_path or catch broad Exception.
def verify_migration_occurred(frontend_process: DynamoFrontendProcess) -> None: @@ - log_path = frontend_process._log_path - try: - with open(log_path, "r") as f: - log_content = f.read() - except Exception as e: - pytest.fail(f"Could not read frontend log file {log_path}: {e}") + log_content = frontend_process.read_logs() + assert log_content, f"Frontend logs empty or unavailable: {frontend_process.log_path}" assert ( "Stream disconnected... recreating stream..." in log_content ), "'Stream disconnected... recreating stream...' message not found in logs" assert ( "Cannot recreate stream: " not in log_content ), "'Cannot recreate stream: ...' error found in logs"
69-83
: Log full traceback on request failuresUse logger.exception in except blocks for better diagnostics; matches Ruff TRY400 hint.
except requests.exceptions.Timeout: - logger.error(f"Request timed out after {timeout} seconds") + logger.exception(f"Request timed out after {timeout} seconds") raise except requests.exceptions.RequestException as e: - logger.error(f"Request failed with error: {e}") + logger.exception(f"Request failed with error: {e}") raise
164-171
: Consider a higher join timeout for long generations8192 tokens may exceed 240s in some environments; consider 300s to reduce flakes.
- request_thread.join(timeout=240) + request_thread.join(timeout=300)tests/fault_tolerance/migration/test_vllm.py (3)
4-9
: Use sys.executable for worker commandAvoids relying on “python3” being in PATH; ensures same interpreter as pytest.
import logging import os +import sys import shutil @@ - command = [ - "python3", + command = [ + sys.executable, "-m", "dynamo.vllm", "--model", FAULT_TOLERANCE_MODEL_NAME,Also applies to: 33-47
101-107
: Silence ARG001 (unused fixture args) or switch to usefixturesFixtures are used for side effects; either mark module with usefixtures and drop params, or keep params and silence Ruff.
Option A (preferred): module-level usefixtures
@@ logger = logging.getLogger(__name__) +# Apply side-effect fixtures to all tests in this module +pytestmark = pytest.mark.usefixtures("runtime_services", "predownload_models", "set_ucx_tls_no_mm") @@ -def test_request_migration_vllm_worker_failure( - request, runtime_services, predownload_models, set_ucx_tls_no_mm -): +def test_request_migration_vllm_worker_failure(request): @@ -def test_request_migration_vllm_graceful_shutdown( - request, runtime_services, predownload_models, set_ucx_tls_no_mm -): +def test_request_migration_vllm_graceful_shutdown(request): @@ -def test_no_request_migration_vllm_worker_failure( - request, runtime_services, predownload_models, set_ucx_tls_no_mm -): +def test_no_request_migration_vllm_worker_failure(request): @@ -def test_no_request_migration_vllm_graceful_shutdown( - request, runtime_services, predownload_models, set_ucx_tls_no_mm -): +def test_no_request_migration_vllm_graceful_shutdown(request):Option B: keep params and silence Ruff per function
-def test_request_migration_vllm_worker_failure( - request, runtime_services, predownload_models, set_ucx_tls_no_mm -): +def test_request_migration_vllm_worker_failure( + request, runtime_services, predownload_models, set_ucx_tls_no_mm # noqa: ARG001 +):(Apply similarly to other tests.)
Also applies to: 151-157, 206-212, 270-276
48-66
: Nit: deterministic worker health port derivationUsing worker_id[-1] assumes numeric suffix; it’s fine here (“worker1/2”), but consider validating or deriving port from an explicit int to prevent silent misconfig if ids change.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
tests/fault_tolerance/README.md
(1 hunks)tests/fault_tolerance/migration/test_vllm.py
(1 hunks)tests/fault_tolerance/migration/utils.py
(1 hunks)tests/fault_tolerance/test_request_migration.py
(0 hunks)
💤 Files with no reviewable changes (1)
- tests/fault_tolerance/test_request_migration.py
🧰 Additional context used
🧬 Code graph analysis (2)
tests/fault_tolerance/migration/test_vllm.py (3)
tests/utils/managed_process.py (2)
ManagedProcess
(71-568)terminate_process_tree
(45-67)tests/utils/payloads.py (1)
check_models_api
(232-243)tests/fault_tolerance/migration/utils.py (5)
DynamoFrontendProcess
(19-40)determine_request_receiving_worker
(91-151)start_completion_request
(43-88)validate_completion_response
(154-191)verify_migration_occurred
(194-212)
tests/fault_tolerance/migration/utils.py (1)
tests/utils/managed_process.py (2)
ManagedProcess
(71-568)log_path
(98-100)
🪛 Ruff (0.14.0)
tests/fault_tolerance/migration/test_vllm.py
106-106: Unused function argument: runtime_services
(ARG001)
106-106: Unused function argument: predownload_models
(ARG001)
106-106: Unused function argument: set_ucx_tls_no_mm
(ARG001)
156-156: Unused function argument: runtime_services
(ARG001)
156-156: Unused function argument: predownload_models
(ARG001)
156-156: Unused function argument: set_ucx_tls_no_mm
(ARG001)
207-207: Unused function argument: runtime_services
(ARG001)
207-207: Unused function argument: predownload_models
(ARG001)
207-207: Unused function argument: set_ucx_tls_no_mm
(ARG001)
271-271: Unused function argument: runtime_services
(ARG001)
271-271: Unused function argument: predownload_models
(ARG001)
271-271: Unused function argument: set_ucx_tls_no_mm
(ARG001)
tests/fault_tolerance/migration/utils.py
79-79: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
82-82: Use logging.exception
instead of logging.error
Replace with exception
(TRY400)
122-122: Do not catch blind exception: Exception
(BLE001)
205-205: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
- GitHub Check: trtllm (arm64)
- GitHub Check: trtllm (amd64)
- GitHub Check: vllm (arm64)
- GitHub Check: sglang
- GitHub Check: vllm (amd64)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
tests/fault_tolerance/README.md (1)
7-12
: README test references are correctAll referenced tests in
tests/fault_tolerance/README.md
exist with matching names and paths.
Signed-off-by: Jacky <[email protected]>
Signed-off-by: Jacky <[email protected]>
ab1e218
to
b4969b2
Compare
Overview:
Add request migration graceful shutdown E2E test.
Details:
Where should the reviewer start?
Start with the README.md, and then the request migration tests, and the utils.py.
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
N/A
Summary by CodeRabbit
Documentation
Tests