Skip to content

Conversation

kthui
Copy link
Contributor

@kthui kthui commented Oct 13, 2025

Overview:

Add request migration graceful shutdown E2E test.

Details:

  • Refactored request migration test structure, allowing other backends in the future.
  • Updated E2E test workflow to only use one request, skipping the worker probe request.
  • Added graceful shutdown test case on vLLM.
  • Updated test README.md, describing each case for request handling E2E tests.

Where should the reviewer start?

Start with the README.md, and then the request migration tests, and the utils.py.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

  • Documentation

    • Reorganized fault-tolerance guide into a focused “Migration Tests” section with clearer prerequisites, commands, and troubleshooting. Updated terminology to align with migration workflows.
  • Tests

    • Added end-to-end migration scenarios covering worker failure and graceful shutdown, plus variants with migration disabled to confirm expected failures.
    • Improved test utilities for launching requests, detecting which worker handled them, validating responses, and verifying migration via logs.
    • Removed legacy migration test in favor of the streamlined, migration-centric suite.

@kthui kthui force-pushed the jacky-ft-migrate-graceful-test branch from c1c30b1 to d5d4c15 Compare October 13, 2025 19:16
@kthui kthui marked this pull request as ready for review October 13, 2025 22:13
@kthui kthui requested review from a team as code owners October 13, 2025 22:13
Copy link
Contributor

coderabbitai bot commented Oct 13, 2025

Walkthrough

Refactors fault-tolerance documentation toward migration-focused tests. Removes the prior end-to-end migration test, adds a dedicated vLLM migration test module and a shared utilities module under tests/fault_tolerance/migration. Tests cover worker kill and graceful shutdown, with and without migration enabled. Introduces process wrappers, request orchestration, log polling, and migration verification.

Changes

Cohort / File(s) Summary of Changes
Docs restructure
tests/fault_tolerance/README.md
Rewrites README to center on migration tests, renaming test cases, updating run commands, prerequisites, scopes, and troubleshooting to a migration-specific workflow.
Migration test utilities
tests/fault_tolerance/migration/utils.py
Adds DynamoFrontendProcess and helpers to start long-running requests, detect recipient worker via logs, validate completion responses, and verify migration via frontend log inspection.
vLLM migration tests
tests/fault_tolerance/migration/test_vllm.py
Introduces DynamoWorkerProcess and four tests covering worker failure vs. graceful shutdown, with migration enabled vs. disabled (migration_limit=0). Manages processes, selects handling worker, triggers failure/shutdown, and asserts outcomes.
Legacy test removal
tests/fault_tolerance/test_request_migration.py
Deletes previous comprehensive migration E2E test and its embedded helpers and process wrappers, superseded by the new migration-focused modules.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Test as Test Harness
  participant Frontend
  participant WA as Worker A
  participant WB as Worker B

  Test->>Frontend: Start Dynamo Frontend
  Test->>WA: Start Worker A (vLLM)
  Test->>WB: Start Worker B (vLLM)
  Note over Test,Frontend: Begin long-running completion request
  Client->>Frontend: POST /completions
  Frontend->>WA: Route request (round-robin)
  par Log polling
    Test->>WA: Poll logs to detect receipt
    Test->>WB: Poll logs
  end
  Note over Test,WA: Identify handling worker

  rect rgba(200,230,255,0.3)
  Test-->>WA: Trigger kill or graceful shutdown
  end

  alt Migration enabled
    Frontend->>WB: Migrate stream
    WB-->>Frontend: Continue tokens
    Frontend-->>Client: 200 OK + completed response
    Test->>Frontend: Verify migration log
  else Migration disabled
    Frontend-->>Client: Error / failed request
    Test-->>Frontend: Assert no migration log
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws at twilight’s verge,
Two workers hum, then one must purge—
A hop, a swap, the streams converge,
Logs whisper, tokens re-emerge.
When limits say “no,” retries diverge—
Still, bunny’s proud of this testy surge. 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly and concisely summarizes the primary change by indicating that a new end-to-end test for request migration under graceful shutdown has been added, matching the core update in this pull request.
Description Check ✅ Passed The pull request description follows the repository template by providing an overview, detailed change list, reviewer guidance on where to start, and a related issues section, covering all required sections with appropriate content.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (7)
tests/fault_tolerance/migration/utils.py (4)

4-11: Use sys.executable instead of hardcoded python

Prevents PATH/env mismatches in CI and across OSes; aligns worker/frontend interpreter.

 import logging
+import sys
 import shutil
 import threading
 import time
@@
-        command = ["python", "-m", "dynamo.frontend", "--router-mode", "round-robin"]
+        command = [sys.executable, "-m", "dynamo.frontend", "--router-mode", "round-robin"]

Also applies to: 22-26


194-212: Use public API for logs; avoid blind exceptions

Read via read_logs() and assert content; don’t access _log_path or catch broad Exception.

 def verify_migration_occurred(frontend_process: DynamoFrontendProcess) -> None:
@@
-    log_path = frontend_process._log_path
-    try:
-        with open(log_path, "r") as f:
-            log_content = f.read()
-    except Exception as e:
-        pytest.fail(f"Could not read frontend log file {log_path}: {e}")
+    log_content = frontend_process.read_logs()
+    assert log_content, f"Frontend logs empty or unavailable: {frontend_process.log_path}"
     assert (
         "Stream disconnected... recreating stream..." in log_content
     ), "'Stream disconnected... recreating stream...' message not found in logs"
     assert (
         "Cannot recreate stream: " not in log_content
     ), "'Cannot recreate stream: ...' error found in logs"

69-83: Log full traceback on request failures

Use logger.exception in except blocks for better diagnostics; matches Ruff TRY400 hint.

         except requests.exceptions.Timeout:
-            logger.error(f"Request timed out after {timeout} seconds")
+            logger.exception(f"Request timed out after {timeout} seconds")
             raise
         except requests.exceptions.RequestException as e:
-            logger.error(f"Request failed with error: {e}")
+            logger.exception(f"Request failed with error: {e}")
             raise

164-171: Consider a higher join timeout for long generations

8192 tokens may exceed 240s in some environments; consider 300s to reduce flakes.

-    request_thread.join(timeout=240)
+    request_thread.join(timeout=300)
tests/fault_tolerance/migration/test_vllm.py (3)

4-9: Use sys.executable for worker command

Avoids relying on “python3” being in PATH; ensures same interpreter as pytest.

 import logging
 import os
+import sys
 import shutil
@@
-        command = [
-            "python3",
+        command = [
+            sys.executable,
             "-m",
             "dynamo.vllm",
             "--model",
             FAULT_TOLERANCE_MODEL_NAME,

Also applies to: 33-47


101-107: Silence ARG001 (unused fixture args) or switch to usefixtures

Fixtures are used for side effects; either mark module with usefixtures and drop params, or keep params and silence Ruff.

Option A (preferred): module-level usefixtures

@@
 logger = logging.getLogger(__name__)
 
+# Apply side-effect fixtures to all tests in this module
+pytestmark = pytest.mark.usefixtures("runtime_services", "predownload_models", "set_ucx_tls_no_mm")
@@
-def test_request_migration_vllm_worker_failure(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_request_migration_vllm_worker_failure(request):
@@
-def test_request_migration_vllm_graceful_shutdown(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_request_migration_vllm_graceful_shutdown(request):
@@
-def test_no_request_migration_vllm_worker_failure(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_no_request_migration_vllm_worker_failure(request):
@@
-def test_no_request_migration_vllm_graceful_shutdown(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_no_request_migration_vllm_graceful_shutdown(request):

Option B: keep params and silence Ruff per function

-def test_request_migration_vllm_worker_failure(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_request_migration_vllm_worker_failure(
+    request, runtime_services, predownload_models, set_ucx_tls_no_mm  # noqa: ARG001
+):

(Apply similarly to other tests.)

Also applies to: 151-157, 206-212, 270-276


48-66: Nit: deterministic worker health port derivation

Using worker_id[-1] assumes numeric suffix; it’s fine here (“worker1/2”), but consider validating or deriving port from an explicit int to prevent silent misconfig if ids change.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8dd104d and ab1e218.

📒 Files selected for processing (4)
  • tests/fault_tolerance/README.md (1 hunks)
  • tests/fault_tolerance/migration/test_vllm.py (1 hunks)
  • tests/fault_tolerance/migration/utils.py (1 hunks)
  • tests/fault_tolerance/test_request_migration.py (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/fault_tolerance/test_request_migration.py
🧰 Additional context used
🧬 Code graph analysis (2)
tests/fault_tolerance/migration/test_vllm.py (3)
tests/utils/managed_process.py (2)
  • ManagedProcess (71-568)
  • terminate_process_tree (45-67)
tests/utils/payloads.py (1)
  • check_models_api (232-243)
tests/fault_tolerance/migration/utils.py (5)
  • DynamoFrontendProcess (19-40)
  • determine_request_receiving_worker (91-151)
  • start_completion_request (43-88)
  • validate_completion_response (154-191)
  • verify_migration_occurred (194-212)
tests/fault_tolerance/migration/utils.py (1)
tests/utils/managed_process.py (2)
  • ManagedProcess (71-568)
  • log_path (98-100)
🪛 Ruff (0.14.0)
tests/fault_tolerance/migration/test_vllm.py

106-106: Unused function argument: runtime_services

(ARG001)


106-106: Unused function argument: predownload_models

(ARG001)


106-106: Unused function argument: set_ucx_tls_no_mm

(ARG001)


156-156: Unused function argument: runtime_services

(ARG001)


156-156: Unused function argument: predownload_models

(ARG001)


156-156: Unused function argument: set_ucx_tls_no_mm

(ARG001)


207-207: Unused function argument: runtime_services

(ARG001)


207-207: Unused function argument: predownload_models

(ARG001)


207-207: Unused function argument: set_ucx_tls_no_mm

(ARG001)


271-271: Unused function argument: runtime_services

(ARG001)


271-271: Unused function argument: predownload_models

(ARG001)


271-271: Unused function argument: set_ucx_tls_no_mm

(ARG001)

tests/fault_tolerance/migration/utils.py

79-79: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


82-82: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


122-122: Do not catch blind exception: Exception

(BLE001)


205-205: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)
  • GitHub Check: trtllm (arm64)
  • GitHub Check: trtllm (amd64)
  • GitHub Check: vllm (arm64)
  • GitHub Check: sglang
  • GitHub Check: vllm (amd64)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (1)
tests/fault_tolerance/README.md (1)

7-12: README test references are correct

All referenced tests in tests/fault_tolerance/README.md exist with matching names and paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant