test: Add request migration graceful shutdown E2E test #3585

kthui · 2025-10-13T17:29:39Z

Overview:

Add request migration graceful shutdown E2E test.

Details:

Refactored request migration test structure, allowing other backends in the future.
Updated E2E test workflow to only use one request, skipping the worker probe request.
Added graceful shutdown test case on vLLM.
Updated test README.md, describing each case for request handling E2E tests.

Where should the reviewer start?

Start with the README.md, and then the request migration tests, and the utils.py.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

N/A

Summary by CodeRabbit

Documentation
- Reorganized fault-tolerance guide into a focused “Migration Tests” section with clearer prerequisites, commands, and troubleshooting. Updated terminology to align with migration workflows.
Tests
- Added end-to-end migration scenarios covering worker failure and graceful shutdown, plus variants with migration disabled to confirm expected failures.
- Improved test utilities for launching requests, detecting which worker handled them, validating responses, and verifying migration via logs.
- Removed legacy migration test in favor of the streamlined, migration-centric suite.

Signed-off-by: Jacky <[email protected]>

coderabbitai · 2025-10-13T22:21:57Z

Walkthrough

Refactors fault-tolerance documentation toward migration-focused tests. Removes the prior end-to-end migration test, adds a dedicated vLLM migration test module and a shared utilities module under tests/fault_tolerance/migration. Tests cover worker kill and graceful shutdown, with and without migration enabled. Introduces process wrappers, request orchestration, log polling, and migration verification.

Changes

Cohort / File(s)	Summary of Changes
Docs restructure `tests/fault_tolerance/README.md`	Rewrites README to center on migration tests, renaming test cases, updating run commands, prerequisites, scopes, and troubleshooting to a migration-specific workflow.
Migration test utilities `tests/fault_tolerance/migration/utils.py`	Adds `DynamoFrontendProcess` and helpers to start long-running requests, detect recipient worker via logs, validate completion responses, and verify migration via frontend log inspection.
vLLM migration tests `tests/fault_tolerance/migration/test_vllm.py`	Introduces `DynamoWorkerProcess` and four tests covering worker failure vs. graceful shutdown, with migration enabled vs. disabled (`migration_limit=0`). Manages processes, selects handling worker, triggers failure/shutdown, and asserts outcomes.
Legacy test removal `tests/fault_tolerance/test_request_migration.py`	Deletes previous comprehensive migration E2E test and its embedded helpers and process wrappers, superseded by the new migration-focused modules.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Test as Test Harness
  participant Frontend
  participant WA as Worker A
  participant WB as Worker B

  Test->>Frontend: Start Dynamo Frontend
  Test->>WA: Start Worker A (vLLM)
  Test->>WB: Start Worker B (vLLM)
  Note over Test,Frontend: Begin long-running completion request
  Client->>Frontend: POST /completions
  Frontend->>WA: Route request (round-robin)
  par Log polling
    Test->>WA: Poll logs to detect receipt
    Test->>WB: Poll logs
  end
  Note over Test,WA: Identify handling worker

  rect rgba(200,230,255,0.3)
  Test-->>WA: Trigger kill or graceful shutdown
  end

  alt Migration enabled
    Frontend->>WB: Migrate stream
    WB-->>Frontend: Continue tokens
    Frontend-->>Client: 200 OK + completed response
    Test->>Frontend: Verify migration log
  else Migration disabled
    Frontend-->>Client: Error / failed request
    Test-->>Frontend: Assert no migration log
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

I thump my paws at twilight’s verge,
Two workers hum, then one must purge—
A hop, a swap, the streams converge,
Logs whisper, tokens re-emerge.
When limits say “no,” retries diverge—
Still, bunny’s proud of this testy surge. 🐇✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly and concisely summarizes the primary change by indicating that a new end-to-end test for request migration under graceful shutdown has been added, matching the core update in this pull request.
Description Check	✅ Passed	The pull request description follows the repository template by providing an overview, detailed change list, reviewer guidance on where to start, and a related issues section, covering all required sections with appropriate content.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (7)

tests/fault_tolerance/migration/utils.py (4)

4-11: Use sys.executable instead of hardcoded python

Prevents PATH/env mismatches in CI and across OSes; aligns worker/frontend interpreter.

 import logging
+import sys
 import shutil
 import threading
 import time
@@
-        command = ["python", "-m", "dynamo.frontend", "--router-mode", "round-robin"]
+        command = [sys.executable, "-m", "dynamo.frontend", "--router-mode", "round-robin"]

Also applies to: 22-26

194-212: Use public API for logs; avoid blind exceptions

Read via read_logs() and assert content; don’t access _log_path or catch broad Exception.

 def verify_migration_occurred(frontend_process: DynamoFrontendProcess) -> None:
@@
-    log_path = frontend_process._log_path
-    try:
-        with open(log_path, "r") as f:
-            log_content = f.read()
-    except Exception as e:
-        pytest.fail(f"Could not read frontend log file {log_path}: {e}")
+    log_content = frontend_process.read_logs()
+    assert log_content, f"Frontend logs empty or unavailable: {frontend_process.log_path}"
     assert (
         "Stream disconnected... recreating stream..." in log_content
     ), "'Stream disconnected... recreating stream...' message not found in logs"
     assert (
         "Cannot recreate stream: " not in log_content
     ), "'Cannot recreate stream: ...' error found in logs"

69-83: Log full traceback on request failures

Use logger.exception in except blocks for better diagnostics; matches Ruff TRY400 hint.

         except requests.exceptions.Timeout:
-            logger.error(f"Request timed out after {timeout} seconds")
+            logger.exception(f"Request timed out after {timeout} seconds")
             raise
         except requests.exceptions.RequestException as e:
-            logger.error(f"Request failed with error: {e}")
+            logger.exception(f"Request failed with error: {e}")
             raise

164-171: Consider a higher join timeout for long generations

8192 tokens may exceed 240s in some environments; consider 300s to reduce flakes.

-    request_thread.join(timeout=240)
+    request_thread.join(timeout=300)

tests/fault_tolerance/migration/test_vllm.py (3)

4-9: Use sys.executable for worker command

Avoids relying on “python3” being in PATH; ensures same interpreter as pytest.

 import logging
 import os
+import sys
 import shutil
@@
-        command = [
-            "python3",
+        command = [
+            sys.executable,
             "-m",
             "dynamo.vllm",
             "--model",
             FAULT_TOLERANCE_MODEL_NAME,

Also applies to: 33-47

101-107: Silence ARG001 (unused fixture args) or switch to usefixtures

Fixtures are used for side effects; either mark module with usefixtures and drop params, or keep params and silence Ruff.

Option A (preferred): module-level usefixtures

@@
 logger = logging.getLogger(__name__)
 
+# Apply side-effect fixtures to all tests in this module
+pytestmark = pytest.mark.usefixtures("runtime_services", "predownload_models", "set_ucx_tls_no_mm")
@@
-def test_request_migration_vllm_worker_failure(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_request_migration_vllm_worker_failure(request):
@@
-def test_request_migration_vllm_graceful_shutdown(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_request_migration_vllm_graceful_shutdown(request):
@@
-def test_no_request_migration_vllm_worker_failure(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_no_request_migration_vllm_worker_failure(request):
@@
-def test_no_request_migration_vllm_graceful_shutdown(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_no_request_migration_vllm_graceful_shutdown(request):

Option B: keep params and silence Ruff per function

-def test_request_migration_vllm_worker_failure(
-    request, runtime_services, predownload_models, set_ucx_tls_no_mm
-):
+def test_request_migration_vllm_worker_failure(
+    request, runtime_services, predownload_models, set_ucx_tls_no_mm  # noqa: ARG001
+):

(Apply similarly to other tests.)

Also applies to: 151-157, 206-212, 270-276

48-66: Nit: deterministic worker health port derivation

Using worker_id[-1] assumes numeric suffix; it’s fine here (“worker1/2”), but consider validating or deriving port from an explicit int to prevent silent misconfig if ids change.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8dd104d and ab1e218.

📒 Files selected for processing (4)

tests/fault_tolerance/README.md (1 hunks)
tests/fault_tolerance/migration/test_vllm.py (1 hunks)
tests/fault_tolerance/migration/utils.py (1 hunks)
tests/fault_tolerance/test_request_migration.py (0 hunks)

💤 Files with no reviewable changes (1)

tests/fault_tolerance/test_request_migration.py

🧰 Additional context used

🧬 Code graph analysis (2)

tests/fault_tolerance/migration/test_vllm.py (3)

tests/utils/managed_process.py (2)

ManagedProcess (71-568)

terminate_process_tree (45-67)

tests/utils/payloads.py (1)

check_models_api (232-243)

tests/fault_tolerance/migration/utils.py (5)

DynamoFrontendProcess (19-40)

determine_request_receiving_worker (91-151)

start_completion_request (43-88)

validate_completion_response (154-191)

verify_migration_occurred (194-212)

tests/fault_tolerance/migration/utils.py (1)

tests/utils/managed_process.py (2)

ManagedProcess (71-568)

log_path (98-100)

🪛 Ruff (0.14.0)

tests/fault_tolerance/migration/test_vllm.py

106-106: Unused function argument: runtime_services

(ARG001)

106-106: Unused function argument: predownload_models

(ARG001)

106-106: Unused function argument: set_ucx_tls_no_mm

(ARG001)

156-156: Unused function argument: runtime_services

(ARG001)

156-156: Unused function argument: predownload_models

(ARG001)

156-156: Unused function argument: set_ucx_tls_no_mm

(ARG001)

207-207: Unused function argument: runtime_services

(ARG001)

207-207: Unused function argument: predownload_models

(ARG001)

207-207: Unused function argument: set_ucx_tls_no_mm

(ARG001)

271-271: Unused function argument: runtime_services

(ARG001)

271-271: Unused function argument: predownload_models

(ARG001)

271-271: Unused function argument: set_ucx_tls_no_mm

(ARG001)

tests/fault_tolerance/migration/utils.py

79-79: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

82-82: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

122-122: Do not catch blind exception: Exception

(BLE001)

205-205: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (6)

GitHub Check: trtllm (arm64)
GitHub Check: trtllm (amd64)
GitHub Check: vllm (arm64)
GitHub Check: sglang
GitHub Check: vllm (amd64)
GitHub Check: Build and Test - dynamo

🔇 Additional comments (1)

tests/fault_tolerance/README.md (1)

7-12: README test references are correct

All referenced tests in tests/fault_tolerance/README.md exist with matching names and paths.

tests/fault_tolerance/migration/utils.py

Signed-off-by: Jacky <[email protected]>

kthui self-assigned this Oct 13, 2025

pull-request-size bot added the size/XXL label Oct 13, 2025

github-actions bot added the test label Oct 13, 2025

kthui force-pushed the jacky-ft-migrate-graceful-test branch from 2320c0e to c1c30b1 Compare October 13, 2025 19:07

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 19:07 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 19:10 Inactive

kthui added 3 commits October 13, 2025 12:16

test: Add request migration on graceful shutdown test and some refactors

3d28959

Signed-off-by: Jacky <[email protected]>

test: Simplify request migration test to send only one request

2d6f332

Signed-off-by: Jacky <[email protected]>

docs: Update fault tolerance test README

d5d4c15

Signed-off-by: Jacky <[email protected]>

kthui force-pushed the jacky-ft-migrate-graceful-test branch from c1c30b1 to d5d4c15 Compare October 13, 2025 19:16

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 19:16 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 19:21 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 22:12 Inactive

kthui marked this pull request as ready for review October 13, 2025 22:13

kthui requested review from a team as code owners October 13, 2025 22:13

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 22:13 Inactive

coderabbitai bot reviewed Oct 13, 2025

View reviewed changes

tests/fault_tolerance/migration/utils.py Show resolved Hide resolved

kthui added 2 commits October 13, 2025 15:46

test: Add migration disabled test cases

0b0dbba

Signed-off-by: Jacky <[email protected]>

refactor: Use the new process.log_path() instead of process._log_path

b4969b2

Signed-off-by: Jacky <[email protected]>

kthui force-pushed the jacky-ft-migrate-graceful-test branch from ab1e218 to b4969b2 Compare October 13, 2025 22:47

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 22:47 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 22:49 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: Add request migration graceful shutdown E2E test #3585

test: Add request migration graceful shutdown E2E test #3585

kthui commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 13, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

test: Add request migration graceful shutdown E2E test #3585

Are you sure you want to change the base?

test: Add request migration graceful shutdown E2E test #3585

Conversation

kthui commented Oct 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 13, 2025

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kthui commented Oct 13, 2025 •

edited by coderabbitai bot

Loading