Skip to content

[VIRT] Add allowWorkloadDisruption (AWD) migration tests#4228

Open
SamAlber wants to merge 1 commit into
RedHatQE:mainfrom
SamAlber:add-awd-feature-test
Open

[VIRT] Add allowWorkloadDisruption (AWD) migration tests#4228
SamAlber wants to merge 1 commit into
RedHatQE:mainfrom
SamAlber:add-awd-feature-test

Conversation

@SamAlber

@SamAlber SamAlber commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Depends on: RedHatQE/openshift-python-wrapper#2676

Short description:

Add test coverage for the allowWorkloadDisruption MigrationPolicy parameter, verifying that migration completes via Paused mode when allowWorkloadDisruption=true and allowPostCopy=false under bandwidth-constrained conditions.

What this PR does / why we need it:

KubeVirt introduced AllowWorkloadDisruption in the MigrationPolicySpec. This PR consolidates PostCopy and Paused migration tests into two parametrized classes split by OS (RHEL and Windows). A single module-scoped MigrationPolicy is dynamically patched via ResourceEditor to toggle PostCopy/Paused mode per test class.

Renamed from test_post_copy_migration.py to test_workload_disruption_migration.py.

PostCopy vs Paused migration modes:

Both modes solve the same problem, a live migration that cannot converge because the guest is dirtying pages faster than they can be transferred, but they handle it differently:

PostCopy (allowPostCopy=true): After completionTimeoutPerGiB expires without convergence, the VM is immediately switched to the target node and begins running there. Any pages not yet transferred are faulted on demand from the source. There is no second timeout, the VM is already running on the target, so migration always succeeds.

Trade-off: if the source node goes down before all pages are transferred, the VM crashes.

Paused (allowWorkloadDisruption=true, allowPostCopy=false): After completionTimeoutPerGiB expires without convergence, the source VM is frozen (paused). With the guest paused, no new dirty pages are generated, so the remaining pages are transferred to the target. The VM then resumes on the target. There is a second completionTimeoutPerGiB window to complete the post-pause transfer, if it expires, the migration fails.

Trade-off: the VM experiences downtime while paused.

Migration Policy Parameters

bandwidth_per_migration: 70Mi
completionTimeoutPerGiB: 10
allow_auto_converge: False

Memory pressure: 2GB (Python bytearray on RHEL, PowerShell on Windows)
VM memory_guest: 4Gi (6Gi after memory hotplug)

KubeVirt calculates the pre-copy timeout using:

completionTimeout = completionTimeoutPerGiB × (mem_gi + migrated_src_disks_gi)

Where mem_gi is derived from the VM's memory_guest converted from binary GiB to decimal GB.

allow_auto_converge=False - disables QEMU's vCPU throttling mechanism that would otherwise help pre-copy migration converge, ensuring the completion timeout fires and migration switches to PostCopy/Paused mode under memory pressure.

With memory_guest=4Gi:

4Gi = 4,294,967,296 bytes / 10⁹ = 4.29 - rounds to 5 decimal GB

pre-copy window = 15 × 5 = 75 seconds

total timeout (Paused second window) = 2 × 75 = 150 seconds

At 70 MiB/s, the initial bulk copy of 4Gi takes 4096/70 ≈ 58.5 seconds — longer than the 50-second timeout. Pre-copy physically cannot complete even without dirty pages, guaranteeing mode transition regardless of guest dirty rate.

After memory hotplug to 6Gi (7 decimal GB), the pre-copy window grows to 10 × 7 = 70 seconds, but the bulk copy also grows to 6144 / 70 ≈ 87.8 seconds — still exceeds the timeout.

Why 2GB stress: fits safely in 4Gi VM (2GB stress + ~1GB OS overhead = 75% utilization), no OOM risk, and its dirty rate far exceeds the 70Mi/s copy bandwidth.

Why completion_timeout_per_gb=10:

The value 10 satisfies both constraints

Short enough to guarantee mode transition - the bulk copy time alone (58.5s for 4Gi, 87.8s for 6Gi) exceeds the timeout window (50s, 70s), making convergence impossible regardless of OS or memory stress implementation speed.

Long enough for Paused mode to succeed - in the second window (VM frozen, dirty rate = 0), total timeout = 2 × pre-copy window: 4Gi at 70 MiB/s = 58.5s < 100s (41.5s margin), and 6Gi at 70 MiB/s = 87.8s < 140s (52.2s margin).

Why 2GB stress: fits safely in 4Gi VM (2GB stress + ~1GB OS overhead = 75% utilization), no OOM risk, and ensures pages are continuously dirtied to reinforce pre-copy failure.

Tests cover (for both PostCopy and Paused modes):

Migration mode verification
Node drain with migration mode assertion
CPU hotplug after migration (RHEL and Windows)
Memory hotplug after migration (RHEL only)
Background process survival across migrations (PID check)

jira-ticket:

https://redhat.atlassian.net/browse/CNV-81275

Summary by CodeRabbit

  • Tests

    • Added workload-disruption migration tests for RHEL and Windows covering migration modes, node drains, CPU/memory hotplug, and background-process PID stability checks.
    • Removed the legacy post-copy migration test module and its associated fixtures.
  • Refactor

    • Renamed the hotplug VM fixture for clarity and updated test call sites.
  • Chores

    • Replaced the migration policy VM label with a new workload-disruption label across test suites.

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

Report bugs in Issues

Welcome! 🎉

This pull request will be automatically processed with the following features:

🔄 Automatic Actions

  • Reviewer Assignment: Reviewers are automatically assigned based on the OWNERS file in the repository root
  • Size Labeling: PR size labels (XS, S, M, L, XL, XXL) are automatically applied based on changes
  • Issue Creation: A tracking issue is created for this PR and will be closed when the PR is merged or closed
  • Branch Labeling: Branch-specific labels are applied to track the target branch
  • Auto-verification: Auto-verified users have their PRs automatically marked as verified
  • Labels: Enabled categories: branch, can-be-merged, cherry-pick, has-conflicts, hold, needs-rebase, size, verified, wip

📋 Available Commands

PR Status Management

  • /wip - Mark PR as work in progress (adds WIP: prefix to title)
  • /wip cancel - Remove work in progress status
  • /hold - Block PR merging (approvers only)
  • /hold cancel - Unblock PR merging
  • /verified - Mark PR as verified
  • /verified cancel - Remove verification status
  • /reprocess - Trigger complete PR workflow reprocessing (useful if webhook failed or configuration changed)
  • /regenerate-welcome - Regenerate this welcome message

Review & Approval

  • /lgtm - Approve changes (looks good to me)
  • /approve - Approve PR (approvers only)
  • /assign-reviewers - Assign reviewers based on OWNERS file
  • /assign-reviewer @username - Assign specific reviewer
  • /check-can-merge - Check if PR meets merge requirements

Testing & Validation

  • /retest tox - Run Python test suite with tox
  • /retest build-container - Rebuild and test container image
  • /retest verify-bugs-are-open - verify-bugs-are-open
  • /retest all - Run all available tests

Container Operations

  • /build-and-push-container - Build and push container image (tagged with PR number)
    • Supports additional build arguments: /build-and-push-container --build-arg KEY=value

Cherry-pick Operations

  • /cherry-pick <branch> - Schedule cherry-pick to target branch when PR is merged
    • Multiple branches: /cherry-pick branch1 branch2 branch3

Label Management

  • /<label-name> - Add a label to the PR
  • /<label-name> cancel - Remove a label from the PR

✅ Merge Requirements

This PR will be automatically approved when the following conditions are met:

  1. Approval: /approve from at least one approver
  2. LGTM Count: Minimum 2 /lgtm from reviewers
  3. Status Checks: All required status checks must pass
  4. No Blockers: No wip, hold, has-conflicts labels and PR must be mergeable (no conflicts)
  5. Verified: PR must be marked as verified

📊 Review Process

Approvers and Reviewers

Approvers:

  • dshchedr
  • vsibirsk

Reviewers:

  • SamAlber
  • SiboWang1997
  • akri3i
  • dshchedr
  • jerry7z
  • kbidarkar
  • vsibirsk
Available Labels
  • hold
  • verified
  • wip
  • lgtm
  • approve

💡 Tips

  • WIP Status: Use /wip when your PR is not ready for review
  • Verification: The verified label is automatically removed on each new commit
  • Cherry-picking: Cherry-pick labels are processed when the PR is merged
  • Container Builds: Container images are automatically tagged with the PR number
  • Permission Levels: Some commands require approver permissions
  • Auto-verified Users: Certain users have automatic verification and merge privileges

For more information, please refer to the project documentation or contact the maintainers.

@coderabbitai

coderabbitai Bot commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces a post-copy VM label with a workload-disruption label, renames a hotplug fixture, removes the post-copy migration test module, adds a new workload-disruption migration test module, and updates tests/fixtures to the new names and label.

Changes

Cohort / File(s) Summary
Labels & config
tests/virt/constants.py, tests/virt/upgrade/conftest.py
Replaced VM_LABEL = {"post-copy-vm":"true"} with WORKLOAD_DISRUPTION_VM_LABEL = {"workload-disruption-vm":"true"}; updated MigrationPolicy and VM test fixtures to use the new label.
New workload-disruption migration tests
tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py
Added a ~+255-line module implementing AWD workload-disruption migration E2E tests, PID-stability and migration-mode assertions, MigrationPolicy fixture using WORKLOAD_DISRUPTION_VM_LABEL, node-drain helpers, and parameterized RHEL/Windows test classes.
Removed post-copy migration tests
tests/virt/node/migration_and_maintenance/test_post_copy_migration.py
Removed entire post-copy migration test module and its fixtures/helpers (PID-assertion, MigrationPolicy, migration/drain fixtures, and TestPostCopyMigration class).
Fixture rename & uses updated
tests/virt/node/conftest.py, tests/virt/node/hotplug/test_cpu_memory_hotplug.py, other test files...
Renamed class-scoped fixture hotplugged_vmvm_with_hotplug_support; updated all dependent fixtures, parametrizations, test signatures, helper calls, and multi-line call formatting where applicable.
Test adjustments
tests/virt/node/hotplug/test_cpu_memory_hotplug.py
Switched all hotplug-related test signatures and snapshots to accept vm_with_hotplug_support; renamed test methods to match fixture rename and updated Windows xfail check to use new fixture name.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the primary change: adding test coverage for allowWorkloadDisruption (AWD) migration functionality, directly matching the substantial new test module introduced in the PR.
Description check ✅ Passed Pull request description is comprehensive and well-structured, covering all required sections with detailed technical context.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

coderabbitai[bot]
coderabbitai Bot previously approved these changes Mar 20, 2026
Comment thread tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py Outdated
Comment thread tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py Outdated
@SamAlber

Copy link
Copy Markdown
Contributor Author

/wip

@openshift-virtualization-qe-bot openshift-virtualization-qe-bot changed the title [VIRT] Add allowWorkloadDisruption (AWD) migration test WIP: [VIRT] Add allowWorkloadDisruption (AWD) migration test Mar 20, 2026
@SamAlber SamAlber force-pushed the add-awd-feature-test branch from 55f7c92 to 2354fe5 Compare March 24, 2026 23:25
@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@SamAlber

Copy link
Copy Markdown
Contributor Author

/wip

@SamAlber

SamAlber commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/build-and-push-container

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/28709

@openshift-virtualization-qe-bot-4

Copy link
Copy Markdown

Failed to build and push quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228

@SamAlber

SamAlber commented Jun 1, 2026

Copy link
Copy Markdown
Contributor Author

/build-and-push-container

@openshift-virtualization-qe-bot-5

Copy link
Copy Markdown

New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/28771

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/29330

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/29357

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/29363

@SamAlber

Copy link
Copy Markdown
Contributor Author

/build-and-push-container

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published

@SamAlber

Copy link
Copy Markdown
Contributor Author

/wip cancel

@SamAlber

Copy link
Copy Markdown
Contributor Author

/build-and-push-container

@openshift-virtualization-qe-bot-6

Copy link
Copy Markdown

New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published

@SamAlber

Copy link
Copy Markdown
Contributor Author

/build-and-push-container

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published

Comment thread tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py Outdated
Comment thread tests/virt/node/migration_and_maintenance/utils.py Outdated
Comment thread tests/virt/node/migration_and_maintenance/utils.py Outdated
@SamAlber

Copy link
Copy Markdown
Contributor Author

/build-and-push-container

@openshift-virtualization-qe-bot-6

Copy link
Copy Markdown

New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published

Comment thread tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py Outdated
Comment thread tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py Outdated
Comment thread tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py Outdated
Comment thread tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py Outdated
Comment thread tests/virt/node/migration_and_maintenance/utils.py
Comment thread tests/virt/node/migration_and_maintenance/utils.py Outdated
Comment thread tests/virt/upgrade/conftest.py
@SamAlber

Copy link
Copy Markdown
Contributor Author

Thanks for the review @dshchedr.
I addressed all review comments, ready for another look.

@openshift-virtualization-qe-bot

Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/29794

@SamAlber

Copy link
Copy Markdown
Contributor Author

/build-and-push-container

@openshift-virtualization-qe-bot-3

Copy link
Copy Markdown
Contributor

/retest all

Auto-triggered: Files in this PR were modified by merged PR #5318.

Overlapping files

tests/virt/node/conftest.py
tests/virt/node/hotplug/test_cpu_memory_hotplug.py
tests/virt/node/migration_and_maintenance/test_post_copy_migration.py
tests/virt/upgrade/conftest.py

Comment thread tests/virt/node/hotplug/test_cpu_memory_hotplug.py Outdated
Comment thread tests/virt/node/migration_and_maintenance/test_workload_disruption_migration.py Outdated
@openshift-virtualization-qe-bot

Copy link
Copy Markdown

D/S test tox -e verify-tc-requirement-polarion failed: cnv-tests-tox-executor/29939

Comment thread tests/virt/node/conftest.py Outdated
)
yield
clean_up_migration_jobs(client=admin_client, vm=hotplugged_vm)
clean_up_migration_jobs(client=admin_client, vm=vm_with_hotplug_support)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about to put clean_up_migration under try/finally here?

@SamAlber SamAlber Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the cleanup to vm_with_hotplug_support fixture teardown instead, guarded by is_jira_open. This avoids scattering try/finally in every fixture that triggers a migration

Consolidate PostCopy and Paused migration tests into two parametrized
classes split by OS (RHEL and Windows).

Renamed from test_post_copy_migration.py to test_workload_disruption_migration.py.

A single module-scoped MigrationPolicy is dynamically patched via
ResourceEditor to toggle PostCopy/Paused mode per test class.

Memory pressure uses built-in tools (Python on RHEL,
PowerShell on Windows).

Tests cover:
- Migration mode verification (PostCopy and Paused)
- Node drain with migration mode assertion
- CPU hotplug after migration (RHEL and Windows)
- Memory hotplug after migration (RHEL and Windows)
- Background process survival across migrations (PID check)

Consolidate VMIM cleanup for CNV-92094 workaround: move
clean_up_migration_jobs to the vm_with_hotplug_support fixture
teardown, guarded by is_jira_open("CNV-92094"). Removes scattered
try/finally and per-fixture cleanup calls — cleanup runs once at
class teardown and auto-disables when the bug is fixed.

Signed-off-by: Samuel Albershtein <salbersh@redhat.com>
Co-authored-by: AI (Claude) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants