[VIRT] Add allowWorkloadDisruption (AWD) migration tests#4228
Conversation
|
Report bugs in Issues Welcome! 🎉This pull request will be automatically processed with the following features: 🔄 Automatic Actions
📋 Available CommandsPR Status Management
Review & Approval
Testing & Validation
Container Operations
Cherry-pick Operations
Label Management
✅ Merge RequirementsThis PR will be automatically approved when the following conditions are met:
📊 Review ProcessApprovers and ReviewersApprovers:
Reviewers:
Available Labels
💡 Tips
For more information, please refer to the project documentation or contact the maintainers. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces a post-copy VM label with a workload-disruption label, renames a hotplug fixture, removes the post-copy migration test module, adds a new workload-disruption migration test module, and updates tests/fixtures to the new names and label. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
a2fde65 to
cd257da
Compare
cd257da to
edb1853
Compare
edb1853 to
55f7c92
Compare
|
/wip |
55f7c92 to
2354fe5
Compare
|
This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days. |
|
/wip |
|
/build-and-push-container |
|
D/S test |
|
Failed to build and push quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 |
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published |
|
D/S test |
|
D/S test |
|
D/S test |
|
D/S test |
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published |
|
/wip cancel |
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published |
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published |
|
/build-and-push-container |
|
New container for quay.io/openshift-cnv/openshift-virtualization-tests:pr-4228 published |
|
Thanks for the review @dshchedr. |
|
D/S test |
|
/build-and-push-container |
|
/retest all Auto-triggered: Files in this PR were modified by merged PR #5318. Overlapping filestests/virt/node/conftest.py |
|
D/S test |
| ) | ||
| yield | ||
| clean_up_migration_jobs(client=admin_client, vm=hotplugged_vm) | ||
| clean_up_migration_jobs(client=admin_client, vm=vm_with_hotplug_support) |
There was a problem hiding this comment.
how about to put clean_up_migration under try/finally here?
There was a problem hiding this comment.
Moved the cleanup to vm_with_hotplug_support fixture teardown instead, guarded by is_jira_open. This avoids scattering try/finally in every fixture that triggers a migration
Consolidate PostCopy and Paused migration tests into two parametrized
classes split by OS (RHEL and Windows).
Renamed from test_post_copy_migration.py to test_workload_disruption_migration.py.
A single module-scoped MigrationPolicy is dynamically patched via
ResourceEditor to toggle PostCopy/Paused mode per test class.
Memory pressure uses built-in tools (Python on RHEL,
PowerShell on Windows).
Tests cover:
- Migration mode verification (PostCopy and Paused)
- Node drain with migration mode assertion
- CPU hotplug after migration (RHEL and Windows)
- Memory hotplug after migration (RHEL and Windows)
- Background process survival across migrations (PID check)
Consolidate VMIM cleanup for CNV-92094 workaround: move
clean_up_migration_jobs to the vm_with_hotplug_support fixture
teardown, guarded by is_jira_open("CNV-92094"). Removes scattered
try/finally and per-fixture cleanup calls — cleanup runs once at
class teardown and auto-disables when the bug is fixed.
Signed-off-by: Samuel Albershtein <salbersh@redhat.com>
Co-authored-by: AI (Claude) <noreply@anthropic.com>
Depends on: RedHatQE/openshift-python-wrapper#2676
Short description:
Add test coverage for the allowWorkloadDisruption MigrationPolicy parameter, verifying that migration completes via Paused mode when allowWorkloadDisruption=true and allowPostCopy=false under bandwidth-constrained conditions.
What this PR does / why we need it:
KubeVirt introduced AllowWorkloadDisruption in the MigrationPolicySpec. This PR consolidates PostCopy and Paused migration tests into two parametrized classes split by OS (RHEL and Windows). A single module-scoped MigrationPolicy is dynamically patched via ResourceEditor to toggle PostCopy/Paused mode per test class.
Renamed from test_post_copy_migration.py to test_workload_disruption_migration.py.
PostCopy vs Paused migration modes:
Both modes solve the same problem, a live migration that cannot converge because the guest is dirtying pages faster than they can be transferred, but they handle it differently:
PostCopy (allowPostCopy=true): After completionTimeoutPerGiB expires without convergence, the VM is immediately switched to the target node and begins running there. Any pages not yet transferred are faulted on demand from the source. There is no second timeout, the VM is already running on the target, so migration always succeeds.
Trade-off: if the source node goes down before all pages are transferred, the VM crashes.
Paused (allowWorkloadDisruption=true, allowPostCopy=false): After completionTimeoutPerGiB expires without convergence, the source VM is frozen (paused). With the guest paused, no new dirty pages are generated, so the remaining pages are transferred to the target. The VM then resumes on the target. There is a second completionTimeoutPerGiB window to complete the post-pause transfer, if it expires, the migration fails.
Trade-off: the VM experiences downtime while paused.
Migration Policy Parameters
KubeVirt calculates the pre-copy timeout using:
completionTimeout = completionTimeoutPerGiB × (mem_gi + migrated_src_disks_gi)Where
mem_giis derived from the VM's memory_guest converted from binary GiB to decimal GB.allow_auto_converge=False - disables QEMU's vCPU throttling mechanism that would otherwise help pre-copy migration converge, ensuring the completion timeout fires and migration switches to PostCopy/Paused mode under memory pressure.
With memory_guest=4Gi:
4Gi = 4,294,967,296 bytes / 10⁹ = 4.29 - rounds to 5 decimal GB
pre-copy window = 15 × 5 = 75 seconds
total timeout (Paused second window) = 2 × 75 = 150 seconds
At 70 MiB/s, the initial bulk copy of 4Gi takes 4096/70 ≈ 58.5 seconds — longer than the 50-second timeout. Pre-copy physically cannot complete even without dirty pages, guaranteeing mode transition regardless of guest dirty rate.
After memory hotplug to 6Gi (7 decimal GB), the pre-copy window grows to 10 × 7 = 70 seconds, but the bulk copy also grows to 6144 / 70 ≈ 87.8 seconds — still exceeds the timeout.
Why 2GB stress: fits safely in 4Gi VM (2GB stress + ~1GB OS overhead = 75% utilization), no OOM risk, and its dirty rate far exceeds the 70Mi/s copy bandwidth.
Why completion_timeout_per_gb=10:
The value 10 satisfies both constraints
Short enough to guarantee mode transition - the bulk copy time alone (58.5s for 4Gi, 87.8s for 6Gi) exceeds the timeout window (50s, 70s), making convergence impossible regardless of OS or memory stress implementation speed.
Long enough for Paused mode to succeed - in the second window (VM frozen, dirty rate = 0), total timeout = 2 × pre-copy window: 4Gi at 70 MiB/s = 58.5s < 100s (41.5s margin), and 6Gi at 70 MiB/s = 87.8s < 140s (52.2s margin).
Why 2GB stress: fits safely in 4Gi VM (2GB stress + ~1GB OS overhead = 75% utilization), no OOM risk, and ensures pages are continuously dirtied to reinforce pre-copy failure.
Tests cover (for both PostCopy and Paused modes):
Migration mode verification
Node drain with migration mode assertion
CPU hotplug after migration (RHEL and Windows)
Memory hotplug after migration (RHEL only)
Background process survival across migrations (PID check)
jira-ticket:
https://redhat.atlassian.net/browse/CNV-81275
Summary by CodeRabbit
Tests
Refactor
Chores