Skip to content

Fix migration#6243

Closed
zbb88888 wants to merge 1 commit intokubeovn:masterfrom
zbb88888:fix-migration
Closed

Fix migration#6243
zbb88888 wants to merge 1 commit intokubeovn:masterfrom
zbb88888:fix-migration

Conversation

@zbb88888
Copy link
Copy Markdown
Collaborator

@zbb88888 zbb88888 commented Feb 1, 2026

Pull Request

What type of this PR

Examples of user facing changes:

简化代码:

统一要求 migrationStateValid = true 才处理(这个时候并不晚,创建一个 pod 并等待其 running 大概5-10s,而虚拟机正常启动运行则更久,应该在 30s以上,其实等待 kubevirt 同步完 VMI migration src 和 dst node 再开始设置 lsp option 并不晚)
移除 MigrationScheduling 的 fallback 逻辑
移除后续的 !migrationStateValid 检查

  • Bug fixes

PR #6242 更新描述

## fix: ensure kubevirt migration uses current migration state

This PR fixes KubeVirt VM migration failures (Issue #6220) by adding MigrationUID validation to prevent stale state usage.

### Problem

When VMs migrate consecutively (e.g., A→B→A), the controller could use stale `MigrationState` from a previous migration:

1. **Missing MigrationUID check**: The code used `vmi.Status.MigrationState` without verifying it belongs to the current migration
2. **Stale node info**: This caused `sourceNode == targetNode` (both showing old values), making the controller skip `SetLogicalSwitchPortMigrateOptions`
3. **Migration timeout**: Without proper LSP options, the new pod couldn't reach network ready state

### Root Cause

```go
// BEFORE: No UID validation - could use state from previous migration
if vmi.Status.MigrationState != nil {
    srcNodeName = vmi.Status.MigrationState.SourceNode
    targetNodeName = vmi.Status.MigrationState.TargetNode
}

Solution

  1. Add MigrationUID validation: Only use vmi.Status.MigrationState if MigrationUID matches current vmiMigration.UID

  2. Remove early return on Completed: The previous Completed check could skip cleanup for MigrationSucceeded/Failed phases

Related

Which issue(s) this PR fixes

#6220

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @zbb88888, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical bug in the KubeVirt VM migration process by ensuring the controller consistently uses the most current migration state. The changes prevent issues arising from stale data, which previously caused network misconfigurations and migration failures during consecutive VM movements. By introducing robust validation and refining the handling of migration phases, the reliability and stability of VM migrations are significantly improved.

Highlights

  • Fix KubeVirt VM Migration Failures: Addresses an issue where consecutive VM migrations (e.g., A->B->A) could fail due to the controller using stale MigrationState data from a previous migration, leading to incorrect network configurations and timeouts.
  • MigrationUID Validation: Implemented a critical check to validate vmi.Status.MigrationState by ensuring its MigrationUID matches the UID of the current VirtualMachineInstanceMigration. This prevents the use of outdated migration information.
  • Refined Migration Phase Handling: Removed an early return for Completed migration states to ensure that necessary cleanup and reset logic for MigrationSucceeded and MigrationFailed phases is always executed, even if the Completed flag is set prematurely.
  • Improved Logging and Error Handling: Enhanced logging to clearly distinguish between nil and stale MigrationState for better debugging. A new skip condition was added for MigrationSucceeded or MigrationFailed phases if the VMI migration state is stale or nil, as correct node information is essential for reset operations.
  • Comprehensive Unit Tests: Added new unit tests in pkg/controller/kubevirt_test.go to cover the MigrationUID validation logic, phase-based skip logic, and end-to-end migration scenarios, including the problematic A->B->A sequence.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a bug in KubeVirt VM migration where stale MigrationState could be used, leading to failures. The core fix, which validates the MigrationUID, is sound and effectively resolves the issue. The associated changes, such as removing the early return for the Completed state and improving the handling of terminal phases, are logical improvements. The addition of comprehensive unit tests is excellent and provides good coverage for the new logic. I have one minor suggestion to improve code readability.

@coveralls
Copy link
Copy Markdown

coveralls commented Feb 1, 2026

Pull Request Test Coverage Report for Build 21565852316

Details

  • 3 of 93 (3.23%) changed or added relevant lines in 2 files are covered.
  • 5 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.003%) to 22.93%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/ovs/ovn-nb-logical_switch_port.go 3 17 17.65%
pkg/controller/kubevirt.go 0 76 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/ovs/ovn-nb-logical_router_route.go 2 74.6%
pkg/controller/kubevirt.go 3 0.0%
Totals Coverage Status
Change from base Build 21538844522: -0.003%
Covered Lines: 12332
Relevant Lines: 53782

💛 - Coveralls

@zbb88888 zbb88888 force-pushed the fix-migration branch 3 times, most recently from ae85c9b to dab53fc Compare February 1, 2026 09:05
@zbb88888
Copy link
Copy Markdown
Collaborator Author

zbb88888 commented Feb 1, 2026

@gemini review

@zbb88888 zbb88888 marked this pull request as ready for review February 1, 2026 09:19
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Feb 1, 2026
@dosubot
Copy link
Copy Markdown

dosubot bot commented Feb 1, 2026

Related Documentation

No published documentation to review for changes on this repository.

Write your first living document

How did I do? Any feedback?  Join Discord

@dosubot dosubot bot added the bug Something isn't working label Feb 1, 2026
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Feb 1, 2026
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Feb 1, 2026
@zbb88888 zbb88888 force-pushed the fix-migration branch 3 times, most recently from d6bdb2f to 8607dba Compare February 1, 2026 13:02
@zbb88888 zbb88888 marked this pull request as draft February 1, 2026 13:03
@zbb88888
Copy link
Copy Markdown
Collaborator Author

zbb88888 commented Feb 1, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a critical bug in KubeVirt VM migration handling by introducing MigrationUID validation to prevent stale migration states, simplifying logic, and enhancing logging. However, a significant issue identified is the failure to remove an early return condition for completed migrations. This oversight prevents crucial cleanup operations, particularly for OVN migration options, which could lead to persistent networking issues for migrated VMs.

Comment on lines 143 to 146
if vmiMigration.Status.MigrationState.Completed {
klog.V(3).Infof("VirtualMachineInstanceMigration %s migration state is completed, skipping", key)
klog.V(3).Infof("VirtualMachineInstanceMigration %s (UID: %s) migration state is already completed, skipping",
key, migrationUID)
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The early return when vmiMigration.Status.MigrationState.Completed is true prevents crucial migration cleanup logic from executing. This specifically skips ResetLogicalSwitchPortMigrateOptions, leaving the VM port in migration mode in OVN indefinitely, which can lead to networking instability, duplicate packets, or loss of connectivity for the VM. This behavior contradicts the PR description, which states that the early return on Completed should be removed. Furthermore, relying on vmiMigration.Status.MigrationState here is less reliable than vmi.Status.MigrationState, which is the authoritative source of truth.

@zbb88888 zbb88888 force-pushed the fix-migration branch 2 times, most recently from 27aeae9 to 14c9ecb Compare February 1, 2026 15:51
Add MigrationUID validation to prevent stale state usage during
consecutive VM migrations (e.g., A→B→A).

Changes:
1. Add MigrationUID check: Only use vmi.Status.MigrationState if
   MigrationUID matches current vmiMigration.UID
2. Simplify MigrationScheduling: Wait for valid state instead of
   using Pod/vmi.Status.NodeName fallback
3. Add unit tests covering UID validation and migration scenarios

The root cause was that vmi.Status.MigrationState could contain stale
info from a previous migration, causing incorrect node detection and
skipping SetLogicalSwitchPortMigrateOptions.

Fixes: kubeovn#6220
Signed-off-by: zbb88888 <jmdxjsjgcxy@gmail.com>
@zbb88888 zbb88888 closed this Feb 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants