MTV-4150 | Fix disk transfer completed when importer pod is pending#4286
MTV-4150 | Fix disk transfer completed when importer pod is pending#4286gwencasey96 wants to merge 1 commit into
Conversation
61122cb to
e4c04d7
Compare
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4286 +/- ##
==========================================
- Coverage 15.45% 11.09% -4.37%
==========================================
Files 112 523 +411
Lines 23377 62973 +39596
==========================================
+ Hits 3613 6984 +3371
- Misses 19479 55390 +35911
- Partials 285 599 +314
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
Where does the DataVolume get set to Succeeded? From the bug, it looks like the phase is just blank: The patch is kind of hard to read, but I am more wondering if it fixes the wrong symptom. If you have a controller log or something that shows the DataVolume itself is Succeeded can you attach it to the Jira ticket? |
|
d61e92e to
060e621
Compare
📝 WalkthroughWalkthroughAdds ChangesDataVolume Completion Gating
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
pkg/controller/plan/migration.go (1)
1740-1743: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueAvoid the
path.Join-based identity key in these new log lines.The new log statements throughout
verifyDataVolumeCompletion(e.g., Lines 1742, 1756, 1766, 1775, 1792, 1807, 1819, 1837, 1848) use"dv", path.Join(dv.Namespace, dv.Name). Prefer separate key/value pairs for resource identity, e.g."dv.name", dv.Name, "dv.namespace", dv.Namespace.Based on learnings: log resource name and namespace as separate key/value pairs rather than combining them into a single identity string key using
path.Join, using dot-prefixed keys (e.g.dv.name,dv.namespace) to distinguish resource types; avoid introducing new log lines that use the olderpath.Join-based identity pattern.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/plan/migration.go` around lines 1740 - 1743, Throughout the verifyDataVolumeCompletion function, replace all instances of the pattern "dv", path.Join(dv.Namespace, dv.Name) with separate key/value pairs for better log structure. Instead use "dv.name", dv.Name, "dv.namespace", dv.Namespace as consecutive key/value pairs in each log statement. This change should be applied to all log lines in verifyDataVolumeCompletion that currently use the path.Join-based identity pattern (appearing at approximately lines 1742, 1756, 1766, 1775, 1792, 1807, 1819, 1837, and 1848).Source: Learnings
pkg/controller/plan/migration_test.go (2)
456-465: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winThis test re-implements the production condition instead of exercising it.
Lines 459-461 copy the exact gating expression from
updateCopyProgressrather than calling the real code path. The assertion can only ever pass and will not catch a regression if the production condition at Line 1896 ofmigration.gochanges. Drive the actual logic (e.g., invokeupdateCopyProgresswith aPendingPopulationDataVolume on a warm plan and assert the resulting task stays pending) so the test guards real behavior.As per path instructions: tests should provide coverage on critical logic paths with specific, meaningful assertions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/plan/migration_test.go` around lines 456 - 465, The test is duplicating the production gating condition from the updateCopyProgress method instead of invoking it. Replace the manual condition check (lines 459-461 that set dvPhase) by directly calling the actual updateCopyProgress method with a Migration configured with a warm plan and a PendingPopulation DataVolume, then assert that the phase remains in the expected state to verify the real code path behavior rather than re-implementing its logic.Source: Path instructions
293-431: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winBind Gomega to each subtest and consider covering the remaining branches.
g := gomega.NewGomegaWithT(t)at Line 293 captures the parentt; failures inside thet.Runclosures are attributed to the parent rather than the failing subtest. Useg := gomega.NewWithT(t)inside the closure (Line 409). Also consider table cases for the currently-uncovered branches ofverifyDataVolumeCompletion: emptyClaimName, prime-PVCNotFound(completion allowed), and the "no importer pod found" path.As per path instructions: assertions must provide informative messages on failure, and tests should strive for high coverage on critical logic paths.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/controller/plan/migration_test.go` around lines 293 - 431, The Gomega instance `g` initialized at line 293 captures the parent test `t`, causing assertion failures in subtests to be misattributed to the parent test rather than the failing subtest. Inside the subtest closure (the t.Run lambda around line 409), create a new Gomega instance using `gomega.NewWithT(t)` bound to the current subtest instead of using the parent-scoped instance. Additionally, add new test cases to the existing test table to cover the currently uncovered branches of `verifyDataVolumeCompletion`: a case with an empty `ClaimName`, a case where the prime-PVC is not found but completion is allowed, and a case for the missing importer pod scenario.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/controller/plan/migration.go`:
- Around line 1784-1822: The error handling for transient failures is
inconsistent between the main PVC check (which returns false to block
completion) and the importer pod check via r.kubevirt.GetImporterPod(*primePVC)
(which returns true to allow completion). When GetImporterPod encounters a
transient error (kErr is not nil), change the return statement to return false
instead of true, so the reconciliation will retry on the next cycle rather than
prematurely marking the task as complete, maintaining consistency with the main
PVC error handling strategy.
- Around line 2046-2058: The default case in the phase switch statement is
incorrectly reporting cdi.PendingPopulation as an unknown or empty phase. Add an
explicit case for cdi.PendingPopulation before the default case that increments
pending, sets task.Phase to api.StepPending, and sets task.Reason to an accurate
message indicating that the DataVolume is in PendingPopulation phase (e.g.,
"DataVolume is in PendingPopulation phase"). Keep the existing default case for
handling truly unknown or empty phases with the current misleading message only
for those cases.
- Around line 1896-1917: The issue is that promoting PendingPopulation to
cdi.Succeeded for cold-conversion migrations creates a deadlock because
verifyDataVolumeCompletion immediately checks if the PVC is bound, which fails
with WaitForFirstConsumer storage classes. To fix this, track the original
DataVolume phase before the promotion at line 1896, then either skip
verifyDataVolumeCompletion for the promoted case or modify the verification
logic in verifyDataVolumeCompletion to relax the bound-check requirement when
the original phase was PendingPopulation, allowing the conversion pod to be
scheduled and trigger PVC binding.
---
Nitpick comments:
In `@pkg/controller/plan/migration_test.go`:
- Around line 456-465: The test is duplicating the production gating condition
from the updateCopyProgress method instead of invoking it. Replace the manual
condition check (lines 459-461 that set dvPhase) by directly calling the actual
updateCopyProgress method with a Migration configured with a warm plan and a
PendingPopulation DataVolume, then assert that the phase remains in the expected
state to verify the real code path behavior rather than re-implementing its
logic.
- Around line 293-431: The Gomega instance `g` initialized at line 293 captures
the parent test `t`, causing assertion failures in subtests to be misattributed
to the parent test rather than the failing subtest. Inside the subtest closure
(the t.Run lambda around line 409), create a new Gomega instance using
`gomega.NewWithT(t)` bound to the current subtest instead of using the
parent-scoped instance. Additionally, add new test cases to the existing test
table to cover the currently uncovered branches of `verifyDataVolumeCompletion`:
a case with an empty `ClaimName`, a case where the prime-PVC is not found but
completion is allowed, and a case for the missing importer pod scenario.
In `@pkg/controller/plan/migration.go`:
- Around line 1740-1743: Throughout the verifyDataVolumeCompletion function,
replace all instances of the pattern "dv", path.Join(dv.Namespace, dv.Name) with
separate key/value pairs for better log structure. Instead use "dv.name",
dv.Name, "dv.namespace", dv.Namespace as consecutive key/value pairs in each log
statement. This change should be applied to all log lines in
verifyDataVolumeCompletion that currently use the path.Join-based identity
pattern (appearing at approximately lines 1742, 1756, 1766, 1775, 1792, 1807,
1819, 1837, and 1848).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 4515a35d-7d0b-40fb-a9b6-d80cf05eac21
📒 Files selected for processing (2)
pkg/controller/plan/migration.gopkg/controller/plan/migration_test.go
| err = r.Destination.Client.Get(context.TODO(), types.NamespacedName{ | ||
| Namespace: r.Plan.Spec.TargetNamespace, | ||
| Name: fmt.Sprintf("prime-%s", pvc.UID), | ||
| }, primePVC) | ||
| if err != nil { | ||
| if !k8serr.IsNotFound(err) { | ||
| r.Log.Error(err, "Could not get prime PVC for DataVolume to verify completion", | ||
| "vm", vm.String(), | ||
| "dv", path.Join(dv.Namespace, dv.Name), | ||
| "dvPhase", dv.Status.Phase) | ||
| return false, fmt.Sprintf("Could not get prime PVC: %v", err) | ||
| } | ||
| // Prime PVC doesn't exist - that's okay, importer may not be using it | ||
| r.Log.V(4).Info("Prime PVC does not exist, skipping prime PVC check", | ||
| "vm", vm.String(), | ||
| "dv", path.Join(dv.Namespace, dv.Name)) | ||
| return true, "" | ||
| } | ||
|
|
||
| // Prime PVC exists - check if it's bound | ||
| if primePVC.Status.Phase != core.ClaimBound { | ||
| r.Log.Info("DataVolume reports Succeeded but prime PVC is not bound", | ||
| "vm", vm.String(), | ||
| "dv", path.Join(dv.Namespace, dv.Name), | ||
| "dvPhase", dv.Status.Phase, | ||
| "primePVCPhase", primePVC.Status.Phase, | ||
| "podPhaseAnnotation", podPhaseAnnotation) | ||
| return false, fmt.Sprintf("Prime PVC is not bound (phase: %s)", primePVC.Status.Phase) | ||
| } | ||
|
|
||
| // Check if importer pod exists and is not pending | ||
| importer, found, kErr := r.kubevirt.GetImporterPod(*primePVC) | ||
| if kErr != nil { | ||
| r.Log.Error(kErr, "Could not get CDI importer pod for DataVolume to verify completion", | ||
| "vm", vm.String(), | ||
| "dv", path.Join(dv.Namespace, dv.Name), | ||
| "dvPhase", dv.Status.Phase) | ||
| // Don't block completion if we can't check the pod - this is a transient issue | ||
| return true, "" |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win
Transient-error handling is asymmetric and biases toward premature completion.
A transient error fetching the main PVC blocks completion (Line 1759), which is safe, but transient errors for the prime PVC (other than NotFound) and the importer pod return true (Lines 1822), allowing the task to be marked complete despite the readiness check failing to run. Since the whole point of this gate is to avoid premature completion, consider returning false (stay pending and retry on the next reconcile) on these transient failures as well.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/controller/plan/migration.go` around lines 1784 - 1822, The error
handling for transient failures is inconsistent between the main PVC check
(which returns false to block completion) and the importer pod check via
r.kubevirt.GetImporterPod(*primePVC) (which returns true to allow completion).
When GetImporterPod encounters a transient error (kErr is not nil), change the
return statement to return false instead of true, so the reconciliation will
retry on the next cycle rather than prematurely marking the task as complete,
maintaining consistency with the main PVC error handling strategy.
| if dv.Status.Phase == cdi.PendingPopulation && r.Source.Provider.RequiresConversion() && !r.Plan.IsWarm() { | ||
| // Cold conversion migrations: the conversion pod is the first consumer of | ||
| // the PVCs, so PendingPopulation can be treated as Succeeded. Warm migrations | ||
| // still require the importer to run — do not promote the phase here. | ||
| dv.Status.Phase = cdi.Succeeded | ||
| } | ||
| conditions := dv.Conditions() | ||
| switch dv.Status.Phase { | ||
| case cdi.Succeeded: | ||
| completed++ | ||
| r.setTaskCompleted(task) | ||
| canMarkCompleted, reason := r.verifyDataVolumeCompletion(dv.DataVolume, vm) | ||
| if canMarkCompleted { | ||
| completed++ | ||
| r.setTaskCompleted(task) | ||
| } else { | ||
| pending++ | ||
| task.Phase = api.StepPending | ||
| if reason != "" { | ||
| task.Reason = fmt.Sprintf("DataVolume shows %s but verification failed: %s", dv.Status.Phase, reason) | ||
| } else { | ||
| task.Reason = fmt.Sprintf("DataVolume shows %s but PVC is not bound or importer pod is pending", dv.Status.Phase) | ||
| } | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major
🧩 Analysis chain
🌐 Web query:
With CDI populators and a WaitForFirstConsumer storage class, is the target PVC bound while the DataVolume phase is PendingPopulation, or does it remain Pending until the first consumer pod is scheduled?
💡 Result:
When using Containerized Data Importer (CDI) with a storage class configured for WaitForFirstConsumer (WFFC) volume binding mode, the target PersistentVolumeClaim (PVC) remains in a Pending state—waiting for a consumer pod to be scheduled—until a pod is created that references the PVC [1][2][3]. Specifically, the DataVolume (DV) phase behavior is as follows: 1. Binding Delay: When the storage class has volumeBindingMode: WaitForFirstConsumer, CDI will not schedule its internal worker pods (used for import, clone, or upload) until the PVC is bound [1][4]. The PVC will remain Pending until a workload (e.g., a KubeVirt Virtual Machine) is scheduled, which triggers the actual binding of the PVC [1][5]. 2. DataVolume Phase: If CDI is using the newer populator mechanism, and the storage class is set to WaitForFirstConsumer without a pre-selected node or immediate binding request, the DataVolume will enter the PendingPopulation phase [6][7]. In this state, the PVC remains unbound (in a Pending state) specifically because the volume binding is delayed by the WFFC storage class policy [7]. 3. Workload Interaction: To move the PVC out of this Pending state, the workload controller (such as KubeVirt) must detect the PendingPopulation (or WaitForFirstConsumer) phase and trigger the creation of a consumer pod (often an ephemeral dummy pod) to satisfy the scheduling requirements [5][8]. Once the PVC is bound to a node, the actual population process completes, and the DataVolume phase transitions accordingly [6]. In summary, the PVC remains unbound (Pending) until the first consumer pod is scheduled [1][3]. The PendingPopulation phase on the DataVolume indicates that CDI is effectively paused or awaiting this binding to proceed with population [6][7].
Citations:
- 1: https://github.com/kubevirt/containerized-data-importer/blob/420f86a0a68a96188338f33b043c0078a1704d3f/doc/waitforfirstconsumer-storage-handling.md
- 2: WaitForFirstConsumer handling for DataVolumes kubevirt/containerized-data-importer#1242
- 3: Add new dv phase WaitForFirstConsumer kubevirt/containerized-data-importer#1247
- 4: cannot upload to DataVolume in WaitForFirstConsumer state, make sure the PVC is Bound kubevirt/containerized-data-importer#2645
- 5: Adds a special handling for DataVolumes in WaitForFirstConsumer state. kubevirt/kubevirt#4025
- 6: https://github.com/kubevirt/containerized-data-importer/blob/main/doc/cdi-populators.md
- 7: https://github.com/kubevirt/containerized-data-importer/blob/420f86a0a68a96188338f33b043c0078a1704d3f/pkg/controller/datavolume/controller-base.go
- 8: Document new kubevirt handling of WaitForFirstConsumer DataVolumes kubevirt/kubevirt#4663
🏁 Script executed:
# First, let's check the file size and examine verifyDataVolumeCompletion function around Line 1772
cd /tmp && find . -name "migration.go" -path "*/controller/plan/*" 2>/dev/null | head -1 | xargs wc -lRepository: kubev2v/forklift
Length of output: 156
🏁 Script executed:
# Find and examine the verifyDataVolumeCompletion function
rg -A 50 "func.*verifyDataVolumeCompletion" pkg/controller/plan/migration.goRepository: kubev2v/forklift
Length of output: 2070
🏁 Script executed:
# Also check the exact area around Line 1772 to see the PVC bound check
sed -n '1750,1800p' pkg/controller/plan/migration.go | cat -nRepository: kubev2v/forklift
Length of output: 2425
🏁 Script executed:
# Check around Line 1896 for the cold conversion context
sed -n '1875,1920p' pkg/controller/plan/migration.go | cat -nRepository: kubev2v/forklift
Length of output: 2061
Promoted PendingPopulation creates a deadlock in cold-conversion migrations with WaitForFirstConsumer storage classes.
At line 1896, the code unconditionally promotes PendingPopulation DataVolumes to cdi.Succeeded for cold-conversion migrations. However, verifyDataVolumeCompletion (which runs immediately after) enforces that the main PVC must be Bound before marking the task complete. With WaitForFirstConsumer storage classes, the PVC intentionally remains Pending until a consumer pod is scheduled—which is exactly what the conversion pod should trigger. By promoting PendingPopulation to Succeeded while the PVC stays Pending, the verification check fails, the task reverts to Pending, and the conversion pod is never created, causing the migration to stall.
Fix: Either skip verifyDataVolumeCompletion for the promoted PendingPopulation case by tracking the original phase, or relax the bound-check requirement when the original DataVolume phase was PendingPopulation.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/controller/plan/migration.go` around lines 1896 - 1917, The issue is that
promoting PendingPopulation to cdi.Succeeded for cold-conversion migrations
creates a deadlock because verifyDataVolumeCompletion immediately checks if the
PVC is bound, which fails with WaitForFirstConsumer storage classes. To fix
this, track the original DataVolume phase before the promotion at line 1896,
then either skip verifyDataVolumeCompletion for the promoted case or modify the
verification logic in verifyDataVolumeCompletion to relax the bound-check
requirement when the original phase was PendingPopulation, allowing the
conversion pod to be scheduled and trigger PVC binding.
| default: | ||
| // Handle unknown/empty phases - treat as pending | ||
| log.Info( | ||
| "DataVolume has unknown or empty phase, treating as pending.", | ||
| "vm", | ||
| vm.String(), | ||
| "dv", | ||
| path.Join(dv.Namespace, dv.Name), | ||
| "phase", | ||
| dv.Status.Phase) | ||
| pending++ | ||
| task.Phase = api.StepPending | ||
| task.Reason = fmt.Sprintf("DataVolume phase is unknown or empty: %s", dv.Status.Phase) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
default reason is misleading for known phases like PendingPopulation.
With warm migrations no longer promoting PendingPopulation (Line 1896) and no explicit cdi.PendingPopulation switch case, that phase now falls into this default branch and reports "DataVolume phase is unknown or empty: PendingPopulation". Treating it as pending is correct, but the message wrongly implies an unknown/empty phase. Consider an explicit cdi.PendingPopulation case with an accurate reason.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/controller/plan/migration.go` around lines 2046 - 2058, The default case
in the phase switch statement is incorrectly reporting cdi.PendingPopulation as
an unknown or empty phase. Add an explicit case for cdi.PendingPopulation before
the default case that increments pending, sets task.Phase to api.StepPending,
and sets task.Reason to an accurate message indicating that the DataVolume is in
PendingPopulation phase (e.g., "DataVolume is in PendingPopulation phase"). Keep
the existing default case for handling truly unknown or empty phases with the
current misleading message only for those cases.
Verify PVC binding and importer pod state before marking disk transfer complete. Do not promote PendingPopulation to Succeeded on warm migrations. Restore separate Paused checkpoint handling. Ref: https://redhat.atlassian.net/browse/MTV-4150 Resolves: MTV-4150 Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Gwen Casey <gcasey@redhat.com>
060e621 to
8cb1f1c
Compare
|



Resolves: MTV-4150
The Disk Transfer step was being marked as completed when a DataVolume reached Succeeded phase, even if the PVCs were not bound or the importer pod was still pending. This could happen when CSI drivers fail to provision volumes (e.g., nfs.csi.k8s.io issues).
This fix adds validation before marking a DataVolume as completed:
If any of these checks fail, the task is marked as pending instead of completed, preventing Disk Transfer from showing as completed prematurely.
Also adds a default case to handle unknown/empty DataVolume phases.