|
| 1 | +# Fix: Call WaitGroup.Done() once only when PVB changes to final status + Update Go to 1.23.0 |
| 2 | + |
| 3 | +## Summary |
| 4 | + |
| 5 | +This PR includes two important updates: |
| 6 | + |
| 7 | +1. **PVB WaitGroup Fix**: Adapts upstream fix [vmware-tanzu/velero#8940](https://github.com/vmware-tanzu/velero/pull/8940) (commit 2eb97fa8b187f9ed0aeb49f216565eddf93a0b08) to prevent calling `WaitGroup.Done()` multiple times for the same PodVolumeBackup, which causes "negative WaitGroup counter" panic errors. |
| 8 | + |
| 9 | +2. **Go Version Update**: Updates Go version from 1.22.5 to 1.23.0 across all Dockerfiles and CI workflows, aligning with upstream commit [874388bd3e34bbcfe955f624b8e3579d959132fc](https://github.com/vmware-tanzu/velero/commit/874388bd3e34bbcfe955f624b8e3579d959132fc). This update was missed in previous CVE-related PRs. |
| 10 | + |
| 11 | +## Problem |
| 12 | +When the event handler receives multiple update events for a PodVolumeBackup that's already in a final status (Completed or Failed), it could call `WaitGroup.Done()` multiple times, leading to a panic. |
| 13 | + |
| 14 | +## Implementation Differences from Upstream |
| 15 | + |
| 16 | +### Upstream Version (Velero main branch) |
| 17 | +- Uses a `pvbIndexer` field (`cache.Indexer`) in the `backupper` struct |
| 18 | +- Tracks PVB states through the indexer infrastructure |
| 19 | +- Checks if PVB already exists in final status before calling `Done()` |
| 20 | + |
| 21 | +### Our Version (OADP 1.4) |
| 22 | +- **Does not have the `pvbIndexer` infrastructure** that exists upstream |
| 23 | +- Implements a simpler solution using `sync.Map` to track processed PVBs |
| 24 | +- Adds a new field `processedPVBs sync.Map` to the `backupper` struct |
| 25 | +- Uses `LoadOrStore` to atomically check and mark PVBs as processed |
| 26 | + |
| 27 | +## Why Direct Cherry-pick Wasn't Possible |
| 28 | +The upstream commit relies on the `pvbIndexer` infrastructure that was introduced in later versions of Velero but is not present in the OADP 1.4 branch. A direct cherry-pick would fail due to: |
| 29 | +1. Missing `pvbIndexer` field in the `backupper` struct |
| 30 | +2. Missing indexer initialization code |
| 31 | +3. Different architectural approach to tracking PVB states |
| 32 | + |
| 33 | +## Adapted Solution |
| 34 | +```go |
| 35 | +// Added to backupper struct: |
| 36 | +processedPVBs sync.Map // tracks PVBs that have already been processed |
| 37 | + |
| 38 | +// In the event handler: |
| 39 | +pvbKey := pvb.Namespace + "/" + pvb.Name |
| 40 | +if _, alreadyProcessed := b.processedPVBs.LoadOrStore(pvbKey, true); !alreadyProcessed { |
| 41 | + b.result = append(b.result, pvb) |
| 42 | + b.wg.Done() |
| 43 | +} |
| 44 | +``` |
| 45 | + |
| 46 | +This achieves the same goal as upstream - preventing multiple `Done()` calls for the same PVB - but using a simpler approach suitable for OADP 1.4's codebase. |
| 47 | + |
| 48 | +## Testing |
| 49 | +- All existing tests pass |
| 50 | +- The fix prevents the panic while maintaining the same functional behavior |
| 51 | +- Tested with `go test ./pkg/podvolume/... -v -count=1` |
| 52 | + |
| 53 | +## Files Updated for Go 1.23.0 |
| 54 | + |
| 55 | +- `Dockerfile` (2 occurrences) |
| 56 | +- `hack/build-image/Dockerfile` |
| 57 | +- `.github/workflows/pr-ci-check.yml` |
| 58 | +- `.github/workflows/push.yml` |
| 59 | +- `.github/workflows/e2e-test-kind.yaml` (2 occurrences) |
| 60 | +- `.github/workflows/crds-verify-kind.yaml` |
| 61 | +- `Tiltfile` |
| 62 | + |
| 63 | +## Related Issues |
| 64 | + |
| 65 | +- Fixes the "negative WaitGroup counter" panic in PodVolumeBackup processing |
| 66 | +- Based on upstream PR: https://github.com/vmware-tanzu/velero/pull/8940 |
| 67 | +- Go update based on upstream commit: https://github.com/vmware-tanzu/velero/commit/874388bd3e34bbcfe955f624b8e3579d959132fc |
0 commit comments