Skip to content

MTV-5829 | Add retry backoff and auth detection to Hyper-V collector#7230

Merged
yaacov merged 1 commit into
kubev2v:mainfrom
Hazanel:hyperV_collector_retry
Jun 23, 2026
Merged

MTV-5829 | Add retry backoff and auth detection to Hyper-V collector#7230
yaacov merged 1 commit into
kubev2v:mainfrom
Hazanel:hyperV_collector_retry

Conversation

@Hazanel

@Hazanel Hazanel commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

The Hyper-V collector retried failed runs with no delay, spamming the host with WinRM requests when credentials were incorrect or basic auth was not enabled.

  • Add RetryInterval (5s) sleep on run() failure in Start() loop, matching the pattern used by oVirt, OpenStack, and OVA collectors.

  • Introduce ErrUnauthorized sentinel error in the driver package, wrapping WinRM HTTP 401/403 responses so callers use errors.Is() instead of fragile string matching.

  • Update Test() to execute a real WinRM command (IsAlive) and return http.StatusUnauthorized on auth failure, triggering ConnectionAuthFailed in the provider controller.

  • Propagate auth errors from SMB prefix discovery in Connect()

Ref: https://redhat.atlassian.net/browse/MTV-5829
Resolves: MTV-5829

@Hazanel

Hazanel commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Hey @coderabbitai - please take a look

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
gh pr view 7230 --repo kubev2v/forklift --json title,body,files,additions,deletions

Length of output: 1587


🏁 Script executed:

#!/bin/bash
gh pr diff 7230 --repo kubev2v/forklift

Length of output: 8965


🏁 Script executed:

#!/bin/bash
# Get full content of the modified files for deeper review
cat pkg/lib/hyperv/driver/winrm.go
echo "---FILE_SEPARATOR---"
cat pkg/controller/provider/container/hyperv/client.go
echo "---FILE_SEPARATOR---"
cat pkg/controller/provider/container/hyperv/collector.go

Length of output: 40031


🏁 Script executed:

#!/bin/bash
# Check Go version in go.mod
head -5 go.mod

# Check RetryInterval constant exists and value
grep -n "RetryInterval" pkg/controller/provider/container/hyperv/collector.go

# Check how the provider controller uses Test() return values (to understand what happens with status=0)
rg -n "Test()" pkg/controller/provider/ --type=go -A5 -B5 | head -80

# Check how ConnectionAuthFailed is determined
rg -n "ConnectionAuthFailed\|StatusUnauthorized\|\.Test(" pkg/controller/provider/ --type=go -C3 | head -60

Length of output: 6324


🏁 Script executed:

#!/bin/bash
# How provider controller uses Test() results
rg -n "ConnectionAuthFailed" pkg/controller/provider/ --type=go -C5

echo "==="

# How .Test() is called
rg -n "\.Test\b" pkg/controller/provider/ --type=go -C5

echo "==="

# How StatusUnauthorized is checked in provider controller
rg -n "StatusUnauthorized" pkg/controller/provider/ --type=go -C5

Length of output: 8061



@Hazanel Thanks for sharing — here is the review.


Overall Assessment

The PR is well-structured and addresses the core problem cleanly. The ErrUnauthorized sentinel, regex extraction from WinRM error strings, the retry sleep, and the tests are all solid. A few things caught my eye:


🔴 Test() doesn't propagate ErrUnauthorized from Connect()

File: pkg/controller/provider/container/hyperv/collector.go, Lines 168–179

The provider controller in validation.go only sets ConnectionAuthFailed (and stops retrying) when Test() returns status == http.StatusUnauthorized. Currently Test() checks for ErrUnauthorized only in the IsAlive() error, but not in the Connect() error:

func (r *Collector) Test() (status int, err error) {
    if err = r.client.Connect(r.provider); err != nil {
        return  // ← status stays 0, even if err wraps ErrUnauthorized
    }
    ...
}

Trigger path: when an SMB URL is configured and the Hyper-V host returns 401/403 on Get-SmbShare:

discoverSMBWindowsPrefix()
  → ExecuteCommand()
    → ExecuteCommandWithTimeout()
      → WrapCommandError()     ← wraps error as ErrUnauthorized ✓
  → fmt.Errorf("Get-SmbShare failed: %w", err)
client.Connect()
  → errors.Is(pErr, ErrUnauthorized) == true ✓
  → returns fmt.Errorf("SMB discovery auth failed: %w", pErr)
Test()
  → err != nil, but status == 0 ✗ → controller keeps retrying

Suggested fix:

func (r *Collector) Test() (status int, err error) {
    if err = r.client.Connect(r.provider); err != nil {
        if errors.Is(err, driver.ErrUnauthorized) {
            status = http.StatusUnauthorized
        }
        return
    }
    if _, err = r.client.driver.IsAlive(); err != nil {
        if errors.Is(err, driver.ErrUnauthorized) {
            status = http.StatusUnauthorized
        }
        return
    }
    r.log.Info("Connected to Hyper-V host via WinRM/HTTPS.")
    return
}

🟡 httpStatus() doesn't traverse errors.Join branches

File: pkg/lib/hyperv/driver/errors.go, Lines 19–27

The function walks the chain with errors.Unwrap(), which only follows a single-parent chain. If a future caller wraps WinRM errors with errors.Join() (or any multi-error type that implements Unwrap() []error), the traversal will stop at that node and miss the 401/403 status code.

This is not a problem today — all WinRM errors are linearly wrapped — but it's a subtle trap. A more defensive implementation using errors.As with a custom target, or inspecting Unwrap() []error, would be more robust. At minimum, a comment noting the single-chain assumption would help future readers.


🟢 Everything else looks good

Change Assessment
RetryInterval = 5s in Start() loop ✓ Matches oVirt / OpenStack / OVA pattern
WrapCommandError applied in ExecuteCommandWithTimeout ✓ All WinRM commands benefit automatically
SMB non-auth errors downgraded from Error to Info ✓ Reduces log noise for transient mount failures
winrm.go log → "WinRM client initialized." ✓ Accurate — winrm.NewClient creates no socket
"Connected" log moved to after IsAlive() in Test() ✓ Now reflects actual confirmed connectivity
fmt.Errorf("%w: %w", ErrUnauthorized, err) ✓ Go 1.20+ multi-error wrapping; Go 1.25 in go.mod
errors_test.go coverage of httpStatus + WrapCommandError ✓ Good table-driven tests covering all branches

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

A new ErrUnauthorized sentinel and WrapCommandError helper are added to the Hyper-V driver to detect WinRM 401/403 HTTP status codes in error chains. ExecuteCommandWithTimeout now wraps command errors through this helper. Client.Connect propagates SMB discovery auth failures as errors. Collector.Test calls IsAlive and returns HTTP 401 on auth failures; Collector.Start gains a continuous retry loop on run errors.

Changes

Hyper-V WinRM Auth Error Propagation

Layer / File(s) Summary
WinRM error utilities and ExecuteCommandWithTimeout wiring
pkg/lib/hyperv/driver/errors.go, pkg/lib/hyperv/driver/winrm.go, pkg/lib/hyperv/driver/errors_test.go
Defines ErrUnauthorized, winrmHTTPStatus regexp, httpStatus chain walker, and WrapCommandError; updates Connect() log message; wraps command execution errors via WrapCommandError; adds table-driven tests for httpStatus and WrapCommandError covering nil, unrelated, 401/403/500, and nested-wrapped cases.
Client SMB auth error and Collector Test/Start updates
pkg/controller/provider/container/hyperv/client.go, pkg/controller/provider/container/hyperv/collector.go
client.go returns a wrapped error on ErrUnauthorized from SMB discovery and logs info for other failures; collector.go rewrites Test() to call IsAlive() and return http.StatusUnauthorized on auth failure, adds a continuous retry loop to Start() on run() error, and removes the deferred "Run failed." log from run().

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 A WinRM error hops down the chain,
With status four-oh-one it's plain!
WrapCommandError catches the fall,
The collector retries after all.
No more "Run failed" lost in the log —
Just a sentinel error, snug as a log. 🪵

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding retry backoff and auth detection to the Hyper-V collector, which are the core objectives of this PR.
Description check ✅ Passed The description is clearly related to the changeset, detailing the problem being solved (retry spamming) and the four main solutions implemented across the modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/controller/provider/container/hyperv/client.go (1)

87-87: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Include the discovery error in the reconnect log context.

Line 87 logs a generic message without pErr, which makes repeated reconnect failures harder to diagnose in production.

Suggested patch
-			r.Log.Info("SMB Windows prefix not yet discovered, will attempt on next reconnect")
+			r.Log.Info("SMB Windows prefix not yet discovered, will attempt on next reconnect", "error", pErr)

As per coding guidelines, "Observability: ... Logs should provide context, not just messages."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/provider/container/hyperv/client.go` at line 87, The log
statement using r.Log.Info with message "SMB Windows prefix not yet discovered,
will attempt on next reconnect" is missing the pErr error variable that contains
the discovery error details. Modify this log call to include pErr as a field or
in the message itself to provide the actual error context needed for diagnosing
reconnect failures. This ensures the log output contains actionable error
information rather than just a generic status message.

Source: Coding guidelines

pkg/controller/provider/container/hyperv/collector.go (1)

167-178: 🧹 Nitpick | 🔵 Trivial | 🏗️ Heavy lift

Add unit tests for the new Test() and retry-loop branches.

This PR introduces new control-flow that should be locked with tests: unauthorized from Connect(), unauthorized from IsAlive(), and retry behavior in Start().

As per coding guidelines, "coverage: Make sure that the code has unit tests."

Also applies to: 206-212

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/controller/provider/container/hyperv/collector.go` around lines 167 -
178, The Test() and Start() methods now contain new control-flow branches
handling unauthorized errors and retry behavior that lack unit test coverage.
Add unit tests to cover: the unauthorized error path from the r.client.Connect()
call in the Test() method, the unauthorized error path from the
r.client.driver.IsAlive() call in the Test() method where it checks
errors.Is(err, driver.ErrUnauthorized) and returns http.StatusUnauthorized, and
the retry loop behavior in the Start() method (lines 206-212). Ensure each test
case verifies the correct status code and error handling for the unauthorized
scenarios, and that the retry logic executes as expected.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/controller/provider/container/hyperv/collector.go`:
- Around line 209-212: The time.Sleep(RetryInterval) call in the error handling
block of the r.run(&ctx) failure case is not cancellation-aware, meaning it will
block for the full interval even after the context is cancelled during shutdown.
Replace this blocking sleep with a context-aware sleep mechanism that respects
the context's Done channel, allowing the retry loop to be interrupted
immediately upon cancellation instead of waiting for the full retry interval to
elapse.
- Around line 169-170: In the Test() method, the Connect() call at line 169
returns immediately on error without checking if the error is a
driver.ErrUnauthorized. Instead of returning the raw error, add a check after
the Connect() call to detect if the returned error is driver.ErrUnauthorized,
and when it is, map it to an http.StatusUnauthorized response before returning.
This ensures that unauthorized SMB failures from the Connect() method are
properly converted to HTTP 401 status codes for the caller.

---

Nitpick comments:
In `@pkg/controller/provider/container/hyperv/client.go`:
- Line 87: The log statement using r.Log.Info with message "SMB Windows prefix
not yet discovered, will attempt on next reconnect" is missing the pErr error
variable that contains the discovery error details. Modify this log call to
include pErr as a field or in the message itself to provide the actual error
context needed for diagnosing reconnect failures. This ensures the log output
contains actionable error information rather than just a generic status message.

In `@pkg/controller/provider/container/hyperv/collector.go`:
- Around line 167-178: The Test() and Start() methods now contain new
control-flow branches handling unauthorized errors and retry behavior that lack
unit test coverage. Add unit tests to cover: the unauthorized error path from
the r.client.Connect() call in the Test() method, the unauthorized error path
from the r.client.driver.IsAlive() call in the Test() method where it checks
errors.Is(err, driver.ErrUnauthorized) and returns http.StatusUnauthorized, and
the retry loop behavior in the Start() method (lines 206-212). Ensure each test
case verifies the correct status code and error handling for the unauthorized
scenarios, and that the retry logic executes as expected.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eda45e4d-66c0-4128-be01-457d042c0f6c

📥 Commits

Reviewing files that changed from the base of the PR and between 47491b2 and ba8ff2b.

📒 Files selected for processing (5)
  • pkg/controller/provider/container/hyperv/client.go
  • pkg/controller/provider/container/hyperv/collector.go
  • pkg/lib/hyperv/driver/errors.go
  • pkg/lib/hyperv/driver/errors_test.go
  • pkg/lib/hyperv/driver/winrm.go

Comment thread pkg/controller/provider/container/hyperv/collector.go
Comment thread pkg/controller/provider/container/hyperv/collector.go
The Hyper-V collector retried failed runs with no delay, spamming the
host with WinRM requests when credentials were incorrect or basic auth
was not enabled.

- Add RetryInterval (5s) sleep on run() failure in Start() loop,
  matching the pattern used by oVirt, OpenStack, and OVA collectors.

- Introduce ErrUnauthorized sentinel error in the driver package,
  wrapping WinRM HTTP 401/403 responses so callers use errors.Is()
  instead of fragile string matching.

- Update Test() to execute a real WinRM command (IsAlive) and return
  http.StatusUnauthorized on auth failure, triggering
  ConnectionAuthFailed in the provider controller.

- Propagate auth errors from SMB prefix discovery in Connect()

Ref: https://redhat.atlassian.net/browse/MTV-5829
Resolves: MTV-5829
Signed-off-by: Elad Hazan <ehazan@redhat.com>
@Hazanel Hazanel force-pushed the hyperV_collector_retry branch from ba8ff2b to 24784da Compare June 22, 2026 14:58
@Hazanel

Hazanel commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @coderabbitai - please follow-up

@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
echo "=== collector.go Test() method ==="
rg -n "func.*Test\(\)" pkg/controller/provider/container/hyperv/collector.go -A 20

echo "=== errors.go ==="
cat pkg/lib/hyperv/driver/errors.go

Length of output: 1988



@Hazanel Both flagged items are already resolved in the current branch — looks like the PR was updated after my initial review:

🔴 Test()ErrUnauthorized propagation from Connect() ✅ Fixed

The fix I suggested is already in place (Lines 169–173):

if err = r.client.Connect(r.provider); err != nil {
    if errors.Is(err, driver.ErrUnauthorized) {
        status = http.StatusUnauthorized
    }
    return
}

Both the Connect() and IsAlive() paths now correctly return http.StatusUnauthorized, which will trigger ConnectionAuthFailed and stop the controller from retrying indefinitely.


🟡 httpStatus()errors.Join traversal concern ✅ Withdrawn

My concern was based on the AI-generated summary describing errors.Unwrap traversal — but the actual implementation doesn't use errors.Unwrap at all. It calls err.Error() directly and applies the regex to the full error string. Since Go's fmt.Errorf("%w", ...) includes wrapped errors in the .Error() output, this approach naturally handles nested status codes without needing manual chain traversal. The concern doesn't apply.


The PR looks good to me. The implementation is clean, well-tested, and follows established patterns in the codebase. 🐇

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 38.88889% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 11.03%. Comparing base (f1fe5d0) to head (24784da).
⚠️ Report is 2728 commits behind head on main.

Files with missing lines Patch % Lines
.../controller/provider/container/hyperv/collector.go 0.00% 17 Missing ⚠️
pkg/controller/provider/container/hyperv/client.go 0.00% 3 Missing ⚠️
pkg/lib/hyperv/driver/winrm.go 0.00% 2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7230      +/-   ##
==========================================
- Coverage   15.45%   11.03%   -4.42%     
==========================================
  Files         112      524     +412     
  Lines       23377    62892   +39515     
==========================================
+ Hits         3613     6943    +3330     
- Misses      19479    55354   +35875     
- Partials      285      595     +310     
Flag Coverage Δ
unittests 11.03% <38.88%> (-4.42%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread pkg/controller/provider/container/hyperv/collector.go
Comment thread pkg/lib/hyperv/driver/errors.go
@yaacov yaacov merged commit b19d8f2 into kubev2v:main Jun 23, 2026
18 checks passed
@Hazanel

Hazanel commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/backport release-2.12

@github-actions

Copy link
Copy Markdown

🔄 Starting backport of PR #7230 to release-2.12
🚀 Live mode
View run

@sonarqubecloud

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

✅ PR #7230 backported to release-2.12.

@Hazanel Hazanel deleted the hyperV_collector_retry branch June 23, 2026 05:59
Hazanel added a commit that referenced this pull request Jun 23, 2026
…er-V collector (#7232)

**Backport:** #7230

The Hyper-V collector retried failed runs with no delay, spamming the
host with WinRM requests when credentials were incorrect or basic auth
was not enabled.

- Add RetryInterval (5s) sleep on run() failure in Start() loop,
matching the pattern used by oVirt, OpenStack, and OVA collectors.

- Introduce ErrUnauthorized sentinel error in the driver package,
wrapping WinRM HTTP 401/403 responses so callers use errors.Is() instead
of fragile string matching.

- Update Test() to execute a real WinRM command (IsAlive) and return
http.StatusUnauthorized on auth failure, triggering ConnectionAuthFailed
in the provider controller.

- Propagate auth errors from SMB prefix discovery in Connect()

Ref: https://redhat.atlassian.net/browse/MTV-5829
Resolves: MTV-5829

Signed-off-by: Elad Hazan <ehazan@redhat.com>
Co-authored-by: Elad Hazan <ehazan@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants