Skip to content

Comments

fix: stabilize upgrade cmd#1646

Merged
yottahmd merged 7 commits intomainfrom
upgrade-bugfix
Feb 8, 2026
Merged

fix: stabilize upgrade cmd#1646
yottahmd merged 7 commits intomainfrom
upgrade-bugfix

Conversation

@yottahmd
Copy link
Collaborator

@yottahmd yottahmd commented Feb 8, 2026

Summary by CodeRabbit

  • Bug Fixes
    • Improved upgrade download reliability with retry logic and exponential backoff.
    • Enhanced progress tracking consistency when retries occur.
    • Strengthened Windows binary replacement with atomic-like operations.
    • Refined backup/restore mechanism with timestamped backups to prevent overwrites.
    • Added input validation for version identifiers.

@coderabbitai
Copy link

coderabbitai bot commented Feb 8, 2026

📝 Walkthrough

Walkthrough

This PR introduces a pluggable cache store abstraction for upgrade checking, replacing direct file-based caching with an injectable CacheStore interface. It adds a file-based implementation, implements retry policies with exponential backoff for network operations, enhances error handling during download and installation, and integrates the store throughout the upgrade flow.

Changes

Cohort / File(s) Summary
Cache Store Abstraction
internal/upgrade/store.go
Introduces CacheStore interface with Load() and Save() methods for pluggable cache persistence.
File-based Cache Implementation
internal/persis/fileupgradecheck/store.go, internal/persis/fileupgradecheck/store_test.go
Implements thread-safe Store with atomic JSON write/read; validates directory structure; handles missing/invalid cache files gracefully.
Upgrade Cache with Store Integration
internal/upgrade/cache.go, internal/upgrade/upgrade.go
Refactors CheckAndUpdateCache and GetCachedUpdateInfo to accept CacheStore parameter; removes internal file I/O logic; adds LastCheck timestamp on cache save; adds SpecificVersionRequest field to Result struct and internal backup mechanism.
Retry Policy & Error Classification
internal/upgrade/retry.go
Introduces exponential backoff retry policy with jitter; defines httpError and nonRetriableError types; classifies HTTP status codes and network errors for retry decisions.
Network Operations with Retry
internal/upgrade/download.go, internal/upgrade/github.go
Wraps download and GitHub API calls in retry loops; adds best-effort HEAD requests for content length; implements atomic file moves; uses URL escaping for version tags; distinguishes retriable vs non-retriable errors.
Installation & Validation
internal/upgrade/install.go, internal/upgrade/version.go
Enhances Windows binary replacement with atomic two-step process using temp files; adds timestamped backups to avoid overwrites; introduces ValidateVersionTag for path traversal and control character detection.
Integration Points
internal/cmd/upgrade.go, internal/service/frontend/server.go
Wires upgradeStore through upgrade flow; updates UpgradeWithReleaseInfo call sites with store parameter; updates getUpdateInfo signature and async cache update logic; adds error handling for store creation.
Test Coverage
internal/upgrade/upgrade_test.go
Updates test signatures for CheckAndUpdateCache and GetCachedUpdateInfo; adds mock CacheStore implementation; expands tests for retry behavior, error handling, and GitHub/download operations.

Sequence Diagram

sequenceDiagram
    actor User
    participant Frontend as Frontend Server
    participant UpgradeFlow as Upgrade Flow
    participant CacheStore as CacheStore
    participant GitHub as GitHub API
    participant Download as Download Manager

    User->>Frontend: Check for updates
    Frontend->>CacheStore: Load()
    CacheStore-->>Frontend: cached info or nil
    
    alt Cache valid
        Frontend-->>User: Return cached update info
    else Cache expired or missing
        Frontend->>UpgradeFlow: CheckAndUpdateCache(store, version)
        UpgradeFlow->>CacheStore: Load()
        CacheStore-->>UpgradeFlow: nil or stale cache
        
        loop Retry with exponential backoff
            UpgradeFlow->>GitHub: GetLatestRelease()
            GitHub-->>UpgradeFlow: Release info or retriable error
        end
        
        UpgradeFlow->>CacheStore: Save(cache with LastCheck)
        CacheStore-->>UpgradeFlow: ✓
        UpgradeFlow-->>Frontend: Update available
        Frontend-->>User: Display update available
    end
    
    opt User triggers upgrade
        User->>UpgradeFlow: Download + Install
        loop Retry download on server errors
            UpgradeFlow->>Download: GET binary with retry policy
            Download->>GitHub: HEAD for content-length
            GitHub-->>Download: Response with size
            Download-->>UpgradeFlow: Binary or retriable error
        end
        UpgradeFlow->>CacheStore: Save(cache with LastCheck)
        CacheStore-->>UpgradeFlow: ✓
        UpgradeFlow-->>User: Upgrade complete
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • feat: self-upgrade command #1623: Refactors upgrade caching into a pluggable CacheStore abstraction by adding fileupgradecheck.Store, updating UpgradeWithReleaseInfo and cache function signatures, and rewiring frontend caching logic — directly related at code level as it implements the same cache store pattern and integration points.
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: stabilize upgrade cmd' is concise and directly relates to the primary objective of stabilizing the upgrade command functionality, though it's somewhat abbreviated.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch upgrade-bugfix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@internal/upgrade/upgrade_test.go`:
- Around line 1621-1626: The HTTP handler passed to httptest.NewServer declares
an unused parameter named `r` which the linter flags; update the handler
signature to use the blank identifier (e.g., change `func(w http.ResponseWriter,
r *http.Request)` to use `_ *http.Request`) or explicitly reference it (e.g., `_
= r`) so the `server` handler in the test no longer has an unused `r` parameter.
- Around line 1593-1602: The HTTP test server handler passed to
httptest.NewServer uses an unused parameter named r which the revive linter
flags; update the anonymous handler func signature in the server variable (the
http.HandlerFunc passed to httptest.NewServer) to replace the unused parameter
name r with the blank identifier _ so the function becomes func(w
http.ResponseWriter, _ *http.Request) and the rest of the logic (incrementing
attempts, returning StatusServiceUnavailable, and encoding release) remains
unchanged.

In `@internal/upgrade/upgrade.go`:
- Around line 341-351: When VerifyBinary(execPath, info.Release.TagName) fails
and you attempt to restore using copyFile(restoreSrc, execPath), capture the
copyFile error into a named variable (e.g., restoreErr) and include its details
when returning the final error instead of only wrapping the original
verification error; update the return paths around the VerifyBinary failure to
return fmt.Errorf("upgrade verification failed (restored backup): %v: %w",
restoreErr, err) or similar so both restoreErr and err are visible, referencing
VerifyBinary, copyFile, execPath, result.BackupPath, and internalBackupPath to
locate the logic.
🧹 Nitpick comments (10)
internal/upgrade/install.go (1)

215-253: Windows replacement logic is now significantly more robust.

The two-rename approach correctly minimizes the vulnerable window compared to a full copy. Rollback on line 246 is appropriately best-effort.

One observation: Lines 216–232 are nearly identical to replaceUnixBinary lines 184–201 (create temp → copy → chmod). Consider extracting a small helper like prepareTempBinary(src, dir string, perm os.FileMode) (string, error) to eliminate the duplication.

,

♻️ Optional: extract shared prep logic
// prepareTempBinary copies src into a new temp file in dir and applies perm.
// On success it returns the temp file path; on failure it cleans up.
func prepareTempBinary(src, dir string, perm os.FileMode) (string, error) {
	tempFile, err := os.CreateTemp(dir, "dagu-new-*")
	if err != nil {
		return "", fmt.Errorf("failed to create temp file: %w", err)
	}
	tempPath := tempFile.Name()
	_ = tempFile.Close()

	if err := copyFile(src, tempPath); err != nil {
		_ = os.Remove(tempPath)
		return "", err
	}

	if err := os.Chmod(tempPath, perm); err != nil {
		_ = os.Remove(tempPath)
		return "", fmt.Errorf("failed to set permissions: %w", err)
	}

	return tempPath, nil
}

Then both replaceUnixBinary and replaceWindowsBinary would call:

tempPath, err := prepareTempBinary(src, filepath.Dir(target), perm)
if err != nil {
    return err
}
internal/persis/fileupgradecheck/store_test.go (2)

1-10: Consider using stretchr/testify/require for assertions.

The test file uses manual if err != nil { t.Fatalf(...) } patterns throughout. This makes tests verbose and less readable compared to require.NoError(t, err) / require.Equal(t, ...).

Example refactor for TestSaveAndLoad
 import (
-	"os"
-	"path/filepath"
 	"testing"
 	"time"
 
 	"github.com/dagu-org/dagu/internal/upgrade"
+	"github.com/stretchr/testify/require"
 )
 func TestSaveAndLoad(t *testing.T) {
 	tmpDir := t.TempDir()
 	store, err := New(tmpDir)
-	if err != nil {
-		t.Fatalf("New() error: %v", err)
-	}
+	require.NoError(t, err)

 	cache := &upgrade.UpgradeCheckCache{...}
-	if err := store.Save(cache); err != nil {
-		t.Fatalf("Save() error: %v", err)
-	}
+	require.NoError(t, store.Save(cache))

 	loaded, err := store.Load()
-	if err != nil {
-		t.Fatalf("Load() error: %v", err)
-	}
-	if loaded == nil {
-		t.Fatal("Load() returned nil after save")
-	}
-	if loaded.LatestVersion != cache.LatestVersion {
-		t.Errorf(...)
-	}
+	require.NoError(t, err)
+	require.NotNil(t, loaded)
+	require.Equal(t, cache.LatestVersion, loaded.LatestVersion)
+	require.Equal(t, cache.CurrentVersion, loaded.CurrentVersion)
+	require.Equal(t, cache.UpdateAvailable, loaded.UpdateAvailable)
+	require.True(t, loaded.LastCheck.Equal(cache.LastCheck))
 }

As per coding guidelines, **/*_test.go: "Use stretchr/testify/require for assertions and shared fixtures from internal/test instead of duplicating mocks".


116-138: TestSaveAtomicWrite only verifies file existence, not atomicity.

The test name suggests it validates atomic write semantics, but it only checks os.Stat on the final path. Consider either renaming to TestSaveCreatesFile or enhancing it to verify atomicity (e.g., confirm no partial writes by checking content validity, or verifying that a concurrent reader never sees a truncated file).

internal/upgrade/retry.go (2)

50-60: 5xx retry range is limited to 500–504; status codes ≥ 505 won't be retried.

Codes like 502/503/504 are the common transient ones, so this is likely intentional. Just flagging that a broader 5xx gateway error (e.g., 507, 520–529 from CDNs) would be treated as non-retriable since it doesn't match the 500 <= code <= 504 check.


62-75: The doc comment says "other → non-retriable httpError" but classifyResponse always returns a plain *httpError.

The non-retriability for 4xx is an emergent property of isRetriableError, not of the error type returned here. The comment is slightly misleading — consider rewording it to clarify that retry eligibility is determined by isRetriableError, not by this function alone.

internal/upgrade/download.go (3)

63-123: Double-close of tempFile: explicit close on Line 106 followed by deferred close on Line 71.

On the success path, tempFile.Close() is called explicitly at Line 106, then the defer at Line 71 calls it again. The second close returns an error that's silently discarded, so this is functionally safe — but it's a code smell. Consider setting tempFile to nil after the explicit close or restructuring to avoid the double close.

Suggested cleanup
 	defer func() {
-		_ = tempFile.Close()
-		if _, statErr := os.Stat(tempPath); statErr == nil {
+		if tempFile != nil {
+			_ = tempFile.Close()
+		}
+		if _, statErr := os.Stat(tempPath); statErr == nil {
 			_ = os.Remove(tempPath)
 		}
 	}()

And after explicit close:

 	if err := tempFile.Close(); err != nil {
 		return &nonRetriableError{err: fmt.Errorf("failed to close temp file: %w", err)}
 	}
+	tempFile = nil

50-50: SetTimeout(0) disables all HTTP timeouts — a single attempt can hang indefinitely.

While the comment indicates this is intentional for large downloads, consider setting a generous connection/TLS timeout (e.g., 30s) separate from the overall transfer timeout. With SetTimeout(0) and no context deadline, a stalled TCP connection during the TLS handshake or DNS resolution could block the retry loop forever.

Resty supports SetTransport to configure DialContext timeouts independently of the read/write deadline, which would allow large transfers while still bounding the initial connection phase.


88-95: Non-200 success codes (e.g., 206 Partial Content) are treated as errors.

code != 200 rejects all non-200 responses. For a fresh full download this is fine, but if future logic adds range requests, 206 would be incorrectly treated as a failure. Low risk given the current usage, just noting it.

internal/upgrade/cache.go (2)

38-38: store.Load() error is silently discarded — consider logging it.

If Load fails due to a persistent issue (e.g., filesystem permissions), every call will bypass the cache and hit GitHub, with no diagnostic breadcrumb. A debug-level log would help troubleshoot without changing the graceful-degradation behavior.


68-68: store.Save() error is silently discarded — same concern as Load.

A failed Save means the next startup will re-fetch from GitHub. Logging the error at debug/warn level would surface persistent storage problems without changing control flow.

@yottahmd yottahmd changed the title fix: stability upgrade cmd function fix: stabilize upgrade cmd Feb 8, 2026
@yottahmd yottahmd merged commit c62a7e8 into main Feb 8, 2026
5 checks passed
@yottahmd yottahmd deleted the upgrade-bugfix branch February 8, 2026 07:22
@codecov
Copy link

codecov bot commented Feb 8, 2026

Codecov Report

❌ Patch coverage is 59.45946% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.91%. Comparing base (8a67644) to head (ac65732).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
internal/upgrade/upgrade.go 16.66% 20 Missing ⚠️
internal/upgrade/github.go 64.86% 7 Missing and 6 partials ⚠️
internal/upgrade/install.go 33.33% 8 Missing and 4 partials ⚠️
internal/persis/fileupgradecheck/store.go 62.96% 3 Missing and 7 partials ⚠️
internal/upgrade/download.go 70.58% 5 Missing and 5 partials ⚠️
internal/cmd/upgrade.go 0.00% 7 Missing ⚠️
internal/upgrade/retry.go 91.66% 2 Missing ⚠️
internal/upgrade/cache.go 80.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1646      +/-   ##
==========================================
+ Coverage   69.86%   69.91%   +0.05%     
==========================================
  Files         333      335       +2     
  Lines       37405    37440      +35     
==========================================
+ Hits        26133    26178      +45     
+ Misses       9198     9184      -14     
- Partials     2074     2078       +4     
Files with missing lines Coverage Δ
internal/upgrade/version.go 100.00% <100.00%> (ø)
internal/upgrade/cache.go 50.00% <80.00%> (-4.67%) ⬇️
internal/upgrade/retry.go 91.66% <91.66%> (ø)
internal/cmd/upgrade.go 19.82% <0.00%> (-0.90%) ⬇️
internal/persis/fileupgradecheck/store.go 62.96% <62.96%> (ø)
internal/upgrade/download.go 81.81% <70.58%> (-2.46%) ⬇️
internal/upgrade/install.go 60.94% <33.33%> (+1.20%) ⬆️
internal/upgrade/github.go 74.22% <64.86%> (+8.24%) ⬆️
internal/upgrade/upgrade.go 29.58% <16.66%> (+0.39%) ⬆️

... and 7 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a67644...ac65732. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant