fix: stabilize upgrade cmd by yottahmd · Pull Request #1646 · dagu-org/dagu

yottahmd · 2026-02-08T06:53:09Z

Summary by CodeRabbit

Bug Fixes
- Improved upgrade download reliability with retry logic and exponential backoff.
- Enhanced progress tracking consistency when retries occur.
- Strengthened Windows binary replacement with atomic-like operations.
- Refined backup/restore mechanism with timestamped backups to prevent overwrites.
- Added input validation for version identifiers.

coderabbitai · 2026-02-08T06:53:40Z

📝 Walkthrough

Walkthrough

This PR introduces a pluggable cache store abstraction for upgrade checking, replacing direct file-based caching with an injectable CacheStore interface. It adds a file-based implementation, implements retry policies with exponential backoff for network operations, enhances error handling during download and installation, and integrates the store throughout the upgrade flow.

Changes

Cohort / File(s)	Summary
Cache Store Abstraction `internal/upgrade/store.go`	Introduces CacheStore interface with Load() and Save() methods for pluggable cache persistence.
File-based Cache Implementation `internal/persis/fileupgradecheck/store.go`, `internal/persis/fileupgradecheck/store_test.go`	Implements thread-safe Store with atomic JSON write/read; validates directory structure; handles missing/invalid cache files gracefully.
Upgrade Cache with Store Integration `internal/upgrade/cache.go`, `internal/upgrade/upgrade.go`	Refactors CheckAndUpdateCache and GetCachedUpdateInfo to accept CacheStore parameter; removes internal file I/O logic; adds LastCheck timestamp on cache save; adds SpecificVersionRequest field to Result struct and internal backup mechanism.
Retry Policy & Error Classification `internal/upgrade/retry.go`	Introduces exponential backoff retry policy with jitter; defines httpError and nonRetriableError types; classifies HTTP status codes and network errors for retry decisions.
Network Operations with Retry `internal/upgrade/download.go`, `internal/upgrade/github.go`	Wraps download and GitHub API calls in retry loops; adds best-effort HEAD requests for content length; implements atomic file moves; uses URL escaping for version tags; distinguishes retriable vs non-retriable errors.
Installation & Validation `internal/upgrade/install.go`, `internal/upgrade/version.go`	Enhances Windows binary replacement with atomic two-step process using temp files; adds timestamped backups to avoid overwrites; introduces ValidateVersionTag for path traversal and control character detection.
Integration Points `internal/cmd/upgrade.go`, `internal/service/frontend/server.go`	Wires upgradeStore through upgrade flow; updates UpgradeWithReleaseInfo call sites with store parameter; updates getUpdateInfo signature and async cache update logic; adds error handling for store creation.
Test Coverage `internal/upgrade/upgrade_test.go`	Updates test signatures for CheckAndUpdateCache and GetCachedUpdateInfo; adds mock CacheStore implementation; expands tests for retry behavior, error handling, and GitHub/download operations.

Sequence Diagram

sequenceDiagram
    actor User
    participant Frontend as Frontend Server
    participant UpgradeFlow as Upgrade Flow
    participant CacheStore as CacheStore
    participant GitHub as GitHub API
    participant Download as Download Manager

    User->>Frontend: Check for updates
    Frontend->>CacheStore: Load()
    CacheStore-->>Frontend: cached info or nil
    
    alt Cache valid
        Frontend-->>User: Return cached update info
    else Cache expired or missing
        Frontend->>UpgradeFlow: CheckAndUpdateCache(store, version)
        UpgradeFlow->>CacheStore: Load()
        CacheStore-->>UpgradeFlow: nil or stale cache
        
        loop Retry with exponential backoff
            UpgradeFlow->>GitHub: GetLatestRelease()
            GitHub-->>UpgradeFlow: Release info or retriable error
        end
        
        UpgradeFlow->>CacheStore: Save(cache with LastCheck)
        CacheStore-->>UpgradeFlow: ✓
        UpgradeFlow-->>Frontend: Update available
        Frontend-->>User: Display update available
    end
    
    opt User triggers upgrade
        User->>UpgradeFlow: Download + Install
        loop Retry download on server errors
            UpgradeFlow->>Download: GET binary with retry policy
            Download->>GitHub: HEAD for content-length
            GitHub-->>Download: Response with size
            Download-->>UpgradeFlow: Binary or retriable error
        end
        UpgradeFlow->>CacheStore: Save(cache with LastCheck)
        CacheStore-->>UpgradeFlow: ✓
        UpgradeFlow-->>User: Upgrade complete
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat: self-upgrade command #1623: Refactors upgrade caching into a pluggable CacheStore abstraction by adding fileupgradecheck.Store, updating UpgradeWithReleaseInfo and cache function signatures, and rewiring frontend caching logic — directly related at code level as it implements the same cache store pattern and integration points.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.16% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix: stabilize upgrade cmd' is concise and directly relates to the primary objective of stabilizing the upgrade command functionality, though it's somewhat abbreviated.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch upgrade-bugfix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@internal/upgrade/upgrade_test.go`:
- Around line 1621-1626: The HTTP handler passed to httptest.NewServer declares
an unused parameter named `r` which the linter flags; update the handler
signature to use the blank identifier (e.g., change `func(w http.ResponseWriter,
r *http.Request)` to use `_ *http.Request`) or explicitly reference it (e.g., `_
= r`) so the `server` handler in the test no longer has an unused `r` parameter.
- Around line 1593-1602: The HTTP test server handler passed to
httptest.NewServer uses an unused parameter named r which the revive linter
flags; update the anonymous handler func signature in the server variable (the
http.HandlerFunc passed to httptest.NewServer) to replace the unused parameter
name r with the blank identifier _ so the function becomes func(w
http.ResponseWriter, _ *http.Request) and the rest of the logic (incrementing
attempts, returning StatusServiceUnavailable, and encoding release) remains
unchanged.

In `@internal/upgrade/upgrade.go`:
- Around line 341-351: When VerifyBinary(execPath, info.Release.TagName) fails
and you attempt to restore using copyFile(restoreSrc, execPath), capture the
copyFile error into a named variable (e.g., restoreErr) and include its details
when returning the final error instead of only wrapping the original
verification error; update the return paths around the VerifyBinary failure to
return fmt.Errorf("upgrade verification failed (restored backup): %v: %w",
restoreErr, err) or similar so both restoreErr and err are visible, referencing
VerifyBinary, copyFile, execPath, result.BackupPath, and internalBackupPath to
locate the logic.

🧹 Nitpick comments (10)

internal/upgrade/install.go (1)
215-253: Windows replacement logic is now significantly more robust.

The two-rename approach correctly minimizes the vulnerable window compared to a full copy. Rollback on line 246 is appropriately best-effort.

One observation: Lines 216–232 are nearly identical to replaceUnixBinary lines 184–201 (create temp → copy → chmod). Consider extracting a small helper like prepareTempBinary(src, dir string, perm os.FileMode) (string, error) to eliminate the duplication.

,
♻️ Optional: extract shared prep logic
// prepareTempBinary copies src into a new temp file in dir and applies perm.
// On success it returns the temp file path; on failure it cleans up.
func prepareTempBinary(src, dir string, perm os.FileMode) (string, error) {
	tempFile, err := os.CreateTemp(dir, "dagu-new-*")
	if err != nil {
		return "", fmt.Errorf("failed to create temp file: %w", err)
	}
	tempPath := tempFile.Name()
	_ = tempFile.Close()

	if err := copyFile(src, tempPath); err != nil {
		_ = os.Remove(tempPath)
		return "", err
	}

	if err := os.Chmod(tempPath, perm); err != nil {
		_ = os.Remove(tempPath)
		return "", fmt.Errorf("failed to set permissions: %w", err)
	}

	return tempPath, nil
}
Then both replaceUnixBinary and replaceWindowsBinary would call:
tempPath, err := prepareTempBinary(src, filepath.Dir(target), perm)
if err != nil {
    return err
}
internal/persis/fileupgradecheck/store_test.go (2)
1-10: Consider using stretchr/testify/require for assertions.

The test file uses manual if err != nil { t.Fatalf(...) } patterns throughout. This makes tests verbose and less readable compared to require.NoError(t, err) / require.Equal(t, ...).
Example refactor for TestSaveAndLoad
 import (
-	"os"
-	"path/filepath"
 	"testing"
 	"time"
 
 	"github.com/dagu-org/dagu/internal/upgrade"
+	"github.com/stretchr/testify/require"
 )
 func TestSaveAndLoad(t *testing.T) {
 	tmpDir := t.TempDir()
 	store, err := New(tmpDir)
-	if err != nil {
-		t.Fatalf("New() error: %v", err)
-	}
+	require.NoError(t, err)

 	cache := &upgrade.UpgradeCheckCache{...}
-	if err := store.Save(cache); err != nil {
-		t.Fatalf("Save() error: %v", err)
-	}
+	require.NoError(t, store.Save(cache))

 	loaded, err := store.Load()
-	if err != nil {
-		t.Fatalf("Load() error: %v", err)
-	}
-	if loaded == nil {
-		t.Fatal("Load() returned nil after save")
-	}
-	if loaded.LatestVersion != cache.LatestVersion {
-		t.Errorf(...)
-	}
+	require.NoError(t, err)
+	require.NotNil(t, loaded)
+	require.Equal(t, cache.LatestVersion, loaded.LatestVersion)
+	require.Equal(t, cache.CurrentVersion, loaded.CurrentVersion)
+	require.Equal(t, cache.UpdateAvailable, loaded.UpdateAvailable)
+	require.True(t, loaded.LastCheck.Equal(cache.LastCheck))
 }
As per coding guidelines, **/*_test.go: "Use stretchr/testify/require for assertions and shared fixtures from internal/test instead of duplicating mocks".

116-138: TestSaveAtomicWrite only verifies file existence, not atomicity.

The test name suggests it validates atomic write semantics, but it only checks os.Stat on the final path. Consider either renaming to TestSaveCreatesFile or enhancing it to verify atomicity (e.g., confirm no partial writes by checking content validity, or verifying that a concurrent reader never sees a truncated file).
internal/upgrade/retry.go (2)

50-60: 5xx retry range is limited to 500–504; status codes ≥ 505 won't be retried.

Codes like 502/503/504 are the common transient ones, so this is likely intentional. Just flagging that a broader 5xx gateway error (e.g., 507, 520–529 from CDNs) would be treated as non-retriable since it doesn't match the 500 <= code <= 504 check.

62-75: The doc comment says "other → non-retriable httpError" but classifyResponse always returns a plain *httpError.

The non-retriability for 4xx is an emergent property of isRetriableError, not of the error type returned here. The comment is slightly misleading — consider rewording it to clarify that retry eligibility is determined by isRetriableError, not by this function alone.
internal/upgrade/download.go (3)
63-123: Double-close of tempFile: explicit close on Line 106 followed by deferred close on Line 71.

On the success path, tempFile.Close() is called explicitly at Line 106, then the defer at Line 71 calls it again. The second close returns an error that's silently discarded, so this is functionally safe — but it's a code smell. Consider setting tempFile to nil after the explicit close or restructuring to avoid the double close.
Suggested cleanup
 	defer func() {
-		_ = tempFile.Close()
-		if _, statErr := os.Stat(tempPath); statErr == nil {
+		if tempFile != nil {
+			_ = tempFile.Close()
+		}
+		if _, statErr := os.Stat(tempPath); statErr == nil {
 			_ = os.Remove(tempPath)
 		}
 	}()
And after explicit close:
 	if err := tempFile.Close(); err != nil {
 		return &nonRetriableError{err: fmt.Errorf("failed to close temp file: %w", err)}
 	}
+	tempFile = nil
50-50: SetTimeout(0) disables all HTTP timeouts — a single attempt can hang indefinitely.

While the comment indicates this is intentional for large downloads, consider setting a generous connection/TLS timeout (e.g., 30s) separate from the overall transfer timeout. With SetTimeout(0) and no context deadline, a stalled TCP connection during the TLS handshake or DNS resolution could block the retry loop forever.

Resty supports SetTransport to configure DialContext timeouts independently of the read/write deadline, which would allow large transfers while still bounding the initial connection phase.

88-95: Non-200 success codes (e.g., 206 Partial Content) are treated as errors.

code != 200 rejects all non-200 responses. For a fresh full download this is fine, but if future logic adds range requests, 206 would be incorrectly treated as a failure. Low risk given the current usage, just noting it.
internal/upgrade/cache.go (2)

38-38: store.Load() error is silently discarded — consider logging it.

If Load fails due to a persistent issue (e.g., filesystem permissions), every call will bypass the cache and hit GitHub, with no diagnostic breadcrumb. A debug-level log would help troubleshoot without changing the graceful-degradation behavior.

68-68: store.Save() error is silently discarded — same concern as Load.

A failed Save means the next startup will re-fetch from GitHub. Logging the error at debug/warn level would surface persistent storage problems without changing control flow.

internal/upgrade/upgrade_test.go

internal/upgrade/upgrade.go

codecov · 2026-02-08T07:22:50Z

Codecov Report

❌ Patch coverage is 59.45946% with 75 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.91%. Comparing base (8a67644) to head (ac65732).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
internal/upgrade/upgrade.go	16.66%	20 Missing ⚠️
internal/upgrade/github.go	64.86%	7 Missing and 6 partials ⚠️
internal/upgrade/install.go	33.33%	8 Missing and 4 partials ⚠️
internal/persis/fileupgradecheck/store.go	62.96%	3 Missing and 7 partials ⚠️
internal/upgrade/download.go	70.58%	5 Missing and 5 partials ⚠️
internal/cmd/upgrade.go	0.00%	7 Missing ⚠️
internal/upgrade/retry.go	91.66%	2 Missing ⚠️
internal/upgrade/cache.go	80.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1646      +/-   ##
==========================================
+ Coverage   69.86%   69.91%   +0.05%     
==========================================
  Files         333      335       +2     
  Lines       37405    37440      +35     
==========================================
+ Hits        26133    26178      +45     
+ Misses       9198     9184      -14     
- Partials     2074     2078       +4

Files with missing lines	Coverage Δ
internal/upgrade/version.go	`100.00% <100.00%> (ø)`
internal/upgrade/cache.go	`50.00% <80.00%> (-4.67%)`	⬇️
internal/upgrade/retry.go	`91.66% <91.66%> (ø)`
internal/cmd/upgrade.go	`19.82% <0.00%> (-0.90%)`	⬇️
internal/persis/fileupgradecheck/store.go	`62.96% <62.96%> (ø)`
internal/upgrade/download.go	`81.81% <70.58%> (-2.46%)`	⬇️
internal/upgrade/install.go	`60.94% <33.33%> (+1.20%)`	⬆️
internal/upgrade/github.go	`74.22% <64.86%> (+8.24%)`	⬆️
internal/upgrade/upgrade.go	`29.58% <16.66%> (+0.39%)`	⬆️

... and 7 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a67644...ac65732. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

yottahmd added 5 commits February 8, 2026 12:27

fix bug

3c532e0

fix bug

4389c5e

file store refactor

5c94127

address remaining issues

411dcea

simplify code

fbce381

coderabbitai bot reviewed Feb 8, 2026

View reviewed changes

internal/upgrade/upgrade_test.go Show resolved Hide resolved

internal/upgrade/upgrade_test.go Show resolved Hide resolved

internal/upgrade/upgrade.go Show resolved Hide resolved

yottahmd added 2 commits February 8, 2026 16:05

refactor

8a4e554

address feedback

ac65732

yottahmd changed the title ~~fix: stability upgrade cmd function~~ fix: stabilize upgrade cmd Feb 8, 2026

yottahmd merged commit c62a7e8 into main Feb 8, 2026
5 checks passed

yottahmd deleted the upgrade-bugfix branch February 8, 2026 07:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

fix: stabilize upgrade cmd#1646

fix: stabilize upgrade cmd#1646
yottahmd merged 7 commits intomainfrom
upgrade-bugfix

yottahmd commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 8, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Comments

Conversation

yottahmd commented Feb 8, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Feb 8, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yottahmd commented Feb 8, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 8, 2026 •

edited

Loading