Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis PR implements a catch-up scheduling system for replaying missed DAG runs during scheduler downtime. It adds catch-up policy types, extends schedule configurations with per-entry catch-up settings, introduces a CatchupEngine for generating and dispatching missed candidates, propagates scheduled-time through CLI commands and runtime layers, and persists per-DAG watermarks for state tracking. Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler
participant CatchupEngine
participant DAGStateStore
participant DAGExecutor
participant QueueStore
Scheduler->>CatchupEngine: Run(dags)
CatchupEngine->>DAGStateStore: LoadAll(dags)
DAGStateStore-->>CatchupEngine: per-DAG states
CatchupEngine->>CatchupEngine: generateCandidates(policy, window)
CatchupEngine->>CatchupEngine: applyPolicy(Off/Latest/All)
CatchupEngine->>CatchupEngine: enforceGlobalCap
CatchupEngine->>CatchupEngine: enforcePerDAGCap
loop For each candidate
CatchupEngine->>CatchupEngine: isDuplicate(check recent)
alt Not duplicate
CatchupEngine->>DAGExecutor: dispatchCandidate
DAGExecutor->>QueueStore: Enqueue(dagRun, scheduledTime)
QueueStore-->>DAGExecutor: success
DAGExecutor-->>CatchupEngine: dispatched
else Duplicate
CatchupEngine-->>CatchupEngine: skip
end
end
CatchupEngine->>DAGStateStore: SaveAll(dags, newWatermark)
DAGStateStore-->>CatchupEngine: saved
CatchupEngine-->>Scheduler: CatchupResult{Dispatched, Skipped}
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
rfcs/004-schedule-catchup.md (1)
282-286:⚠️ Potential issue | 🟠 Major
DAGU_IS_CATCHUPvalue documented as"true"but implementation uses"1".See the corresponding issue flagged in
internal/runtime/agent/agent.goline 417. The RFC documents the value as"true", which should be the authoritative specification. Ensure the code is updated to match.internal/service/scheduler/dag_executor.go (1)
94-105:⚠️ Potential issue | 🟡 Minor
scheduledTimeis accepted but not propagated to the distributed execution task.When
ExecuteDAGtakes the distributed path (lines 94-105),scheduledTimeis not passed toCreateTask. In contrast, the local execution path explicitly passes it viaformatScheduledTime(scheduledTime)(line 117).For the main flows,
scheduledTimeis preserved elsewhere: for distributed START, it's persisted in the queue entry before dispatch; for RETRY, it's embedded inpreviousStatus(as JSON). However, this inconsistency creates confusion and means the parameter is unused in the distributed code path. IfExecuteDAGis called directly for distributed execution outside the normalHandleJob/queue flows, the scheduled time is effectively lost.Consider whether
CreateTaskshould accept ascheduledTimeoption for consistency, or document why the distributed path intentionally omits it.
🤖 Fix all issues with AI agents
In `@internal/core/dag.go`:
- Around line 657-661: UnmarshalJSON currently parses alias.CatchupWindow with
time.ParseDuration which doesn't accept day units, causing mismatch with
buildSchedulerFromEntries (which uses duration.Parse); update the UnmarshalJSON
logic that sets s.CatchupWindow (when alias.CatchupWindow != "") to use the same
duration.Parse function as the spec builder (and keep the existing error
wrapping message), ensuring alias.CatchupWindow, s.CatchupWindow, and
UnmarshalJSON behave consistently with buildSchedulerFromEntries.
In `@internal/runtime/agent/agent.go`:
- Around line 410-421: The env var value for catch-up is being set to "1" which
contradicts the RFC; change the value to "true" where
extraEnvs[exec.EnvKeyIsCatchup] is assigned in the scheduled-time injection
block (inside the a.scheduledTime conditional that uses runtime.GetDAGContext,
rCtx.EnvScope.WithEntries, eval.EnvSourceDAGEnv and runtime.WithDAGContext) so
the map entry uses "true" when a.triggerType == core.TriggerTypeCatchUp, keeping
all other logic intact.
In `@internal/service/scheduler/catchup.go`:
- Around line 83-91: The first-run branch uses c.clock() twice causing
inconsistent timestamps: replace the second call to c.clock() inside the
hasAnyWatermark == false branch with the previously captured variable catchupTo
so the saved watermarks and the catchup reference use the same timestamp; update
the SaveAll invocation (called via c.dagStateStore.SaveAll) to pass catchupTo
instead of c.clock(), leaving hasAnyWatermark, catchupTo and result.Duration
logic unchanged.
- Around line 113-145: The loop currently breaks on dispatch failure but
unconditionally calls c.dagStateStore.SaveAll(dags, catchupTo), advancing every
DAG watermark; change this so only DAGs that completed catch-up are advanced to
catchupTo: track successful DAGs inside the dispatch loop (e.g., a map keyed by
cand.dag.Name or dag ID updated when you increment result.Dispatched and after
successful per-DAG Save(cand.dag, dagState{LastTick: cand.scheduledTime})), and
after the loop call SaveAll only for that subset (or iterate and call Save(dag,
dagState{LastTick: catchupTo}) for each successful DAG). Ensure failing DAGs are
left at their last saved state (you already Save per-dispatch in
dispatchCandidate flow) and do not include them in the final SaveAll(catchupTo)
advance.
🧹 Nitpick comments (14)
internal/cmd/enqueue.go (1)
68-70: Silently discarding the error fromctx.StringParamis inconsistent with other parameters in this function.Lines 35–37 and 49–51 both check
errfromctx.StringParam. Whilescheduled-timeis optional and the flag is registered (so it should always succeed), swallowing the error silently makes debugging harder if the param name ever gets out of sync. The same pattern appears instart.goline 104.Suggested fix
- scheduledTime, _ := ctx.StringParam("scheduled-time") - - return enqueueDAGRun(ctx, dag, runID, triggerType, scheduledTime) + scheduledTime, err := ctx.StringParam("scheduled-time") + if err != nil { + return fmt.Errorf("failed to get scheduled-time: %w", err) + } + + return enqueueDAGRun(ctx, dag, runID, triggerType, scheduledTime)internal/cmd/start.go (2)
104-104: Same error-discarding pattern asenqueue.go.See the comment on
enqueue.goline 68 — same concern applies here. Consider handling the error consistently.
335-335: Function signatures are growing long — consider an options struct.
handleSubDAGRunnow takes 9 parameters andexecuteDAGRuntakes 8. Each new cross-cutting concern (triggerType, scheduledTime) adds another parameter to every function in the call chain. An options struct would make future additions non-breaking and improve readability.Not blocking for this PR since the pattern is pre-existing, but worth considering as a follow-up.
Also applies to: 371-371
rfcs/004-schedule-catchup.md (1)
368-392: Add language identifiers to fenced code blocks.The log output and CLI output code blocks lack language identifiers, which triggers markdownlint MD040 warnings. Use
textorlogfor log output blocks, andtextfor CLI output blocks.For example:
-``` +```text level=INFO msg="Catch-up started" ...Also applies to: 419-438, 522-560
internal/cmn/config/loader_test.go (1)
487-489: Consider adding a test case that exercises explicit YAML/env overrides for the catchup fields.The current YAML test doesn't set
maxGlobalCatchupRuns,maxCatchupRunsPerDAG, orcatchupRateLimitin the YAML input, so it only tests defaults. A dedicated test case (or extending the YAML block) would confirm that user-provided values override the defaults correctly.internal/service/scheduler/queue_processor.go (1)
382-387: Silently discardingParseTimeerror may hide issues.If
status.ScheduledTimeis non-empty but malformed, a zerotime.Timewill silently propagate toExecuteDAG. Consider logging at debug level on parse failure so operators can diagnose mismatches.Suggested change
go func() { - scheduledTime, _ := stringutil.ParseTime(status.ScheduledTime) + scheduledTime, err := stringutil.ParseTime(status.ScheduledTime) + if err != nil && status.ScheduledTime != "" { + logger.Debug(ctx, "Failed to parse scheduledTime, using zero value", tag.Error(err)) + } if err := p.dagExecutor.ExecuteDAG(ctx, dag, coordinatorv1.Operation_OPERATION_RETRY, runID, status, status.TriggerType, scheduledTime); err != nil {internal/core/catchup_policy_test.go (2)
28-31: Consider wrapping table iterations int.Runfor better failure diagnostics.When a table case fails, the test output won't indicate which row. Using
t.Run(tt.want, ...)(or a name field) would pinpoint failures. This is a minor nit for a small table.
50-58: Samet.Runsuggestion applies to the parse test loop.internal/service/scheduler/catchup_test.go (2)
226-271: Global cap test relies on deterministic map iteration order.
dagsis amap[string]*core.DAG, and Go maps have non-deterministic iteration order. WithMaxGlobalCatchupRuns = 3and two DAGs each generating 24 candidates, the test asserts exactly 3 total candidates — but the per-DAG distribution depends on which DAG is visited first bygenerateCandidates. If the implementation iterates the map and accumulates candidates until the global cap is hit, the result count (3) is stable, but if you ever want to assert which DAG's candidates appear, this will be flaky.The current assertion (
assert.Equal(t, 3, len(candidates))) is fine as long as only the total count matters. Just noting this for awareness in case future assertions on candidate contents are added.
31-65: Consider extracting the mock to a shared test helper or using a code generator.The
catchupMockDAGRunStoreduplicates mock boilerplate for theexec.DAGRunStoreinterface. If this interface grows, this mock will need manual updates.Based on learnings: "Integration fixtures should live in
internal/testfor use in test files."internal/service/scheduler/entryreader.go (1)
211-218:Registry()is read-only but acquires an exclusive lock.The
entryReaderImpl.lockis async.Mutex, soRegistry()blocks all other operations (including otherRegistry()callers). If this becomes a hot path (e.g., called every catchup tick), consider upgrading tosync.RWMutexso reads can proceed concurrently.internal/service/scheduler/scheduler.go (1)
339-347: Per-tick file I/O for watermark persistence — consider the write volume.
SaveAllwrites one JSON file per catch-up-enabled DAG on every scheduler tick (once per minute). With many catch-up-enabled DAGs this creates sustained filesystem churn. For most deployments this is fine, but consider batching into a single file or only writing when the set of catch-up DAGs changes, if this becomes a bottleneck.internal/cmn/config/loader.go (1)
732-734: Missing environment variable bindings for new catchup config fields.The other scheduler settings (
scheduler.port,lockStaleThreshold, etc.) have correspondingDAGU_SCHEDULER_*env var bindings in theenvBindingsslice (lines 1208-1211). The three new fields (MaxGlobalCatchupRuns,MaxCatchupRunsPerDAG,CatchupRateLimit) lack env var bindings, making them config-file-only. If environment-based configuration is expected to be supported (as it is for sibling fields), add bindings.Proposed env bindings (to add near line 1211)
{key: "scheduler.zombieDetectionInterval", env: "SCHEDULER_ZOMBIE_DETECTION_INTERVAL"}, + {key: "scheduler.maxGlobalCatchupRuns", env: "SCHEDULER_MAX_GLOBAL_CATCHUP_RUNS"}, + {key: "scheduler.maxCatchupRunsPerDAG", env: "SCHEDULER_MAX_CATCHUP_RUNS_PER_DAG"}, + {key: "scheduler.catchupRateLimit", env: "SCHEDULER_CATCHUP_RATE_LIMIT"},internal/service/scheduler/catchup.go (1)
321-336: Duplicate check scans only the last 50 attempts — could miss duplicates with high catch-up volume.If a DAG has more than 50 recent runs (e.g., from a previous catch-up or frequent schedule), older runs won't be checked. With the default per-DAG cap of 20, this is unlikely to be hit in practice. Consider making the limit configurable or at least matching it to
MaxCatchupRunsPerDAG.
| if alias.CatchupWindow != "" { | ||
| s.CatchupWindow, err = time.ParseDuration(alias.CatchupWindow) | ||
| if err != nil { | ||
| return fmt.Errorf("invalid catchupWindow %q: %w", alias.CatchupWindow, err) | ||
| } |
There was a problem hiding this comment.
Inconsistent duration parsing: time.ParseDuration here vs duration.Parse in spec builder.
UnmarshalJSON uses time.ParseDuration which doesn't support day-unit durations (e.g., "2d12h"), but buildSchedulerFromEntries in internal/core/spec/schedule.go (line 54) uses duration.Parse which does. A user who sets catchupWindow: "2d" in YAML will succeed at build time, but the value will fail to round-trip through JSON serialization/deserialization.
Proposed fix
+ "github.com/dagu-org/dagu/internal/cmn/duration"
...
if alias.CatchupWindow != "" {
- s.CatchupWindow, err = time.ParseDuration(alias.CatchupWindow)
+ s.CatchupWindow, err = duration.Parse(alias.CatchupWindow)
if err != nil {
return fmt.Errorf("invalid catchupWindow %q: %w", alias.CatchupWindow, err)
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if alias.CatchupWindow != "" { | |
| s.CatchupWindow, err = time.ParseDuration(alias.CatchupWindow) | |
| if err != nil { | |
| return fmt.Errorf("invalid catchupWindow %q: %w", alias.CatchupWindow, err) | |
| } | |
| if alias.CatchupWindow != "" { | |
| s.CatchupWindow, err = duration.Parse(alias.CatchupWindow) | |
| if err != nil { | |
| return fmt.Errorf("invalid catchupWindow %q: %w", alias.CatchupWindow, err) | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@internal/core/dag.go` around lines 657 - 661, UnmarshalJSON currently parses
alias.CatchupWindow with time.ParseDuration which doesn't accept day units,
causing mismatch with buildSchedulerFromEntries (which uses duration.Parse);
update the UnmarshalJSON logic that sets s.CatchupWindow (when
alias.CatchupWindow != "") to use the same duration.Parse function as the spec
builder (and keep the existing error wrapping message), ensuring
alias.CatchupWindow, s.CatchupWindow, and UnmarshalJSON behave consistently with
buildSchedulerFromEntries.
| // Inject scheduled time and catch-up environment variables | ||
| if a.scheduledTime != "" { | ||
| rCtx := runtime.GetDAGContext(ctx) | ||
| extraEnvs := map[string]string{ | ||
| exec.EnvKeyScheduledTime: a.scheduledTime, | ||
| } | ||
| if a.triggerType == core.TriggerTypeCatchUp { | ||
| extraEnvs[exec.EnvKeyIsCatchup] = "1" | ||
| } | ||
| rCtx.EnvScope = rCtx.EnvScope.WithEntries(extraEnvs, eval.EnvSourceDAGEnv) | ||
| ctx = runtime.WithDAGContext(ctx, rCtx) | ||
| } |
There was a problem hiding this comment.
DAGU_IS_CATCHUP value "1" contradicts the RFC spec which says "true".
The RFC (line 285) documents DAGU_IS_CATCHUP as "true", and the walkthrough example (line 544) shows DAGU_IS_CATCHUP=true. However, line 417 sets it to "1". DAG authors reading the RFC would write if [ "$DAGU_IS_CATCHUP" = "true" ], which would fail.
Either update the code to use "true" or update the RFC — recommend aligning the code with the RFC since it's the public contract.
Suggested fix
if a.triggerType == core.TriggerTypeCatchUp {
- extraEnvs[exec.EnvKeyIsCatchup] = "1"
+ extraEnvs[exec.EnvKeyIsCatchup] = "true"
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Inject scheduled time and catch-up environment variables | |
| if a.scheduledTime != "" { | |
| rCtx := runtime.GetDAGContext(ctx) | |
| extraEnvs := map[string]string{ | |
| exec.EnvKeyScheduledTime: a.scheduledTime, | |
| } | |
| if a.triggerType == core.TriggerTypeCatchUp { | |
| extraEnvs[exec.EnvKeyIsCatchup] = "1" | |
| } | |
| rCtx.EnvScope = rCtx.EnvScope.WithEntries(extraEnvs, eval.EnvSourceDAGEnv) | |
| ctx = runtime.WithDAGContext(ctx, rCtx) | |
| } | |
| // Inject scheduled time and catch-up environment variables | |
| if a.scheduledTime != "" { | |
| rCtx := runtime.GetDAGContext(ctx) | |
| extraEnvs := map[string]string{ | |
| exec.EnvKeyScheduledTime: a.scheduledTime, | |
| } | |
| if a.triggerType == core.TriggerTypeCatchUp { | |
| extraEnvs[exec.EnvKeyIsCatchup] = "true" | |
| } | |
| rCtx.EnvScope = rCtx.EnvScope.WithEntries(extraEnvs, eval.EnvSourceDAGEnv) | |
| ctx = runtime.WithDAGContext(ctx, rCtx) | |
| } |
🤖 Prompt for AI Agents
In `@internal/runtime/agent/agent.go` around lines 410 - 421, The env var value
for catch-up is being set to "1" which contradicts the RFC; change the value to
"true" where extraEnvs[exec.EnvKeyIsCatchup] is assigned in the scheduled-time
injection block (inside the a.scheduledTime conditional that uses
runtime.GetDAGContext, rCtx.EnvScope.WithEntries, eval.EnvSourceDAGEnv and
runtime.WithDAGContext) so the map entry uses "true" when a.triggerType ==
core.TriggerTypeCatchUp, keeping all other logic intact.
| if !hasAnyWatermark { | ||
| // First run — no catch-up needed, seed all DAGs with current time | ||
| logger.Info(ctx, "No per-DAG watermarks found, skipping catch-up") | ||
| if err := c.dagStateStore.SaveAll(dags, c.clock()); err != nil { | ||
| return nil, fmt.Errorf("failed to save initial watermarks: %w", err) | ||
| } | ||
| result.Duration = c.clock().Sub(start) | ||
| return result, nil | ||
| } |
There was a problem hiding this comment.
Minor inconsistency: c.clock() called separately for SaveAll vs. catchupTo.
Line 72 captures catchupTo := c.clock(), but line 86 calls c.clock() again for SaveAll. In the first-run path, the saved watermark could be slightly later than catchupTo used elsewhere. Use catchupTo for consistency.
Proposed fix
- if err := c.dagStateStore.SaveAll(dags, c.clock()); err != nil {
+ if err := c.dagStateStore.SaveAll(dags, catchupTo); err != nil {🤖 Prompt for AI Agents
In `@internal/service/scheduler/catchup.go` around lines 83 - 91, The first-run
branch uses c.clock() twice causing inconsistent timestamps: replace the second
call to c.clock() inside the hasAnyWatermark == false branch with the previously
captured variable catchupTo so the saved watermarks and the catchup reference
use the same timestamp; update the SaveAll invocation (called via
c.dagStateStore.SaveAll) to pass catchupTo instead of c.clock(), leaving
hasAnyWatermark, catchupTo and result.Duration logic unchanged.
| // Dispatch candidates | ||
| for _, cand := range candidates { | ||
| if ctx.Err() != nil { | ||
| break | ||
| } | ||
|
|
||
| dispatched, err := c.dispatchCandidate(ctx, cand) | ||
| if err != nil { | ||
| logger.Error(ctx, "Catch-up dispatch failed, stopping catch-up", | ||
| tag.DAG(cand.dag.Name), | ||
| tag.Error(err), | ||
| ) | ||
| // Save watermark at the last successful dispatch point | ||
| break | ||
| } | ||
|
|
||
| if dispatched { | ||
| result.Dispatched++ | ||
| // Advance this DAG's watermark after each successful dispatch | ||
| if err := c.dagStateStore.Save(cand.dag, dagState{LastTick: cand.scheduledTime}); err != nil { | ||
| logger.Error(ctx, "Failed to save DAG state", tag.Error(err)) | ||
| } | ||
| } else { | ||
| result.Skipped++ | ||
| } | ||
|
|
||
| time.Sleep(c.config.Scheduler.CatchupRateLimit) | ||
| } | ||
|
|
||
| // Set watermarks to catchupTo after all dispatches | ||
| if err := c.dagStateStore.SaveAll(dags, catchupTo); err != nil { | ||
| logger.Error(ctx, "Failed to save final watermarks", tag.Error(err)) | ||
| } |
There was a problem hiding this comment.
Watermark advances to catchupTo even after dispatch failure — missed runs will be lost.
When a dispatch fails (line 120), the loop breaks, but line 143 unconditionally calls SaveAll(dags, catchupTo), advancing all DAG watermarks to catchupTo. This means:
- The failed DAG's watermark jumps past any un-dispatched candidates.
- Any remaining candidates for other DAGs that weren't reached in the loop are also skipped.
On the next restart, catch-up won't retry these skipped runs because the watermark already advanced past them.
Consider tracking which DAGs completed successfully and only advancing their watermarks to catchupTo, while leaving failed/incomplete DAGs at their last successful dispatch point (already saved incrementally at line 132).
Sketch of a fix
+ // Track which DAGs had all candidates dispatched or skipped
+ completedDAGs := make(map[string]*core.DAG)
+ failedDAG := ""
+
// Dispatch candidates
for _, cand := range candidates {
if ctx.Err() != nil {
break
}
dispatched, err := c.dispatchCandidate(ctx, cand)
if err != nil {
logger.Error(ctx, "Catch-up dispatch failed, stopping catch-up",
tag.DAG(cand.dag.Name),
tag.Error(err),
)
- // Save watermark at the last successful dispatch point
+ failedDAG = cand.dag.Name
break
}
if dispatched {
result.Dispatched++
// Advance this DAG's watermark after each successful dispatch
if err := c.dagStateStore.Save(cand.dag, dagState{LastTick: cand.scheduledTime}); err != nil {
logger.Error(ctx, "Failed to save DAG state", tag.Error(err))
}
} else {
result.Skipped++
}
time.Sleep(c.config.Scheduler.CatchupRateLimit)
}
- // Set watermarks to catchupTo after all dispatches
- if err := c.dagStateStore.SaveAll(dags, catchupTo); err != nil {
- logger.Error(ctx, "Failed to save final watermarks", tag.Error(err))
+ // Only advance watermarks for DAGs that weren't interrupted
+ for k, dag := range dags {
+ if k == failedDAG {
+ continue // Leave at last successful dispatch watermark
+ }
+ if err := c.dagStateStore.Save(dag, dagState{LastTick: catchupTo}); err != nil {
+ logger.Error(ctx, "Failed to save final watermark", tag.Error(err))
+ }
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // Dispatch candidates | |
| for _, cand := range candidates { | |
| if ctx.Err() != nil { | |
| break | |
| } | |
| dispatched, err := c.dispatchCandidate(ctx, cand) | |
| if err != nil { | |
| logger.Error(ctx, "Catch-up dispatch failed, stopping catch-up", | |
| tag.DAG(cand.dag.Name), | |
| tag.Error(err), | |
| ) | |
| // Save watermark at the last successful dispatch point | |
| break | |
| } | |
| if dispatched { | |
| result.Dispatched++ | |
| // Advance this DAG's watermark after each successful dispatch | |
| if err := c.dagStateStore.Save(cand.dag, dagState{LastTick: cand.scheduledTime}); err != nil { | |
| logger.Error(ctx, "Failed to save DAG state", tag.Error(err)) | |
| } | |
| } else { | |
| result.Skipped++ | |
| } | |
| time.Sleep(c.config.Scheduler.CatchupRateLimit) | |
| } | |
| // Set watermarks to catchupTo after all dispatches | |
| if err := c.dagStateStore.SaveAll(dags, catchupTo); err != nil { | |
| logger.Error(ctx, "Failed to save final watermarks", tag.Error(err)) | |
| } | |
| // Track which DAGs had all candidates dispatched or skipped | |
| completedDAGs := make(map[string]*core.DAG) | |
| failedDAG := "" | |
| // Dispatch candidates | |
| for _, cand := range candidates { | |
| if ctx.Err() != nil { | |
| break | |
| } | |
| dispatched, err := c.dispatchCandidate(ctx, cand) | |
| if err != nil { | |
| logger.Error(ctx, "Catch-up dispatch failed, stopping catch-up", | |
| tag.DAG(cand.dag.Name), | |
| tag.Error(err), | |
| ) | |
| failedDAG = cand.dag.Name | |
| break | |
| } | |
| if dispatched { | |
| result.Dispatched++ | |
| // Advance this DAG's watermark after each successful dispatch | |
| if err := c.dagStateStore.Save(cand.dag, dagState{LastTick: cand.scheduledTime}); err != nil { | |
| logger.Error(ctx, "Failed to save DAG state", tag.Error(err)) | |
| } | |
| } else { | |
| result.Skipped++ | |
| } | |
| time.Sleep(c.config.Scheduler.CatchupRateLimit) | |
| } | |
| // Only advance watermarks for DAGs that weren't interrupted | |
| for k, dag := range dags { | |
| if k == failedDAG { | |
| continue // Leave at last successful dispatch watermark | |
| } | |
| if err := c.dagStateStore.Save(dag, dagState{LastTick: catchupTo}); err != nil { | |
| logger.Error(ctx, "Failed to save final watermark", tag.Error(err)) | |
| } | |
| } |
🤖 Prompt for AI Agents
In `@internal/service/scheduler/catchup.go` around lines 113 - 145, The loop
currently breaks on dispatch failure but unconditionally calls
c.dagStateStore.SaveAll(dags, catchupTo), advancing every DAG watermark; change
this so only DAGs that completed catch-up are advanced to catchupTo: track
successful DAGs inside the dispatch loop (e.g., a map keyed by cand.dag.Name or
dag ID updated when you increment result.Dispatched and after successful per-DAG
Save(cand.dag, dagState{LastTick: cand.scheduledTime})), and after the loop call
SaveAll only for that subset (or iterate and call Save(dag, dagState{LastTick:
catchupTo}) for each successful DAG). Ensure failing DAGs are left at their last
saved state (you already Save per-dispatch in dispatchCandidate flow) and do not
include them in the final SaveAll(catchupTo) advance.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1647 +/- ##
==========================================
- Coverage 69.88% 69.80% -0.09%
==========================================
Files 335 338 +3
Lines 37440 37733 +293
==========================================
+ Hits 26166 26340 +174
- Misses 9204 9293 +89
- Partials 2070 2100 +30
... and 5 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Summary by CodeRabbit
Release Notes
New Features
Configuration