-
-
Notifications
You must be signed in to change notification settings - Fork 224
feat: enhance stability of distributed execution #1580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThe PR implements shared-nothing worker mode, refactors the coordinator handler API from functional options to struct-based config, extends retry configuration for coordinator connections, improves remote progress display with terminal width handling, adds file sync durability to DAG writes, and includes error tracking for sub-DAG polling with consecutive failure thresholds. Changes
Sequence Diagram(s)sequenceDiagram
participant Worker
participant Context
participant CoordinatorClient
participant Coordinator
participant TaskPayload
Worker->>Context: NewContext(cmd, config)
Context->>Context: isSharedNothingWorker()?
alt Shared-nothing Mode
Context->>Context: Skip local store init
Context->>Context: Apply retry config
Context-->>Worker: Return context (nil stores)
Worker->>TaskPayload: Extract DAG definition
Worker->>CoordinatorClient: Send status updates
CoordinatorClient->>Coordinator: Push status & logs
else Standard Mode
Context->>Context: Initialize all local stores
Context-->>Worker: Return context (ready stores)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
internal/cmn/config/loader.go (1)
293-311: Consider logging a warning onRetryIntervalparse failure for consistency.Other duration fields in this file (e.g.,
LockStaleThresholdat lines 740-744) log warnings when parsing fails. The silent failure here is inconsistent and could make configuration issues harder to diagnose.♻️ Suggested fix
if def.RetryInterval != "" { if d, err := time.ParseDuration(def.RetryInterval); err == nil { peer.RetryInterval = d + } else { + l.warnings = append(l.warnings, fmt.Sprintf("Invalid peer.retryInterval value: %s", def.RetryInterval)) } }internal/cmd/progress_remote.go (1)
251-263: Edge case: truncation may cut into the arrow prefix.When
availableWidthis small (e.g., 5-7), the truncation at line 260 could cut into the " → " prefix, resulting in malformed output like " →…" or " …".Consider ensuring minimum width accounts for the arrow:
♻️ Suggested improvement
func (p *RemoteProgressDisplay) truncateWorkerID(availableWidth int) string { - if p.workerID == "" || availableWidth <= 5 { + // Arrow " → " is 3 chars + at least 1 char of ID + ellipsis = 5 minimum + const arrowLen = 4 // " → " with space + if p.workerID == "" || availableWidth < arrowLen+2 { return "" } workerSuffix := " → " + p.workerID if len(workerSuffix) > availableWidth { - return workerSuffix[:availableWidth-1] + "…" + // Ensure we keep the arrow and truncate only the worker ID part + maxIDLen := availableWidth - arrowLen - 1 // -1 for ellipsis + if maxIDLen <= 0 { + return "" + } + return " → " + p.workerID[:maxIDLen] + "…" } return workerSuffix }
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
internal/cmd/context.gointernal/cmd/coord.gointernal/cmd/progress_remote.gointernal/cmn/config/config.gointernal/cmn/config/definition.gointernal/cmn/config/loader.gointernal/persis/filedagrun/writer.gointernal/runtime/executor/dag_runner.gointernal/service/coordinator/handler.gointernal/service/coordinator/handler_test.gointernal/service/worker/handler.gointernal/service/worker/remote_handler.gointernal/service/worker/worker_test.gointernal/test/coordinator.goui/src/features/dags/components/dag-details/DAGDetailsPanel.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsxui/src/features/dags/components/dag-details/NodeStatusTable.tsx
🧰 Additional context used
📓 Path-based instructions (5)
ui/**/*.{ts,tsx}
📄 CodeRabbit inference engine (AGENTS.md)
ui/**/*.{ts,tsx}: The React + TypeScript frontend resides inui/, with production bundles copied tointernal/service/frontend/assetsbymake ui
UI code follows ESLint + Prettier (2-space indent) and Tailwind utilities; name React components in PascalCase (JobList.tsx) and hooks withuse*(useJobs.ts)
Files:
ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsxui/src/features/dags/components/dag-details/NodeStatusTable.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsx
ui/**/*.{tsx,ts,jsx,js}
📄 CodeRabbit inference engine (ui/CLAUDE.md)
Avoid full-page loading overlays and LoadingIndicator components that hide content - show stale data while fetching updates instead
Files:
ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsxui/src/features/dags/components/dag-details/NodeStatusTable.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsx
ui/**/*.{tsx,jsx}
📄 CodeRabbit inference engine (ui/CLAUDE.md)
ui/**/*.{tsx,jsx}: Keep modal headers small and information-dense with minimal padding (e.g.,p-2orp-3instead ofp-4orp-6)
Use compact form element heights: select boxes withh-7or smaller, buttons withh-7orh-8, inputs with compact padding (py-0.5orpy-1)
Minimize table and list row heights while maintaining readability, merge related columns to save space, and handle long text withwhitespace-normal break-words
Use consistent metadata styling with uniform backgrounds (e.g.,bg-slate-200 dark:bg-slate-700) and text hierarchy using size/weight over color variation
Use flexbox-first layouts withmin-h-0andoverflow-hiddento prevent layout breaks, and account for fixed elements when setting heights
Support keyboard navigation in all interactive components including modals with arrow keys, enter, and escape keys
Avoid auto-focusing first items in modals unless it makes sense for the specific use case
Maintain sufficient color contrast in both light and dark modes, use proper ARIA labels, and ensure text remains readable at smaller sizes
Use transparent backgrounds in navigation elements and keep navigation components small and unobtrusive
Avoid two-line displays for single metadata items, excessive whitespace between elements, decorative elements without purpose, and modals that take up excessive screen space
Files:
ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsxui/src/features/dags/components/dag-details/NodeStatusTable.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsx
**/*.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*.go: Backend entrypoint incmd/orchestrates the scheduler and CLI; runtime, persistence, and service layers sit underinternal/*(for exampleinternal/runtime,internal/persistence)
Keep Go filesgofmt/goimportsclean; use tabs, PascalCase for exported symbols (SchedulerClient), lowerCamelCase for locals, andErr...names for package-level errors
Repository linting relies ongolangci-lint; prefer idiomatic Go patterns, minimal global state, and structured logging helpers ininternal/common
Files:
internal/runtime/executor/dag_runner.gointernal/cmn/config/config.gointernal/service/worker/remote_handler.gointernal/service/worker/handler.gointernal/cmn/config/loader.gointernal/persis/filedagrun/writer.gointernal/service/worker/worker_test.gointernal/test/coordinator.gointernal/cmd/coord.gointernal/cmn/config/definition.gointernal/cmd/context.gointernal/service/coordinator/handler_test.gointernal/cmd/progress_remote.gointernal/service/coordinator/handler.go
**/*_test.go
📄 CodeRabbit inference engine (AGENTS.md)
**/*_test.go: Co-locate Go tests as*_test.go; favour table-driven cases and cover failure paths
Usestretchr/testify/requireand shared fixtures frominternal/testinstead of duplicating mocks
Files:
internal/service/worker/worker_test.gointernal/service/coordinator/handler_test.go
🧠 Learnings (8)
📚 Learning: 2026-01-12T16:59:14.740Z
Learnt from: CR
Repo: dagu-org/dagu PR: 0
File: ui/CLAUDE.md:0-0
Timestamp: 2026-01-12T16:59:14.740Z
Learning: Applies to ui/**/*.{tsx,jsx} : Use flexbox-first layouts with `min-h-0` and `overflow-hidden` to prevent layout breaks, and account for fixed elements when setting heights
Applied to files:
ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsxui/src/features/dags/components/dag-details/NodeStatusTable.tsx
📚 Learning: 2026-01-12T16:59:14.740Z
Learnt from: CR
Repo: dagu-org/dagu PR: 0
File: ui/CLAUDE.md:0-0
Timestamp: 2026-01-12T16:59:14.740Z
Learning: Applies to ui/**/*.{tsx,jsx} : Minimize table and list row heights while maintaining readability, merge related columns to save space, and handle long text with `whitespace-normal break-words`
Applied to files:
ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsxui/src/features/dags/components/dag-details/NodeStatusTable.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsx
📚 Learning: 2026-01-12T16:59:14.740Z
Learnt from: CR
Repo: dagu-org/dagu PR: 0
File: ui/CLAUDE.md:0-0
Timestamp: 2026-01-12T16:59:14.740Z
Learning: Applies to ui/**/*.{tsx,jsx} : Avoid two-line displays for single metadata items, excessive whitespace between elements, decorative elements without purpose, and modals that take up excessive screen space
Applied to files:
ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsxui/src/features/dags/components/dag-details/NodeStatusTable.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsx
📚 Learning: 2026-01-12T16:59:14.740Z
Learnt from: CR
Repo: dagu-org/dagu PR: 0
File: ui/CLAUDE.md:0-0
Timestamp: 2026-01-12T16:59:14.740Z
Learning: Applies to ui/**/*.{tsx,jsx} : Use transparent backgrounds in navigation elements and keep navigation components small and unobtrusive
Applied to files:
ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsxui/src/features/dags/components/dag-details/NodeStatusTable.tsx
📚 Learning: 2026-01-12T16:59:14.740Z
Learnt from: CR
Repo: dagu-org/dagu PR: 0
File: ui/CLAUDE.md:0-0
Timestamp: 2026-01-12T16:59:14.740Z
Learning: Applies to ui/**/*.{tsx,ts,jsx,js} : Avoid full-page loading overlays and LoadingIndicator components that hide content - show stale data while fetching updates instead
Applied to files:
ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsx
📚 Learning: 2026-01-12T16:59:14.740Z
Learnt from: CR
Repo: dagu-org/dagu PR: 0
File: ui/CLAUDE.md:0-0
Timestamp: 2026-01-12T16:59:14.740Z
Learning: Applies to ui/**/*.{tsx,jsx} : Keep modal headers small and information-dense with minimal padding (e.g., `p-2` or `p-3` instead of `p-4` or `p-6`)
Applied to files:
ui/src/features/dags/components/dag-details/NodeStatusTable.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsx
📚 Learning: 2026-01-12T16:59:14.740Z
Learnt from: CR
Repo: dagu-org/dagu PR: 0
File: ui/CLAUDE.md:0-0
Timestamp: 2026-01-12T16:59:14.740Z
Learning: Applies to ui/**/*.{tsx,jsx} : Use consistent metadata styling with uniform backgrounds (e.g., `bg-slate-200 dark:bg-slate-700`) and text hierarchy using size/weight over color variation
Applied to files:
ui/src/features/dags/components/dag-details/NodeStatusTable.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsx
📚 Learning: 2026-01-12T16:59:14.740Z
Learnt from: CR
Repo: dagu-org/dagu PR: 0
File: ui/CLAUDE.md:0-0
Timestamp: 2026-01-12T16:59:14.740Z
Learning: Applies to ui/**/*.{tsx,jsx} : Use compact form element heights: select boxes with `h-7` or smaller, buttons with `h-7` or `h-8`, inputs with compact padding (`py-0.5` or `py-1`)
Applied to files:
ui/src/features/dags/components/dag-details/NodeStatusTable.tsxui/src/features/dags/components/dag-details/DAGStepTable.tsx
🧬 Code graph analysis (7)
internal/runtime/executor/dag_runner.go (1)
api/v2/api.gen.go (1)
Error(682-691)
internal/service/worker/remote_handler.go (1)
internal/core/spec/loader.go (2)
LoadOption(41-41)WithBaseConfig(44-48)
internal/test/coordinator.go (4)
internal/service/coordinator/handler.go (2)
HandlerConfig(72-84)NewHandler(94-105)internal/test/helper.go (2)
WithStatusPersistence(84-88)WithLogPersistence(92-96)internal/core/exec/dagrun.go (1)
DAGRunStore(23-52)internal/runtime/runner.go (1)
Config(96-110)
internal/cmd/coord.go (2)
internal/service/coordinator/handler.go (2)
NewHandler(94-105)HandlerConfig(72-84)internal/core/exec/dagrun.go (1)
DAGRunStore(23-52)
internal/cmd/context.go (1)
internal/cmn/config/config.go (3)
Worker(336-341)Core(57-85)Peer(363-387)
internal/service/coordinator/handler_test.go (1)
internal/service/coordinator/handler.go (2)
NewHandler(94-105)HandlerConfig(72-84)
internal/service/coordinator/handler.go (3)
internal/core/exec/dagrun.go (1)
DAGRunStore(23-52)proto/coordinator/v1/coordinator.pb.go (3)
Task(368-389)Task(402-402)Task(417-419)internal/core/status.go (1)
NotStarted(7-7)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test on ubuntu-latest
- GitHub Check: Build
🔇 Additional comments (22)
ui/src/features/dags/components/dag-details/DAGStepTable.tsx (1)
36-36: LGTM!Removing the
p-pxpadding aligns with the broader overflow handling refinements in this PR. The 1px padding was unnecessary for this table container and its removal helps achieve consistent spacing. Based on coding guidelines, this avoids excessive whitespace between elements.ui/src/features/dags/components/dag-details/DAGDetailsPanel.tsx (1)
222-222: LGTM - Proper overflow containment.Adding
overflow-x-hiddenat the panel level creates a clean overflow boundary while child tables (withoverflow-x-auto) maintain their own horizontal scrolling capability. This follows the coding guideline to useoverflow-hiddento prevent layout breaks. The layered approach ensures horizontal scroll is contained within the intended table areas.ui/src/features/dags/components/dag-details/NodeStatusTable.tsx (1)
46-46: LGTM!Consistent with the
p-pxremoval inDAGStepTable.tsx. The horizontal scroll capability is preserved viaoverflow-x-auto, and the innermin-w-[900px]constraint ensures proper table rendering at intermediate viewport sizes.internal/runtime/executor/dag_runner.go (2)
365-367: LGTM! Good addition of consecutive error tracking.The introduction of
maxConsecutiveErrorsand the counter improves resilience by detecting when the coordinator becomes unreachable, preventing indefinite polling.One consideration: if network conditions vary significantly across deployments, this threshold could be made configurable (e.g., via an option or environment variable). However, 10 consecutive failures at 1-second intervals (~10 seconds) is a reasonable default for most scenarios.
426-436: LGTM! Error handling logic is correct.The implementation properly:
- Increments the counter on each failure
- Logs a warning with the current consecutive error count for observability
- Returns an informative error when the threshold is reached (wrapping the original error)
- Resets the counter on success to avoid penalizing transient failures
internal/persis/filedagrun/writer.go (2)
131-134: Consider the performance impact of per-write fsync.Calling
file.Sync()after every write ensures durability for multi-coordinator visibility, but incurs significant syscall overhead. For high-frequency status updates, this could become a bottleneck.If write frequency is expected to be high, consider:
- Batching writes before sync
- Using a configurable sync interval
- Documenting the expected write frequency and performance characteristics
If the trade-off is acceptable for your use case (correctness over throughput), this is fine as-is.
102-137: Clean error handling refactor.The simplified error handling with direct returns improves readability. Each error is properly wrapped with context, and the control flow is clear and idiomatic.
internal/service/worker/handler.go (1)
43-62: LGTM!The temporary DAG file creation and cleanup logic is correct. The deferred cleanup with
os.IsNotExistcheck properly handles edge cases. Logging only the temporary file path (instead of both original and temp) simplifies output and aligns with the shared-nothing architecture where the coordinator provides definitions.internal/cmn/config/definition.go (1)
186-194: LGTM!The new retry configuration fields are well-documented with clear default values. Using
stringforRetryInterval(to be parsed totime.Durationduring config loading) follows the existing pattern in other definition structs likeMonitoringDef.RetentionandSchedulerDef.LockStaleThreshold.internal/service/worker/remote_handler.go (1)
209-226: LGTM!The comments clearly document the architectural rationale for not including
DAGsDirin shared-nothing mode. The load options are correctly composed:
WithBaseConfigpreserves base configuration inheritanceWithNameensures the original DAG name is used (not the temp file path)WithParamspasses through task parameters for parallel executioninternal/service/worker/worker_test.go (1)
94-99: LGTM!Tests are correctly updated to include the
Definitionfield, aligning with the new validation inDispatchthat requirestask.Definitionfor distributed execution. The YAML definition is minimal but valid, sufficient for test purposes.Also applies to: 133-138, 204-209, 259-265, 450-456, 480-487, 637-641, 699-703, 771-775
internal/service/coordinator/handler.go (4)
71-105: LGTM!The refactor from functional options to
HandlerConfigstruct improves clarity and makes the constructor easier to use. TheapplyDefaultsmethod correctly handles the optionalStaleHeartbeatThresholdfield.
186-218: LGTM!The validation requiring
task.Definitionensures shared-nothing workers always receive the full DAG definition. The attempt creation logic appropriately distinguishes root DAGs from sub-DAGs. The design choice to log warnings but not fail dispatch when attempt creation fails provides good resilience—tasks can still execute, and the warning alerts operators to storage issues.
264-270: LGTM!The nil
dagRunStoreguards provide good defense-in-depth for test scenarios. ThewriteInitialStatushelper elegantly prevents "corrupted status file" errors by ensuring the file is never empty.Minor note:
StartedAtis set at dispatch time rather than actual execution start, but this is acceptable since workers will overwrite the status with accurate timestamps once execution begins.Also applies to: 346-351, 393-406
325-327: LGTM!Initial status writes are correctly placed after
attempt.Open()succeeds but before caching, ensuring consistent behavior for both root DAG and sub-DAG paths.Also applies to: 376-378
internal/service/coordinator/handler_test.go (1)
231-231: LGTM! Consistent migration to struct-based HandlerConfig.The test file correctly updates all
NewHandlercalls to use the newHandlerConfig{}struct-based API. The changes are consistent across all test cases, and the mock implementations (mockDAGRunStore,mockDAGRunAttempt) are well-designed with proper mutex usage for thread-safety.Also applies to: 248-248, 280-280, 305-312
internal/cmn/config/config.go (1)
378-386: LGTM! Well-documented retry configuration fields.The new
MaxRetriesandRetryIntervalfields are clearly documented with exponential backoff behavior. The zero-value semantics (disabled retries) are appropriate, and defaults are correctly applied at the usage sites inloader.goandcontext.go.internal/cmd/context.go (2)
167-182: LGTM! Shared-nothing worker mode is well-implemented.The early return path for shared-nothing workers correctly:
- Logs the mode with coordinator addresses for debugging
- Returns a minimal Context with nil stores
- Documents why stores are nil (status pushed to coordinator, DAG definitions from task payload)
The approach avoids unnecessary file I/O in distributed worker scenarios.
310-316: LGTM! Retry configuration applied correctly.The conditional application of
MaxRetriesandRetryIntervalonly when positive values are configured preserves the defaults fromcoordinator.DefaultConfig()while allowing explicit overrides.internal/test/coordinator.go (1)
58-68: LGTM! Test helper correctly migrated to config-based API.The test setup properly:
- Initializes an empty
HandlerConfig- Conditionally populates
DAGRunStoreandLogDirbased on test options- Passes the config to
NewHandlerThis aligns with the new API in
internal/service/coordinator/handler.go.internal/cmd/coord.go (1)
197-200: LGTM! Coordinator handler correctly initialized with config.The
HandlerConfigis properly populated with:
DAGRunStore: for status persistence in shared-nothing modeLogDir: for remote log streamingThis aligns with the documented requirements in
HandlerConfigfor shared-nothing worker architecture.internal/cmd/progress_remote.go (1)
48-60: LGTM! Terminal width detection with sensible defaults.Good defensive programming:
- Default width of 80 characters
- Only attempts to get terminal size when TTY is detected
- Validates the returned width before using it
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1580 +/- ##
==========================================
- Coverage 64.87% 64.63% -0.24%
==========================================
Files 255 255
Lines 28430 28519 +89
==========================================
- Hits 18444 18434 -10
- Misses 8337 8429 +92
- Partials 1649 1656 +7
... and 3 files with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Configuration
Style
✏️ Tip: You can customize this high-level summary in your review settings.