build: deploy companion charts in parallel#6428
Conversation
There was a problem hiding this comment.
Pull request overview
This PR improves CI deploy performance by installing companion Helm charts concurrently (instead of serially) in the camunda-deployer tooling under scripts/, reducing pre-deploy wall time for scenarios with multiple companions.
Changes:
- Deploy companion charts in parallel using
errgroup, preserving the existing lifecycle barrier before post-infra hooks and the main chart deploy. - Serialize
helm repo add/updateacross concurrent companions with a mutex to avoid races onrepositories.yaml. - Add unit tests validating parallel overlap, sibling cancellation on error, single-companion behavior, and repo op serialization; bump
golang.org/x/syncto supporterrgroup.
Reviewed changes
Copilot reviewed 5 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/camunda-deployer/pkg/deployer/helm.go | Adds deployCompanionCharts (errgroup fan-out) and mutex-guarded repo add/update in companion deploy path. |
| scripts/camunda-deployer/pkg/deployer/deployer.go | Switches Deploy() from serial companion loop to the new parallel helper while keeping ordering before post-infra hooks. |
| scripts/camunda-deployer/pkg/deployer/helm_test.go | Adds unit tests covering parallelism, cancellation, and repo serialization. |
| scripts/camunda-deployer/go.mod | Adds direct dependency on golang.org/x/sync for errgroup. |
| scripts/camunda-deployer/go.sum | Records golang.org/x/sync v0.21.0 checksums. |
| scripts/deploy-camunda/go.mod | Bumps indirect golang.org/x/sync version. |
| scripts/deploy-camunda/go.sum | Updates golang.org/x/sync checksums to match the new version. |
| // companionRepoMu serializes helm repo add/update across concurrent companion | ||
| // deployments — those commands rewrite the shared repositories.yaml, so | ||
| // concurrent writers would race. No current scenario pairs two remote-repo | ||
| // companions, so contention is effectively zero; this guards future ones. |
| // deployCompanionCharts deploys all configured companion charts concurrently as | ||
| // separate Helm releases in the same namespace. Companions are independent of | ||
| // each other (only the main chart depends on them), so they install in | ||
| // parallel. The call blocks until every companion is ready; on the first | ||
| // failure the shared context is cancelled, terminating in-flight siblings, and | ||
| // the first error is returned. A single companion behaves identically to a | ||
| // serial deploy. |
| // Deploy companion charts (e.g., OpenSearch) as separate Helm releases in | ||
| // the same namespace, concurrently — they are independent of each other and | ||
| // each is deployed with --wait. deployCompanionCharts blocks until all are | ||
| // ready, which keeps them ordered before the post-infra hooks and the main | ||
| // Camunda chart below. |
eamonnmoloney
left a comment
There was a problem hiding this comment.
crev review
PR #6428 parallelizes companion chart deployment using errgroup with a mutex guard for helm repo operations — a well-structured change with four new tests. One P1: companionRepoMu.Unlock() is a plain call rather than defer, leaving a panic-induced permanent deadlock path; fix with defer companionRepoMu.Unlock() immediately after Lock(). Two P2: the serialization test's repoUpdate stub is an instant no-op (leaving the full {helmRepoAdd + helmRepoUpdate} atomicity claim untested), and the golang.org/x/sync bump in the separate deploy-camunda module appears incidental and not required by this change.
Specialists run: correctness, adr-conformance, devils-advocate, verifier, escalation-assessor. Devil's-advocate hypotheses: 9 raised, 1 promoted.
Hypotheses by stance: adversarial-input=3 author-blind-spot=2 missing-case=3 scope-discipline=1 · by disposition: covered_by_specialist=2 dropped_low_severity=2 dropped_rubric=2 dropped_ungroundable=2 promoted=1
Escalation: AI review sufficient (score: 0.17, threshold: 0.50). Scripts-only Go change with no chart templates, values.yaml, security surfaces, or golden files touched — all hard escalation rules are clean. Dominant upward signals are first-time contributor (author_familiarity_inverse=1.0) and moderate novelty from introducing an errgroup-based concurrency primitive into CI tooling (novelty_score=0.6). The P1 mutex finding is a scoped correctness issue, not a rule violation. Composite score 0.173 is well below the 0.5 threshold.
| } | ||
| return nil | ||
| }() | ||
| companionRepoMu.Unlock() |
There was a problem hiding this comment.
P1 · companionRepoMu.Unlock() is not deferred — panic in the locked section permanently deadlocks (via correctness)
The mutex is locked at line 247 with companionRepoMu.Lock() and unlocked with a plain companionRepoMu.Unlock() call at line 257, after the immediately-invoked function expression (IIFE) returns. This pattern is unsafe: if helmRepoAdd or helmRepoUpdate panics inside the IIFE, the anonymous function exits via panic rather than a normal return, and companionRepoMu.Unlock() at line 257 is never reached. The mutex remains permanently locked. Every subsequent goroutine that calls deployCompanionChart with a non-empty RepoName will block forever in companionRepoMu.Lock(), causing deployCompanionCharts (and the entire Deploy call) to hang until the parent context times out.
Go panics in os/exec-based code are rare but not impossible: nil function variables (e.g. if helmRepoAddFn is accidentally nil-set in a test or misconfigured binary), runtime out-of-memory, or stack overflow in recursive calls can all produce panics. The idiomatic Go pattern for any mutex usage is unconditional defer.
Recommendation: Replace the plain unlock with a deferred unlock:
companionRepoMu.Lock()
defer companionRepoMu.Unlock()
err := func() error {
if err := helmRepoAdd(ctx, cc.RepoName, cc.RepoURL); err != nil {
return fmt.Errorf("companion chart %q: repo add failed: %w", cc.ReleaseName, err)
}
if err := helmRepoUpdate(ctx); err != nil {
return fmt.Errorf("companion chart %q: repo update failed: %w", cc.ReleaseName, err)
}
return nil
}()
if err != nil {
return err
}The IIFE is no longer needed once defer handles the unlock — the block can be flattened.
companionRepoMu.Unlock()
There was a problem hiding this comment.
Good catch — fixed in 3b41e6d.
One adjustment to the suggested placement: I moved Lock() + defer Unlock() inside the IIFE rather than at the deployCompanionChart scope. A function-level defer would hold the mutex through the subsequent helm upgrade --install --wait (lines 263+), serializing every companion deploy and defeating the parallelism this PR adds. Scoping the defer to the IIFE releases the lock the instant repo registration returns — panic-safe and still concurrent.
Verified go build ./... + go test ./pkg/deployer/... -race (all companion/repo tests pass).
3b41e6d to
154e9ac
Compare
CI scenarios with multiple companion charts (e.g. 8.10 eske = Elasticsearch + Keycloak) deployed companions serially, each with --wait, so total pre-deploy time was the sum of all companion install times before the main Camunda chart started. Companions are independent of each other (only the main chart depends on them), so they can install concurrently. Fan out companion deploys via errgroup. g.Wait() acts as a barrier that keeps all companions ordered before the post-infra hooks and the main chart, preserving the maintainer-required lifecycle (post-infra runs after companions are ready, e.g. Bitnami 8.9->8.10 migration). On the first failure the shared context is cancelled, terminating in-flight siblings. helm repo add/update rewrite the shared repositories.yaml; guard them with a mutex so concurrent companions never race on that file. No current scenario pairs two remote-repo companions, so contention is effectively zero today; the guard protects future ones. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove rationale sentences from companionRepoMu, deployCompanionCharts, and the Deploy() call site. Keep only the non-obvious behavioral constraints (mutex protects shared file writes; barrier ordering). Rationale lives in the PR body and commit message per AGENTS.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Move companionRepoMu.Lock()/Unlock() inside the IIFE so the unlock is deferred and runs even if helmRepoAdd/helmRepoUpdate panic, while keeping the lock scoped to repo registration only — the long-running helm upgrade --install --wait still runs unlocked, preserving parallelism. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
154e9ac to
4f5c55a
Compare
Which problem does the PR fix?
Fixes #6252.
CI scenarios with multiple companion charts (e.g. 8.10
eske= Elasticsearch + Keycloak) deployed companions serially, each with--wait, so total pre-deploy time was the sum of all companion install times (~6 min) before the main Camunda chart even started.What's in this PR?
Companions are independent of each other (only the main chart depends on them), so they now install concurrently via
errgroup:deployCompanionChartshelper fans out one goroutine per companion.g.Wait()is a barrier that keeps all companions ordered before the post-infra hooks and the main chart — preserving the maintainer-required lifecycle (post-infra runs after companions are ready, e.g. Bitnami 8.9→8.10 migration scripts). Loop position indeployer.gois unchanged, so the ordering constraint holds.helm.RunCaptureStderr→executil.RunCommandCaptureStderrusesexec.CommandContext).helm repo add/updaterewrite the sharedrepositories.yaml; a mutex guards them so concurrent companions never race on that file. No current scenario pairs two remote-repo companions (8.10eske: Keycloak is a local chart, only Elasticsearch is remote), so contention is effectively zero today — the guard protects future ones.Expected impact: pre-deploy time drops from sum → max (~2 min saved per multi-companion scenario, across the 8.10
identity: keycloakmatrix).Tests added in
helm_test.go: parallel overlap, error-cancels-siblings, single-companion preserved, repo-registration serialized (all pass under-race).Files changed are Go tooling under
scripts/only (no user-facing chart files), hencebuild:per the CI PR-title rule.Checklist
Please make sure to follow our Contributing Guide.
Before opening the PR:
make go.update-golden-only.After opening the PR:
🤖 Generated with Claude Code