Skip to content

feat: Add phase_status tracking and PARTIAL apply status#1690

Merged
bjcoombs merged 4 commits intodevelopfrom
045-manifest-source-of-truth--11--phase-status-partial
Mar 16, 2026
Merged

feat: Add phase_status tracking and PARTIAL apply status#1690
bjcoombs merged 4 commits intodevelopfrom
045-manifest-source-of-truth--11--phase-status-partial

Conversation

@bjcoombs
Copy link
Copy Markdown
Collaborator

Summary

  • Add per-phase execution status tracking to manifest versions for partial failure recovery
  • When a manifest apply fails mid-execution, the system records which phases completed, failed, or were skipped
  • New PARTIAL apply status distinguishes partial failures (some phases succeeded) from complete failures

Changes

Migrations (CockroachDB-safe)

  • 20260316000002_add_phase_status.sql: Add phase_status JSONB column to manifest_versions
  • 20260316000003_add_partial_apply_status.sql: Alter CHECK constraint to include PARTIAL (separate migration for CockroachDB compatibility)

Proto

  • PhaseStatusDetail message with status, started_at, completed_at, error fields
  • phase_status map field on ManifestVersion and ApplyManifestResponse
  • APPLY_STATUS_PARTIAL and APPLY_MANIFEST_STATUS_PARTIAL enum values

Domain (manifest package)

  • PhaseStatusMap, PhaseStatusEntry, PhaseExecutionStatus types
  • GetPhaseStatus() / SetPhaseStatus() on VersionEntity for JSON round-tripping
  • UpdatePhaseStatus() repository method
  • StoreManifestVersionWithPhaseStatus() on HistoryService
  • ApplyStatusPartial constant

gRPC Handler (applier package)

  • buildInitialPhaseStatus() derives pending phase entries from execution plan
  • updatePhaseStatus() marks phases as COMPLETED/FAILED/SKIPPED based on saga step results
  • classifyFailure() distinguishes partial vs complete failure
  • Phase status included in both failure and success response paths

Task Master

045-manifest-source-of-truth.11 - Add phase_status and PARTIAL apply status

Test plan

  • Unit tests for classifyFailure (nil, all-failed, partial)
  • Unit tests for buildInitialPhaseStatus from execution plan
  • Unit tests for updatePhaseStatus (all success, partial failure)
  • Unit tests for findFailedPhase (no result, with result)
  • Unit tests for phaseStatusMapToResponseProto (nil, populated)
  • Unit tests for VersionEntity.GetPhaseStatus/SetPhaseStatus round-trip
  • Integration test: PARTIAL apply status persists to CockroachDB
  • Integration test: UpdatePhaseStatus JSONB column round-trip
  • All existing applier and manifest tests pass

Add per-phase execution status tracking to manifest versions for partial
failure recovery. When a manifest apply fails mid-execution, the system
now records which phases completed, which failed, and which were skipped.

Changes:
- Migration: Add phase_status JSONB column to manifest_versions
- Migration: Add PARTIAL to apply_status CHECK constraint (separate file
  for CockroachDB compatibility)
- Domain: PhaseStatusMap, PhaseStatusEntry types with JSON serialization
- Repository: UpdatePhaseStatus method, GetPhaseStatus/SetPhaseStatus on
  VersionEntity
- Proto: PhaseStatusDetail message, phase_status map on ManifestVersion
  and ApplyManifestResponse, APPLY_STATUS_PARTIAL and
  APPLY_MANIFEST_STATUS_PARTIAL enum values
- gRPC handler: Derives phase status from execution plan phases and saga
  step results, classifies partial vs complete failure
- Tests: Unit tests for classification, phase status building, and
  round-trip serialization; integration tests for DB persistence
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

This change introduces phase-level status tracking for manifest application: proto additions for phase details and partial status, runtime phase tracking and classification in the applier, persistence of phase status in history/repository, DB migrations for a JSONB phase_status column and PARTIAL apply state, and tests covering the new behavior.

Changes

Cohort / File(s) Summary
Protobufs
api/proto/meridian/control_plane/v1/apply_manifest_service.proto, api/proto/meridian/control_plane/v1/manifest_history_service.proto
Added phase_status map fields and PhaseStatusDetail message; introduced APPLY_MANIFEST_STATUS_PARTIAL / APPLY_STATUS_PARTIAL enum values.
Applier runtime
services/control-plane/internal/applier/grpc_handler.go, services/control-plane/internal/applier/grpc_handler_test.go
Added phase-status lifecycle: initialization, updating from execution results, failure classification (partial vs full), conversion to proto, history recording with phase status; tests for the new helpers and flows.
History & repository
services/control-plane/internal/manifest/history.go, services/control-plane/internal/manifest/repository.go, services/control-plane/internal/manifest/repository_test.go
Added PhaseExecutionStatus/PhaseStatusEntry/PhaseStatusMap types; VersionEntity.PhaseStatus JSONB field with Get/Set; UpdatePhaseStatus repo method; StoreManifestVersionWithPhaseStatus and proto conversion; tests and test DDL updated for phase_status and PARTIAL apply status.
Migrations
services/control-plane/migrations/20260316000002_add_phase_status.sql, services/control-plane/migrations/20260316000003_add_partial_apply_status.sql, services/control-plane/migrations/20260316000004_add_partial_apply_status_constraint.sql
Added phase_status JSONB column and split constraint update migrations to add PARTIAL to the apply_status CHECK constraint due to DB transactional limits.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Handler as ApplyManifestHandler
    participant Planner as ExecutionPlanner
    participant Executor as Executor
    participant StatusMgr as PhaseStatusManager
    participant History as HistoryService
    participant DB as Database

    Client->>Handler: ApplyManifest(request)
    Handler->>Planner: build execution plan
    Handler->>Handler: buildInitialPhaseStatus(plan)
    Handler->>Executor: Execute plan
    Executor-->>Handler: ApplyManifestResult (steps success/fail)
    Handler->>StatusMgr: updatePhaseStatus(plan, result, execErr)
    alt Partial failure detected
        StatusMgr->>StatusMgr: findFailedPhase(...)
        StatusMgr->>StatusMgr: mark phases COMPLETED/FAILED/SKIPPED, set timestamps/errors
    else All success
        StatusMgr->>StatusMgr: mark all phases COMPLETED
    end
    StatusMgr-->>Handler: phaseStatus map
    Handler->>Handler: classifyFailure(phaseStatus) -> (ApplyStatus, ManifestStatus)
    Handler->>History: recordHistoryWithPhaseStatus(manifest, phaseStatus, ...)
    History->>DB: Store manifest version with phase_status JSONB
    DB-->>History: persist result
    History-->>Handler: stored ManifestVersion proto
    Handler->>Handler: phaseStatusMapToResponseProto(phaseStatus)
    Handler-->>Client: ApplyManifestResponse (status + phase_status)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the primary changes: adding phase_status tracking and introducing PARTIAL apply status for partial failure recovery.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering migrations, proto changes, domain types, gRPC handler logic, and test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch 045-manifest-source-of-truth--11--phase-status-partial
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

coderabbitai[bot]
coderabbitai Bot previously requested changes Mar 16, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
services/control-plane/internal/manifest/repository.go (1)

79-105: Keep UpdatePhaseStatus on the typed PhaseStatusMap API.

UpdatePhaseStatus re-exposes raw JSON even though SetPhaseStatus already owns serialization in this package. Keeping the repository API typed avoids duplicated marshaling and malformed writes leaking into callers.

Suggested shape
-func (r *Repository) UpdatePhaseStatus(ctx context.Context, id uuid.UUID, phaseStatusJSON *string) error {
+func (r *Repository) UpdatePhaseStatus(ctx context.Context, id uuid.UUID, phaseStatus PhaseStatusMap) error {
+	var entity VersionEntity
+	if err := entity.SetPhaseStatus(phaseStatus); err != nil {
+		return err
+	}
 	return db.WithGormTenantTransaction(ctx, r.db, func(tx *gorm.DB) error {
-		result := tx.Model(&VersionEntity{}).Where("id = ?", id).Update("phase_status", phaseStatusJSON)
+		result := tx.Model(&VersionEntity{}).Where("id = ?", id).Update("phase_status", entity.PhaseStatus)
 		if result.Error != nil {
 			return fmt.Errorf("update phase_status: %w", result.Error)
 		}

Also applies to: 264-276

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/manifest/repository.go` around lines 79 -
105, The UpdatePhaseStatus method should operate on the typed PhaseStatusMap
rather than raw JSON: change UpdatePhaseStatus to accept/return PhaseStatusMap
(not []byte/string), call VersionEntity.GetPhaseStatus to read current map,
mutate it, then use VersionEntity.SetPhaseStatus to serialize and store it,
avoiding direct json.Marshal/json.Unmarshal in the repo layer; update any
callers to pass a PhaseStatusMap and remove any duplicated marshaling logic
(also apply the same change to the other occurrence referenced around lines
264-276).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/control-plane/internal/applier/executor.go`:
- Around line 210-211: The comment on ApplyManifestResult.Status advertises a
"partial" status that Apply never returns; update the documentation for
ApplyManifestResult.Status (and any related comment blocks) to only list the
statuses actually produced by the Apply function (e.g., "applied" and "failed"),
or explicitly note that "partial" is produced elsewhere (by phase analysis) if
you prefer to keep it — locate the ApplyManifestResult type and the Apply
function references to make the change so the docstring accurately matches
runtime behavior.

In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 526-559: The current logic marks the entire failed phase as FAILED
even if earlier calls in that same phase succeeded; modify the update that runs
when failedPhase == p (in the loop that updates phaseStatus using
findFailedPhase, and the analogous block around
classifyFailure/updatePhaseStatus later) to inspect the step-level results
(e.g., result.StepResults or whatever structure holds per-step statuses) for the
failedPhase: if any step in that phase succeeded, treat the phase as having
partial work — set the phase status to manifest.PhaseStatusCompleted (or to an
existing "partial" status if your manifest has one), set CompletedAt = &now and
populate Error from result.Error or execErr as currently done; only mark it
strictly FAILED when no step in that phase completed successfully. Ensure you
update both occurrences (the loop using phases/phaseStatus and the similar block
at lines ~630-649) and reference findFailedPhase, phaseStatus, phases, result,
execErr and classifyFailure when making the change.
- Around line 184-190: The code currently calls
recordHistoryWithPhaseStatus(..., expectedSeq=0) and discards the error,
allowing stale requests to append FAILED/PARTIAL rows; change this to use the
same optimistic-sequence logic as the success path: pass the correct expected
sequence (the same sequence value used when writing successful history) instead
of 0, capture the returned error from recordHistoryWithPhaseStatus, and handle
ErrSequenceConflict the same way the success path does (i.e., detect sequence
conflict and return/translate it to the same conflict response or retry logic).
Locate recordHistoryWithPhaseStatus, classifyFailure, and the
execResult.jobID/phaseStatus usage to implement the change so failed/partial
history writes respect optimistic locking and surface the error.

In `@services/control-plane/internal/manifest/history.go`:
- Around line 281-285: The code currently swallows errors from
entity.GetPhaseStatus() causing malformed phase_status to be omitted; change the
block in EntityToProto (or the function where this snippet lives) to propagate
the error instead of ignoring it: if entity.GetPhaseStatus() returns a non-nil
err, wrap it with context (e.g., "invalid phase_status for manifest <id>" or
similar) and return the error from EntityToProto (or bubble it up) so callers
see the failure; preserve the existing mv.PhaseStatus assignment when err == nil
and phaseStatus != nil.
- Around line 152-157: The code currently ignores errors from
s.repo.UpdatePhaseStatus after calling entity.SetPhaseStatus, resulting in a
false success (PARTIAL) even when phase_status persistence fails; modify the
logic in the function that calls SetPhaseStatus and s.repo.UpdatePhaseStatus so
that you capture the error returned by s.repo.UpdatePhaseStatus and propagate it
(or convert it to an appropriate failure result) instead of discarding it: call
setErr := entity.SetPhaseStatus(phaseStatus); if setErr == nil { if updErr :=
s.repo.UpdatePhaseStatus(ctx, entity.ID, entity.PhaseStatus); updErr != nil {
return updErr (or mark snapshot as failed) } } so callers observe persistence
failures from UpdatePhaseStatus rather than a silent success.

---

Nitpick comments:
In `@services/control-plane/internal/manifest/repository.go`:
- Around line 79-105: The UpdatePhaseStatus method should operate on the typed
PhaseStatusMap rather than raw JSON: change UpdatePhaseStatus to accept/return
PhaseStatusMap (not []byte/string), call VersionEntity.GetPhaseStatus to read
current map, mutate it, then use VersionEntity.SetPhaseStatus to serialize and
store it, avoiding direct json.Marshal/json.Unmarshal in the repo layer; update
any callers to pass a PhaseStatusMap and remove any duplicated marshaling logic
(also apply the same change to the other occurrence referenced around lines
264-276).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bedc0ac2-0bc5-469c-bbc7-040a889fac1a

📥 Commits

Reviewing files that changed from the base of the PR and between 7c9c8f8 and f474248.

⛔ Files ignored due to path filters (1)
  • services/control-plane/migrations/atlas.sum is excluded by !**/*.sum
📒 Files selected for processing (10)
  • api/proto/meridian/control_plane/v1/apply_manifest_service.proto
  • api/proto/meridian/control_plane/v1/manifest_history_service.proto
  • services/control-plane/internal/applier/executor.go
  • services/control-plane/internal/applier/grpc_handler.go
  • services/control-plane/internal/applier/grpc_handler_test.go
  • services/control-plane/internal/manifest/history.go
  • services/control-plane/internal/manifest/repository.go
  • services/control-plane/internal/manifest/repository_test.go
  • services/control-plane/migrations/20260316000002_add_phase_status.sql
  • services/control-plane/migrations/20260316000003_add_partial_apply_status.sql

Comment thread services/control-plane/internal/applier/executor.go Outdated
Comment on lines +184 to +190
// Determine if this is a partial failure (some phases succeeded)
applyStatus, responseStatus := classifyFailure(execResult.phaseStatus)
response.Status = responseStatus
response.PhaseStatus = phaseStatusMapToResponseProto(execResult.phaseStatus)

// Record history with phase status
_, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't bypass optimistic locking on failed/partial history writes.

Line 190 hardcodes expectedSeq=0 and drops the returned error, so a stale request can still append a FAILED/PARTIAL history row after another manifest version has already advanced the sequence. This should use the same sequence check and ErrSequenceConflict handling as the success path.

Suggested fix
-		_, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus)
+		if _, err := h.recordHistoryWithPhaseStatus(
+			ctx,
+			req.GetManifest(),
+			req.GetAppliedBy(),
+			execResult.jobID,
+			applyStatus,
+			nil,
+			req.GetExpectedSequenceNumber(),
+			execResult.phaseStatus,
+		); err != nil && errors.Is(err, manifest.ErrSequenceConflict) {
+			return nil, status.Errorf(codes.Aborted, "%v", err)
+		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Determine if this is a partial failure (some phases succeeded)
applyStatus, responseStatus := classifyFailure(execResult.phaseStatus)
response.Status = responseStatus
response.PhaseStatus = phaseStatusMapToResponseProto(execResult.phaseStatus)
// Record history with phase status
_, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus)
// Determine if this is a partial failure (some phases succeeded)
applyStatus, responseStatus := classifyFailure(execResult.phaseStatus)
response.Status = responseStatus
response.PhaseStatus = phaseStatusMapToResponseProto(execResult.phaseStatus)
// Record history with phase status
if _, err := h.recordHistoryWithPhaseStatus(
ctx,
req.GetManifest(),
req.GetAppliedBy(),
execResult.jobID,
applyStatus,
nil,
req.GetExpectedSequenceNumber(),
execResult.phaseStatus,
); err != nil && errors.Is(err, manifest.ErrSequenceConflict) {
return nil, status.Errorf(codes.Aborted, "%v", err)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/applier/grpc_handler.go` around lines 184 -
190, The code currently calls recordHistoryWithPhaseStatus(..., expectedSeq=0)
and discards the error, allowing stale requests to append FAILED/PARTIAL rows;
change this to use the same optimistic-sequence logic as the success path: pass
the correct expected sequence (the same sequence value used when writing
successful history) instead of 0, capture the returned error from
recordHistoryWithPhaseStatus, and handle ErrSequenceConflict the same way the
success path does (i.e., detect sequence conflict and return/translate it to the
same conflict response or retry logic). Locate recordHistoryWithPhaseStatus,
classifyFailure, and the execResult.jobID/phaseStatus usage to implement the
change so failed/partial history writes respect optimistic locking and surface
the error.

Comment on lines +526 to +559
// On failure: mark phases based on step results.
// Steps are executed in phase order. We mark phases as COMPLETED until we
// find a failed step, then the containing phase is FAILED and the rest are SKIPPED.
failedPhase := findFailedPhase(plan, result)

for _, p := range phases {
key := fmt.Sprintf("phase_%d", p)
entry := phaseStatus[key]
entry.StartedAt = &now

if failedPhase > 0 && p < failedPhase {
entry.Status = manifest.PhaseStatusCompleted
entry.CompletedAt = &now
} else if failedPhase > 0 && p == failedPhase {
entry.Status = manifest.PhaseStatusFailed
entry.CompletedAt = &now
if result != nil {
entry.Error = result.Error
} else if execErr != nil {
entry.Error = execErr.Error()
}
} else if failedPhase > 0 {
entry.Status = manifest.PhaseStatusSkipped
entry.StartedAt = nil
} else {
// No phase-level info available; mark all as failed
entry.Status = manifest.PhaseStatusFailed
entry.CompletedAt = &now
if execErr != nil {
entry.Error = execErr.Error()
}
}
phaseStatus[key] = entry
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Partial work inside the first failed phase is still reported as FAILED.

A phase can contain multiple calls. If some calls in phase 1 succeed and a later call in that same phase fails, updatePhaseStatus records only FAILED for phase_1, so classifyFailure returns FAILED because no phase is COMPLETED. That loses the "side effects already happened" signal this PR is adding.

Also applies to: 630-649

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/applier/grpc_handler.go` around lines 526 -
559, The current logic marks the entire failed phase as FAILED even if earlier
calls in that same phase succeeded; modify the update that runs when failedPhase
== p (in the loop that updates phaseStatus using findFailedPhase, and the
analogous block around classifyFailure/updatePhaseStatus later) to inspect the
step-level results (e.g., result.StepResults or whatever structure holds
per-step statuses) for the failedPhase: if any step in that phase succeeded,
treat the phase as having partial work — set the phase status to
manifest.PhaseStatusCompleted (or to an existing "partial" status if your
manifest has one), set CompletedAt = &now and populate Error from result.Error
or execErr as currently done; only mark it strictly FAILED when no step in that
phase completed successfully. Ensure you update both occurrences (the loop using
phases/phaseStatus and the similar block at lines ~630-649) and reference
findFailedPhase, phaseStatus, phases, result, execErr and classifyFailure when
making the change.

Comment thread services/control-plane/internal/manifest/history.go Outdated
Comment thread services/control-plane/internal/manifest/history.go
@claude
Copy link
Copy Markdown

claude Bot commented Mar 16, 2026

Claude Code Review

Commit: 5d1d3a5 | CI: running

Summary

Well-structured PR that adds per-phase execution tracking with proper CockroachDB migration splitting (ADD COLUMN / DROP CONSTRAINT / ADD CONSTRAINT across 3 files). The domain types, proto definitions, and test coverage are solid. Two areas worth addressing: duplicate conversion logic and a fragile positional correlation assumption in findFailedPhase.

Risk Assessment

Area Level Detail
Blast radius Low Failure path only gains richer metadata; success path unchanged except adding phase_status to response
Rollback Safe New column is nullable JSONB, new enum value is additive; old code ignores both
Scale Low Phase map is bounded by number of phases (currently max 11); no unbounded structures
Cross-system Low Proto additions are additive (new field + new enum value), backward compatible
Migration Safe Three separate migrations respect CockroachDB constraint-rename limitation; atlas.sum updated

Findings

Severity Location Description Status
Improvement grpc_handler.go / history.go Duplicate proto conversion functions (phaseStatusMapToResponseProto vs phaseStatusMapToProto) Open
Improvement grpc_handler.go:559 findFailedPhase relies on positional correlation between step results and planned calls - document the invariant or add a guard Open
Improvement grpc_handler.go + history.go recordHistoryWithPhaseStatus / StoreManifestVersionWithPhaseStatus duplicate their base methods; consider adding optional PhaseStatusMap parameter Open

Bot Review Notes

CodeRabbit thread 1 (grpc_handler.go:190 - expectedSeq=0 on failure path): This is intentional and correct. Failed/partial applies record audit history unconditionally - they should not be blocked by sequence conflicts from concurrent successful applies. The dropped error follows the existing pattern where history recording is non-fatal. No action needed.

CodeRabbit thread 2 (grpc_handler.go:561 - intra-phase partial work): Valid observation but by design. This PR tracks per-phase status, not per-step. A phase with 3 calls where call 3 fails is correctly marked FAILED at the phase level. Sub-phase granularity would be a separate enhancement. The only scenario where this matters for classifyFailure is if the first phase fails (no prior completed phases), which correctly returns FAILED rather than PARTIAL. No action needed for this PR's scope.

claude[bot]
claude Bot previously requested changes Mar 16, 2026
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See summary comment. 1 blocking finding on findFailedPhase name mismatch, 1 suggestion on duplicate proto conversion.

Comment on lines +549 to +560
entry.StartedAt = nil
} else {
// No phase-level info available; mark all as failed
entry.Status = manifest.PhaseStatusFailed
entry.CompletedAt = &now
if execErr != nil {
entry.Error = execErr.Error()
}
}
phaseStatus[key] = entry
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MUST FIX: findFailedPhase builds a map keyed by string(call.GRPCMethod) (fully-qualified gRPC paths like meridian.reference_data.v1.ReferenceDataService/RegisterInstrument) but matches against saga.StepResult.StepName which uses handler registry names (like reference_data.register_instrument).

These will never match in production, so findFailedPhase always returns 0, updatePhaseStatus marks all phases as FAILED (the failedPhase == 0 else-branch), and classifyFailure never sees both COMPLETED and FAILED entries. The PARTIAL status is unreachable.

The unit tests mask this because they use matching short names for both GRPCMethod and StepName.

Suggested fix: Add a HandlerName field to PlannedCall that stores the registry name (e.g., reference_data.register_instrument), then use that field as the map key here instead of GRPCMethod. Also update tests to use realistic production names to prevent regression.

- Fix findFailedPhase to use positional correlation instead of name
  matching. Saga step results use handler names (e.g.
  "reference_data.register_instrument") while execution plan calls use
  gRPC method paths. Positional index is the correct mapping.
- Fix EntityToProto to propagate phase_status deserialization errors
  instead of silently dropping them.
- Rewrite StoreManifestVersionWithPhaseStatus to include phase_status
  in the initial Store call (single atomic write) instead of a
  separate UpdatePhaseStatus call that could fail silently.
- Remove misleading "partial" from executor Status field comment since
  partial classification happens in the gRPC handler, not the executor.
CockroachDB cannot drop and recreate a constraint with the same name
in a single transaction. Split the apply_status constraint change into
two migrations: one to drop the old constraint, one to add the new.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
services/control-plane/internal/applier/grpc_handler.go (1)

184-191: ⚠️ Potential issue | 🟠 Major

Optimistic locking is bypassed for failed/partial history writes.

Line 190 hardcodes expectedSeq=0 and discards both return values, allowing stale requests to append FAILED/PARTIAL history rows even after another manifest version has already advanced the sequence. This should use the same sequence check as the success path (line 197).

🔧 Proposed fix
-		_, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus)
+		if _, err := h.recordHistoryWithPhaseStatus(
+			ctx,
+			req.GetManifest(),
+			req.GetAppliedBy(),
+			execResult.jobID,
+			applyStatus,
+			nil,
+			req.GetExpectedSequenceNumber(),
+			execResult.phaseStatus,
+		); err != nil && errors.Is(err, manifest.ErrSequenceConflict) {
+			return nil, status.Errorf(codes.Aborted, "%v", err)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/applier/grpc_handler.go` around lines 184 -
191, The history write on failure currently calls
recordHistoryWithPhaseStatus(..., expectedSeq=0) and discards its results, which
bypasses optimistic-lock checks; change that call to use the same expected
sequence value used by the success path (the sequence variable passed to the
success recordHistoryWithPhaseStatus call) and handle/propagate the returned
(seq, err) instead of ignoring them so that stale requests cannot append
FAILED/PARTIAL rows when the manifest sequence has advanced; locate the call to
recordHistoryWithPhaseStatus and replace the hardcoded 0 with the shared
expectedSeq variable (and check the returned error/sequence like the success
path does).
🧹 Nitpick comments (2)
services/control-plane/internal/applier/grpc_handler.go (1)

652-672: Duplicate conversion logic with manifest.phaseStatusMapToProto.

This function duplicates the logic in services/control-plane/internal/manifest/history.go:336-356. Consider exposing the manifest package's phaseStatusMapToProto and reusing it here, or extracting to a shared location.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/applier/grpc_handler.go` around lines 652 -
672, The function phaseStatusMapToResponseProto duplicates
manifest.phaseStatusMapToProto; update this code to reuse the shared conversion
by either exporting manifest.phaseStatusMapToProto (rename to
PhaseStatusMapToProto) or moving the conversion into a shared helper used by
both packages, then replace phaseStatusMapToResponseProto’s body with a call to
the shared function (preserving the target type controlplanev1.PhaseStatusDetail
if necessary by adapting the shared helper return type or adding a thin
adapter). Ensure references to manifest.PhaseStatusMap and the existing
manifest.phaseStatusMapToProto/PhaseStatusMapToProto symbol are used so logic is
not duplicated.
services/control-plane/internal/manifest/history.go (1)

135-204: Consider extracting shared logic to reduce duplication.

StoreManifestVersionWithPhaseStatus duplicates most of the logic from StoreManifestVersion (manifest marshaling, diff summary generation, graph serialization, entity creation). Consider refactoring to a shared internal helper that both methods call, passing phaseStatus as an optional parameter.

♻️ Suggested approach
+// storeVersionInternal contains the shared logic for storing manifest versions.
+func (s *HistoryService) storeVersionInternal(
+	ctx context.Context,
+	mf *controlplanev1.Manifest,
+	appliedBy string,
+	applyJobID *uuid.UUID,
+	applyStatus ApplyStatus,
+	graph *validator.RelationshipGraph,
+	expectedSeq int64,
+	phaseStatus PhaseStatusMap,
+) (*VersionEntity, error) {
+	// ... shared marshaling, diff, graph, entity creation logic ...
+	if phaseStatus != nil {
+		if err := entity.SetPhaseStatus(phaseStatus); err != nil {
+			return nil, fmt.Errorf("failed to serialize phase_status: %w", err)
+		}
+	}
+	// ... store and return ...
+}
+
 func (s *HistoryService) StoreManifestVersion(...) (*VersionEntity, error) {
-	// ... 60+ lines of duplicated logic ...
+	return s.storeVersionInternal(ctx, manifest, appliedBy, applyJobID, status, graph, expectedSeq, nil)
 }
 
 func (s *HistoryService) StoreManifestVersionWithPhaseStatus(...) (*VersionEntity, error) {
-	// ... 60+ lines of duplicated logic ...
+	return s.storeVersionInternal(ctx, mf, appliedBy, applyJobID, applyStatus, graph, expectedSeq, phaseStatus)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/manifest/history.go` around lines 135 - 204,
StoreManifestVersionWithPhaseStatus duplicates the manifest marshaling, diff
summary generation, relationship-graph JSON serialization, and VersionEntity
construction found in StoreManifestVersion; refactor by extracting a shared
helper (e.g., prepareVersionEntity or buildVersionEntity) that accepts the
Manifest (mf), appliedBy, applyJobID, applyStatus, graph, expectedSeq, and an
optional PhaseStatusMap, performs the protojson.Marshal, calls
generateDiffSummary when needed, serializes the graph, constructs and returns a
fully populated VersionEntity (or an entity plus any error), and then have both
StoreManifestVersion and StoreManifestVersionWithPhaseStatus call that helper
and only handle repo.Store and return; ensure SetPhaseStatus remains invoked
inside the helper when a non-nil phaseStatus is provided and preserve existing
error handling and semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 184-191: The history write on failure currently calls
recordHistoryWithPhaseStatus(..., expectedSeq=0) and discards its results, which
bypasses optimistic-lock checks; change that call to use the same expected
sequence value used by the success path (the sequence variable passed to the
success recordHistoryWithPhaseStatus call) and handle/propagate the returned
(seq, err) instead of ignoring them so that stale requests cannot append
FAILED/PARTIAL rows when the manifest sequence has advanced; locate the call to
recordHistoryWithPhaseStatus and replace the hardcoded 0 with the shared
expectedSeq variable (and check the returned error/sequence like the success
path does).

---

Nitpick comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 652-672: The function phaseStatusMapToResponseProto duplicates
manifest.phaseStatusMapToProto; update this code to reuse the shared conversion
by either exporting manifest.phaseStatusMapToProto (rename to
PhaseStatusMapToProto) or moving the conversion into a shared helper used by
both packages, then replace phaseStatusMapToResponseProto’s body with a call to
the shared function (preserving the target type controlplanev1.PhaseStatusDetail
if necessary by adapting the shared helper return type or adding a thin
adapter). Ensure references to manifest.PhaseStatusMap and the existing
manifest.phaseStatusMapToProto/PhaseStatusMapToProto symbol are used so logic is
not duplicated.

In `@services/control-plane/internal/manifest/history.go`:
- Around line 135-204: StoreManifestVersionWithPhaseStatus duplicates the
manifest marshaling, diff summary generation, relationship-graph JSON
serialization, and VersionEntity construction found in StoreManifestVersion;
refactor by extracting a shared helper (e.g., prepareVersionEntity or
buildVersionEntity) that accepts the Manifest (mf), appliedBy, applyJobID,
applyStatus, graph, expectedSeq, and an optional PhaseStatusMap, performs the
protojson.Marshal, calls generateDiffSummary when needed, serializes the graph,
constructs and returns a fully populated VersionEntity (or an entity plus any
error), and then have both StoreManifestVersion and
StoreManifestVersionWithPhaseStatus call that helper and only handle repo.Store
and return; ensure SetPhaseStatus remains invoked inside the helper when a
non-nil phaseStatus is provided and preserve existing error handling and
semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 40e51b84-28c5-4b78-bf0e-e9ca2ce074e4

📥 Commits

Reviewing files that changed from the base of the PR and between f474248 and 7a76cdf.

⛔ Files ignored due to path filters (1)
  • services/control-plane/migrations/atlas.sum is excluded by !**/*.sum
📒 Files selected for processing (5)
  • services/control-plane/internal/applier/grpc_handler.go
  • services/control-plane/internal/applier/grpc_handler_test.go
  • services/control-plane/internal/manifest/history.go
  • services/control-plane/migrations/20260316000003_add_partial_apply_status.sql
  • services/control-plane/migrations/20260316000004_add_partial_apply_status_constraint.sql

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All critical findings from the previous review are resolved. One non-blocking code quality suggestion (duplicate function). See summary comment for full details and bot thread assessments.

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 16, 2026

@bjcoombs bjcoombs dismissed stale reviews from coderabbitai[bot] and claude[bot] March 16, 2026 10:31

Stale review: issues addressed in commits d47e28d and 7a76cdf

Keep both UpdatePhaseStatus (this branch) and GetBySequenceNumber
(from develop). Regenerate proto files for merged proto changes.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See summary comment. 3 inline suggestions.


// phaseStatusMapToResponseProto converts a PhaseStatusMap to proto map for the response.
func phaseStatusMapToResponseProto(ps manifest.PhaseStatusMap) map[string]*controlplanev1.PhaseStatusDetail {
if ps == nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: phaseStatusMapToResponseProto here is identical to phaseStatusMapToProto in history.go. Two copies of the same conversion logic will drift. Export PhaseStatusMapToProto from the manifest package and call it from both places.

Comment on lines +572 to +580
func findFailedPhase(plan *planner.ExecutionPlan, result *ApplyManifestResult) planner.Phase {
if result == nil || len(result.StepResults) == 0 {
return 0
}

for i, step := range result.StepResults {
if !step.Success {
if i < len(plan.Calls) {
return plan.Calls[i].Phase
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The positional correlation between result.StepResults and plan.Calls is correct today (the executor runs calls sequentially in plan order), but this invariant is implicit. Consider adding a brief comment explaining why positional correlation holds so future editors know this is load-bearing.

Comment on lines +135 to +204
// StoreManifestVersionWithPhaseStatus saves a manifest snapshot with phase-level execution status.
// Phase status is included in the initial entity creation (single atomic write).
func (s *HistoryService) StoreManifestVersionWithPhaseStatus(
ctx context.Context,
mf *controlplanev1.Manifest,
appliedBy string,
applyJobID *uuid.UUID,
applyStatus ApplyStatus,
graph *validator.RelationshipGraph,
expectedSeq int64,
phaseStatus PhaseStatusMap,
) (*VersionEntity, error) {
if mf == nil {
return nil, ErrNilManifest
}
if appliedBy == "" {
return nil, ErrEmptyAppliedBy
}

marshaler := protojson.MarshalOptions{UseProtoNames: true}
jsonBytes, err := marshaler.Marshal(mf)
if err != nil {
return nil, fmt.Errorf("failed to marshal manifest to JSON: %w", err)
}

var diffSummary *string
if applyStatus == ApplyStatusApplied {
summary, diffErr := s.generateDiffSummary(ctx, mf)
if diffErr != nil && !errors.Is(diffErr, ErrVersionNotFound) {
return nil, fmt.Errorf("failed to generate diff summary: %w", diffErr)
}
if summary != "" {
diffSummary = &summary
}
}

var graphJSON *string
if graph != nil {
if graphBytes, graphErr := json.Marshal(graph); graphErr == nil {
s := string(graphBytes)
graphJSON = &s
}
}

now := time.Now().UTC()
entity := &VersionEntity{
ID: uuid.New(),
Version: mf.Version,
ManifestJSON: string(jsonBytes),
AppliedAt: now,
AppliedBy: appliedBy,
ApplyStatus: applyStatus,
ApplyJobID: applyJobID,
DiffSummary: diffSummary,
RelationshipGraph: graphJSON,
CreatedAt: now,
}

if phaseStatus != nil {
if err := entity.SetPhaseStatus(phaseStatus); err != nil {
return nil, fmt.Errorf("failed to serialize phase_status: %w", err)
}
}

if err := s.repo.Store(ctx, entity, expectedSeq); err != nil {
return nil, err
}

return entity, nil
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: StoreManifestVersionWithPhaseStatus is a near-identical copy of StoreManifestVersion (75 lines) differing only by the SetPhaseStatus call. Consider extracting shared entity construction into a private buildVersionEntity helper.

coderabbitai[bot]
coderabbitai Bot previously requested changes Mar 16, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
services/control-plane/internal/applier/grpc_handler.go (1)

194-216: ⚠️ Potential issue | 🟠 Major

Persist phase_status on successful applies too.

The success path still writes history through recordHistory, so the stored manifest_versions row never gets phase_status. services/control-plane/internal/manifest/history.go already added StoreManifestVersionWithPhaseStatus, and EntityToProto only emits phase data when it was persisted, so later history reads will lose the phase status you just returned in the apply response.

Suggested fix
-	snapshot, seqErr := h.recordHistory(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, manifest.ApplyStatusApplied, validationResult.graph, expectedSeq)
+	snapshot, seqErr := h.recordHistoryWithPhaseStatus(
+		ctx,
+		req.GetManifest(),
+		req.GetAppliedBy(),
+		execResult.jobID,
+		manifest.ApplyStatusApplied,
+		validationResult.graph,
+		expectedSeq,
+		execResult.phaseStatus,
+	)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/applier/grpc_handler.go` around lines 194 -
216, The manifest version row isn't persisting phase_status on successful
applies; after the successful call to recordHistory in ApplyManifest handler,
call the new persistence method that includes phase info
(StoreManifestVersionWithPhaseStatus) on the version store instead of the
current Save so the stored manifest_versions row records execResult.phaseStatus;
locate the block around recordHistory and the subsequent h.versionStore.Save
call in grpc_handler.go and replace or augment that save with a call to the
versionStore's StoreManifestVersionWithPhaseStatus (passing req.GetManifest(),
req.GetAppliedBy(), and execResult.phaseStatus) so future history reads via
EntityToProto will include the persisted phase_status.
♻️ Duplicate comments (1)
services/control-plane/internal/applier/grpc_handler.go (1)

184-190: ⚠️ Potential issue | 🟠 Major

Use the request sequence on failed/partial history writes.

This still hardcodes expectedSeq=0 and drops recordHistoryWithPhaseStatus’s error. A stale apply can therefore append FAILED/PARTIAL history after the sequence has advanced, and ErrSequenceConflict never reaches the client.

Suggested fix
-		_, _ = h.recordHistoryWithPhaseStatus(ctx, req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil, 0, execResult.phaseStatus)
+		if _, seqErr := h.recordHistoryWithPhaseStatus(
+			ctx,
+			req.GetManifest(),
+			req.GetAppliedBy(),
+			execResult.jobID,
+			applyStatus,
+			nil,
+			req.GetExpectedSequenceNumber(),
+			execResult.phaseStatus,
+		); seqErr != nil {
+			logger.Warn("sequence number conflict during history recording", "error", seqErr)
+			return nil, status.Errorf(codes.Aborted, "%v", seqErr)
+		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/applier/grpc_handler.go` around lines 184 -
190, The history write currently hardcodes expectedSeq=0 and discards errors
from recordHistoryWithPhaseStatus; change the call that writes history to use
the request sequence (e.g., req.GetSequence()) instead of 0, capture the
returned error from recordHistoryWithPhaseStatus(ctx, req.GetManifest(),
req.GetAppliedBy(), execResult.jobID, applyStatus, nil, req.GetSequence(),
execResult.phaseStatus), and propagate sequence conflicts and other errors back
to the caller (detect ErrSequenceConflict and convert/return it appropriately)
rather than ignoring them so stale applies cannot append FAILED/PARTIAL history.
🧹 Nitpick comments (1)
services/control-plane/internal/manifest/history.go (1)

137-204: Extract the common snapshot-building logic.

StoreManifestVersion and StoreManifestVersionWithPhaseStatus now duplicate manifest marshaling, diff-summary generation, graph serialization, and VersionEntity construction. Pulling that into one helper will make future field or validation changes land in one place instead of drifting between the two write paths.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/manifest/history.go` around lines 137 - 204,
StoreManifestVersion and StoreManifestVersionWithPhaseStatus duplicate manifest
marshaling, diff-summary generation, graph serialization, and VersionEntity
construction; extract that logic into a single helper (e.g., buildVersionEntity
or buildSnapshot) that accepts (ctx, mf *controlplanev1.Manifest, appliedBy
string, applyJobID *uuid.UUID, applyStatus ApplyStatus, graph
*validator.RelationshipGraph, now time.Time, phaseStatus PhaseStatusMap) and
returns (*VersionEntity, error). Move protojson marshaling, generateDiffSummary
(only when applyStatus == ApplyStatusApplied and preserving ErrVersionNotFound
handling), graph JSON serialization, and the VersionEntity field population into
that helper, and call entity.SetPhaseStatus inside the helper when phaseStatus
!= nil; then have both StoreManifestVersion and
StoreManifestVersionWithPhaseStatus call the helper and only keep repo.Store and
expectedSeq handling in the callers to preserve current semantics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 512-560: The code in updatePhaseStatus (after executor.Apply)
currently backfills StartedAt/CompletedAt with a single synthetic now for every
phase (variables: phases, phaseStatus, findFailedPhase, result, execErr), which
violates the manifest timing contract; instead, stop setting entry.StartedAt and
entry.CompletedAt to &now unless you can source real per-phase timings from the
executor result (e.g., per-step/phase timestamps on result or StepResults) and
copy those into entry.StartedAt/CompletedAt; otherwise, only set entry.Status
and entry.Error as appropriate and leave StartedAt/CompletedAt nil so inaccurate
timestamps are not recorded.

---

Outside diff comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 194-216: The manifest version row isn't persisting phase_status on
successful applies; after the successful call to recordHistory in ApplyManifest
handler, call the new persistence method that includes phase info
(StoreManifestVersionWithPhaseStatus) on the version store instead of the
current Save so the stored manifest_versions row records execResult.phaseStatus;
locate the block around recordHistory and the subsequent h.versionStore.Save
call in grpc_handler.go and replace or augment that save with a call to the
versionStore's StoreManifestVersionWithPhaseStatus (passing req.GetManifest(),
req.GetAppliedBy(), and execResult.phaseStatus) so future history reads via
EntityToProto will include the persisted phase_status.

---

Duplicate comments:
In `@services/control-plane/internal/applier/grpc_handler.go`:
- Around line 184-190: The history write currently hardcodes expectedSeq=0 and
discards errors from recordHistoryWithPhaseStatus; change the call that writes
history to use the request sequence (e.g., req.GetSequence()) instead of 0,
capture the returned error from recordHistoryWithPhaseStatus(ctx,
req.GetManifest(), req.GetAppliedBy(), execResult.jobID, applyStatus, nil,
req.GetSequence(), execResult.phaseStatus), and propagate sequence conflicts and
other errors back to the caller (detect ErrSequenceConflict and convert/return
it appropriately) rather than ignoring them so stale applies cannot append
FAILED/PARTIAL history.

---

Nitpick comments:
In `@services/control-plane/internal/manifest/history.go`:
- Around line 137-204: StoreManifestVersion and
StoreManifestVersionWithPhaseStatus duplicate manifest marshaling, diff-summary
generation, graph serialization, and VersionEntity construction; extract that
logic into a single helper (e.g., buildVersionEntity or buildSnapshot) that
accepts (ctx, mf *controlplanev1.Manifest, appliedBy string, applyJobID
*uuid.UUID, applyStatus ApplyStatus, graph *validator.RelationshipGraph, now
time.Time, phaseStatus PhaseStatusMap) and returns (*VersionEntity, error). Move
protojson marshaling, generateDiffSummary (only when applyStatus ==
ApplyStatusApplied and preserving ErrVersionNotFound handling), graph JSON
serialization, and the VersionEntity field population into that helper, and call
entity.SetPhaseStatus inside the helper when phaseStatus != nil; then have both
StoreManifestVersion and StoreManifestVersionWithPhaseStatus call the helper and
only keep repo.Store and expectedSeq handling in the callers to preserve current
semantics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 222da673-9cff-4b18-8b76-9b304b2514ca

📥 Commits

Reviewing files that changed from the base of the PR and between 7a76cdf and 5d1d3a5.

📒 Files selected for processing (5)
  • api/proto/meridian/control_plane/v1/apply_manifest_service.proto
  • api/proto/meridian/control_plane/v1/manifest_history_service.proto
  • services/control-plane/internal/applier/grpc_handler.go
  • services/control-plane/internal/manifest/history.go
  • services/control-plane/internal/manifest/repository.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • api/proto/meridian/control_plane/v1/apply_manifest_service.proto

Comment on lines +512 to +560
now := time.Now().UTC()
phases := plan.Phases()

if execErr == nil && result != nil {
// All phases completed successfully
for _, p := range phases {
key := fmt.Sprintf("phase_%d", p)
entry := phaseStatus[key]
entry.Status = manifest.PhaseStatusCompleted
entry.StartedAt = &now
entry.CompletedAt = &now
phaseStatus[key] = entry
}
return
}

// On failure: mark phases based on step results.
// Steps are executed in phase order. We mark phases as COMPLETED until we
// find a failed step, then the containing phase is FAILED and the rest are SKIPPED.
failedPhase := findFailedPhase(plan, result)

for _, p := range phases {
key := fmt.Sprintf("phase_%d", p)
entry := phaseStatus[key]
entry.StartedAt = &now

if failedPhase > 0 && p < failedPhase {
entry.Status = manifest.PhaseStatusCompleted
entry.CompletedAt = &now
} else if failedPhase > 0 && p == failedPhase {
entry.Status = manifest.PhaseStatusFailed
entry.CompletedAt = &now
if result != nil {
entry.Error = result.Error
} else if execErr != nil {
entry.Error = execErr.Error()
}
} else if failedPhase > 0 {
entry.Status = manifest.PhaseStatusSkipped
entry.StartedAt = nil
} else {
// No phase-level info available; mark all as failed
entry.Status = manifest.PhaseStatusFailed
entry.CompletedAt = &now
if execErr != nil {
entry.Error = execErr.Error()
}
}
phaseStatus[key] = entry
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t backfill phase timings with one post-run timestamp.

updatePhaseStatus runs only after executor.Apply returns, so every executed phase gets the same synthetic now. That does not match the started_at / completed_at contract added in api/proto/meridian/control_plane/v1/manifest_history_service.proto and will make the stored phase timing misleading. Either thread real per-phase timing out of the executor or leave these fields unset until that data exists.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@services/control-plane/internal/applier/grpc_handler.go` around lines 512 -
560, The code in updatePhaseStatus (after executor.Apply) currently backfills
StartedAt/CompletedAt with a single synthetic now for every phase (variables:
phases, phaseStatus, findFailedPhase, result, execErr), which violates the
manifest timing contract; instead, stop setting entry.StartedAt and
entry.CompletedAt to &now unless you can source real per-phase timings from the
executor result (e.g., per-step/phase timestamps on result or StepResults) and
copy those into entry.StartedAt/CompletedAt; otherwise, only set entry.Status
and entry.Error as appropriate and leave StartedAt/CompletedAt nil so inaccurate
timestamps are not recorded.

@bjcoombs bjcoombs dismissed coderabbitai[bot]’s stale review March 16, 2026 11:15

Design suggestions acknowledged - timestamps represent recording time, optimistic locking on failure path is pre-existing behavior, and code dedup is follow-up work

@bjcoombs bjcoombs merged commit eb0de06 into develop Mar 16, 2026
54 checks passed
@bjcoombs bjcoombs deleted the 045-manifest-source-of-truth--11--phase-status-partial branch March 16, 2026 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant