Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
134 changes: 134 additions & 0 deletions features/deployment-delete-race-prevention.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Deployment delete: race prevention and orphan detection

## What it is

Hardening for concurrent deployment mutations after the **toets-hn7/pr-36**
incident (2026-06-24): two sibling deployment deletes of the same project ran
concurrently, collided on the shared project file during `git push`, hit an
unrecoverable rebase conflict, and left the deployment half-deleted. The same
bug class produced the durable **toets-hn7/pr-32** orphan (a deployment removed
from the project file on 2026-06-12 whose ArgoCD Application and pods kept
running for 12 days, undetected).

Three changes address it, plus a remediation runbook for existing orphans.

## Background: why it happened

Every deployment of a project writes the **same** shared git files:
`projects/<project>.yaml` and the per-project ArgoCD `kustomization.yaml`. The
async task worker ran two deletes for the same project concurrently because the
in-flight guard keyed on `project_name` **and** `deployment_name`. A long
ArgoCD-prune wait (3.5 min for pr-36) widened the window for the sibling delete
to land on `main`, so pr-36's rebase could not auto-merge the overlapping list
edit. The push path aborted on conflict and failed the task terminally; only the
blind task auto-retry recovered it. When the retry does not recover (pr-32), the
orphan is permanent because nothing reconciles deployment-level drift.

## The changes

### 1a. Serialize project-file-mutating tasks per project

`AsyncTaskService.claim_next_task` no longer starts a pending task while another
**project-file-mutating** task for the same project is in flight, regardless of
deployment. The set is `PROJECT_FILE_MUTATING_TASK_TYPES` (upsert/update-image/
delete deployment, refresh, create-project, add-component(-to-deployment),
add-service). Clone/backup/restore are excluded so a slow restore never blocks
deploys. Non-mutating tasks keep the original per-deployment behavior.

This removes the pr-36/pr-37 collision at the source. Cross-project work still
runs in parallel; only same-project mutations serialize.

### 1b. Self-heal the project-file push on a true rebase conflict

`GitConnector.push_changes` / `commit_and_push` accept an optional `reapply`
async callback. On a rebase conflict (non-fast-forward then unmergeable):

- With `reapply`: hard-reset to the current remote, invoke `reapply` to
re-apply the intended change on top of it, re-commit, and retry the push. This
converges on a textual conflict that is semantically trivial.
- Without `reapply`: raise the new typed `GitPushConflictError`.

The deployment-delete path (step 8, removing the deployment from the project
file) passes a `reapply` that re-reads the project file fresh and re-removes the
deployment. This covers residual races that 1a cannot (notably the resource
auto-tuner, which commits to project files outside the task system).

### 2a. Deployment drift report (read-only)

`GET /api/v2/admin/deployments/drift` (admin API key) compares deployments
declared in project files against live ArgoCD Applications and reports:

- `orphaned_deployments`: live application with no project-file entry (the pr-32
case),
- `missing_deployments`: declared but no live application.

It performs **zero** mutations (mirrors the service-orphan sweep's "no
auto-delete from a scan" rule). Logic lives in `opi/jobs/deployment_drift.py`
(`classify_deployment_drift`, pure and unit-tested); the endpoint lists
Applications via kubectl using the `project` label so it does not depend on
ArgoCD RBAC.

Project-infrastructure apps (`{project}-infrastructure`, which manage per-project
infra such as PostgreSQL clusters) carry the same `project` label but are not
deployments and never appear in `deployments[]`. They are explicitly excluded so
they are not mis-reported as orphans. Validated against live production (93
ArgoCD apps, 36 project files): the report returns exactly the one real orphan
(`toets-hn7-pr-32`), zero false positives, zero missing.

```bash
curl -X GET "https://<opi-host>/api/v2/admin/deployments/drift" \
-H "X-API-Key: <admin-api-key>"
```

## Remediating an existing orphan (e.g. pr-32)

Drift is report-only; cleanup is deliberate. Prefer driving the delete through
OPI once this build is rolled out: re-add the deployment to its project file,
let OPI adopt it on reprocess, then `DELETE /api/v2/projects/<project>/<dep>`
for a full, idempotent teardown (services, namespace, manifests, app).

Manual Git fallback (two phases, matching OPI's own ordering to avoid a
finalizer deadlock):

1. In the ArgoCD-applications repo: delete
`odcn-production/<project>/<project>-<dep>-argocd-application.yaml` and remove
its line from `odcn-production/<project>/kustomization.yaml`. Commit + push.
Wait for ArgoCD to prune the Application (its `resources-finalizer` removes
the pods).
2. After the app is gone, in the deployments repo: delete the
`odcn-production/<project>/<dep>/` manifest folder. Commit + push.

Also clear any now-stale `marked_for_deletion` row (a delete that timed out on
the ArgoCD wait marks `deployment_manifests` for deferred cleanup; a later retry
that deletes the manifests can leave the row behind):

```bash
curl -X GET "https://<opi-host>/api/v2/admin/marked-for-deletion?project_name=<project>" -H "X-API-Key: <key>"
curl -X DELETE "https://<opi-host>/api/v2/admin/marked-for-deletion/<mark_id>" -H "X-API-Key: <key>"
```

## Files

- `opi/core/async_task_service.py` - `PROJECT_FILE_MUTATING_TASK_TYPES`, per-project claim guard
- `opi/connectors/git.py` - `GitPushConflictError`, `reapply` hook, `_reset_to_remote`
- `opi/manager/delete_project_manager.py` - step 8 passes a `reapply` closure
- `opi/jobs/deployment_drift.py` - drift classification
- `opi/api/admin_router.py` - `GET /api/v2/admin/deployments/drift`

### Tests

- `tests/test_async_task_service.py` - claim-guard query wiring (mocked)
- `tests/test_async_task_claim_serialization_db.py` - real-Postgres proof of 1a
(the pr-36/pr-37 scenario, backups not over-serialized, cross-project
parallelism). Marked `requires_infra`; run with `TEST_DATABASE_DSN` set against
an ephemeral Postgres (see the module docstring).
- `tests/test_git_push_conflict.py` - 1b control flow (mocked git)
- `tests/test_git_push_conflict_integration.py` - real-git end-to-end: forces an
actual rebase conflict and proves the reapply path converges while preserving
the concurrent writer's change.
- `tests/test_deployment_drift.py` - drift classification incl. infra-app exclusion

## Dependencies

None new. Uses the existing async task queue, GitConnector, kubectl connector,
and admin API-key auth.
35 changes: 35 additions & 0 deletions operations-manager/python/opi/api/admin_router.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
POST /api/v2/admin/cleanup/trigger - Trigger cleanup (purge expired)
POST /api/v2/admin/reconciliation/trigger - Trigger full reconciliation
DELETE /api/v2/admin/marked-for-deletion/{mark_id} - Remove a specific mark
GET /api/v2/admin/deployments/drift - Report live-but-undeclared deployments
"""

import logging
Expand Down Expand Up @@ -215,6 +216,40 @@ async def orphan_sweep_report(request: Request) -> JSONResponse:
return JSONResponse(content=report, status_code=200)


@admin_router.get("/deployments/drift")
@validate_admin_api_key
async def deployment_drift_report(request: Request) -> JSONResponse:
"""Report deployments that are live on the cluster but no longer declared.

Compares the deployments in the project files against the live ArgoCD
Application resources. Surfaces ``orphaned_deployments`` (a live application
with no project-file entry, the toets-hn7/pr-36-class durable failure where a
terminal delete left the app/manifests/pods running) and
``missing_deployments`` (declared but no live application). Performs ZERO
mutations; remediation is a deliberate operator step.

Example:
curl -X GET "http://localhost:9595/api/v2/admin/deployments/drift" \\
-H "X-API-Key: your-admin-api-key"
"""
from opi.connectors.kubectl import create_kubectl_connector
from opi.core.cluster_config import get_argo_namespace
from opi.jobs.deployment_drift import classify_deployment_drift
from opi.services.project_service import get_project_service

cluster = settings.CLUSTER_MANAGER
all_projects = get_project_service().get_all_projects()
project_yamls: list[dict[str, Any]] = [p.data for p in all_projects.values() if p.data]

kubectl = create_kubectl_connector()
# OPI labels every Application it creates with `project`; selecting on the
# label's existence lists all OPI-managed apps without depending on ArgoCD RBAC.
argo_apps = await kubectl.get_resources_by_label("applications.argoproj.io", get_argo_namespace(cluster), "project")

report = classify_deployment_drift(project_yamls, cluster, argo_apps)
return JSONResponse(content=report, status_code=200)


@admin_router.post("/orphans/confirm")
@validate_admin_api_key
async def confirm_orphans(request: Request) -> JSONResponse:
Expand Down
81 changes: 76 additions & 5 deletions operations-manager/python/opi/connectors/git.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,16 @@ def _obfuscate_git_command(cmd_str: str) -> str:
return obfuscated


class GitPushConflictError(RuntimeError):
"""Raised when a push fails because a rebase on the remote branch hit a
content conflict that git cannot resolve automatically.

Distinct from generic push failures so callers can react specifically (for
example by re-reading the remote state and re-applying their intended change
instead of failing terminally). See push_changes(reapply=...).
"""


class GitConnector:
"""Connector for interacting with Git repositories using GitPython."""

Expand Down Expand Up @@ -825,6 +835,23 @@ async def _rebase_on_remote(self, branch: str | None = None) -> bool:
logger.error(f"Failed to rebase on {server_info}: {e}")
raise

async def _reset_to_remote(self, branch: str | None = None) -> None:
"""Discard local commits and working-tree changes, hard-resetting to
origin/<branch>.

Used to recover from an unresolvable rebase conflict before re-applying an
intended change on top of the current remote state.
"""
target_branch = branch or self.branch
fetch_cmd = ["fetch", "origin", target_branch]
_, stderr, code = await self._run_git_command(fetch_cmd, cwd=self.__working_dir)
self._check_git_command_result(code, stderr, f"fetch origin/{target_branch}")

reset_cmd = ["reset", "--hard", f"origin/{target_branch}"]
_, stderr, code = await self._run_git_command(reset_cmd, cwd=self.__working_dir)
self._check_git_command_result(code, stderr, f"reset --hard origin/{target_branch}")
logger.info(f"Hard-reset working tree to origin/{target_branch}")

async def file_changed_between_commits(self, file_path: str, old_commit: str, new_commit: str) -> bool:
"""
Check if a specific file was changed between commits using git diff.
Expand Down Expand Up @@ -1098,14 +1125,22 @@ def _abort_if_plaintext_secrets_present(self) -> None:
"Dit zou secrets in platte tekst naar git committen; operatie afgebroken."
)

async def commit_and_push(self, message: str) -> None:
async def commit_and_push(
self,
message: str,
reapply: Callable[[], Coroutine[Any, Any, None]] | None = None,
) -> None:
"""
Commit all changes in the working directory and push to remote repository.
This stages all changes (including new files, modifications, and deletions) before committing.
This is always done at the end when no other git operations are needed.

Args:
message: Commit message
reapply: Optional async callback that re-applies the intended change to
the working tree if the push hits an unresolvable rebase conflict.
See push_changes for the recovery contract. When omitted, a rebase
conflict raises GitPushConflictError.
"""
logger.debug(f"Committing and pushing all changes: {message}")

Expand All @@ -1122,7 +1157,7 @@ async def commit_and_push(self, message: str) -> None:

# Commit and push the changes
await self.commit_changes(message)
await self.push_changes()
await self.push_changes(reapply=reapply, commit_message=message)

logger.info(f"Successfully committed and pushed all changes: {message}")

Expand Down Expand Up @@ -1333,19 +1368,42 @@ async def commit_changes(self, message: str) -> None:
logger.debug(f"Successfully committed changes: {message}")

# TODO: update push changes to handle rebase, and if rebase fails, commit and push to temporary branch
async def push_changes(self, branch: str | None = None, max_retries: int = 5) -> None:
async def push_changes(
self,
branch: str | None = None,
max_retries: int = 5,
reapply: Callable[[], Coroutine[Any, Any, None]] | None = None,
commit_message: str | None = None,
) -> None:
"""
Push committed changes to remote repository.

If the push fails due to non-fast-forward (remote has newer commits),
this method will automatically fetch, rebase, and retry the push.

If a non-fast-forward push then hits a content conflict during rebase
(concurrent writers touching the same file region, e.g. two sibling
deployment deletes editing the same project file), behaviour depends on
``reapply``:

- When ``reapply`` is provided, the local commit is discarded, the working
tree is hard-reset to the current remote state, ``reapply`` re-applies the
intended change on top of it, and the push is retried. This converges
instead of failing on a textual conflict that is semantically trivial.
- When ``reapply`` is None (default), a ``GitPushConflictError`` is raised.

Args:
branch: Branch to push to (defaults to configured branch)
max_retries: Maximum number of push attempts after rebase (default: 5)
reapply: Optional async callback that re-applies the caller's intended
change to the working tree (files only; staging and commit are
handled here). Invoked after a hard reset to the current remote.
commit_message: Commit message used when re-committing a re-applied
change. Required for the re-apply path to produce a clean commit.

Raises:
RuntimeError: If push fails after all retries or if rebase has conflicts
GitPushConflictError: If rebase hits a conflict and no ``reapply`` is given.
RuntimeError: If push fails after all retries for other reasons.
"""
await self.ensure_repo_cloned()

Expand Down Expand Up @@ -1380,8 +1438,21 @@ async def push_changes(self, branch: str | None = None, max_retries: int = 5) ->

rebase_success = await self._rebase_on_remote(target_branch)
if not rebase_success:
if reapply is not None:
logger.warning(
f"Rebase conflict on {target_branch}; resetting to current remote "
f"and re-applying the intended change before retrying push"
)
await self._reset_to_remote(target_branch)
await reapply()
# Stage and commit the re-applied change on top of the
# now-current remote state, then retry the push.
await self._run_git_command(["add", "-A"], cwd=self.__working_dir)
await self.commit_changes(commit_message or "Re-apply change after rebase conflict")
continue

server_info = self._get_server_context()
raise RuntimeError(
raise GitPushConflictError(
f"Cannot push to {target_branch} on {server_info}: "
f"Remote has conflicting changes that cannot be automatically merged. "
f"Manual intervention required."
Expand Down
Loading
Loading