Skip to content

Reconcile CAS races against concurrent writers#8

Closed
nodo wants to merge 5 commits into
mainfrom
nodo/fix-race-refs
Closed

Reconcile CAS races against concurrent writers#8
nodo wants to merge 5 commits into
mainfrom
nodo/fix-race-refs

Conversation

@nodo

@nodo nodo commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Two concurrent push using mirror-worker now could cause a race, e.g.

Trigger timeline (12:15:55–12:15:57 UTC)

  ┌──────────────┬───────────────────────┬────────────────┬───────────┬──────────┐
  │     Time     │         Event         │     Source     │ Body size │ Duration │
  ├──────────────┼───────────────────────┼────────────────┼───────────┼──────────┤
  │ 12:15:55.864 │ POST /webhooks/github │ 140.82.115.152 │ 20,206 B  │ 198 ms   │
  ├──────────────┼───────────────────────┼────────────────┼───────────┼──────────┤
  │ 12:15:56.098 │ POST /webhooks/github │ 140.82.115.117 │ 16,088 B  │ 76 ms    │
  ├──────────────┼───────────────────────┼────────────────┼───────────┼──────────┤
  │ 12:15:56.587 │ POST /webhooks/github │ 140.82.115.37  │ 25,547 B  │ 69 ms    │
  └──────────────┴───────────────────────┴────────────────┴───────────┴──────────┘

  All three are GitHub-Hookshot/d0accb2 (GitHub's webhook dispatcher) from different GitHub source IPs with different body sizes — so these are three distinct events, not retries. Each produced one NATS job (seq 106, 107, 108).

  How it cascaded into the CAS race

  Mirror-worker pods picked up two of those jobs in parallel (entiredb/server/entire-server receive-pack spans confirm this within the same trace_ids we saw failing):

  ┌─────────────────────────────────┬───────────────────────────┬─────────────────────────────────┬────────────────────────────────────┐
  │           Pod / trace           │ /info/refs (list target)  │    /git-receive-pack (push)     │              Outcome               │
  ├─────────────────────────────────┼───────────────────────────┼─────────────────────────────────┼────────────────────────────────────┤
  │ hc8t9 — seq 106 — db577226…f630 │ 12:15:56.737              │ 12:15:57.082 (564ms)            │ OK — advanced ref to Y             │
  ├─────────────────────────────────┼───────────────────────────┼─────────────────────────────────┼────────────────────────────────────┤
  │ 2kkgk — seq 107 — c290ed6a…0e83 │ 12:15:56.775 (38ms later) │ 12:15:57.105 (23ms after hc8t9) │ FAIL — CAS expected X, target is Y │
  └─────────────────────────────────┴───────────────────────────┴─────────────────────────────────┴────────────────────────────────────┘

  Both pods saw the same TargetHash when listing refs (38ms apart), both built CAS plans against it, hc8t9's push landed first and moved the ref, 2kkgk's push arrived 23ms later and its CAS fired remote ref has changed.
  

This change changes git-sync to check if the new ref is actually what we need if that's the case just succeed.


Note

Medium Risk
Changes push error semantics and makes strategies treat some receive-pack failures as success after a target ref refresh; incorrect reconciliation logic could mask real push failures or misreport relay outcomes.

Overview
Adds push reconciliation to treat benign CAS races as success. gitproto now surfaces structured receive-pack report-status failures via PushReportError (unpack vs per-ref failures) and exposes Pusher.ListRefs() to re-advertise target refs.

A new strategy/pushreconcile helper checks per-ref failures by re-listing target refs and verifying the target already matches each plan’s intended outcome (including deletes), logging warnings for unexpected status strings. The incremental, replicate, and materialized strategies (and syncer) are updated to pass a lister and to swallow reconciled push errors while setting relay reason to reconciled; extensive tests cover reconciliation success/failure paths and the new error shaping.

Reviewed by Cursor Bugbot for commit 4fc3761. Configure here.

nodo and others added 3 commits April 21, 2026 15:55
When two writers push the same source state concurrently (common with
multiple mirror-worker pods handling bursty webhook fan-out), receive-pack
returns per-ref "remote ref has changed" from the loser's CAS. Today this
bubbles up as a sync error even though the target is already where we
wanted it.

Surface report-status per-ref failures as a structured
gitproto.PushReportError, and add an internal/strategy/pushreconcile helper
that, on such an error, re-advertises the target and checks whether every
failed ref already matches the plan's intended outcome. Wired into the
replicate, incremental, and materialized strategies.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address adversarial review findings on the previous commit:

1. pushreconcile.Check now gates reconciliation on an allowlist of CAS-race
   statuses ("remote ref has changed", "already exists"). Previously any
   per-ref rejection — ACL denials, hook/policy failures, etc. — was
   silently swallowed if a concurrent writer happened to converge the ref
   to the planned hash, hiding real misconfiguration behind "reconciled"
   successes.

2. buildReportError now treats any unpack status other than exact "ok" as
   fatal, including empty string. Previously a malformed/degraded
   receive-pack response with a blank unpack line could fall through as
   success if no per-ref failures were reported, weakening detection
   relative to go-git's ReportStatus.Error semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: bd56885bd5b7
@nodo

nodo commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator Author

bugbot run

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 4fc3761. Configure here.

@nodo nodo closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

1 participant