|
| 1 | +# Round-4 review — owner decisions on new findings |
| 2 | + |
| 3 | +**Date:** 2026-05-23 |
| 4 | +**Context:** PR #1477 (draft). Round-3 fixes (C5/C6) have shipped — hint API is gone, e2e drill rewritten, reason length checked. Round-4 review surfaced ~16 new findings at a different abstraction level (frontend regressions + non-versioning subscriber edge cases + audit semantics). This doc records owner-perspective best handling. |
| 5 | + |
| 6 | +## Framing principles (unchanged from round-3) |
| 7 | + |
| 8 | +1. Feature flag is off by default — risk-bounded. |
| 9 | +2. Draft PR, cheap to add commits. |
| 10 | +3. Don't re-litigate locked decisions. |
| 11 | +4. Honesty over momentum — fix structural mismatches. |
| 12 | +5. Two new block-merge items live in code that's outside `pkg/core/versioning/` proper. They still must be addressed because they're regressions this PR introduces. |
| 13 | + |
| 14 | +## Decision matrix |
| 15 | + |
| 16 | +| # | Finding | Severity | Decision | Bucket | Rationale | |
| 17 | +|---|---------|----------|----------|--------|-----------| |
| 18 | +| R4-1 | `processConfigDelete` calls `toRuleRes(... NodeData)` only to get a `ResourceKind`, then silently drops the event if `toRuleRes` returns nil (zk_config.go:208) | High | **VERIFY → FIX (C7)** | Block-merge | The nil-guard was added defensively, but if `toRuleRes` legitimately returns nil on empty `NodeData` (which ZK delete events often have), this drops the entire delete from both the in-memory store and the ledger. Fix: derive `ResourceKind` from the rule-suffix that `processDelete` already switched on, instead of reconstructing from NodeData. Two lines; eliminates the silent-drop class. | |
| 19 | +| R4-2 | `updateByFormView.vue:327` removed the `configVersion == 'v3.0'` guard around condition parsing | Medium-High | **FIX (C7)** | Block-merge | The parser below (`item.split(' => ')`) is v3.0-only. Removing the guard means non-v3.0 conditions silently produce broken `routeList`. Either restore the guard, or add an explicit comment that v3.0 is the only on-the-wire format (matching `updateRoutingRule`'s `configVersion \|\| 'v3.0'` default) and skip the parse for anything else. Owner pick: **restore the guard + log a warning** — preserves backward compat at zero cost. | |
| 20 | +| R4-3 | `validateRuleVersionReason` doesn't validate emptiness, only length (rule_version.go:144) | Low | **FIX (C7)** | Cheap polish | Rename to `validateRuleVersionReasonLength` OR add the empty-check at handler level for symmetry with Abandon (where the service-layer check is what catches empty). Owner pick: rename — keeps the validation single-sourced at the service layer. Two-token change. | |
| 21 | +| R4-4 | `currentResourceForIntent` returns `(current, deleted, err)` where `deleted = !exists` — naming flip inside the function (rule_version.go:258) | Low | **FIX (C7)** | Cheap polish | One-line comment above the return explaining the inversion. No semantic change. | |
| 22 | +| R4-5 | `ErrRollbackToCurrent` is raised for both ID match AND content-hash match — name reads literally only | Low | **FIX (C7)** | Cheap polish | Update the error message to "cannot roll back to a version identical to current" (keep the constant name). No callers depend on substring matching. | |
| 23 | +| R4-6 | `RepairOpenIntents` runs at startup but there's no sweeper for stale Pending intents | Medium | **DEFER → follow-up issue** | Out-of-scope | A Pending intent that the registry never echoes (e.g., admin write crashed before registry roundtrip) blocks all subsequent writes to that rule until an operator explicitly repairs/abandons. Repair/abandon endpoints + UI surfacing exist, so it's recoverable today — just manual. A sweeper that auto-abandons intents older than N hours is a reasonable v2. File issue, link from PR §3. | |
| 24 | +| R4-7 | `Subscriber.AsyncEnabled() bool { return false }` — synchronous DB write on every registry push | Medium | **DOCUMENT (C8) + monitor** | Doc-only | With the feature flag off, this is a no-op. With it on under bursty pushes, every rule event still pays a ledger DB roundtrip. Current code keeps the subscriber synchronous and lossless; document the latency trade-off plus the bootstrap O(rules) cost. File follow-up to benchmark under load before broad enablement. | |
| 25 | +| R4-8 | `NormalizeSpec` round-trip stabilizes top-level map keys but not items inside arrays-of-maps | Low | **VERIFY → test (C7)** | Defensive test | None of the current proto schemas (ConditionRoute / TagRoute / DynamicConfig) have list-of-map fields where map ordering can vary, so no real bug today. Add a unit test that pins hash stability for a proto containing a map field, so a future proto change can't silently break dedup. ~15 LOC test. | |
| 26 | +| R4-9 | 409 / 503 wire shape (`gin.H{}` with flat fields) diverges from `model.CommonResp` | Low | **WON'T-FIX (inherits round-3 #12)** | Locked | Already documented as deliberate. Owner-comment in `writeVersioningResp` from C6 should already explain. | |
| 27 | +| R4-10 | `GormStore.mu` is in-process; multi-instance correctness rests on untested `FOR UPDATE` semantics | Low | **DOCUMENT (C8)** | Doc-only | Tightens round-3 #5 won't-fix. Phase 9 kept the mutex; round-4 calls out that the cross-instance test gap is real but acceptable when shipping flag-off. Add one sentence to PR §8: "multi-instance writes rely on `SELECT ... FOR UPDATE` row locking on the meta table; tested in-process only." | |
| 28 | +| R4-11 | `Source` dedup makes UPSTREAM echoes after ADMIN invisible in the ledger | Medium | **DOCUMENT (C8)** | Doc-only | Already covered by round-3 C6 ("effective state change log, not API-call log"). Adding one sentence to the History drawer UI tooltip ("Author shows the originating writer; identical follow-up echoes from registries are coalesced into this row") is optional. Owner pick: **skip the UI tooltip, keep PR-description note** — the History drawer is for the admin auditor, not end users, and the audit chain is queryable via `contentHash` if anyone needs the full event count. | |
| 29 | +| R4-12 | Reason field in 409 payload (`intentId`) is conditionally populated; frontend handles both branches | Low | **WON'T-FIX (deliberate)** | Locked | Asymmetry is intentional — pure `ErrVersionIntentPending` (without struct error) has no intent ID to report. Frontend's `notifyVersionLedgerPending` handles both branches. No-op. | |
| 30 | +| R4-13 | Auth gate is authentication-only; no RBAC on rollback | Low | **WON'T-FIX** | Locked | Matches existing rule CRUD (PUT/POST/DELETE have the same gate). No new gap introduced by this PR. RBAC is a project-wide topic, not a versioning concern. | |
| 31 | +| R4-14 | E2E drill (`e2e_rollback_drill_test.go`) still doesn't go through handler/lock layer after the C5 rewrite | Low | **NOT REGRESSION** | OK | Round-3 #2 rewrote the drill to go through `applyRuleMutationIntent`. Going further to drive the gin handler would need real auth wiring; the service-layer drive is the right level. Handler-level coverage lives in `rule_version_test.go` and `component_test.go`. | |
| 32 | +| R4-15 | Frontend 2s `setTimeout` after rule updates is still present in `updateByFormView.vue` | Medium | **INHERITS round-3 #4 (deferred)** | Locked | Decision already made: out of this PR's smoke gate. Follow-up issue to be filed at PR merge. | |
| 33 | +| R4-16 | `Diff` of a historical version when the rule is currently deleted returns 404 | Low | **INHERITS round-3 #8 (documented)** | Locked | Already in C6 doc. | |
| 34 | + |
| 35 | +## Bucket summary |
| 36 | + |
| 37 | +### Block-merge (this PR, in C7) |
| 38 | +- R4-1 — Derive ResourceKind from suffix in `processConfigDelete`, drop the nil-NodeData silent skip. Real correctness gap. |
| 39 | +- R4-2 — Restore `configVersion == 'v3.0'` guard (or document the single-format invariant). UI regression. |
| 40 | + |
| 41 | +### Cheap polish (bundle into C7) |
| 42 | +- R4-3 — Rename `validateRuleVersionReason` to `validateRuleVersionReasonLength`. |
| 43 | +- R4-4 — Comment on `currentResourceForIntent` return semantics. |
| 44 | +- R4-5 — Update `ErrRollbackToCurrent` message to cover content-hash equivalence. |
| 45 | +- R4-8 — Add hash-stability unit test for proto-with-map field. |
| 46 | + |
| 47 | +### Doc-only (this PR, in C8) |
| 48 | +- R4-7 — PR §8 perf characteristics: synchronous lossless ledger writes + O(rules) bootstrap. |
| 49 | +- R4-10 — PR §8 multi-instance correctness rests on `FOR UPDATE`, untested cross-process. |
| 50 | +- R4-11 — Already covered by C6. |
| 51 | + |
| 52 | +### Deferred (follow-up issues) |
| 53 | +- R4-6 — Stale-Pending-intent sweeper. |
| 54 | + |
| 55 | +### Inherit (no new action) |
| 56 | +- R4-9, R4-12, R4-13, R4-14, R4-15, R4-16. |
| 57 | + |
| 58 | +## Commit packaging |
| 59 | + |
| 60 | +- **C7 — round-4 fixes (`fix(versioning): subscriber delete event resilience + UI config-version compat`)** |
| 61 | + - `processConfigDelete`: derive ResourceKind from the already-switched suffix, drop the toRuleRes-nil silent skip. |
| 62 | + - `updateByFormView.vue`: restore `configVersion == 'v3.0'` guard (or equivalent explicit single-format assertion with logger warn). |
| 63 | + - Rename `validateRuleVersionReason` → `validateRuleVersionReasonLength`. |
| 64 | + - Comment on `currentResourceForIntent`. |
| 65 | + - Update `ErrRollbackToCurrent` error message. |
| 66 | + - Add `TestNormalizeSpecHashStableForProtoWithMaps`. |
| 67 | + |
| 68 | +- **C8 — round-4 docs (`docs(versioning): performance characteristics + multi-instance caveats`)** |
| 69 | + - PR §8 (Upgrade and rollback): perf section covering synchronous lossless subscriber + O(rules) bootstrap + DB roundtrip per event. |
| 70 | + - PR §8: multi-instance row-lock caveat. |
| 71 | + |
| 72 | +## Acceptance gate (delta from round-3) |
| 73 | + |
| 74 | +Round-3 gate: `go vet`, focused `go test`, `npm run build`, no smoke re-run. |
| 75 | + |
| 76 | +C7 changes: |
| 77 | +- Modifies `pkg/core/discovery/subscriber/zk_config.go` (existing test in `zk_config_test.go` must still pass; add one delete-with-empty-NodeData case). |
| 78 | +- Modifies a single `.vue` file in routingRule path (frontend `npm run build` re-run, no UI smoke required because the change is restoring removed code). |
| 79 | +- Adds one Go test in `pkg/core/versioning/`. |
| 80 | + |
| 81 | +Keep: no end-to-end smoke re-run (still no public wire change beyond what's already shipped). |
| 82 | + |
| 83 | +## Follow-up issues to file (delta from round-3) |
| 84 | + |
| 85 | +Additions to the round-3 follow-up list: |
| 86 | +5. **Stale-Pending-intent sweeper.** Auto-abandon intents in `IntentStatusPending` older than configurable duration (default 24h), with audit log. Prevents a crashed-admin-write from indefinitely blocking a rule. |
| 87 | +6. **Benchmark versioning subscriber under bursty ZK push.** Confirm the synchronous lossless DB-write design holds under realistic load; consider an explicit durable queue before switching to async best-effort dispatch. |
| 88 | +7. **Cross-process correctness test for `GormStore`.** Spin up two GORM connections against a real MySQL/Postgres, drive concurrent `InsertVersion`, assert monotonic `version_no` and meta convergence. Validates the `FOR UPDATE` path that the SQLite test can't exercise. |
| 89 | + |
| 90 | +## What round-4 confirmed is good |
| 91 | + |
| 92 | +- C5/C6 from round-3 actually landed (dead hint API gone, e2e drill rewritten via service layer, reason length checked, weak-CAS documented). |
| 93 | +- Intent lifecycle is internally consistent — pending/applied/committed/failed transitions are guarded in both stores. |
| 94 | +- Feature flag genuinely off by default; new endpoints return 503 + structured body. |
| 95 | +- Frontend interceptor correctly filters 409 codes from the global toast. |
| 96 | +- Auth middleware test confirms rollback endpoints require a session. |
| 97 | + |
| 98 | +## Won't-relitigate list (carrying forward) |
| 99 | + |
| 100 | +Same as round-3, plus: |
| 101 | +- `Subscriber.AsyncEnabled = false` — kept to preserve lossless ledger semantics until benchmark data and a durable async design justify changing it. |
| 102 | +- `NormalizeSpec`'s reliance on Go stdlib JSON key sorting — kept; the proto schemas in scope don't have the problematic shape. |
| 103 | + |
| 104 | +## C8 documentation addendum |
| 105 | + |
| 106 | +### Performance characteristics when versioning is enabled |
| 107 | + |
| 108 | +Rule versioning is feature-flagged off by default. When enabled, startup scans all in-memory governor rule resources and records bootstrap snapshots, so startup cost is O(number of rules). Runtime rule events are handled by the versioning subscriber synchronously: each create/update/delete pays normalization plus one ledger write before the event bus returns. This keeps the history lossless, but under bursty registry pushes it can add latency to registry event processing. Before enabling broadly in a large deployment, benchmark realistic ZK push bursts and size the backing database accordingly. |
| 109 | + |
| 110 | +### Multi-instance caveat |
| 111 | + |
| 112 | +`GormStore` uses an in-process mutex plus `SELECT ... FOR UPDATE` on the `rule_version_meta` row to serialize version number allocation. Current automated coverage proves the in-process path; cross-process correctness still relies on the database row lock behavior and should be validated with a real MySQL/Postgres concurrency test before broad multi-instance enablement. |
0 commit comments