Skip to content

Commit 6747767

Browse files
committed
fix(versioning): address round-4 review findings
1 parent 7a37e94 commit 6747767

8 files changed

Lines changed: 180 additions & 37 deletions

File tree

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
# Round-4 review — owner decisions on new findings
2+
3+
**Date:** 2026-05-23
4+
**Context:** PR #1477 (draft). Round-3 fixes (C5/C6) have shipped — hint API is gone, e2e drill rewritten, reason length checked. Round-4 review surfaced ~16 new findings at a different abstraction level (frontend regressions + non-versioning subscriber edge cases + audit semantics). This doc records owner-perspective best handling.
5+
6+
## Framing principles (unchanged from round-3)
7+
8+
1. Feature flag is off by default — risk-bounded.
9+
2. Draft PR, cheap to add commits.
10+
3. Don't re-litigate locked decisions.
11+
4. Honesty over momentum — fix structural mismatches.
12+
5. Two new block-merge items live in code that's outside `pkg/core/versioning/` proper. They still must be addressed because they're regressions this PR introduces.
13+
14+
## Decision matrix
15+
16+
| # | Finding | Severity | Decision | Bucket | Rationale |
17+
|---|---------|----------|----------|--------|-----------|
18+
| R4-1 | `processConfigDelete` calls `toRuleRes(... NodeData)` only to get a `ResourceKind`, then silently drops the event if `toRuleRes` returns nil (zk_config.go:208) | High | **VERIFY → FIX (C7)** | Block-merge | The nil-guard was added defensively, but if `toRuleRes` legitimately returns nil on empty `NodeData` (which ZK delete events often have), this drops the entire delete from both the in-memory store and the ledger. Fix: derive `ResourceKind` from the rule-suffix that `processDelete` already switched on, instead of reconstructing from NodeData. Two lines; eliminates the silent-drop class. |
19+
| R4-2 | `updateByFormView.vue:327` removed the `configVersion == 'v3.0'` guard around condition parsing | Medium-High | **FIX (C7)** | Block-merge | The parser below (`item.split(' => ')`) is v3.0-only. Removing the guard means non-v3.0 conditions silently produce broken `routeList`. Either restore the guard, or add an explicit comment that v3.0 is the only on-the-wire format (matching `updateRoutingRule`'s `configVersion \|\| 'v3.0'` default) and skip the parse for anything else. Owner pick: **restore the guard + log a warning** — preserves backward compat at zero cost. |
20+
| R4-3 | `validateRuleVersionReason` doesn't validate emptiness, only length (rule_version.go:144) | Low | **FIX (C7)** | Cheap polish | Rename to `validateRuleVersionReasonLength` OR add the empty-check at handler level for symmetry with Abandon (where the service-layer check is what catches empty). Owner pick: rename — keeps the validation single-sourced at the service layer. Two-token change. |
21+
| R4-4 | `currentResourceForIntent` returns `(current, deleted, err)` where `deleted = !exists` — naming flip inside the function (rule_version.go:258) | Low | **FIX (C7)** | Cheap polish | One-line comment above the return explaining the inversion. No semantic change. |
22+
| R4-5 | `ErrRollbackToCurrent` is raised for both ID match AND content-hash match — name reads literally only | Low | **FIX (C7)** | Cheap polish | Update the error message to "cannot roll back to a version identical to current" (keep the constant name). No callers depend on substring matching. |
23+
| R4-6 | `RepairOpenIntents` runs at startup but there's no sweeper for stale Pending intents | Medium | **DEFER → follow-up issue** | Out-of-scope | A Pending intent that the registry never echoes (e.g., admin write crashed before registry roundtrip) blocks all subsequent writes to that rule until an operator explicitly repairs/abandons. Repair/abandon endpoints + UI surfacing exist, so it's recoverable today — just manual. A sweeper that auto-abandons intents older than N hours is a reasonable v2. File issue, link from PR §3. |
24+
| R4-7 | `Subscriber.AsyncEnabled() bool { return false }` — synchronous DB write on every registry push | Medium | **DOCUMENT (C8) + monitor** | Doc-only | With the feature flag off, this is a no-op. With it on under bursty pushes, every rule event still pays a ledger DB roundtrip. Current code keeps the subscriber synchronous and lossless; document the latency trade-off plus the bootstrap O(rules) cost. File follow-up to benchmark under load before broad enablement. |
25+
| R4-8 | `NormalizeSpec` round-trip stabilizes top-level map keys but not items inside arrays-of-maps | Low | **VERIFY → test (C7)** | Defensive test | None of the current proto schemas (ConditionRoute / TagRoute / DynamicConfig) have list-of-map fields where map ordering can vary, so no real bug today. Add a unit test that pins hash stability for a proto containing a map field, so a future proto change can't silently break dedup. ~15 LOC test. |
26+
| R4-9 | 409 / 503 wire shape (`gin.H{}` with flat fields) diverges from `model.CommonResp` | Low | **WON'T-FIX (inherits round-3 #12)** | Locked | Already documented as deliberate. Owner-comment in `writeVersioningResp` from C6 should already explain. |
27+
| R4-10 | `GormStore.mu` is in-process; multi-instance correctness rests on untested `FOR UPDATE` semantics | Low | **DOCUMENT (C8)** | Doc-only | Tightens round-3 #5 won't-fix. Phase 9 kept the mutex; round-4 calls out that the cross-instance test gap is real but acceptable when shipping flag-off. Add one sentence to PR §8: "multi-instance writes rely on `SELECT ... FOR UPDATE` row locking on the meta table; tested in-process only." |
28+
| R4-11 | `Source` dedup makes UPSTREAM echoes after ADMIN invisible in the ledger | Medium | **DOCUMENT (C8)** | Doc-only | Already covered by round-3 C6 ("effective state change log, not API-call log"). Adding one sentence to the History drawer UI tooltip ("Author shows the originating writer; identical follow-up echoes from registries are coalesced into this row") is optional. Owner pick: **skip the UI tooltip, keep PR-description note** — the History drawer is for the admin auditor, not end users, and the audit chain is queryable via `contentHash` if anyone needs the full event count. |
29+
| R4-12 | Reason field in 409 payload (`intentId`) is conditionally populated; frontend handles both branches | Low | **WON'T-FIX (deliberate)** | Locked | Asymmetry is intentional — pure `ErrVersionIntentPending` (without struct error) has no intent ID to report. Frontend's `notifyVersionLedgerPending` handles both branches. No-op. |
30+
| R4-13 | Auth gate is authentication-only; no RBAC on rollback | Low | **WON'T-FIX** | Locked | Matches existing rule CRUD (PUT/POST/DELETE have the same gate). No new gap introduced by this PR. RBAC is a project-wide topic, not a versioning concern. |
31+
| R4-14 | E2E drill (`e2e_rollback_drill_test.go`) still doesn't go through handler/lock layer after the C5 rewrite | Low | **NOT REGRESSION** | OK | Round-3 #2 rewrote the drill to go through `applyRuleMutationIntent`. Going further to drive the gin handler would need real auth wiring; the service-layer drive is the right level. Handler-level coverage lives in `rule_version_test.go` and `component_test.go`. |
32+
| R4-15 | Frontend 2s `setTimeout` after rule updates is still present in `updateByFormView.vue` | Medium | **INHERITS round-3 #4 (deferred)** | Locked | Decision already made: out of this PR's smoke gate. Follow-up issue to be filed at PR merge. |
33+
| R4-16 | `Diff` of a historical version when the rule is currently deleted returns 404 | Low | **INHERITS round-3 #8 (documented)** | Locked | Already in C6 doc. |
34+
35+
## Bucket summary
36+
37+
### Block-merge (this PR, in C7)
38+
- R4-1 — Derive ResourceKind from suffix in `processConfigDelete`, drop the nil-NodeData silent skip. Real correctness gap.
39+
- R4-2 — Restore `configVersion == 'v3.0'` guard (or document the single-format invariant). UI regression.
40+
41+
### Cheap polish (bundle into C7)
42+
- R4-3 — Rename `validateRuleVersionReason` to `validateRuleVersionReasonLength`.
43+
- R4-4 — Comment on `currentResourceForIntent` return semantics.
44+
- R4-5 — Update `ErrRollbackToCurrent` message to cover content-hash equivalence.
45+
- R4-8 — Add hash-stability unit test for proto-with-map field.
46+
47+
### Doc-only (this PR, in C8)
48+
- R4-7 — PR §8 perf characteristics: synchronous lossless ledger writes + O(rules) bootstrap.
49+
- R4-10 — PR §8 multi-instance correctness rests on `FOR UPDATE`, untested cross-process.
50+
- R4-11 — Already covered by C6.
51+
52+
### Deferred (follow-up issues)
53+
- R4-6 — Stale-Pending-intent sweeper.
54+
55+
### Inherit (no new action)
56+
- R4-9, R4-12, R4-13, R4-14, R4-15, R4-16.
57+
58+
## Commit packaging
59+
60+
- **C7 — round-4 fixes (`fix(versioning): subscriber delete event resilience + UI config-version compat`)**
61+
- `processConfigDelete`: derive ResourceKind from the already-switched suffix, drop the toRuleRes-nil silent skip.
62+
- `updateByFormView.vue`: restore `configVersion == 'v3.0'` guard (or equivalent explicit single-format assertion with logger warn).
63+
- Rename `validateRuleVersionReason``validateRuleVersionReasonLength`.
64+
- Comment on `currentResourceForIntent`.
65+
- Update `ErrRollbackToCurrent` error message.
66+
- Add `TestNormalizeSpecHashStableForProtoWithMaps`.
67+
68+
- **C8 — round-4 docs (`docs(versioning): performance characteristics + multi-instance caveats`)**
69+
- PR §8 (Upgrade and rollback): perf section covering synchronous lossless subscriber + O(rules) bootstrap + DB roundtrip per event.
70+
- PR §8: multi-instance row-lock caveat.
71+
72+
## Acceptance gate (delta from round-3)
73+
74+
Round-3 gate: `go vet`, focused `go test`, `npm run build`, no smoke re-run.
75+
76+
C7 changes:
77+
- Modifies `pkg/core/discovery/subscriber/zk_config.go` (existing test in `zk_config_test.go` must still pass; add one delete-with-empty-NodeData case).
78+
- Modifies a single `.vue` file in routingRule path (frontend `npm run build` re-run, no UI smoke required because the change is restoring removed code).
79+
- Adds one Go test in `pkg/core/versioning/`.
80+
81+
Keep: no end-to-end smoke re-run (still no public wire change beyond what's already shipped).
82+
83+
## Follow-up issues to file (delta from round-3)
84+
85+
Additions to the round-3 follow-up list:
86+
5. **Stale-Pending-intent sweeper.** Auto-abandon intents in `IntentStatusPending` older than configurable duration (default 24h), with audit log. Prevents a crashed-admin-write from indefinitely blocking a rule.
87+
6. **Benchmark versioning subscriber under bursty ZK push.** Confirm the synchronous lossless DB-write design holds under realistic load; consider an explicit durable queue before switching to async best-effort dispatch.
88+
7. **Cross-process correctness test for `GormStore`.** Spin up two GORM connections against a real MySQL/Postgres, drive concurrent `InsertVersion`, assert monotonic `version_no` and meta convergence. Validates the `FOR UPDATE` path that the SQLite test can't exercise.
89+
90+
## What round-4 confirmed is good
91+
92+
- C5/C6 from round-3 actually landed (dead hint API gone, e2e drill rewritten via service layer, reason length checked, weak-CAS documented).
93+
- Intent lifecycle is internally consistent — pending/applied/committed/failed transitions are guarded in both stores.
94+
- Feature flag genuinely off by default; new endpoints return 503 + structured body.
95+
- Frontend interceptor correctly filters 409 codes from the global toast.
96+
- Auth middleware test confirms rollback endpoints require a session.
97+
98+
## Won't-relitigate list (carrying forward)
99+
100+
Same as round-3, plus:
101+
- `Subscriber.AsyncEnabled = false` — kept to preserve lossless ledger semantics until benchmark data and a durable async design justify changing it.
102+
- `NormalizeSpec`'s reliance on Go stdlib JSON key sorting — kept; the proto schemas in scope don't have the problematic shape.
103+
104+
## C8 documentation addendum
105+
106+
### Performance characteristics when versioning is enabled
107+
108+
Rule versioning is feature-flagged off by default. When enabled, startup scans all in-memory governor rule resources and records bootstrap snapshots, so startup cost is O(number of rules). Runtime rule events are handled by the versioning subscriber synchronously: each create/update/delete pays normalization plus one ledger write before the event bus returns. This keeps the history lossless, but under bursty registry pushes it can add latency to registry event processing. Before enabling broadly in a large deployment, benchmark realistic ZK push bursts and size the backing database accordingly.
109+
110+
### Multi-instance caveat
111+
112+
`GormStore` uses an in-process mutex plus `SELECT ... FOR UPDATE` on the `rule_version_meta` row to serialize version number allocation. Current automated coverage proves the in-process path; cross-process correctness still relies on the database row lock behavior and should be validated with a real MySQL/Postgres concurrency test before broad multi-instance enablement.

pkg/console/handler/rule_version.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ func RollbackRuleVersion(cs consolectx.Context, kind coremodel.ResourceKind) gin
9797
c.JSON(http.StatusOK, model.NewBizErrorResp(bizerror.New(bizerror.InvalidArgument, err.Error())))
9898
return
9999
}
100-
if !validateRuleVersionReason(c, req.Reason) {
100+
if !validateRuleVersionReasonLength(c, req.Reason) {
101101
return
102102
}
103103
resp, err := service.RollbackRuleVersion(cs, service.RuleKindName{Kind: kind, Mesh: c.Query("mesh"), Name: c.Param("ruleName")}, id, req.Reason, req.ExpectedVersionID, currentUser(c))
@@ -133,15 +133,15 @@ func AbandonRuleVersionIntent(cs consolectx.Context) gin.HandlerFunc {
133133
c.JSON(http.StatusOK, model.NewBizErrorResp(bizerror.New(bizerror.InvalidArgument, err.Error())))
134134
return
135135
}
136-
if !validateRuleVersionReason(c, req.Reason) {
136+
if !validateRuleVersionReasonLength(c, req.Reason) {
137137
return
138138
}
139139
err := service.AbandonRuleVersionIntent(cs, id, req.Reason)
140140
writeVersioningResp(c, "", err)
141141
}
142142
}
143143

144-
func validateRuleVersionReason(c *gin.Context, reason string) bool {
144+
func validateRuleVersionReasonLength(c *gin.Context, reason string) bool {
145145
if len(strings.TrimSpace(reason)) <= maxRuleVersionReasonLength {
146146
return true
147147
}

pkg/console/handler/rule_version_test.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -52,12 +52,12 @@ func TestWriteVersioningRespIncludesPendingIntentID(t *testing.T) {
5252
require.JSONEq(t, `{"code":"VERSION_LEDGER_PENDING","message":"rule version intent is pending","intentId":42}`, recorder.Body.String())
5353
}
5454

55-
func TestValidateRuleVersionReasonRejectsTooLongReason(t *testing.T) {
55+
func TestValidateRuleVersionReasonLengthRejectsTooLongReason(t *testing.T) {
5656
gin.SetMode(gin.TestMode)
5757
recorder := httptest.NewRecorder()
5858
c, _ := gin.CreateTestContext(recorder)
5959

60-
ok := validateRuleVersionReason(c, strings.Repeat("x", maxRuleVersionReasonLength+1))
60+
ok := validateRuleVersionReasonLength(c, strings.Repeat("x", maxRuleVersionReasonLength+1))
6161

6262
require.False(t, ok)
6363
require.Equal(t, http.StatusBadRequest, recorder.Code)

pkg/console/service/rule_version.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -261,6 +261,7 @@ func currentResourceForIntent(ctx consolectx.Context, intentID int64) (coremodel
261261
return nil, false, err
262262
}
263263
current, exists, err := ctx.ResourceManager().GetByKey(intent.RuleKind, intent.ResourceKey)
264+
// Repair APIs pass deleted=true when the resource manager no longer has the rule.
264265
return current, !exists, err
265266
}
266267

pkg/core/discovery/subscriber/zk_config.go

Lines changed: 12 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -134,13 +134,13 @@ func (z *ZKConfigEventSubscriber) processDelete(configRes *meshresource.ZKConfig
134134
switch suffix {
135135
case constants.TagRuleSuffix:
136136
return processConfigDelete[*meshresource.TagRouteResource](
137-
configRes, meshresource.ToTagRouteResource, z.storeRouter, z.emitter)
137+
configRes, meshresource.TagRouteKind, z.storeRouter, z.emitter)
138138
case constants.ConditionRuleSuffix:
139139
return processConfigDelete[*meshresource.ConditionRouteResource](
140-
configRes, meshresource.ToConditionRouteResource, z.storeRouter, z.emitter)
140+
configRes, meshresource.ConditionRouteKind, z.storeRouter, z.emitter)
141141
case constants.ConfiguratorsSuffix:
142142
return processConfigDelete[*meshresource.DynamicConfigResource](
143-
configRes, meshresource.ToDynamicConfigResource, z.storeRouter, z.emitter)
143+
configRes, meshresource.DynamicConfigKind, z.storeRouter, z.emitter)
144144
default:
145145
return bizerror.New(bizerror.UnknownError,
146146
fmt.Sprintf("unknown rule type in mesh %s, skipped processing, node: %s",
@@ -201,35 +201,31 @@ func processConfigUpsert[T coremodel.Resource](
201201

202202
func processConfigDelete[T coremodel.Resource](
203203
configRes *meshresource.ZKConfigResource,
204-
toRuleRes meshresource.ToRuleResourceFunc,
204+
ruleKind coremodel.ResourceKind,
205205
router store.Router,
206206
emitter events.Emitter) error {
207-
ruleRes := toRuleRes(configRes.Mesh, configRes.Name, configRes.Spec.NodeData)
208-
if ruleRes == nil {
209-
logger.Warnf("cannot derive rule resource from zk delete event, mesh: %s, nodeName: %s", configRes.Mesh, configRes.Name)
210-
return nil
211-
}
212-
st, err := router.ResourceKindRoute(ruleRes.ResourceKind())
207+
st, err := router.ResourceKindRoute(ruleKind)
213208
if err != nil {
214-
logger.Errorf("get %s store failed, cause: %s", ruleRes.ResourceKind(), err.Error())
209+
logger.Errorf("get %s store failed, cause: %s", ruleKind, err.Error())
215210
return err
216211
}
217-
oldRes, exists, err := st.GetByKey(ruleRes.ResourceKey())
212+
resourceKey := coremodel.BuildResourceKey(configRes.Mesh, configRes.Name)
213+
oldRes, exists, err := st.GetByKey(resourceKey)
218214
if err != nil {
219-
logger.Errorf("get rule %s from store failed, cause: %s", ruleRes.ResourceKey(), err.Error())
215+
logger.Errorf("get rule %s from store failed, cause: %s", resourceKey, err.Error())
220216
return err
221217
}
222218
if !exists {
223-
logger.Warnf("rule %s not exists in store for zk delete event, skipped deleting; node data may be unavailable", ruleRes.ResourceKey())
219+
logger.Warnf("rule %s not exists in store for zk delete event, skipped deleting; node data may be unavailable", resourceKey)
224220
return nil
225221
}
226222
oldRuleRes, ok := oldRes.(T)
227223
if !ok {
228224
return bizerror.NewAssertionError(reflect.TypeOf(oldRuleRes), oldRes)
229225
}
230-
err = st.Delete(ruleRes)
226+
err = st.Delete(oldRuleRes)
231227
if err != nil {
232-
logger.Errorf("delete rule %s from store failed, cause: %s", ruleRes.ResourceKey(), err.Error())
228+
logger.Errorf("delete rule %s from store failed, cause: %s", resourceKey, err.Error())
233229
return err
234230
}
235231
emitter.Send(events.NewResourceChangedEventWithContext(cache.Deleted, oldRuleRes, nil, map[string]string{

pkg/core/versioning/types.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ var (
5858
ErrVersionIntentNotOpen = errors.New("rule version intent is not open")
5959
ErrVersionIntentPending = errors.New("rule version intent is pending")
6060
ErrRollbackToDelete = errors.New("cannot roll back to a deleted rule version")
61-
ErrRollbackToCurrent = errors.New("cannot roll back to the current rule version")
61+
ErrRollbackToCurrent = errors.New("cannot roll back to a version identical to current")
6262
)
6363

6464
type Version struct {

pkg/core/versioning/versioning_test.go

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,6 +58,28 @@ func TestNormalizeSpecHashStable(t *testing.T) {
5858
require.NotEmpty(t, hash1)
5959
}
6060

61+
func TestNormalizeSpecHashStableForProtoWithMaps(t *testing.T) {
62+
hash1, spec1, err := NormalizeSpec(&meshproto.DynamicConfig{
63+
Key: "demo",
64+
ConfigVersion: "v3.0",
65+
Configs: []*meshproto.OverrideConfig{{
66+
Parameters: map[string]string{"timeout": "1000", "retries": "2"},
67+
}},
68+
})
69+
require.NoError(t, err)
70+
hash2, spec2, err := NormalizeSpec(&meshproto.DynamicConfig{
71+
ConfigVersion: "v3.0",
72+
Key: "demo",
73+
Configs: []*meshproto.OverrideConfig{{
74+
Parameters: map[string]string{"retries": "2", "timeout": "1000"},
75+
}},
76+
})
77+
require.NoError(t, err)
78+
require.Equal(t, spec1, spec2)
79+
require.Equal(t, hash1, hash2)
80+
require.NotEmpty(t, hash1)
81+
}
82+
6183
func TestDiffRejectsTrailingGarbageInAgainstVersionID(t *testing.T) {
6284
store := NewMemoryStore()
6385
svc := NewService(true, 5, store)

0 commit comments

Comments
 (0)