feat: dial-unified-config — server (Configuration API + MergedConfigStore + secret encryption)#1529
feat: dial-unified-config — server (Configuration API + MergedConfigStore + secret encryption)#1529siarhei-fedziukovich wants to merge 136 commits into
Conversation
Add the 10-document proposal for unifying DIAL Core configuration management — Configuration API + dial-cli — under docs/sandbox/dial-unified-config/. Covers problem context, architecture, API reference, security/audit, CLI design, user guide, migration & rollout, open questions, and admin MCP spec. Status v2.20: decisions locked, ready for Phase 1 implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Execution playbook for the Phase 1-3 (+4 NTH) MVP: slice register, agent loop, simplification principles, branching model, halt conditions, LSP integration notes. Companion to the proposal docs (01-09); spec stays the contract, IMPLEMENTATION.md governs execution. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Project-scoped slash command at .claude/commands/dial-mvp.md that resumes the dial-unified-config MVP implementation in any new session. Loads context from IMPLEMENTATION.md + project memory, runs the agent loop, halts at architect-plan and on any halt-condition trigger. Usage: /dial-mvp [<slice-id> | status]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add .classpath, .factorypath, .project, .settings/ to .gitignore. These are generated by Eclipse and the JDTLS language server (used by Claude Code's LSP integration), should not be tracked. Existing entries cover IntelliJ (.idea/) and VS Code (.vscode/). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .claude/commands/review-unified-config.md file is a personal/local slash command, not part of the shared MVP tooling. Gitignored to avoid accidental inclusion in commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refine the branch-hygiene policy in IMPLEMENTATION.md §3.2 / §9. Default remains rebase (linear history; force-push; sub-branches rebase onto new tip). Per-sync merge override is now allowed when situational — early in MVP with few slices in flight, or when conflicts resolve more cleanly with a merge commit. Late in MVP, prefer rebase to keep history readable for the final big-PR review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-DELETE locks - config.reload.onInvalidEntity default flipped from skip to abort (aligns reload-side with config.write.softValidation: false; opt-in skip retained for scale-up resilience). - Singleton /v1/settings/platform/global gains DELETE to release API control and revert the projection to file-sourced (or default); POST still 405. Three-state source field: api | file | default. Touches docs 01, 02, 03, 05, 06, 07, 08, 09, plus IMPLEMENTATION.md slice rows 1S.3, 2S.9, 3S.2 and the locked-rules summary. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Git refuses to create refs/heads/feature/unified-config/<x> while refs/heads/feature/unified-config exists as a branch (ref-vs-directory conflict). Rename slice sub-branches from feature/unified-config/<id>-<short-title> to feature/unified-config-<id>-<short-title>. Wildcard ergonomics preserved (git branch --list 'feature/unified-config-*'). Discovered on first /dial-mvp 1S.0 run when the orchestrator hit halt-condition §4.1 #1 (constraint contradicts plan) and surfaced three options; Option A picked (smallest blast radius — no rename of an existing branch). Updates IMPLEMENTATION.md §3.2 + §9 and the /dial-mvp slash command. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce, EntityBucketBinding (#1513) Co-authored-by: SiarheiFedziukovich <siarhei_fedziukovich@epam.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Slices now integrate via local `git merge --squash` into feature/unified-config — no per-slice PR, no per-slice formal code-owner review. Orchestrator halts for the user to approve the slice diff and a §3.5-formatted commit message before merging, then deletes the sub-branch and updates the slice register Status to merged. The single big PR feature/unified-config → development at MVP-complete remains the only formal external review checkpoint. Adds IMPLEMENTATION.md §3.5 (commit-message format with type guide and example), updates §3.2 (branching diagram + integration bullet), §4 (drops per-slice /ultrareview recommendation; reserved for MVP-complete or ad-hoc), §5 (status legend: in-review → awaiting-merge; PR column → Commit), and §9 (decisions log). Updates /dial-mvp slash command step 7 (OPEN PR → MERGE LOCALLY) and Important notes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ction
Replaces the 1S.0 stub's 405 fall-through with a model-read handler off the
in-memory volatile Config. Public/Owner field projection: status is always
"valid" in Phase 1; source ("file") is admin-only. Empty name returns 404 —
the listing slot is reserved for 1S.2. Unblocks Track B (CLI 1C.0) on wire
contract.
Design anchors: 03 §1, §2, §4; 04 §1.5
Tests: server/src/test/java/com/epam/aidial/core/server/ConfigModelReadTest.java
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New project-scoped slash command that runs multiple dial-mvp slices sequentially. The two routine halts (architect-plan approval, merge-diff approval) become conditional, gated by self-tests in IMPLEMENTATION.md §4.2 — auto-approval is earned, not assumed. Halt conditions §4.1 still always trigger; auto-mode never bypasses them. Adds IMPLEMENTATION.md §4.2 (Auto-mode policy) with the §A ARCHITECT and §B MERGE LOCALLY self-test checklists, when-to-use guidance (mechanical/semi-mechanical only — Phase-3 sweep, Phase-2 prereqs), when-NOT-to-use guidance (1S.0, 2S.8, 2S.10, 4S.0 — high-uncertainty slices need plain /dial-mvp), and auditability format. Usage: /dial-mvp-auto (until blocked) | /dial-mvp-auto 3 (count) | /dial-mvp-auto until-phase-end. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the per-bucket listing endpoint behind the existing CONFIG_RESOURCE route. Phase 1 returns the full in-memory snapshot with hasMore: false (per 03 §4 forward-compat); ?limit shape-validated, ?cursor accepted-and- ignored. Public/Owner field projection on items is shared with the single- GET path via a projectModelItem helper. Trailing slash optional. Also fixes RegexUtil.collectGroups to skip optional named groups whose start is -1, which surfaced as a server-side IOOBE only when 1S.2 exercised /v1/models/public without a trailing slash. Design anchors: 03 §1, §4 Tests: server/src/test/java/.../ConfigModelListTest.java (new); ConfigModelReadTest update Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends ConfigResourceController GET dispatch to interceptors, roles, keys,
routes (platform/), schemas (public/), and the singleton settings at
platform/global. Bucket-aware authz already gates non-admin off platform/.
Key.key is masked with "***" — Phase 1 has no reveal-secrets surface. Settings
GET projects globalInterceptors with file/default source; POST/PUT/DELETE
return 405 with Allow: GET, PUT, DELETE.
Design anchors: 03 §1; 04 §1.2
Tests: server/src/test/java/com/epam/aidial/core/server/Config{Interceptor,Role,Key,Route,Schema,Settings}Test.java
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ResourceController overrides handle(resourceUrl) to add an additive admin admit on GETs to applications/toolsets. When the unified-config gate grants admin access, the request bypasses the rules-based AccessService check and proceeds with hasWriteAccess=true (full data); everyone else falls through to the existing flow unchanged. Phase 1 reads only — write preflight ships in 1S.5. Design anchors: 03 §1; 02 §6 Tests: server/src/test/java/com/epam/aidial/core/server/ConfigAdminAppToolsetReadTest.java Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AccessControlBaseController.handle(resourceUrl) gains a ConfigAuthorizationService preflight: when the admin role is asserted AND the unified-config gate authorizes the request (public/ + admin always; user-bucket only when admin owns it), the rules-based AccessService check is skipped and the handler runs with hasWriteAccess=true. Otherwise the request falls through to the existing path. Covers all 4 subclasses (Resource + 3 FILES) and both reads/writes; supersedes the narrower 1S.4 override on ResourceController. OQ-33's "admin can't reach user buckets" is enforced by the gate not admitting admin onto user buckets; existing share-based grants (publication review) continue through the rules path unchanged. Design anchors: 03 §1; OQ-21; OQ-33 Tests: server/src/test/java/com/epam/aidial/core/server/ConfigAdminPreflightTest.java Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AdminExportController + RouteTemplate.CONFIG_EXPORT for /v1/admin/export. Admin-only via ConfigAuthorizationService; default JSON, YAML when ?format=yaml or Accept: application/yaml. Keys are re-attached with masked secrets — Config field's @JsonProperty(WRITE_ONLY) suppresses the map at serialization time. JSON-string round-trip avoids the TokenBuffer/writeRaw incompatibility in applicationTypeSchemas' custom serializer. Design anchors: 03 §1; 07 Phase 1 Tests: server/src/test/java/com/epam/aidial/core/server/AdminExportTest.java Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AdminHealthConfigController + RouteTemplate.CONFIG_HEALTH for
/v1/admin/health/config. Admin-only via ConfigAuthorizationService; Phase 1
always reports {status:"healthy",skipped:[]} unconditionally — the invalid-
entity sibling store that populates skipped, plus the dial_config_skipped_*
Prometheus metrics, ship together in slice 2S.9. Cardinality-zero metric
scaffolds skipped here per §2.1/§2.3 (no observable behavior in Phase 1).
Design anchors: 07 Phase 2; 02 §4.1
Tests: server/src/test/java/com/epam/aidial/core/server/AdminHealthConfigTest.java
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r handler
testAdminCanReachPlatformEntity asserted the 1S.0 stub's 405 ("no handler yet")
on admin GET /v1/interceptors/platform/anything. After slice 1S.3 wired the
interceptors read handler, the same request now resolves to 404 (interceptor
"anything" not found) — still proves the admin gate admitted. Adjusted the
expectation; comment notes the success signal can be 404 or 405 depending on
whether the type has its handler yet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§A self-test item 4 read literally would halt every Phase-2 prereq slice (pure-internal refactors with no HTTP surface) — contradicting §4.2's own "When to invoke" list that names them as auto-eligible. Loosen the wording to require integration tests only when the slice exposes HTTP behaviour; well-targeted unit tests otherwise. No behavioural change to the merge gate; the amendment unblocks the 2S.0-pre … 2S.7-pre auto-mode batch. Design anchors: IMPLEMENTATION.md §4.2 §A; §8 doc-amendment lifecycle Tests: no new tests Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces unconditional value.setKey(apiKey) with a blank-guarded call so API-managed keys (whose Key.key already holds the decrypted secret) are not silently overwritten by the human-readable map key. Legacy file-sourced format (map key = secret, Key.key blank) is unaffected. Compile-time prereq for the Phase-2 keys controller. Design anchors: 07 Phase 2 prereqs (line 98); OQ-12 Tests: server/src/test/java/com/epam/aidial/core/server/security/ApiKeyStoreTest.java Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce locks Address @astsiapanay's review threads on ConfigResourceController: - Pass context.getUserDisplayName() as author to every putResource call (was null). Matches ResourceController#putResource pattern; the author is recorded on create and preserved across updates by ResourceService. - Drop the per-resource ResourceService lock from handlePost / handlePut / handleDelete. ConfigResourceController handles only admin-only types (models, interceptors, roles, keys, routes, schemas, settings), which are never written by non-admin paths. The cluster-wide AdminWriteLock alone serializes all admin writes — the second lock added nothing. See 02-architecture.md §4.4. - For handleSettingsPut / handleSettingsDelete, switch to the (..., false) skip-lock overloads of putResource / deleteResource since the AdminWriteLock is already held. AdminWriteSerializationTest still passes — the cluster-wide lock contract is preserved; this only removes the redundant per-resource lock layered above it. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reverses four prior locked decisions to align /v1/{type}/{bucket}/{name}
(models / interceptors / roles / keys / routes / schemas / settings) with
the existing user Resource API: PUT-upsert + If-None-Match: * / If-Match
conditional headers (POST -> 405); listing moves to /v1/metadata/... with
ResourceFolderMetadata / ResourceItemMetadata shape (blob-only); ?token=&
limit= (max 1000) pagination; no projection on listings. Canonical and
lite proposal docs swept; one memory entry records the lock reversal.
Design anchors: 02 §4.1, §4.3, §5.1; 03 §1, §3, §4, §6, §7; 04 §1.5, §2.5,
§3.2; 05 §1; 06 §2.8; 09 §6.1; project_unified_config_review.md 2026-05-20
Tests: server/src/test/.../ConfigResourceConditionalHeaderTest.java + 14
flipped per-type tests
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… polish note Slice U.0 status 🚧 → ✅, commit e9f0944. Append round-3 polish: GET 304 off event loop; PUT-upsert single-pre-read collapse; metadata-controller verb guard; shared MetadataQuery helper; settings 405-on-metadata-route uses Allow: GET per RFC 9110 §15.5.6; shared descriptorFor helper; SETTINGS_TYPE → enum; narrative slice-ID comments removed; lite docs swept in lockstep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes AdminExportController, AdminExportTest, SecretMaskingApiTest, RouteTemplate.CONFIG_EXPORT, and the ControllerSelector handler. Core team has concerns about the current implementation; export is not MVP-critical and is deferred to a later phase alongside audit. Slices 1S.6 (merged ec1ac53) and 1C.4 marked dropped. Design docs (full + lite) annotate every callsite "deferred — Defer.1"; design content preserved. SecretMaskingApiTest deletion is safe — its only assertion path was GET /v1/admin/export; per-entity-GET secret masking remains covered by 2S.10's SecretFieldProcessor + dual-ObjectMapper tests. Design anchors: 02 §10; 03 §1, §4, §7; 04 §1.2, §3.7; 05 §1; 06 §2.2; 07 Phase 1/4/5; 08 OQ-10/11/17; 09 §6.1/§6.2; IMPLEMENTATION.md §5.5 Defer.1 Tests: removed AdminExportTest + SecretMaskingApiTest; no new tests (deletion only) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the documented entity-name contract (02 §4, 03 §3) with code by adding a 400-on-mismatch regex check ^[A-Za-z0-9._%:-]+$ in ConfigResourceController for PUT/DELETE on new admin-config types. Applied to the URL-decoded segment. Conservative floor; docs note the set is extendable on client request. Lite docs swept in lockstep. Design anchors: 02 §4, 03 §3 Tests: server/src/test/.../ModelWriteApiTest.java (amended testPutGetWithUrlEncodedName + 2 new rejection guards) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…entries
Carve file-sourced entries out of the per-entity Configuration API and
put them on a dedicated read-only path /v1/admin/config/file/{type}[/{name}].
Per-entity GET (and singleton settings GET) becomes blob-only — file
entries return 404. Authz: admin role for all types EXCEPT keys, which
requires security-admin (file Config.keys map keys equal secrets per
OQ-12). The source field is retired entirely — the URL itself discloses
the source. Write verbs on the new route return 405 + Allow: GET per
RFC 9110 §15.5.6 (no fall-through to GlobalRouteController).
Closes the file-keys URL-leak: previously a file key's secret value
could land in logs/traces/metrics via the simple-name fallback on
GET /v1/keys/platform/{name}. Reopens locked decisions: Polish.1
(simple-name fallback retired), OQ-10 (singleton three-state projection
retired), per-entity source field (gone). Runtime resolution unchanged
— union still flows through MergedConfigStore into Config.
Design anchors: 02 §4 (operator-vs-runtime), 03 §1/§2 (file-config
endpoints, blob-only singleton), 04 §1.5/§1.5.1 (security-admin gate),
08 OQ-10 (three-state projection retired)
Tests: server/src/test/.../FileConfigApiTest.java (new HTTP integration),
FileConfigControllerTest.java (new unit), plus 12 existing tests flipped
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ocks Replace the single "admin-writes" Redis key with the per-bucket lock-key derivation that LockService#underBucketLocks already uses (BlobStorageUtil.toStoragePath + distinct + sorted). Composite Lock keeps the controller try-with-resources callsites unchanged. Cross-bucket serialization (design 02 §4.4) is preserved by holding both PUBLIC_LOCATION and PLATFORM_LOCATION locks at once. Design anchors: 02 §4.4 Tests: server/src/test/.../AdminWriteSerializationTest.java Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Public view is a per-type allowlist (matching today's ModelData / ApplicationData), not "all-non-secrets". Retires the false claim that /openai/models exposes upstreams/endpoints to authenticated readers and removes the "mechanical @JSONVIEW(Public)" mass-annotation rule. Lite §Authorization sheds the ConfigAuthorizationService impl-detail sentence. New U.2 slice queued in IMPLEMENTATION.md §5.5 to land the projection contract end-to-end (Views.java, DEFAULT_VIEW_INCLUSION = false, per-type allowlists) — today the controller serializes the full entity, exposing upstreams/endpoints to non-admin readers. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
| // Serializes rebuild() across init/reload/rebuildNow/timer-fire callsites. ReentrantLock | ||
| // (vs. synchronized) lets virtual threads park instead of pin their carrier while waiting — | ||
| // rebuild() does blob-storage IO that can take tens to hundreds of ms. | ||
| private final ReentrantLock rebuildLock = new ReentrantLock(); |
There was a problem hiding this comment.
The lock is used to apply changes sequentially in the end the last operation wins. The code doesn't resolve conflicts: it either deletes or updates an entity in a global config map. We can safely remove the lock.
There was a problem hiding this comment.
I'd like to push back on removing the lock before changing the code — rebuildLock is doing more than ordering single-key updates. Specifically:
-
Multi-map atomic swap across the entire
Config.rebuild()and theapplyEntity*/applyBatchpaths each produce a freshConfig(and a freshinvalidEntitiesmap) and assign it to the volatileconfigfield at the end. Without the lock, two concurrent producers can interleave such that the final volatile swap reflects only one of them — the other's writes (to multiple type-maps, plus the siblinginvalidEntitiesstore) are silently dropped. "Last operation wins" is true at the swap level, but the in-flight builders are mutating different freshConfiginstances, so we'd lose the changes from whichever lost the race rather than merging them. Therebuild()path in particular readsfileConfigStore.get(), scans blob storage for every managed type, and runsSecretFieldProcessor.decryptFieldson each entity — none of that work composes if two run in parallel. -
Cross-map invariants inside one rebuild.
applyEntityWriteLockedforINTERCEPTORresurrects models frominvalidEntitiesintonext.modelsbased on the new interceptor being present innext.interceptors.applyEntityDeleteLockedforINTERCEPTORcascades throughnext.modelsto invalidate model chains that referenced the deleted interceptor. Both read and mutate multiple maps before the volatile swap. Without serialization an interleaved write could see a half-builtnextand produce an inconsistent merged config (interceptor present but models still invalid, or interceptor deleted but models still pointing at it). -
Mixed callers. The lock is acquired from:
init,reload,requestRebuild's timer-fire,applyReplicaEventon pub/sub, the writer-podrebuildNow(), and now the bulkapplyBatch. These can fire concurrently — pub/sub events arrive while a debounced rebuild is running, while an admin PUT is running, while a file-store reload callback fires. The lock is what serializes them; "last operation wins" is only safe semantics if there is exactly one in-flight operation at a time. -
ReentrantLock specifically (not
synchronized). Chosen so virtual threads park instead of pinning their carrier —rebuild()does blob-storage IO that can take tens to hundreds of ms, which would block a Vert.x carrier undersynchronizedon the writer-pod path.
Happy to be wrong — if you have a specific concurrent path in mind where the lock is preventing parallelism that would compose correctly, I'd take that as the place to narrow rather than remove. But removing the lock outright would, I think, regress (2) (cross-map invariants) in particular, and the workload doesn't justify the risk — admin writes are rare (dozens/day on real envs per §4.4), so the contention the lock creates isn't a real cost.
Wanted to flag rather than just act on this one. Want me to leave it as-is, or would you like me to look at narrowing the lock to specific sections (e.g., the rebuild path vs. the partial-update path) rather than removing it?
There was a problem hiding this comment.
All operations are executed under distributed Redis lock of bucket level such as apply*. So the local lock is not required any longer.
However the operations reload, rebuildNow, rebuild are executed under a local lock rebuildLock. So DIAL Core may read dirty state, e.g. applyBatch is in progress. We need to run these operations under distributed bucket lock as well.
There was a problem hiding this comment.
You're right that all apply* callsites from the write controllers (ConfigResourceController.handlePut/handleDelete/handleSettings* at L445/L546/L387/L415 and AdminApplyController.process at L174) are wrapped in lockService.underBucketLocks(ADMIN_BUCKET_LOCATIONS, ...). So at the cluster level the distributed lock is the outer serializer for those paths.
However, rebuildLock is doing real work that the distributed lock doesn't cover — even on the writer pod. Two reasons:
-
applyReplicaEvent(pub/sub fast-path, L238). Dispatched fromonResourceEventviataskExecutor.submit(...)(L212-L219) with noLockServicecall anywhere in the chain. On a replica pod, multiple replica events can race concurrently —rebuildLockis the only thing serializing them. -
Same-pod
rebuild()racing againstapply*. This is the one I want to flag because it bit me when I first thought we could drop the lock fromapply*.apply*doesnext = shallowClone(this.config)before its mutations, then swapsthis.config = nextat the end. The debouncedrebuild()timer-fire (L399) fires on its own thread, holds onlyrebuildLock, and does the same shape: readthis.config, buildmerged, swap. WithoutrebuildLockshared between them, an interleaving like- apply:
next = shallowClone(this.config)(snapshot taken) - rebuild: full sweep, then
this.config = merged - apply:
this.config = next(clobbers rebuild's swap)
loses any state rebuild surfaced from blob storage that apply didn't touch (e.g., a new entity that landed via pub/sub from another pod, or a previously-invalid entity that became valid on this sweep). The same race exists on any pod that has both an in-flight admin write and a pending debounced rebuild — i.e. the writer pod, every time.
- apply:
So rebuildLock is correctness-load-bearing on every pod, not redundant with the distributed lock. The writer-pod ReentrantLock acquisition on the uncontended apply path is nanoseconds — cheap enough that keeping it doesn't justify a redesign.
Recommend leaving rebuildLock as-is for this PR. If you'd like to revisit the lock layering more broadly (e.g., extend the distributed lock to applyReplicaEvent and rebuild() callsites to remove the local lock entirely), that wants a separate design discussion since it turns the file-store reload callback into a Redis round-trip and creates pub/sub thundering-herd against the bucket lock. Happy to spin it up as a follow-up OQ if you want.
There was a problem hiding this comment.
Came back to this with fresh eyes — your concern was real and I had it narrowed to the wrong race. The cross-pod dirty read on partial blob batches: writer pod's AdminApplyController.applyBatch runs resourceService.putResource(...) once per entity, then mergedConfigStore.applyBatch (the in-memory swap) only after ALL blobs are written. During that interval T_write_first_blob → T_inmemory_swap the blob store carries partial-batch state. Any other pod doing rebuild() / applyReplicaEvent / reload() in that window seeds a torn Config — exactly what MODE_ABORT cannot recover from cleanly, and what MODE_SKIP only converges from after subsequent pub/sub events.
The distributed lock on the writer only serialized writers against each other; it did nothing to block the replica's read paths.
Fixed in d8e10df6 on feature/unified-config-server. The 5 read/dispatch callsites in MergedConfigStore (init, reload, rebuildNow, onRebuildTimerFire, onResourceEvent→applyReplicaEvent) now acquire the same underBucketLocks(ADMIN_BUCKET_LOCATIONS) the writers acquire. The lock-holding writer pod owns the cross-bucket read horizon until its in-memory swap completes; replica rebuilds, pub/sub fast-paths, and admin-reload all wait.
Layering, summarized:
| Outer (distributed, cross-pod) | Inner (intra-pod) | Acquired by |
|---|---|---|
ADMIN_BUCKET_LOCATIONS via underBucketLocks |
rebuildLock (ReentrantLock) |
init, reload, rebuildNow, debounced timer-fire, applyReplicaEvent dispatch, write controllers around apply* |
| — | rebuildLock only |
applyEntityWrite, applyEntityDelete, applyBatch, applySettingsWrite, applySettingsDelete (their callers hold the outer lock) |
The inner rebuildLock stays as defensive same-pod serialization (e.g., timer-fire rebuild on a virtual thread vs. an in-flight controller apply on another virtual thread on the same pod). Removing it would require contract-changing every public apply* to "must be called under the outer lock" — there's no compile-time guard for that today, so I kept the local lock and added contract Javadocs to every apply* so the contract is explicit. Tracked as a future Simplicity-First cleanup, not load-bearing now.
Non-reentrancy. LockService.lock generates a fresh random owner per call; the Redis Lua script doesn't recognize re-entry. So nested underBucketLocks on the same thread would deadlock. Verified: none of the 5 new callsites runs inside an outer lock — init is at boot, reload enters from ConfigController.taskExecutor.submit, rebuildNow has no production caller, the timer fire and pub/sub dispatch both go through fresh taskExecutor.submit virtual threads. Each apply* method's Javadoc now explicitly states the caller-holds-outer-lock contract.
Event loop. No callsite I touched runs on the event loop — all five hop to virtual threads before acquiring Redis locks.
Tests.
- New integration test
AdminReadSerializationTest.testReloadBlocksWhileAdminLockHeld— holds the platform bucket lock from the test thread, firesPOST /v1/ops/config/reload, asserts the reload blocks (TimeoutException at 500ms) until release. Mirrors the existingAdminWriteSerializationTestpattern. ./gradlew checkstyleMain checkstyleTest :server:test— full server suite passes.
Doc 02 §4.4 rewritten to describe the new layering, the non-reentrancy rule, the lock-ordering invariant, and the inner/outer split. Two stale references to the removed AdminWriteLockService cleaned up.
Withdrawing my recommendation from r3303191590 — option (c) as I had described it was unsafe (writer-pod apply* and timer-fire race on the same this.config swap). The fix above is closer in spirit to your original suggestion: replica reads serialize against writer writes via the same bucket locks. Ready for another look.
There was a problem hiding this comment.
Reverted the lock-layering change (68f80668 reverts d8e10df6) — want to think about this a little more before committing to an approach. The cross-pod dirty-read is real, but extending the bucket locks to every rebuild / replica-event / reload path turns frequent local operations into Redis round-trips and changes the propagation model more than I'd like to do without more thought (pub/sub thundering-herd, file-reload callback latency, the no-reentrancy footgun on apply*). Will come back with a more considered proposal. Leaving the thread open.
There was a problem hiding this comment.
Fixed in 71d9066. The three full-rebuild entry points — reload(), rebuildNow(), and the debounced onRebuildTimerFire() — now wrap their rebuild() blob scan in lockService.underBucketLocks(MergedConfigStore.ADMIN_BUCKET_LOCATIONS, …), the same [public/, platform/] bucket locks the write controllers (ConfigResourceController, AdminApplyController) already acquire. A writer pod holds those locks across its whole multi-blob batch and the in-memory swap, so a peer pod's rebuild can no longer scan partial-batch (applyBatch-in-progress) state.
Scope, per your follow-up:
init()is left unlocked — it runs once at startup before any concurrent admin writer.- The replica fast path (
onResourceEvent/applyReplicaEvent) and theapply*partial-update methods are untouched. The replica path is best-effort and self-heals via itsrequestRebuild()fallback plus the polling SLA, so locking the high-frequency per-event path would serialize every replica against every writer for no correctness gain over the fallback.
ADMIN_BUCKET_LOCATIONS is consolidated to a single public static final on MergedConfigStore (both controllers reference it) so the reader and writer lock-key sets stay byte-identical. underBucketLocks sorts keys internally and the locked callsites run on virtual threads (taskExecutor), so no ordering inversion and no event-loop blocking. New regression test AdminReadSerializationTest holds the platform-bucket lock and asserts POST /v1/ops/config/reload blocks until release.
…atch Reviewer flagged that applyBatch cloned the full per-type map on each entry via putEntity/removeEntity, yielding O(batch × map) work — 100 entries on a 100-entry map = 10k operations. Pre-clone each touched type-map once at the top of applyBatch and mutate in place across all entries; single volatile swap of this.config at the end. Per-entry validation/coercion failures roll back the (type, canonicalId) slot in next + the matching invalidEntities record so subsequent entries observe the pre-change state, preserving single-entity atomicity. applyEntityWriteLocked / applyEntityDeleteLocked now share the same in-place helpers (cloneTypeMap once + applyChangeInPlace). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer flagged the wrapper as reinventing LockService.underBucketLocks functionality. Wrapper deleted; the five admin-write callsites inline lockService.underBucketLocks(ADMIN_BUCKET_LOCATIONS, () -> ...) directly, with ADMIN_BUCKET_LOCATIONS = List.of(PUBLIC_LOCATION, PLATFORM_LOCATION) declared per-controller. Lock-key identity, sorted acquisition order, and cluster-wide admin-write serialization (02-architecture.md §4.4) are preserved — the wrapper used the same BlobStorageUtil.toStoragePath + sorted derivation that underBucketLocks already performs. AdminWriteSerializationTest comment updated; lock-probe key derivation unchanged. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sync 17 commits from development (PRs #1509–#1560): toolset tool listing/discovery fixes, Snowflake-managed MCP "Invalid client" fix, auto-sharing for prompts + citation attachments, applicationTypeRoutes exposure, MCP feature flag for schema-rich apps, Veo 3.1 timeout fix, Netty CVE remediation, CI workflow bumps. Conflict resolution: - build.gradle: adopt dev's netty_version=4.1.133.Final property and subprojects-level netty-transport-native-epoll exclusion (CVE-compliant per project_netty_vertx_pin.md memory); rewrite this branch's netty-codec / netty-codec-http forces to use ${netty_version}; drop the now-redundant transport-native-epoll force. gradle.properties (auto-merged): Jackson bump 2.18.6 → 2.21.2, new mcp_sdk_version=1.1.2 and netty_version=4.1.133.Final. server/build.gradle (auto-merged): MCP SDK deps added by dev coexist with this branch's vertx-web-client scope dedupe. AccessService.java (auto-merged): dev's attached-prompts check sits above the attached-deployments block; non-overlapping with this branch's admin-authz preflight surface. Per-sync merge override per IMPLEMENTATION.md §3.2 — rationale: keep PR #1529 review threads anchored to existing SHAs (force-push would detach review-thread context). ./gradlew checkstyleMain checkstyleTest :server:test :storage:test :config:test :credentials:test all green. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
ApplicationService and ToolSetService were unconditionally stripping forwardAuthToken on every write — pre-existing user-safety policy from 00c75ec, but it locked admin out too. Adds `preserveForwardAuthToken` parameter; controllers compute `descriptor.isPublic() && hasAdminAccess` for the carve-out. Bulk apply preserves; publication-approval keeps stripping (defense in depth). Design anchors: 04 §1.4 (amended); IMPLEMENTATION.md §5.5 Tests: server/src/test/java/.../ConfigAdminAppToolsetWriteTest.java Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DIAL team agreed in review call to postpone the plaintext-reveal feature. Encryption-at-rest stays; @JsonProperty(WRITE_ONLY) drops secret fields from GET responses instead of emitting "***". File-config /keys denied for every caller (admin included) — file map keys equal secrets per OQ-12 and no role distinguishes "may read secret-equivalent names" from plain admin anymore. Design preserved in 04 §2.6 marked deferred. Design anchors: 04 §1.5, §1.5.1, §2.5, §2.6 (amended); 03 §1, §3.1 (amended); 08 OQ-12 (amended); IMPLEMENTATION.md §5.5 Tests: full :server:test suite passes; new DualMapperTest / SecretFieldProcessorTest coverage for stripEncryptedFields and the mask-as-literal preserve-on-omit contract Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eload Closes the cross-pod dirty-read window where a replica pod's rebuild(), applyReplicaEvent, or reload() could scan blob storage while another pod's AdminApplyController.applyBatch is mid-batch (some blobs written, in-memory swap pending). Wraps the 5 read/dispatch callsites in MergedConfigStore in underBucketLocks(ADMIN_BUCKET_LOCATIONS, ...) — same key set the writers acquire. Inner rebuildLock retained as intra-pod defensive serialization. Lock contract documented on every apply*/rebuild method. Design anchors: 02 §4.4 (rewritten) Tests: server/src/test/.../AdminReadSerializationTest.java Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eplica/reload" This reverts commit d8e10df.
…d reads under admin bucket locks Closes the cross-pod dirty-read window where a peer pod's full-rebuild blob scan could observe partial-batch state: a writer pod's apply writes blobs one-by-one then swaps the in-memory Config after the loop, so the blob store carries a torn view during the batch. Per reviewer astsiapanay's narrowing, the three full-rebuild entry points — reload(), rebuildNow(), and the debounced onRebuildTimerFire() — now wrap their rebuild() blob scan in lockService.underBucketLocks(ADMIN_BUCKET_LOCATIONS, …), the same [public/, platform/] bucket locks the write controllers already hold. init() (startup-only) and the replica fast path (onResourceEvent / applyReplicaEvent, best-effort with a requestRebuild() fallback + polling SLA) are deliberately left unlocked. ADMIN_BUCKET_LOCATIONS is consolidated to a single public static final on MergedConfigStore; both controllers now reference it so reader and writer lock-key sets stay byte-identical. Design anchors: 02 §2 / §4.4 (rewritten) / §11.1 Tests: server/.../AdminReadSerializationTest.java Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Aligns the lite docs with the full proposal: corrects the pre-existing AdminWriteLockService / "admin-writes" single-key drift to the actual two-bucket underBucketLocks mechanism, and documents that the full-rebuild blob scan (reload / rebuildNow / debounced timer) shares those locks while init() and the replica fast path are deliberately left unlocked. Touches lite 02-architecture.md (locking table row + serialization paragraph) and 03-api.md (cluster-wide serialization bullet). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve conflicts: - ResourceTypes: keep unified-config enums + RESPONSE_MAPPING - .gitignore: keep both ignore sets - App/CustomApplication tests: adopt #1565 secured upstream exposure (upstreams returned only to write-access callers) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This comment has been minimized.
This comment has been minimized.
Records the core-team concern that Upstream.extraData is a wide property (more often non-secret than secret, sometimes mixing both in one JSON document) while the post-U.4 state encrypts it wholesale with no read path. Proposed design: plaintext extraData + encrypted secretExtraData with merge-at-consumption on the X-UPSTREAM-EXTRA-DATA header. Status PROPOSED, not locked - full design in 04 §2.11, OQ-34 open, slice U.5 registered ⏸ blocked pending core-team decision; no locked text amended. Design anchors: 04 §2.11 (new, proposed); 08 OQ-34 (new, open) Tests: no new tests (docs-only) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Dependency Review✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.Scanned FilesNone |
…s deny-all The lite docs still described the pre-U.4 security-admin tier gating file-sourced keys; U.4 retired that tier and denies type=keys for every caller including admin. Also document the names-only stub shape of the file-config list endpoint. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…listing
The list response shape of GET /v1/admin/config/file/{type} existed only
in FileConfigController and its tests — 03-api-reference.md defined the
endpoints and authz but never the wire contract. Pin it: names-only stub
items, full bodies only on the single-entry GET, bodies-in-bulk deferred
to /v1/admin/export (Defer.1). Extra per-item fields (type, description)
are YAGNI but the shape is forward-compatible with adding them.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Applicable issues
Description of changes
Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.