Skip to content

feat: dial-unified-config — server (Configuration API + MergedConfigStore + secret encryption)#1529

Open
siarhei-fedziukovich wants to merge 136 commits into
developmentfrom
feature/unified-config-server
Open

feat: dial-unified-config — server (Configuration API + MergedConfigStore + secret encryption)#1529
siarhei-fedziukovich wants to merge 136 commits into
developmentfrom
feature/unified-config-server

Conversation

@siarhei-fedziukovich
Copy link
Copy Markdown
Contributor

  • chore: add dial-unified-config proposal documentation
  • chore: add dial-unified-config implementation plan
  • chore: add /dial-mvp orchestrator slash command
  • chore: gitignore Eclipse / JDTLS IDE artifacts
  • chore: gitignore personal review-unified-config slash command
  • chore: allow per-sync merge override on feature-branch integration
  • docs(dial-unified-config): apply 2026-05-01 default-flip and settings-DELETE locks
  • chore: rename slice sub-branches to use hyphen separator
  • feat: 1S.0: bootstrap CONFIG_RESOURCE route, ConfigAuthorizationService, EntityBucketBinding (feat: 1S.0: bootstrap CONFIG_RESOURCE route, ConfigAuthorizationService, EntityBucketBinding #1513)
  • docs(dial-unified-config): mark slice 1S.0 merged
  • chore: drop per-slice PRs in favor of local squash-merge
  • feat: 1S.1: GET /v1/models/public/{name} read with Public/Owner projection
  • docs(dial-unified-config): mark slice 1S.1 merged
  • chore: add /dial-mvp-auto for batch slice execution with self-test gates
  • feat: 1S.2: GET /v1/models/public/ listing with pagination envelope
  • docs(dial-unified-config): mark slice 1S.2 merged
  • feat: 1S.3: read API for platform-bucket types + schemas/settings
  • docs(dial-unified-config): mark slice 1S.3 merged
  • feat: 1S.4: ConfigAuthorizationService preflight for app/toolset reads
  • docs(dial-unified-config): mark slice 1S.4 merged
  • feat: 1S.5: admin authz preflight on FILES/RESOURCE controllers
  • docs(dial-unified-config): mark slice 1S.5 merged
  • feat: 1S.6: GET /v1/admin/export — full Config snapshot (JSON/YAML)
  • docs(dial-unified-config): mark slice 1S.6 merged
  • feat: 1S.7: GET /v1/admin/health/config — Phase 1 healthy stub
  • docs(dial-unified-config): mark slice 1S.7 merged
  • fix(test): update ConfigBootstrapTest expectation for 1S.3 interceptor handler
  • docs(dial-unified-config): clarify §A item 4 for pure-internal refactors
  • refactor: 2S.0-pre: ApiKeyStore.addProjectKeys() dual-format guard
  • docs(dial-unified-config): mark slice 2S.0-pre merged
  • refactor: 2S.1-pre: platform/ bucket plumbing — constants, factory, ResourceTypes
  • docs(dial-unified-config): mark slice 2S.1-pre merged
  • refactor: 2S.2-pre: ResourceType.urlSegment() decouples URL from blob group
  • docs(dial-unified-config): mark slice 2S.2-pre merged
  • refactor: 2S.3-pre: ApiKeyStore concurrent migration + OQ-12 put-by-secret
  • docs(dial-unified-config): mark slice 2S.3-pre merged + amend row scope
  • docs(dial-unified-config): drop slice 2S.4-pre — capability already exists
  • refactor: 2S.5-pre: FileConfigStore initialOnReloadCallbacks
  • docs(dial-unified-config): mark slice 2S.5-pre merged
  • refactor: 2S.7-pre: extract ConfigPostProcessor from FileConfigStore.load()
  • docs(dial-unified-config): record 2S.7-pre commit SHA
  • refactor: 2S.6-pre: move addProjectKeys into ConfigPostProcessor
  • docs(dial-unified-config): mark slice 2S.6-pre merged + record 2S.7-pre dep
  • feat: 2S.8: introduce MergedConfigStore unioning FileConfigStore + ResourceService
  • docs(dial-unified-config): mark slice 2S.8 merged + record Option C scope
  • feat: 2S.9: invalid-entity sibling store + skip|abort knob
  • docs(dial-unified-config): mark slice 2S.9 merged
  • feat: 2S.10: SecretFieldProcessor + dual ObjectMapper for @EncryptedField
  • docs(dial-unified-config): mark slice 2S.10 merged
  • feat: 2S.11: MODEL POST/PUT/DELETE on /v1/models/public/{name}
  • docs(dial-unified-config): mark slice 2S.11 merged
  • feat: 2S.12: POST /v1/admin/validate model-scoped
  • docs(dial-unified-config): mark slice 2S.12 merged
  • feat: 2S.13: cross-ref validation on per-entity model writes
  • docs(dial-unified-config): mark slice 2S.13 merged
  • feat: 2S.14: writer-pod immediate Config swap via rebuildNow()
  • docs(dial-unified-config): mark slice 2S.14 merged
  • feat: 1.5S.0-pre: ResourceTopic codec tolerates unknown fields
  • docs(dial-unified-config): mark slice 1.5S.0-pre merged
  • feat: 1.5S.1: ResourceTopic.subscribeAll for cross-cutting listeners
  • docs(dial-unified-config): mark slice 1.5S.1 merged
  • feat: 1.5S.2: stamp senderPodId on every ResourceEvent
  • docs(dial-unified-config): mark slice 1.5S.2 merged
  • feat: 1.5S.3: MergedConfigStore subscribeAll listener
  • docs(dial-unified-config): mark slice 1.5S.3 merged
  • feat: 3S.0-pre: codeVerifier encryption with lazy plaintext fallback
  • docs(dial-unified-config): mark slice 3S.0-pre merged
  • feat: 3S.1: BlobEntityValidator helper for apps/toolsets
  • docs(dial-unified-config): mark slice 3S.1 merged
  • feat: 3S.4: admin write paths for files/prompts/conversations (tests)
  • docs(dial-unified-config): mark slice 3S.4 merged
  • feat: 3S.2: write API for interceptors/roles/keys/routes/schemas
  • docs(dial-unified-config): mark slice 3S.2 merged; add 3S.2-settings sibling row
  • feat: 3S.2-settings: settings singleton write API + GET-from-blob
  • docs(dial-unified-config): mark slice 3S.2-settings merged
  • feat: 3S.3: admin write paths for applications/toolsets (tests)
  • docs(dial-unified-config): mark slice 3S.3 merged with scope narrow note
  • feat: 4S.0: POST /v1/admin/apply bulk admin write endpoint
  • docs(dial-unified-config): mark slice 4S.0 merged with unknown-kind narrow note
  • feat: 4S.1: POST /v1/admin/validate multi-entity batch validate
  • docs(dial-unified-config): mark slice 4S.1 merged with phase-2 replace + sentinel-fix note
  • test: cover api-key→admin-role authz path on /v1/roles/platform/
  • test: cover api-key→admin-role authz path on /v1/roles/platform/
  • chore: add /dial-uc-debug for unified-config investigation with plan-halt
  • feat: 2S.15: canonical IDs in entity.getName() for API-managed entries
  • docs(dial-unified-config): mark slice 2S.15 merged
  • chore: add MCP wrapper slices to IMPLEMENTATION.md (Track C)
  • fix: unblock PR CI — bump Netty to 4.1.133.Final and dedupe vertx-web-client scope
  • feat: Polish.1: listing canonical IDs for API entries + dedup-by-key
  • docs(dial-unified-config): mark slice Polish.1 merged with halt-at-scope retrospective

Applicable issues

  • fixes #

Description of changes

Checklist

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

SiarheiFedziukovich and others added 30 commits May 1, 2026 17:45
Add the 10-document proposal for unifying DIAL Core configuration management — Configuration API + dial-cli — under docs/sandbox/dial-unified-config/. Covers problem context, architecture, API reference, security/audit, CLI design, user guide, migration & rollout, open questions, and admin MCP spec.

Status v2.20: decisions locked, ready for Phase 1 implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Execution playbook for the Phase 1-3 (+4 NTH) MVP: slice register, agent loop, simplification principles, branching model, halt conditions, LSP integration notes. Companion to the proposal docs (01-09); spec stays the contract, IMPLEMENTATION.md governs execution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Project-scoped slash command at .claude/commands/dial-mvp.md that resumes the dial-unified-config MVP implementation in any new session. Loads context from IMPLEMENTATION.md + project memory, runs the agent loop, halts at architect-plan and on any halt-condition trigger. Usage: /dial-mvp [<slice-id> | status].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add .classpath, .factorypath, .project, .settings/ to .gitignore. These are generated by Eclipse and the JDTLS language server (used by Claude Code's LSP integration), should not be tracked. Existing entries cover IntelliJ (.idea/) and VS Code (.vscode/).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The .claude/commands/review-unified-config.md file is a personal/local slash command, not part of the shared MVP tooling. Gitignored to avoid accidental inclusion in commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Refine the branch-hygiene policy in IMPLEMENTATION.md §3.2 / §9. Default remains rebase (linear history; force-push; sub-branches rebase onto new tip). Per-sync merge override is now allowed when situational — early in MVP with few slices in flight, or when conflicts resolve more cleanly with a merge commit. Late in MVP, prefer rebase to keep history readable for the final big-PR review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-DELETE locks

- config.reload.onInvalidEntity default flipped from skip to abort
  (aligns reload-side with config.write.softValidation: false; opt-in
  skip retained for scale-up resilience).
- Singleton /v1/settings/platform/global gains DELETE to release API
  control and revert the projection to file-sourced (or default);
  POST still 405. Three-state source field: api | file | default.

Touches docs 01, 02, 03, 05, 06, 07, 08, 09, plus IMPLEMENTATION.md
slice rows 1S.3, 2S.9, 3S.2 and the locked-rules summary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Git refuses to create refs/heads/feature/unified-config/<x> while refs/heads/feature/unified-config exists as a branch (ref-vs-directory conflict). Rename slice sub-branches from feature/unified-config/<id>-<short-title> to feature/unified-config-<id>-<short-title>. Wildcard ergonomics preserved (git branch --list 'feature/unified-config-*').

Discovered on first /dial-mvp 1S.0 run when the orchestrator hit halt-condition §4.1 #1 (constraint contradicts plan) and surfaced three options; Option A picked (smallest blast radius — no rename of an existing branch).

Updates IMPLEMENTATION.md §3.2 + §9 and the /dial-mvp slash command.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce, EntityBucketBinding (#1513)

Co-authored-by: SiarheiFedziukovich <siarhei_fedziukovich@epam.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Slices now integrate via local `git merge --squash` into feature/unified-config — no per-slice PR, no per-slice formal code-owner review. Orchestrator halts for the user to approve the slice diff and a §3.5-formatted commit message before merging, then deletes the sub-branch and updates the slice register Status to merged. The single big PR feature/unified-config → development at MVP-complete remains the only formal external review checkpoint.

Adds IMPLEMENTATION.md §3.5 (commit-message format with type guide and example), updates §3.2 (branching diagram + integration bullet), §4 (drops per-slice /ultrareview recommendation; reserved for MVP-complete or ad-hoc), §5 (status legend: in-review → awaiting-merge; PR column → Commit), and §9 (decisions log). Updates /dial-mvp slash command step 7 (OPEN PR → MERGE LOCALLY) and Important notes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ction

Replaces the 1S.0 stub's 405 fall-through with a model-read handler off the
in-memory volatile Config. Public/Owner field projection: status is always
"valid" in Phase 1; source ("file") is admin-only. Empty name returns 404 —
the listing slot is reserved for 1S.2. Unblocks Track B (CLI 1C.0) on wire
contract.

Design anchors: 03 §1, §2, §4; 04 §1.5
Tests: server/src/test/java/com/epam/aidial/core/server/ConfigModelReadTest.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New project-scoped slash command that runs multiple dial-mvp slices sequentially. The two routine halts (architect-plan approval, merge-diff approval) become conditional, gated by self-tests in IMPLEMENTATION.md §4.2 — auto-approval is earned, not assumed. Halt conditions §4.1 still always trigger; auto-mode never bypasses them.

Adds IMPLEMENTATION.md §4.2 (Auto-mode policy) with the §A ARCHITECT and §B MERGE LOCALLY self-test checklists, when-to-use guidance (mechanical/semi-mechanical only — Phase-3 sweep, Phase-2 prereqs), when-NOT-to-use guidance (1S.0, 2S.8, 2S.10, 4S.0 — high-uncertainty slices need plain /dial-mvp), and auditability format.

Usage: /dial-mvp-auto (until blocked) | /dial-mvp-auto 3 (count) | /dial-mvp-auto until-phase-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the per-bucket listing endpoint behind the existing CONFIG_RESOURCE
route. Phase 1 returns the full in-memory snapshot with hasMore: false
(per 03 §4 forward-compat); ?limit shape-validated, ?cursor accepted-and-
ignored. Public/Owner field projection on items is shared with the single-
GET path via a projectModelItem helper. Trailing slash optional. Also
fixes RegexUtil.collectGroups to skip optional named groups whose start
is -1, which surfaced as a server-side IOOBE only when 1S.2 exercised
/v1/models/public without a trailing slash.

Design anchors: 03 §1, §4
Tests: server/src/test/java/.../ConfigModelListTest.java (new); ConfigModelReadTest update

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extends ConfigResourceController GET dispatch to interceptors, roles, keys,
routes (platform/), schemas (public/), and the singleton settings at
platform/global. Bucket-aware authz already gates non-admin off platform/.
Key.key is masked with "***" — Phase 1 has no reveal-secrets surface. Settings
GET projects globalInterceptors with file/default source; POST/PUT/DELETE
return 405 with Allow: GET, PUT, DELETE.

Design anchors: 03 §1; 04 §1.2
Tests: server/src/test/java/com/epam/aidial/core/server/Config{Interceptor,Role,Key,Route,Schema,Settings}Test.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ResourceController overrides handle(resourceUrl) to add an additive admin
admit on GETs to applications/toolsets. When the unified-config gate grants
admin access, the request bypasses the rules-based AccessService check and
proceeds with hasWriteAccess=true (full data); everyone else falls through
to the existing flow unchanged. Phase 1 reads only — write preflight ships
in 1S.5.

Design anchors: 03 §1; 02 §6
Tests: server/src/test/java/com/epam/aidial/core/server/ConfigAdminAppToolsetReadTest.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AccessControlBaseController.handle(resourceUrl) gains a ConfigAuthorizationService
preflight: when the admin role is asserted AND the unified-config gate authorizes
the request (public/ + admin always; user-bucket only when admin owns it), the
rules-based AccessService check is skipped and the handler runs with
hasWriteAccess=true. Otherwise the request falls through to the existing path.
Covers all 4 subclasses (Resource + 3 FILES) and both reads/writes; supersedes
the narrower 1S.4 override on ResourceController.

OQ-33's "admin can't reach user buckets" is enforced by the gate not admitting
admin onto user buckets; existing share-based grants (publication review) continue
through the rules path unchanged.

Design anchors: 03 §1; OQ-21; OQ-33
Tests: server/src/test/java/com/epam/aidial/core/server/ConfigAdminPreflightTest.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AdminExportController + RouteTemplate.CONFIG_EXPORT for /v1/admin/export.
Admin-only via ConfigAuthorizationService; default JSON, YAML when ?format=yaml
or Accept: application/yaml. Keys are re-attached with masked secrets — Config
field's @JsonProperty(WRITE_ONLY) suppresses the map at serialization time.
JSON-string round-trip avoids the TokenBuffer/writeRaw incompatibility in
applicationTypeSchemas' custom serializer.

Design anchors: 03 §1; 07 Phase 1
Tests: server/src/test/java/com/epam/aidial/core/server/AdminExportTest.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds AdminHealthConfigController + RouteTemplate.CONFIG_HEALTH for
/v1/admin/health/config. Admin-only via ConfigAuthorizationService; Phase 1
always reports {status:"healthy",skipped:[]} unconditionally — the invalid-
entity sibling store that populates skipped, plus the dial_config_skipped_*
Prometheus metrics, ship together in slice 2S.9. Cardinality-zero metric
scaffolds skipped here per §2.1/§2.3 (no observable behavior in Phase 1).

Design anchors: 07 Phase 2; 02 §4.1
Tests: server/src/test/java/com/epam/aidial/core/server/AdminHealthConfigTest.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r handler

testAdminCanReachPlatformEntity asserted the 1S.0 stub's 405 ("no handler yet")
on admin GET /v1/interceptors/platform/anything. After slice 1S.3 wired the
interceptors read handler, the same request now resolves to 404 (interceptor
"anything" not found) — still proves the admin gate admitted. Adjusted the
expectation; comment notes the success signal can be 404 or 405 depending on
whether the type has its handler yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
§A self-test item 4 read literally would halt every Phase-2 prereq
slice (pure-internal refactors with no HTTP surface) — contradicting
§4.2's own "When to invoke" list that names them as auto-eligible.
Loosen the wording to require integration tests only when the slice
exposes HTTP behaviour; well-targeted unit tests otherwise. No
behavioural change to the merge gate; the amendment unblocks the
2S.0-pre … 2S.7-pre auto-mode batch.

Design anchors: IMPLEMENTATION.md §4.2 §A; §8 doc-amendment lifecycle
Tests: no new tests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces unconditional value.setKey(apiKey) with a blank-guarded call so
API-managed keys (whose Key.key already holds the decrypted secret) are
not silently overwritten by the human-readable map key. Legacy
file-sourced format (map key = secret, Key.key blank) is unaffected.
Compile-time prereq for the Phase-2 keys controller.

Design anchors: 07 Phase 2 prereqs (line 98); OQ-12
Tests: server/src/test/java/com/epam/aidial/core/server/security/ApiKeyStoreTest.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ce locks

Address @astsiapanay's review threads on ConfigResourceController:
- Pass context.getUserDisplayName() as author to every putResource call
  (was null). Matches ResourceController#putResource pattern; the author
  is recorded on create and preserved across updates by ResourceService.
- Drop the per-resource ResourceService lock from handlePost / handlePut /
  handleDelete. ConfigResourceController handles only admin-only types
  (models, interceptors, roles, keys, routes, schemas, settings), which
  are never written by non-admin paths. The cluster-wide AdminWriteLock
  alone serializes all admin writes — the second lock added nothing. See
  02-architecture.md §4.4.
- For handleSettingsPut / handleSettingsDelete, switch to the (..., false)
  skip-lock overloads of putResource / deleteResource since the
  AdminWriteLock is already held.

AdminWriteSerializationTest still passes — the cluster-wide lock contract
is preserved; this only removes the redundant per-resource lock layered
above it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread server/src/main/java/com/epam/aidial/core/server/config/MergedConfigStore.java Outdated
Comment thread server/src/main/java/com/epam/aidial/core/server/config/MergedConfigStore.java Outdated
Comment thread server/src/main/java/com/epam/aidial/core/server/config/MergedConfigStore.java Outdated
SiarheiFedziukovich and others added 9 commits May 20, 2026 13:00
Reverses four prior locked decisions to align /v1/{type}/{bucket}/{name}
(models / interceptors / roles / keys / routes / schemas / settings) with
the existing user Resource API: PUT-upsert + If-None-Match: * / If-Match
conditional headers (POST -> 405); listing moves to /v1/metadata/... with
ResourceFolderMetadata / ResourceItemMetadata shape (blob-only); ?token=&
limit= (max 1000) pagination; no projection on listings. Canonical and
lite proposal docs swept; one memory entry records the lock reversal.

Design anchors: 02 §4.1, §4.3, §5.1; 03 §1, §3, §4, §6, §7; 04 §1.5, §2.5,
§3.2; 05 §1; 06 §2.8; 09 §6.1; project_unified_config_review.md 2026-05-20
Tests: server/src/test/.../ConfigResourceConditionalHeaderTest.java + 14
flipped per-type tests

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… polish note

Slice U.0 status 🚧 → ✅, commit e9f0944. Append round-3 polish: GET 304 off
event loop; PUT-upsert single-pre-read collapse; metadata-controller verb guard;
shared MetadataQuery helper; settings 405-on-metadata-route uses Allow: GET per
RFC 9110 §15.5.6; shared descriptorFor helper; SETTINGS_TYPE → enum; narrative
slice-ID comments removed; lite docs swept in lockstep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes AdminExportController, AdminExportTest, SecretMaskingApiTest,
RouteTemplate.CONFIG_EXPORT, and the ControllerSelector handler. Core
team has concerns about the current implementation; export is not
MVP-critical and is deferred to a later phase alongside audit. Slices
1S.6 (merged ec1ac53) and 1C.4 marked dropped. Design docs (full +
lite) annotate every callsite "deferred — Defer.1"; design content
preserved. SecretMaskingApiTest deletion is safe — its only assertion
path was GET /v1/admin/export; per-entity-GET secret masking remains
covered by 2S.10's SecretFieldProcessor + dual-ObjectMapper tests.

Design anchors: 02 §10; 03 §1, §4, §7; 04 §1.2, §3.7; 05 §1;
              06 §2.2; 07 Phase 1/4/5; 08 OQ-10/11/17; 09 §6.1/§6.2;
              IMPLEMENTATION.md §5.5 Defer.1
Tests: removed AdminExportTest + SecretMaskingApiTest;
       no new tests (deletion only)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns the documented entity-name contract (02 §4, 03 §3) with code by
adding a 400-on-mismatch regex check ^[A-Za-z0-9._%:-]+$ in
ConfigResourceController for PUT/DELETE on new admin-config types.
Applied to the URL-decoded segment. Conservative floor; docs note the
set is extendable on client request. Lite docs swept in lockstep.

Design anchors: 02 §4, 03 §3
Tests: server/src/test/.../ModelWriteApiTest.java
  (amended testPutGetWithUrlEncodedName + 2 new rejection guards)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…entries

Carve file-sourced entries out of the per-entity Configuration API and
put them on a dedicated read-only path /v1/admin/config/file/{type}[/{name}].
Per-entity GET (and singleton settings GET) becomes blob-only — file
entries return 404. Authz: admin role for all types EXCEPT keys, which
requires security-admin (file Config.keys map keys equal secrets per
OQ-12). The source field is retired entirely — the URL itself discloses
the source. Write verbs on the new route return 405 + Allow: GET per
RFC 9110 §15.5.6 (no fall-through to GlobalRouteController).

Closes the file-keys URL-leak: previously a file key's secret value
could land in logs/traces/metrics via the simple-name fallback on
GET /v1/keys/platform/{name}. Reopens locked decisions: Polish.1
(simple-name fallback retired), OQ-10 (singleton three-state projection
retired), per-entity source field (gone). Runtime resolution unchanged
— union still flows through MergedConfigStore into Config.

Design anchors: 02 §4 (operator-vs-runtime), 03 §1/§2 (file-config
endpoints, blob-only singleton), 04 §1.5/§1.5.1 (security-admin gate),
08 OQ-10 (three-state projection retired)
Tests: server/src/test/.../FileConfigApiTest.java (new HTTP integration),
FileConfigControllerTest.java (new unit), plus 12 existing tests flipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ocks

Replace the single "admin-writes" Redis key with the per-bucket lock-key
derivation that LockService#underBucketLocks already uses
(BlobStorageUtil.toStoragePath + distinct + sorted). Composite Lock keeps
the controller try-with-resources callsites unchanged. Cross-bucket
serialization (design 02 §4.4) is preserved by holding both PUBLIC_LOCATION
and PLATFORM_LOCATION locks at once.

Design anchors: 02 §4.4
Tests: server/src/test/.../AdminWriteSerializationTest.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Public view is a per-type allowlist (matching today's ModelData /
ApplicationData), not "all-non-secrets". Retires the false claim that
/openai/models exposes upstreams/endpoints to authenticated readers and
removes the "mechanical @JSONVIEW(Public)" mass-annotation rule.
Lite §Authorization sheds the ConfigAuthorizationService impl-detail
sentence. New U.2 slice queued in IMPLEMENTATION.md §5.5 to land the
projection contract end-to-end (Views.java, DEFAULT_VIEW_INCLUSION =
false, per-type allowlists) — today the controller serializes the full
entity, exposing upstreams/endpoints to non-admin readers.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
// Serializes rebuild() across init/reload/rebuildNow/timer-fire callsites. ReentrantLock
// (vs. synchronized) lets virtual threads park instead of pin their carrier while waiting —
// rebuild() does blob-storage IO that can take tens to hundreds of ms.
private final ReentrantLock rebuildLock = new ReentrantLock();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock is used to apply changes sequentially in the end the last operation wins. The code doesn't resolve conflicts: it either deletes or updates an entity in a global config map. We can safely remove the lock.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to push back on removing the lock before changing the code — rebuildLock is doing more than ordering single-key updates. Specifically:

  1. Multi-map atomic swap across the entire Config. rebuild() and the applyEntity* / applyBatch paths each produce a fresh Config (and a fresh invalidEntities map) and assign it to the volatile config field at the end. Without the lock, two concurrent producers can interleave such that the final volatile swap reflects only one of them — the other's writes (to multiple type-maps, plus the sibling invalidEntities store) are silently dropped. "Last operation wins" is true at the swap level, but the in-flight builders are mutating different fresh Config instances, so we'd lose the changes from whichever lost the race rather than merging them. The rebuild() path in particular reads fileConfigStore.get(), scans blob storage for every managed type, and runs SecretFieldProcessor.decryptFields on each entity — none of that work composes if two run in parallel.

  2. Cross-map invariants inside one rebuild. applyEntityWriteLocked for INTERCEPTOR resurrects models from invalidEntities into next.models based on the new interceptor being present in next.interceptors. applyEntityDeleteLocked for INTERCEPTOR cascades through next.models to invalidate model chains that referenced the deleted interceptor. Both read and mutate multiple maps before the volatile swap. Without serialization an interleaved write could see a half-built next and produce an inconsistent merged config (interceptor present but models still invalid, or interceptor deleted but models still pointing at it).

  3. Mixed callers. The lock is acquired from: init, reload, requestRebuild's timer-fire, applyReplicaEvent on pub/sub, the writer-pod rebuildNow(), and now the bulk applyBatch. These can fire concurrently — pub/sub events arrive while a debounced rebuild is running, while an admin PUT is running, while a file-store reload callback fires. The lock is what serializes them; "last operation wins" is only safe semantics if there is exactly one in-flight operation at a time.

  4. ReentrantLock specifically (not synchronized). Chosen so virtual threads park instead of pinning their carrier — rebuild() does blob-storage IO that can take tens to hundreds of ms, which would block a Vert.x carrier under synchronized on the writer-pod path.

Happy to be wrong — if you have a specific concurrent path in mind where the lock is preventing parallelism that would compose correctly, I'd take that as the place to narrow rather than remove. But removing the lock outright would, I think, regress (2) (cross-map invariants) in particular, and the workload doesn't justify the risk — admin writes are rare (dozens/day on real envs per §4.4), so the contention the lock creates isn't a real cost.

Wanted to flag rather than just act on this one. Want me to leave it as-is, or would you like me to look at narrowing the lock to specific sections (e.g., the rebuild path vs. the partial-update path) rather than removing it?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All operations are executed under distributed Redis lock of bucket level such as apply*. So the local lock is not required any longer.

However the operations reload, rebuildNow, rebuild are executed under a local lock rebuildLock. So DIAL Core may read dirty state, e.g. applyBatch is in progress. We need to run these operations under distributed bucket lock as well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that all apply* callsites from the write controllers (ConfigResourceController.handlePut/handleDelete/handleSettings* at L445/L546/L387/L415 and AdminApplyController.process at L174) are wrapped in lockService.underBucketLocks(ADMIN_BUCKET_LOCATIONS, ...). So at the cluster level the distributed lock is the outer serializer for those paths.

However, rebuildLock is doing real work that the distributed lock doesn't cover — even on the writer pod. Two reasons:

  1. applyReplicaEvent (pub/sub fast-path, L238). Dispatched from onResourceEvent via taskExecutor.submit(...) (L212-L219) with no LockService call anywhere in the chain. On a replica pod, multiple replica events can race concurrently — rebuildLock is the only thing serializing them.

  2. Same-pod rebuild() racing against apply*. This is the one I want to flag because it bit me when I first thought we could drop the lock from apply*. apply* does next = shallowClone(this.config) before its mutations, then swaps this.config = next at the end. The debounced rebuild() timer-fire (L399) fires on its own thread, holds only rebuildLock, and does the same shape: read this.config, build merged, swap. Without rebuildLock shared between them, an interleaving like

    • apply: next = shallowClone(this.config) (snapshot taken)
    • rebuild: full sweep, then this.config = merged
    • apply: this.config = next (clobbers rebuild's swap)

    loses any state rebuild surfaced from blob storage that apply didn't touch (e.g., a new entity that landed via pub/sub from another pod, or a previously-invalid entity that became valid on this sweep). The same race exists on any pod that has both an in-flight admin write and a pending debounced rebuild — i.e. the writer pod, every time.

So rebuildLock is correctness-load-bearing on every pod, not redundant with the distributed lock. The writer-pod ReentrantLock acquisition on the uncontended apply path is nanoseconds — cheap enough that keeping it doesn't justify a redesign.

Recommend leaving rebuildLock as-is for this PR. If you'd like to revisit the lock layering more broadly (e.g., extend the distributed lock to applyReplicaEvent and rebuild() callsites to remove the local lock entirely), that wants a separate design discussion since it turns the file-store reload callback into a Redis round-trip and creates pub/sub thundering-herd against the bucket lock. Happy to spin it up as a follow-up OQ if you want.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Came back to this with fresh eyes — your concern was real and I had it narrowed to the wrong race. The cross-pod dirty read on partial blob batches: writer pod's AdminApplyController.applyBatch runs resourceService.putResource(...) once per entity, then mergedConfigStore.applyBatch (the in-memory swap) only after ALL blobs are written. During that interval T_write_first_blob → T_inmemory_swap the blob store carries partial-batch state. Any other pod doing rebuild() / applyReplicaEvent / reload() in that window seeds a torn Config — exactly what MODE_ABORT cannot recover from cleanly, and what MODE_SKIP only converges from after subsequent pub/sub events.

The distributed lock on the writer only serialized writers against each other; it did nothing to block the replica's read paths.

Fixed in d8e10df6 on feature/unified-config-server. The 5 read/dispatch callsites in MergedConfigStore (init, reload, rebuildNow, onRebuildTimerFire, onResourceEventapplyReplicaEvent) now acquire the same underBucketLocks(ADMIN_BUCKET_LOCATIONS) the writers acquire. The lock-holding writer pod owns the cross-bucket read horizon until its in-memory swap completes; replica rebuilds, pub/sub fast-paths, and admin-reload all wait.

Layering, summarized:

Outer (distributed, cross-pod) Inner (intra-pod) Acquired by
ADMIN_BUCKET_LOCATIONS via underBucketLocks rebuildLock (ReentrantLock) init, reload, rebuildNow, debounced timer-fire, applyReplicaEvent dispatch, write controllers around apply*
rebuildLock only applyEntityWrite, applyEntityDelete, applyBatch, applySettingsWrite, applySettingsDelete (their callers hold the outer lock)

The inner rebuildLock stays as defensive same-pod serialization (e.g., timer-fire rebuild on a virtual thread vs. an in-flight controller apply on another virtual thread on the same pod). Removing it would require contract-changing every public apply* to "must be called under the outer lock" — there's no compile-time guard for that today, so I kept the local lock and added contract Javadocs to every apply* so the contract is explicit. Tracked as a future Simplicity-First cleanup, not load-bearing now.

Non-reentrancy. LockService.lock generates a fresh random owner per call; the Redis Lua script doesn't recognize re-entry. So nested underBucketLocks on the same thread would deadlock. Verified: none of the 5 new callsites runs inside an outer lock — init is at boot, reload enters from ConfigController.taskExecutor.submit, rebuildNow has no production caller, the timer fire and pub/sub dispatch both go through fresh taskExecutor.submit virtual threads. Each apply* method's Javadoc now explicitly states the caller-holds-outer-lock contract.

Event loop. No callsite I touched runs on the event loop — all five hop to virtual threads before acquiring Redis locks.

Tests.

  • New integration test AdminReadSerializationTest.testReloadBlocksWhileAdminLockHeld — holds the platform bucket lock from the test thread, fires POST /v1/ops/config/reload, asserts the reload blocks (TimeoutException at 500ms) until release. Mirrors the existing AdminWriteSerializationTest pattern.
  • ./gradlew checkstyleMain checkstyleTest :server:test — full server suite passes.

Doc 02 §4.4 rewritten to describe the new layering, the non-reentrancy rule, the lock-ordering invariant, and the inner/outer split. Two stale references to the removed AdminWriteLockService cleaned up.

Withdrawing my recommendation from r3303191590 — option (c) as I had described it was unsafe (writer-pod apply* and timer-fire race on the same this.config swap). The fix above is closer in spirit to your original suggestion: replica reads serialize against writer writes via the same bucket locks. Ready for another look.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted the lock-layering change (68f80668 reverts d8e10df6) — want to think about this a little more before committing to an approach. The cross-pod dirty-read is real, but extending the bucket locks to every rebuild / replica-event / reload path turns frequent local operations into Redis round-trips and changes the propagation model more than I'd like to do without more thought (pub/sub thundering-herd, file-reload callback latency, the no-reentrancy footgun on apply*). Will come back with a more considered proposal. Leaving the thread open.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 71d9066. The three full-rebuild entry points — reload(), rebuildNow(), and the debounced onRebuildTimerFire() — now wrap their rebuild() blob scan in lockService.underBucketLocks(MergedConfigStore.ADMIN_BUCKET_LOCATIONS, …), the same [public/, platform/] bucket locks the write controllers (ConfigResourceController, AdminApplyController) already acquire. A writer pod holds those locks across its whole multi-blob batch and the in-memory swap, so a peer pod's rebuild can no longer scan partial-batch (applyBatch-in-progress) state.

Scope, per your follow-up:

  • init() is left unlocked — it runs once at startup before any concurrent admin writer.
  • The replica fast path (onResourceEvent/applyReplicaEvent) and the apply* partial-update methods are untouched. The replica path is best-effort and self-heals via its requestRebuild() fallback plus the polling SLA, so locking the high-frequency per-event path would serialize every replica against every writer for no correctness gain over the fallback.

ADMIN_BUCKET_LOCATIONS is consolidated to a single public static final on MergedConfigStore (both controllers reference it) so the reader and writer lock-key sets stay byte-identical. underBucketLocks sorts keys internally and the locked callsites run on virtual threads (taskExecutor), so no ordering inversion and no event-loop blocking. New regression test AdminReadSerializationTest holds the platform-bucket lock and asserts POST /v1/ops/config/reload blocks until release.

Comment thread server/src/main/java/com/epam/aidial/core/server/config/MergedConfigStore.java Outdated
SiarheiFedziukovich and others added 3 commits May 23, 2026 00:48
…atch

Reviewer flagged that applyBatch cloned the full per-type map on each entry
via putEntity/removeEntity, yielding O(batch × map) work — 100 entries on
a 100-entry map = 10k operations.

Pre-clone each touched type-map once at the top of applyBatch and mutate
in place across all entries; single volatile swap of this.config at the
end. Per-entry validation/coercion failures roll back the (type, canonicalId)
slot in next + the matching invalidEntities record so subsequent entries
observe the pre-change state, preserving single-entity atomicity.

applyEntityWriteLocked / applyEntityDeleteLocked now share the same
in-place helpers (cloneTypeMap once + applyChangeInPlace).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reviewer flagged the wrapper as reinventing LockService.underBucketLocks
functionality. Wrapper deleted; the five admin-write callsites inline
lockService.underBucketLocks(ADMIN_BUCKET_LOCATIONS, () -> ...) directly,
with ADMIN_BUCKET_LOCATIONS = List.of(PUBLIC_LOCATION, PLATFORM_LOCATION)
declared per-controller.

Lock-key identity, sorted acquisition order, and cluster-wide admin-write
serialization (02-architecture.md §4.4) are preserved — the wrapper used
the same BlobStorageUtil.toStoragePath + sorted derivation that
underBucketLocks already performs.

AdminWriteSerializationTest comment updated; lock-probe key derivation
unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sync 17 commits from development (PRs #1509#1560): toolset tool
listing/discovery fixes, Snowflake-managed MCP "Invalid client" fix,
auto-sharing for prompts + citation attachments, applicationTypeRoutes
exposure, MCP feature flag for schema-rich apps, Veo 3.1 timeout fix,
Netty CVE remediation, CI workflow bumps.

Conflict resolution:
- build.gradle: adopt dev's netty_version=4.1.133.Final property and
  subprojects-level netty-transport-native-epoll exclusion (CVE-compliant
  per project_netty_vertx_pin.md memory); rewrite this branch's
  netty-codec / netty-codec-http forces to use ${netty_version}; drop the
  now-redundant transport-native-epoll force.

gradle.properties (auto-merged): Jackson bump 2.18.6 → 2.21.2, new
mcp_sdk_version=1.1.2 and netty_version=4.1.133.Final.

server/build.gradle (auto-merged): MCP SDK deps added by dev coexist
with this branch's vertx-web-client scope dedupe.

AccessService.java (auto-merged): dev's attached-prompts check sits
above the attached-deployments block; non-overlapping with this branch's
admin-authz preflight surface.

Per-sync merge override per IMPLEMENTATION.md §3.2 — rationale: keep
PR #1529 review threads anchored to existing SHAs (force-push would
detach review-thread context).

./gradlew checkstyleMain checkstyleTest :server:test :storage:test
:config:test :credentials:test all green.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@ai-dial-actions

This comment has been minimized.

SiarheiFedziukovich and others added 9 commits May 26, 2026 12:48
ApplicationService and ToolSetService were unconditionally stripping
forwardAuthToken on every write — pre-existing user-safety policy from
00c75ec, but it locked admin out too. Adds `preserveForwardAuthToken`
parameter; controllers compute `descriptor.isPublic() && hasAdminAccess`
for the carve-out. Bulk apply preserves; publication-approval keeps
stripping (defense in depth).

Design anchors: 04 §1.4 (amended); IMPLEMENTATION.md §5.5
Tests: server/src/test/java/.../ConfigAdminAppToolsetWriteTest.java

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DIAL team agreed in review call to postpone the plaintext-reveal feature.
Encryption-at-rest stays; @JsonProperty(WRITE_ONLY) drops secret fields
from GET responses instead of emitting "***". File-config /keys denied
for every caller (admin included) — file map keys equal secrets per
OQ-12 and no role distinguishes "may read secret-equivalent names" from
plain admin anymore. Design preserved in 04 §2.6 marked deferred.

Design anchors: 04 §1.5, §1.5.1, §2.5, §2.6 (amended); 03 §1, §3.1
(amended); 08 OQ-12 (amended); IMPLEMENTATION.md §5.5
Tests: full :server:test suite passes; new DualMapperTest /
SecretFieldProcessorTest coverage for stripEncryptedFields and the
mask-as-literal preserve-on-omit contract

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…eload

Closes the cross-pod dirty-read window where a replica pod's rebuild(),
applyReplicaEvent, or reload() could scan blob storage while another pod's
AdminApplyController.applyBatch is mid-batch (some blobs written, in-memory
swap pending). Wraps the 5 read/dispatch callsites in MergedConfigStore in
underBucketLocks(ADMIN_BUCKET_LOCATIONS, ...) — same key set the writers
acquire. Inner rebuildLock retained as intra-pod defensive serialization.
Lock contract documented on every apply*/rebuild method.

Design anchors: 02 §4.4 (rewritten)
Tests: server/src/test/.../AdminReadSerializationTest.java

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d reads under admin bucket locks

Closes the cross-pod dirty-read window where a peer pod's full-rebuild blob
scan could observe partial-batch state: a writer pod's apply writes blobs
one-by-one then swaps the in-memory Config after the loop, so the blob store
carries a torn view during the batch.

Per reviewer astsiapanay's narrowing, the three full-rebuild entry points —
reload(), rebuildNow(), and the debounced onRebuildTimerFire() — now wrap their
rebuild() blob scan in lockService.underBucketLocks(ADMIN_BUCKET_LOCATIONS, …),
the same [public/, platform/] bucket locks the write controllers already hold.
init() (startup-only) and the replica fast path (onResourceEvent /
applyReplicaEvent, best-effort with a requestRebuild() fallback + polling SLA)
are deliberately left unlocked.

ADMIN_BUCKET_LOCATIONS is consolidated to a single public static final on
MergedConfigStore; both controllers now reference it so reader and writer
lock-key sets stay byte-identical.

Design anchors: 02 §2 / §4.4 (rewritten) / §11.1
Tests: server/.../AdminReadSerializationTest.java

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Aligns the lite docs with the full proposal: corrects the pre-existing
AdminWriteLockService / "admin-writes" single-key drift to the actual
two-bucket underBucketLocks mechanism, and documents that the full-rebuild
blob scan (reload / rebuildNow / debounced timer) shares those locks while
init() and the replica fast path are deliberately left unlocked.

Touches lite 02-architecture.md (locking table row + serialization paragraph)
and 03-api.md (cluster-wide serialization bullet).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve conflicts:
- ResourceTypes: keep unified-config enums + RESPONSE_MAPPING
- .gitignore: keep both ignore sets
- App/CustomApplication tests: adopt #1565 secured upstream exposure
  (upstreams returned only to write-access callers)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ai-dial-actions

This comment has been minimized.

Records the core-team concern that Upstream.extraData is a wide property
(more often non-secret than secret, sometimes mixing both in one JSON
document) while the post-U.4 state encrypts it wholesale with no read
path. Proposed design: plaintext extraData + encrypted secretExtraData
with merge-at-consumption on the X-UPSTREAM-EXTRA-DATA header. Status
PROPOSED, not locked - full design in 04 §2.11, OQ-34 open, slice U.5
registered ⏸ blocked pending core-team decision; no locked text amended.

Design anchors: 04 §2.11 (new, proposed); 08 OQ-34 (new, open)
Tests: no new tests (docs-only)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ai-dial-actions
Copy link
Copy Markdown
Contributor

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

SiarheiFedziukovich and others added 2 commits June 3, 2026 23:44
…s deny-all

The lite docs still described the pre-U.4 security-admin tier gating
file-sourced keys; U.4 retired that tier and denies type=keys for every
caller including admin. Also document the names-only stub shape of the
file-config list endpoint.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…listing

The list response shape of GET /v1/admin/config/file/{type} existed only
in FileConfigController and its tests — 03-api-reference.md defined the
endpoints and authz but never the wire contract. Pin it: names-only stub
items, full bodies only on the single-entry GET, bodies-in-bulk deferred
to /v1/admin/export (Defer.1). Extra per-item fields (type, description)
are YAGNI but the shape is forward-compatible with adding them.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants