Skip to content

Commit 06bac4a

Browse files
authored
docs: add ADR-0037 for scheduler attribution design (#2165)
Documents the decision to use attributed identity (Actor struct with Authenticated=false) with a separate actorContextKey to prevent trust escalation, satisfying SOC 2 CC6.1 and ISO 27001 A.5.16 requirements. Covers Phase A implementation, context key separation rationale, and Phase C JWT authentication forward-compatibility. Co-authored-by: Ben Coombs <bjcoombs@users.noreply.github.com>
1 parent 1ab9bbd commit 06bac4a

1 file changed

Lines changed: 249 additions & 0 deletions

File tree

Lines changed: 249 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,249 @@
1+
---
2+
name: adr-0037-scheduler-attribution-design
3+
description: Attribute scheduled operations to distinct actor identities rather than anonymous system context
4+
triggers:
5+
- Implementing or modifying scheduled job execution in any service
6+
- Adding new scheduler or worker background process
7+
- Reviewing audit trail completeness for SOC 2 or ISO 27001
8+
- Designing actor identity for non-human system operations
9+
instructions: |
10+
Schedulers must inject an Actor into the execution context before calling any
11+
downstream service. Use auth.WithActor with Type=ActorTypeScheduler,
12+
Authenticated=false, and ID in the format "system:scheduler:{scheduler-name}".
13+
Never set Authenticated=true in schedulers or workers - only the gRPC auth
14+
interceptor may set that field. Use a separate actorContextKey (not UserIDContextKey)
15+
to prevent scheduler identity from being mistaken for an authenticated user session.
16+
---
17+
18+
# 37. Scheduler Attribution Design
19+
20+
Date: 2026-04-07
21+
22+
## Status
23+
24+
Accepted
25+
26+
## Context
27+
28+
Meridian's scheduler infrastructure executes jobs on behalf of tenants using
29+
`context.Background()`. Without additional identity context, any mutations
30+
performed during execution are recorded with `changed_by = "system"` - an
31+
anonymous attribution that fails two compliance requirements:
32+
33+
- **SOC 2 CC6.1**: Logical and physical access controls must identify the
34+
actor responsible for each operation.
35+
- **ISO 27001 A.5.16**: Identity management requires that operations be
36+
attributable to a specific, identifiable actor.
37+
38+
When the audit trail shows `changed_by = "system"` for all scheduler-triggered
39+
mutations, it is impossible to distinguish human operations from scheduled
40+
operations, or to identify which scheduler triggered a particular change.
41+
42+
### The Trust Escalation Risk
43+
44+
The identity service uses `GetUserIDFromContext` as an authentication gate:
45+
46+
```go
47+
// services/identity/service/grpc_identity_endpoints.go:129
48+
if _, ok := auth.GetUserIDFromContext(ctx); !ok {
49+
return nil, status.Errorf(codes.Unauthenticated, "missing authentication context")
50+
}
51+
52+
// services/identity/service/grpc_role_endpoints.go:147
53+
if _, ok := auth.GetUserIDFromContext(ctx); !ok {
54+
return nil, status.Errorf(codes.Unauthenticated, "missing authentication context")
55+
}
56+
```
57+
58+
If a scheduler identity were placed in `UserIDContextKey` (the same key used
59+
by the gRPC auth interceptor for JWT-validated sessions), scheduler jobs would
60+
bypass authentication checks and gain access to identity-management endpoints.
61+
This would violate the principle of least privilege and allow a misconfigured
62+
scheduler to perform privileged operations.
63+
64+
### Phase Design
65+
66+
Attribution is introduced in two phases:
67+
68+
- **Phase A (current)**: Attributed identity. The scheduler asserts its own
69+
identity string. The claim is not cryptographically verified - it is trusted
70+
because it originates from platform-controlled code, not from external input.
71+
- **Phase C (deferred)**: Authenticated identity. Each scheduler acquires a
72+
short-lived JWT from the platform's identity provider. The gRPC auth
73+
interceptor verifies the token and sets `Authenticated=true`.
74+
75+
## Decision Drivers
76+
77+
* Audit trails must identify which scheduler triggered each mutation for
78+
SOC 2 and ISO 27001 compliance
79+
* Scheduler identity must not be mistakable for an authenticated human session
80+
* `Actor.Authenticated=false` must be preserved to prevent privilege escalation
81+
through trust promotion
82+
* The design must be forward-compatible with Phase C JWT-based authentication
83+
* Multiple scheduler instances (billing, settlement, catch-up) must produce
84+
distinguishable audit records
85+
86+
## Considered Options
87+
88+
1. **Attributed identity with separate context key** (chosen)
89+
2. **Attributed identity reusing UserIDContextKey**
90+
3. **Defer all attribution to Phase C JWT authentication**
91+
4. **Service account per scheduler with full JWT issuance**
92+
93+
## Decision Outcome
94+
95+
Chosen option: **Attributed identity with separate context key**, because it
96+
satisfies the immediate compliance requirements while preserving the security
97+
boundary between scheduler and authenticated-user identity paths.
98+
99+
### Implementation
100+
101+
The `Actor` struct in `shared/platform/auth/actor.go` carries four fields:
102+
103+
| Field | Purpose |
104+
|-------|---------|
105+
| `ID` | Identifier string, e.g. `system:scheduler:billing-cron` |
106+
| `Type` | `ActorTypeScheduler`, `ActorTypeWorker`, `ActorTypeHuman`, etc. |
107+
| `Authenticated` | `false` for all schedulers; `true` only when set by the gRPC auth interceptor |
108+
| `Source` | Describes the injection path, e.g. `"cron-scheduler"`, `"catch-up"` |
109+
110+
The `actorContextKey` type is an unexported struct distinct from `contextKey`
111+
(used for `UserIDContextKey`). This structural separation is the enforcement
112+
mechanism: `auth.GetUserIDFromContext` cannot retrieve an Actor, and
113+
`auth.ActorFromContext` cannot retrieve a user ID. The two identity channels
114+
cannot collide regardless of the values they carry.
115+
116+
The Actor is injected in `executeJob` (live cron execution) and
117+
`catchUpSchedule` (startup catch-up) before any downstream call:
118+
119+
```go
120+
// shared/platform/scheduler/cron.go - executeJob
121+
ctx = auth.WithActor(ctx, auth.Actor{
122+
ID: fmt.Sprintf("system:scheduler:%s", s.config.Name),
123+
Type: auth.ActorTypeScheduler,
124+
Authenticated: false,
125+
Source: "cron-scheduler",
126+
})
127+
128+
// shared/platform/scheduler/catchup.go - catchUpSchedule
129+
ctx = auth.WithActor(ctx, auth.Actor{
130+
ID: fmt.Sprintf("system:scheduler:%s", s.config.Name),
131+
Type: auth.ActorTypeScheduler,
132+
Authenticated: false,
133+
Source: "catch-up",
134+
})
135+
```
136+
137+
The `GetUserFromContext` function in `shared/platform/audit/context.go` checks
138+
for an Actor first, then falls back to `UserIDContextKey`, then to
139+
`DefaultAuditUser`:
140+
141+
```go
142+
if actor, ok := auth.ActorFromContext(ctx); ok && actor.ID != "" {
143+
return actor.ID
144+
}
145+
userID, ok := auth.GetUserIDFromContext(ctx)
146+
// ...
147+
return DefaultAuditUser
148+
```
149+
150+
The `changed_by` column in audit tables will contain
151+
`system:scheduler:{scheduler-name}` for all scheduler-triggered mutations,
152+
making them distinguishable from human operations (`<user-uuid>`) and from
153+
genuinely anonymous operations (`system`).
154+
155+
The scheduler name is injected at construction time via `CronSchedulerConfig.Name`,
156+
meaning different scheduler instances (e.g., `billing-cron`, `settlement-cron`)
157+
produce distinct attribution strings without any shared configuration.
158+
159+
Tenant ID is not included in the `changed_by` string. The audit trail is
160+
already scoped to the tenant schema; including the tenant ID in `changed_by`
161+
would denormalise data that is implicit from the row's location.
162+
163+
### Positive Consequences
164+
165+
* Audit trails distinguish scheduler-triggered mutations from human operations
166+
and anonymous system operations, satisfying SOC 2 CC6.1 and ISO 27001 A.5.16
167+
* `Actor.Authenticated=false` ensures schedulers cannot be promoted to
168+
authenticated status by any downstream code path
169+
* Separate context key enforces the boundary at the type level; no runtime
170+
check can accidentally treat a scheduler actor as an authenticated user
171+
* The `Source` field on `Actor` supports forensic analysis: catch-up executions
172+
(`"catch-up"`) are distinguishable from live cron executions (`"cron-scheduler"`)
173+
in diagnostic logs even when both carry the same `ID`
174+
* The design is forward-compatible with Phase C: adding JWT issuance requires
175+
only setting `Authenticated=true` in the interceptor; no downstream code
176+
changes are needed
177+
178+
### Negative Consequences
179+
180+
* Attribution strings are asserted, not verified. A bug in platform code could
181+
inject an incorrect `Actor.ID`. Mitigation: attribution is set only in
182+
platform-controlled scheduler code, not at service boundaries or in
183+
tenant-configurable logic
184+
* A tenant may be suspended after a scheduler acquires context but before
185+
execution completes. The tenant status check runs before semaphore acquisition
186+
and execution, but not continuously during execution. This is an acceptable
187+
window given that execution is bounded (default: 5 minutes)
188+
189+
## Pros and Cons of the Options
190+
191+
### Option 1: Attributed identity with separate context key (chosen)
192+
193+
* Good, because audit compliance is satisfied immediately without Phase C
194+
* Good, because the structural key separation prevents scheduler identity from
195+
being mistaken for an authenticated session at the type level
196+
* Good, because multiple schedulers produce distinct attribution without
197+
additional configuration
198+
* Good, because `Actor.Authenticated=false` is a stable invariant that
199+
downstream authorization checks can rely on
200+
* Bad, because attribution strings are not cryptographically verified
201+
202+
### Option 2: Attributed identity reusing UserIDContextKey
203+
204+
* Good, because no new context key needed
205+
* Bad, because scheduler identity would pass the `GetUserIDFromContext` auth
206+
gate in identity endpoints, granting schedulers unintended access to
207+
privileged operations
208+
* Bad, because audit trail cannot distinguish scheduler from authenticated
209+
human user without inspecting the ID format string
210+
211+
### Option 3: Defer all attribution to Phase C JWT authentication
212+
213+
* Good, because authenticated identity is stronger than attributed identity
214+
* Bad, because compliance gap persists until Phase C is complete
215+
* Bad, because Phase C requires identity provider integration, token issuance
216+
infrastructure, and interceptor changes - a multi-week effort
217+
* Bad, because a scheduler with no identity at all fails SOC 2 CC6.1 today
218+
219+
### Option 4: Service account per scheduler with full JWT issuance
220+
221+
* Good, because each scheduler has a cryptographically verified identity today
222+
* Bad, because requires standing up service account management, token issuance,
223+
and rotation infrastructure before any compliance benefit is realised
224+
* Bad, because overkill for the current threat model: schedulers run in
225+
platform-controlled code, not at tenant-configurable boundaries
226+
227+
## Links
228+
229+
* PR #2151 - Update audit context to check Actor for scheduler attribution
230+
* PR #2163 - Inject Actor and correlation ID in scheduler executeJob
231+
* `shared/platform/auth/actor.go` - Actor struct and actorContextKey
232+
* `shared/platform/audit/context.go` - GetUserFromContext with Actor check
233+
* `shared/platform/scheduler/cron.go` - Actor injection in executeJob
234+
* `shared/platform/scheduler/catchup.go` - Actor injection in catchUpSchedule
235+
236+
## Notes
237+
238+
* **Phase C trigger**: When service accounts or machine-identity JWTs are
239+
introduced, the scheduler should acquire a short-lived token from the
240+
identity provider at startup and refresh it on expiry. The gRPC auth
241+
interceptor would verify the token and set `Authenticated=true`. No changes
242+
to the `Actor` struct, context keys, or downstream audit code are required.
243+
* **New scheduler checklist**: Any new background worker or scheduler must
244+
inject an `Actor` with `Authenticated=false` before its first downstream
245+
call. Omitting this reverts attribution to `"system"` and reopens the
246+
compliance gap.
247+
* **Do not copy Authenticated from external input**: The `Actor.Authenticated`
248+
field must never be populated from proto messages, HTTP headers, or request
249+
bodies. Only the gRPC auth interceptor may set it to `true`.

0 commit comments

Comments
 (0)