|
| 1 | +--- |
| 2 | +name: adr-0037-scheduler-attribution-design |
| 3 | +description: Attribute scheduled operations to distinct actor identities rather than anonymous system context |
| 4 | +triggers: |
| 5 | + - Implementing or modifying scheduled job execution in any service |
| 6 | + - Adding new scheduler or worker background process |
| 7 | + - Reviewing audit trail completeness for SOC 2 or ISO 27001 |
| 8 | + - Designing actor identity for non-human system operations |
| 9 | +instructions: | |
| 10 | + Schedulers must inject an Actor into the execution context before calling any |
| 11 | + downstream service. Use auth.WithActor with Type=ActorTypeScheduler, |
| 12 | + Authenticated=false, and ID in the format "system:scheduler:{scheduler-name}". |
| 13 | + Never set Authenticated=true in schedulers or workers - only the gRPC auth |
| 14 | + interceptor may set that field. Use a separate actorContextKey (not UserIDContextKey) |
| 15 | + to prevent scheduler identity from being mistaken for an authenticated user session. |
| 16 | +--- |
| 17 | + |
| 18 | +# 37. Scheduler Attribution Design |
| 19 | + |
| 20 | +Date: 2026-04-07 |
| 21 | + |
| 22 | +## Status |
| 23 | + |
| 24 | +Accepted |
| 25 | + |
| 26 | +## Context |
| 27 | + |
| 28 | +Meridian's scheduler infrastructure executes jobs on behalf of tenants using |
| 29 | +`context.Background()`. Without additional identity context, any mutations |
| 30 | +performed during execution are recorded with `changed_by = "system"` - an |
| 31 | +anonymous attribution that fails two compliance requirements: |
| 32 | + |
| 33 | +- **SOC 2 CC6.1**: Logical and physical access controls must identify the |
| 34 | + actor responsible for each operation. |
| 35 | +- **ISO 27001 A.5.16**: Identity management requires that operations be |
| 36 | + attributable to a specific, identifiable actor. |
| 37 | + |
| 38 | +When the audit trail shows `changed_by = "system"` for all scheduler-triggered |
| 39 | +mutations, it is impossible to distinguish human operations from scheduled |
| 40 | +operations, or to identify which scheduler triggered a particular change. |
| 41 | + |
| 42 | +### The Trust Escalation Risk |
| 43 | + |
| 44 | +The identity service uses `GetUserIDFromContext` as an authentication gate: |
| 45 | + |
| 46 | +```go |
| 47 | +// services/identity/service/grpc_identity_endpoints.go:129 |
| 48 | +if _, ok := auth.GetUserIDFromContext(ctx); !ok { |
| 49 | + return nil, status.Errorf(codes.Unauthenticated, "missing authentication context") |
| 50 | +} |
| 51 | + |
| 52 | +// services/identity/service/grpc_role_endpoints.go:147 |
| 53 | +if _, ok := auth.GetUserIDFromContext(ctx); !ok { |
| 54 | + return nil, status.Errorf(codes.Unauthenticated, "missing authentication context") |
| 55 | +} |
| 56 | +``` |
| 57 | + |
| 58 | +If a scheduler identity were placed in `UserIDContextKey` (the same key used |
| 59 | +by the gRPC auth interceptor for JWT-validated sessions), scheduler jobs would |
| 60 | +bypass authentication checks and gain access to identity-management endpoints. |
| 61 | +This would violate the principle of least privilege and allow a misconfigured |
| 62 | +scheduler to perform privileged operations. |
| 63 | + |
| 64 | +### Phase Design |
| 65 | + |
| 66 | +Attribution is introduced in two phases: |
| 67 | + |
| 68 | +- **Phase A (current)**: Attributed identity. The scheduler asserts its own |
| 69 | + identity string. The claim is not cryptographically verified - it is trusted |
| 70 | + because it originates from platform-controlled code, not from external input. |
| 71 | +- **Phase C (deferred)**: Authenticated identity. Each scheduler acquires a |
| 72 | + short-lived JWT from the platform's identity provider. The gRPC auth |
| 73 | + interceptor verifies the token and sets `Authenticated=true`. |
| 74 | + |
| 75 | +## Decision Drivers |
| 76 | + |
| 77 | +* Audit trails must identify which scheduler triggered each mutation for |
| 78 | + SOC 2 and ISO 27001 compliance |
| 79 | +* Scheduler identity must not be mistakable for an authenticated human session |
| 80 | +* `Actor.Authenticated=false` must be preserved to prevent privilege escalation |
| 81 | + through trust promotion |
| 82 | +* The design must be forward-compatible with Phase C JWT-based authentication |
| 83 | +* Multiple scheduler instances (billing, settlement, catch-up) must produce |
| 84 | + distinguishable audit records |
| 85 | + |
| 86 | +## Considered Options |
| 87 | + |
| 88 | +1. **Attributed identity with separate context key** (chosen) |
| 89 | +2. **Attributed identity reusing UserIDContextKey** |
| 90 | +3. **Defer all attribution to Phase C JWT authentication** |
| 91 | +4. **Service account per scheduler with full JWT issuance** |
| 92 | + |
| 93 | +## Decision Outcome |
| 94 | + |
| 95 | +Chosen option: **Attributed identity with separate context key**, because it |
| 96 | +satisfies the immediate compliance requirements while preserving the security |
| 97 | +boundary between scheduler and authenticated-user identity paths. |
| 98 | + |
| 99 | +### Implementation |
| 100 | + |
| 101 | +The `Actor` struct in `shared/platform/auth/actor.go` carries four fields: |
| 102 | + |
| 103 | +| Field | Purpose | |
| 104 | +|-------|---------| |
| 105 | +| `ID` | Identifier string, e.g. `system:scheduler:billing-cron` | |
| 106 | +| `Type` | `ActorTypeScheduler`, `ActorTypeWorker`, `ActorTypeHuman`, etc. | |
| 107 | +| `Authenticated` | `false` for all schedulers; `true` only when set by the gRPC auth interceptor | |
| 108 | +| `Source` | Describes the injection path, e.g. `"cron-scheduler"`, `"catch-up"` | |
| 109 | + |
| 110 | +The `actorContextKey` type is an unexported struct distinct from `contextKey` |
| 111 | +(used for `UserIDContextKey`). This structural separation is the enforcement |
| 112 | +mechanism: `auth.GetUserIDFromContext` cannot retrieve an Actor, and |
| 113 | +`auth.ActorFromContext` cannot retrieve a user ID. The two identity channels |
| 114 | +cannot collide regardless of the values they carry. |
| 115 | + |
| 116 | +The Actor is injected in `executeJob` (live cron execution) and |
| 117 | +`catchUpSchedule` (startup catch-up) before any downstream call: |
| 118 | + |
| 119 | +```go |
| 120 | +// shared/platform/scheduler/cron.go - executeJob |
| 121 | +ctx = auth.WithActor(ctx, auth.Actor{ |
| 122 | + ID: fmt.Sprintf("system:scheduler:%s", s.config.Name), |
| 123 | + Type: auth.ActorTypeScheduler, |
| 124 | + Authenticated: false, |
| 125 | + Source: "cron-scheduler", |
| 126 | +}) |
| 127 | + |
| 128 | +// shared/platform/scheduler/catchup.go - catchUpSchedule |
| 129 | +ctx = auth.WithActor(ctx, auth.Actor{ |
| 130 | + ID: fmt.Sprintf("system:scheduler:%s", s.config.Name), |
| 131 | + Type: auth.ActorTypeScheduler, |
| 132 | + Authenticated: false, |
| 133 | + Source: "catch-up", |
| 134 | +}) |
| 135 | +``` |
| 136 | + |
| 137 | +The `GetUserFromContext` function in `shared/platform/audit/context.go` checks |
| 138 | +for an Actor first, then falls back to `UserIDContextKey`, then to |
| 139 | +`DefaultAuditUser`: |
| 140 | + |
| 141 | +```go |
| 142 | +if actor, ok := auth.ActorFromContext(ctx); ok && actor.ID != "" { |
| 143 | + return actor.ID |
| 144 | +} |
| 145 | +userID, ok := auth.GetUserIDFromContext(ctx) |
| 146 | +// ... |
| 147 | +return DefaultAuditUser |
| 148 | +``` |
| 149 | + |
| 150 | +The `changed_by` column in audit tables will contain |
| 151 | +`system:scheduler:{scheduler-name}` for all scheduler-triggered mutations, |
| 152 | +making them distinguishable from human operations (`<user-uuid>`) and from |
| 153 | +genuinely anonymous operations (`system`). |
| 154 | + |
| 155 | +The scheduler name is injected at construction time via `CronSchedulerConfig.Name`, |
| 156 | +meaning different scheduler instances (e.g., `billing-cron`, `settlement-cron`) |
| 157 | +produce distinct attribution strings without any shared configuration. |
| 158 | + |
| 159 | +Tenant ID is not included in the `changed_by` string. The audit trail is |
| 160 | +already scoped to the tenant schema; including the tenant ID in `changed_by` |
| 161 | +would denormalise data that is implicit from the row's location. |
| 162 | + |
| 163 | +### Positive Consequences |
| 164 | + |
| 165 | +* Audit trails distinguish scheduler-triggered mutations from human operations |
| 166 | + and anonymous system operations, satisfying SOC 2 CC6.1 and ISO 27001 A.5.16 |
| 167 | +* `Actor.Authenticated=false` ensures schedulers cannot be promoted to |
| 168 | + authenticated status by any downstream code path |
| 169 | +* Separate context key enforces the boundary at the type level; no runtime |
| 170 | + check can accidentally treat a scheduler actor as an authenticated user |
| 171 | +* The `Source` field on `Actor` supports forensic analysis: catch-up executions |
| 172 | + (`"catch-up"`) are distinguishable from live cron executions (`"cron-scheduler"`) |
| 173 | + in diagnostic logs even when both carry the same `ID` |
| 174 | +* The design is forward-compatible with Phase C: adding JWT issuance requires |
| 175 | + only setting `Authenticated=true` in the interceptor; no downstream code |
| 176 | + changes are needed |
| 177 | + |
| 178 | +### Negative Consequences |
| 179 | + |
| 180 | +* Attribution strings are asserted, not verified. A bug in platform code could |
| 181 | + inject an incorrect `Actor.ID`. Mitigation: attribution is set only in |
| 182 | + platform-controlled scheduler code, not at service boundaries or in |
| 183 | + tenant-configurable logic |
| 184 | +* A tenant may be suspended after a scheduler acquires context but before |
| 185 | + execution completes. The tenant status check runs before semaphore acquisition |
| 186 | + and execution, but not continuously during execution. This is an acceptable |
| 187 | + window given that execution is bounded (default: 5 minutes) |
| 188 | + |
| 189 | +## Pros and Cons of the Options |
| 190 | + |
| 191 | +### Option 1: Attributed identity with separate context key (chosen) |
| 192 | + |
| 193 | +* Good, because audit compliance is satisfied immediately without Phase C |
| 194 | +* Good, because the structural key separation prevents scheduler identity from |
| 195 | + being mistaken for an authenticated session at the type level |
| 196 | +* Good, because multiple schedulers produce distinct attribution without |
| 197 | + additional configuration |
| 198 | +* Good, because `Actor.Authenticated=false` is a stable invariant that |
| 199 | + downstream authorization checks can rely on |
| 200 | +* Bad, because attribution strings are not cryptographically verified |
| 201 | + |
| 202 | +### Option 2: Attributed identity reusing UserIDContextKey |
| 203 | + |
| 204 | +* Good, because no new context key needed |
| 205 | +* Bad, because scheduler identity would pass the `GetUserIDFromContext` auth |
| 206 | + gate in identity endpoints, granting schedulers unintended access to |
| 207 | + privileged operations |
| 208 | +* Bad, because audit trail cannot distinguish scheduler from authenticated |
| 209 | + human user without inspecting the ID format string |
| 210 | + |
| 211 | +### Option 3: Defer all attribution to Phase C JWT authentication |
| 212 | + |
| 213 | +* Good, because authenticated identity is stronger than attributed identity |
| 214 | +* Bad, because compliance gap persists until Phase C is complete |
| 215 | +* Bad, because Phase C requires identity provider integration, token issuance |
| 216 | + infrastructure, and interceptor changes - a multi-week effort |
| 217 | +* Bad, because a scheduler with no identity at all fails SOC 2 CC6.1 today |
| 218 | + |
| 219 | +### Option 4: Service account per scheduler with full JWT issuance |
| 220 | + |
| 221 | +* Good, because each scheduler has a cryptographically verified identity today |
| 222 | +* Bad, because requires standing up service account management, token issuance, |
| 223 | + and rotation infrastructure before any compliance benefit is realised |
| 224 | +* Bad, because overkill for the current threat model: schedulers run in |
| 225 | + platform-controlled code, not at tenant-configurable boundaries |
| 226 | + |
| 227 | +## Links |
| 228 | + |
| 229 | +* PR #2151 - Update audit context to check Actor for scheduler attribution |
| 230 | +* PR #2163 - Inject Actor and correlation ID in scheduler executeJob |
| 231 | +* `shared/platform/auth/actor.go` - Actor struct and actorContextKey |
| 232 | +* `shared/platform/audit/context.go` - GetUserFromContext with Actor check |
| 233 | +* `shared/platform/scheduler/cron.go` - Actor injection in executeJob |
| 234 | +* `shared/platform/scheduler/catchup.go` - Actor injection in catchUpSchedule |
| 235 | + |
| 236 | +## Notes |
| 237 | + |
| 238 | +* **Phase C trigger**: When service accounts or machine-identity JWTs are |
| 239 | + introduced, the scheduler should acquire a short-lived token from the |
| 240 | + identity provider at startup and refresh it on expiry. The gRPC auth |
| 241 | + interceptor would verify the token and set `Authenticated=true`. No changes |
| 242 | + to the `Actor` struct, context keys, or downstream audit code are required. |
| 243 | +* **New scheduler checklist**: Any new background worker or scheduler must |
| 244 | + inject an `Actor` with `Authenticated=false` before its first downstream |
| 245 | + call. Omitting this reverts attribution to `"system"` and reopens the |
| 246 | + compliance gap. |
| 247 | +* **Do not copy Authenticated from external input**: The `Actor.Authenticated` |
| 248 | + field must never be populated from proto messages, HTTP headers, or request |
| 249 | + bodies. Only the gRPC auth interceptor may set it to `true`. |
0 commit comments