| name | prd-per-tenant-scheduled-execution | ||||||
|---|---|---|---|---|---|---|---|
| description | Phased architecture for per-tenant scheduling identity: Actor struct for audit attribution, manifest-driven scheduling via tenant_schedule DB table, and deferred authenticated system identity with JWT. | ||||||
| triggers |
|
||||||
| instructions | Key design decisions: - SystemActorContextKey MUST be separate from UserIDContextKey (verified auth bypass risk in identity service endpoints) - Actor struct with Authenticated boolean prevents trust escalation - changed_by format: system:scheduler:{service} (no tenant ID) - Database-backed tenant_schedule table, not ManifestScheduleProvider - Deliverable A (attribution) ships standalone for existing schedulers - Deliverable C (JWT auth) deferred until cross-service calls needed |
Meridian's scheduler infrastructure has three gaps that compound as the platform scales:
-
No execution identity. All scheduled work (billing, forecasting, reconciliation) runs as bare
context.Background()with implicit god-mode database access. Every entity mutation recordschanged_by = "system"- indistinguishable across billing runs, catch-up replays, background workers, migrations, and any unauthenticated code path. This fails SOC 2 CC6.1 (logical access controls) and ISO 27001 A.5.16 (identity management). -
Manifest schedule gap. The manifest proto declares
scheduled:triggers (manifest.proto:289-298) and validates them (uniqueness checks, prefix parsing), but has no cron expression field and no bridge to theCronSchedulerinfrastructure. Validation passes but no schedule is registered- the trigger is syntactically accepted but operationally inert.
The MCP server documentation (
reference.go:212) contradicts the manifest examples - documentingscheduled:<cron-expression>while manifests usescheduled:<name>.
- the trigger is syntactically accepted but operationally inert.
The MCP server documentation (
-
Missing scaling guardrails.
executeJob()spawns unbounded goroutines vialifecycle.ExecuteGuarded()with no concurrency semaphore. Tenant suspension status is never checked before execution. No minimum cron interval is enforced. At the current scale (~3 schedules), these are invisible. At N tenants x M schedule types, they become resource exhaustion and data integrity risks.
The shared/platform/scheduler package provides a CronScheduler with:
ScheduleProviderinterface returning all schedules across all tenantsSchedulestruct carryingTenantIDfor per-tenant schema routing- Redis-based distributed locking (
shared/platform/redislock) preventing duplicate execution across replicas ExecutionStorefor audit trail persistence- Catch-up logic for missed windows on startup
Three services consume it:
| Service | Provider | Schedule Source | Multi-tenant? |
|---|---|---|---|
| payment-order | BillingScheduleProvider |
Static env var | Single tenant per config |
| forecasting | ForecastScheduleProvider |
forecasting_strategy DB table |
Yes - per-tenant, per-strategy |
| reconciliation | SettlementScheduleProvider |
Reference Data gRPC (stub) | Yes (planned) |
The forecasting service is the existence proof - it already does
dynamic per-tenant scheduling from a database table with per-tenant
tenant_id and per-strategy cron expressions.
The scheduler creates context.Background() at cron.go:296 and
injects only tenant context for schema routing. No UserIDContextKey
is set. The audit system (shared/platform/audit) falls back to
DefaultAuditUser = "system" (audit/context.go:11-13).
The identity service uses auth.GetUserIDFromContext(ctx) as an
authentication gate in 5+ endpoints
(grpc_identity_endpoints.go:129, grpc_role_endpoints.go:147,
etc.). Any value in UserIDContextKey - including an attributed
string - passes these gates. This means attributed identity MUST
NOT be injected into UserIDContextKey.
- A
"service"RBAC role is defined (shared/platform/auth/rbac.go:32) with account/position/ transaction permissions but is assigned to no identity - forward-looking scaffolding for system actors. - OAuth2 client credentials exist for service-to-service token
exchange (
shared/platform/auth/service_auth.go) but are not used by the scheduler.
The original question - "per-tenant system user with auth token" - is the right destination but wrong starting point. The dependency chain is: attribution before authentication, scheduler hardening before manifest bridge.
Attributed identity is a structured string in context that appears in audit trails. It answers "who did this?" without cryptographic proof. Sufficient for SOC 2 Type I and most Type II audits with compensating controls.
Authenticated identity is a verified principal (JWT) that passes through the auth interceptor chain. It answers "who did this AND were they authorized?" Required when scheduled work makes cross-service authenticated gRPC calls.
Phase A provides attributed identity. Phase C provides authenticated
identity. The Actor struct is designed to support both without data
migration.
Attributed system actor identity MUST use a separate
SystemActorContextKey, NOT the existing UserIDContextKey. This is
non-negotiable based on verified code evidence:
GetUserIDFromContextis used as an auth gate in 5+ identity service endpoints- A JWT
subclaim could theoretically containsystem:scheduler:*strings, creating namespace collision - Separate context keys create a clean boundary: JWT path populates
UserIDContextKey, scheduler path populatesSystemActorContextKey, they can never collide
A single typed struct replaces context key proliferation:
type Actor struct {
ID string // "system:scheduler:billing" or user UUID
Type ActorType // Human, Scheduler, Worker, Migration
Authenticated bool // true only if set by auth interceptor
Source string // "grpc-interceptor", "cron-scheduler", "catch-up"
}- The gRPC interceptor sets
Actor{ID: userID, Type: Human, Authenticated: true, Source: "grpc-interceptor"} - The scheduler sets
Actor{ID: "system:scheduler:billing", Type: Scheduler, Authenticated: false, Source: "cron-scheduler"} - Audit hooks read
actor.IDforchanged_byregardless of type - Auth gates check
actor.Authenticated- attributed strings never pass auth checks - Future actor types (workers, migrations, webhooks) extend via
ActorTypewithout new context keys
system:scheduler:{service} - no tenant ID. The tenant is implicit
in the schema-scoped audit trail. Including tenant ID is redundant,
creates privacy leakage in cross-tenant audit views, and renders
poorly in the UI.
Scope: Standalone value for the 3 existing schedulers. No manifest changes, no proto changes, no API surface changes.
Estimated complexity: 5 story points
Create shared/platform/auth/actor.go:
Actorstruct withID,Type,Authenticated,SourcefieldsActorTypeenum:Human,Scheduler,Worker,MigrationActorContextKeycontext keyWithActor(ctx, actor)andActorFromContext(ctx)helpers- Update
audit.GetUserFromContext()to checkActorContextKeyfirst, thenUserIDContextKey, then fall back toDefaultAuditUser
In shared/platform/scheduler/cron.go, executeJob():
- Inject
Actor{ID: "system:scheduler:{schedulerName}", Type: Scheduler, Authenticated: false, Source: "cron-scheduler"}into context - Inject
audit.WithCorrelationID(ctx, execID.String())to link all audit records from one execution - For catch-up executions, use
Source: "catch-up"
In executeJob(), before calling the executor:
- Query tenant status (active/suspended/deprovisioned)
- Skip execution for non-active tenants, record as
SKIPPEDwith reason - Known limitation: tenant can be suspended mid-execution. Saga-level handling is a future concern.
In executeJob() or CronScheduler:
- Add a configurable semaphore (default max 20 concurrent executions)
- Excess executions are
SKIPPEDwith reason "concurrency limit reached" - Prevents DB connection pool exhaustion when schedules align (e.g.,
0 0 1 * *for all tenants)
In refreshSchedules():
- Add random jitter (0-10s) to the refresh ticker interval
- Prevents synchronized
ListSchedules()bursts when multiple replicas start simultaneously
Document:
- Attribution vs authentication decision and rationale
SystemActorContextKeyseparate fromUserIDContextKeyrationale (with code evidence)Actorstruct design and forward-compatibility with Phase C- Known limitations (attributed strings are not cryptographically verified)
Scope: Bridges manifest scheduled: trigger declarations to the
CronScheduler. Uses database-backed schedule storage (proven by
forecasting pattern).
Estimated complexity: 8 story points
Prerequisite: Deliverable A (attribution must be in place before scaling schedule count)
Per-tenant-schema table written by manifest application, read by ScheduleProvider:
CREATE TABLE tenant_schedule (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
schedule_name VARCHAR(128) NOT NULL,
saga_name VARCHAR(128) NOT NULL,
cron_expr VARCHAR(64) NOT NULL,
enabled BOOLEAN NOT NULL DEFAULT true,
manifest_version_id UUID,
metadata JSONB,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE(schedule_name)
);manifest_version_id provides traceability: which manifest version
created this schedule. This references control-plane manifest versions
(cross-schema, no FK constraint) - a soft reference for audit/debugging
purposes, not a hard database relationship.
During ApplyManifest, for scheduled: triggers:
- Parse schedule configuration from manifest
- Translate friendly abstractions to cron expressions (if applicable)
- Diff declared schedules against existing
tenant_schedulerows - Insert/update/delete schedule rows
- Return registered schedules with
next_executiontime in the apply response
A TenantScheduleProvider that queries tenant_schedule across all
tenant schemas. Replaces or supplements per-service providers:
- Billing: migrate from env var to
tenant_schedule - Reconciliation: implement against
tenant_scheduleinstead of Reference Data stub - Forecasting: can adopt
tenant_scheduleor keepforecasting_strategy(per-service decision)
Enforced at manifest validation time:
- Minimum cron interval: 15 minutes
- Maximum schedules per tenant manifest: 10-20 (configurable)
- Reject syntactically valid but semantically nonsensical expressions (e.g.,
0 0 31 2 *) - Warn on very infrequent schedules (annually) as likely bugs
Runtime guardrails in executeJob():
- Per-tenant concurrent execution limit (3-5, configurable)
- Excess executions recorded as
SKIPPEDwith tenant-specific reason - Distinct from the global semaphore in A.4
Observability (can ship in parallel):
- Execution latency histogram per scheduler/tenant
- Lock contention metrics
- Expected-vs-actual execution frequency check (alert when a schedule hasn't fired in 2x its expected interval)
- Redis health metric
Design decision (before B.2 implementation):
- Raw cron expressions only? (
schedule: "0 2 1 * *") - Friendly abstractions? (
schedule: { every: "1h" },schedule: { monthly: { day: 1, hour: 2 } }) - Named presets? (
schedule: "monthly_billing"mapping to a reference data entry) - Resolve MCP docs contradiction (
scheduled:<cron>vsscheduled:<name>)
The tenant_schedule table decouples this decision from the scheduler -
manifest DX can evolve independently because the translation to cron
expressions happens at the application layer.
Scope: Per-tenant system user with JWT-scoped execution. Full auth chain for scheduled work.
Estimated complexity: 13 story points
Trigger: Required when scheduled sagas need to make cross-service authenticated gRPC calls, OR when a customer requires SOC 2 Type II with cryptographically verifiable chain of custody.
- Created during tenant provisioning as a post-provisioning hook
- Assigned the existing
"service"RBAC role - Per-service role scoping if needed (billing doesn't need the same permissions as forecasting)
- Lifecycle: created on provision, suspended on tenant suspend, deprovisioned on tenant deprovision
- Scheduler mints short-lived JWTs per execution (not long-lived cached credentials)
- Token lifetime =
ExecutionTimeout+ buffer (e.g., 15 minutes) - Minting is in-process (no external call that can fail)
- Token carries tenant ID, service role, and execution correlation ID
- Executor injects JWT into context before saga execution
- Saga steps that make gRPC calls carry the token through interceptors
- Token-expiry-mid-saga handling: design upfront (fail + compensate, or refresh mid-saga)
- No long-lived credentials to rotate (per-execution minting)
- System user suspension on tenant deactivation
- Monitoring: alert on system user token mint failures
- Removing
public.platform_saga_definitiontable (control-plane uses it forapply_manifest) - Changing the forecasting service's
forecasting_strategytable (can adopttenant_scheduleor keep its own) - Tenant-to-tenant data sharing / mesh scheduling (future architecture)
- Schedule-triggered notifications to tenants (future DX feature)
- Existing scheduler tests pass with
Actorcontext injection changed_byfields showsystem:scheduler:{service}instead of"system"- Audit records carry
correlation_idlinking toscheduler_execution.id - Suspended tenant's schedules are skipped with audit trail
- Concurrent execution capped at semaphore limit
- Manifest with
scheduled:trigger createstenant_schedulerow - Schedule appears in
CronSchedulerwithin 60s refresh interval - Apply response includes registered schedules with next execution time
- Cron expressions below 15-min floor are rejected at validation
- Per-tenant schedule count exceeding cap is rejected
- Scheduled saga steps can make authenticated gRPC calls to other services
- Token carries correct tenant ID and service role
- Token expiry is handled gracefully (saga fails cleanly, not silently)
- Six Thinking Hats analysis: 5-person panel (security, distributed systems, SRE, product, compliance)
- Key code paths:
shared/platform/scheduler/cron.go,shared/platform/auth/,shared/platform/audit/ - Existing patterns: forecasting
StrategyRepository.ListAllActive()(DB-backed schedule provider) - Unused scaffolding:
"service"RBAC role (shared/platform/auth/rbac.go:32)