docs: PRD-060 per-tenant scheduled execution architecture#2142
docs: PRD-060 per-tenant scheduled execution architecture#2142
Conversation
Define phased approach to per-tenant scheduling identity: - Deliverable A: Actor struct, audit attribution, concurrency limits, tenant status checks (5 points, standalone value) - Deliverable B: Manifest-driven scheduling via tenant_schedule DB table, cron guardrails, schedule health monitoring (8 points) - Deliverable C: Authenticated system identity with per-execution JWT minting (13 points, deferred until cross-service auth needed) Key design decisions from Six Thinking Hats panel analysis: - SystemActorContextKey separate from UserIDContextKey (verified auth bypass risk in 5+ identity service endpoints) - Actor struct with Authenticated boolean prevents trust escalation - Database-backed schedule table decouples manifest DX from scheduler
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds PRD-060 documenting a phased design for per-tenant scheduled execution: attribution and audit identity, a manifest-driven schedule bridge with validation and scaling guardrails, and a deferred plan for authenticated per-tenant system identities and short-lived JWTs. Also updates the PRD index/README to surface this and other new PRDs. Changes
Sequence Diagram(s)sequenceDiagram
participant Scheduler as Scheduler
participant DB as TenantScheduleDB
participant Limiter as GlobalConcurrencyLimiter
participant Saga as SagaEngine
participant Audit as AuditService
Scheduler->>DB: read tenant schedules (manifest bridge)
DB-->>Scheduler: return schedules (with tenant id)
Scheduler->>Scheduler: check tenant status (suspended/deprovisioned) and apply refresh jitter
Scheduler->>Limiter: request execution slot for tenant
alt slot granted
Limiter-->>Scheduler: granted
Scheduler->>Saga: start execution (inject Actor + correlation ID)
Saga->>Audit: emit attributed audit events
Saga->>Audit: emit execution outcome
else slot denied
Limiter-->>Scheduler: denied
Scheduler->>Audit: emit SKIPPED outcome (concurrency/limit)
end
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
- Add YAML frontmatter to PRD-060 for Claude Code skill triggers - Add missing PRDs to README index: 046 (economy visualization), 047 (security audit), 051 (party navigation), 056 (correspondence), 057 (convergent manifest apply), 058 (economy visibility), 059 (asset-agnostic posting layer), 060 (scheduled execution) - Add PRD-060 to Operations & SaaS category - Add PRD-059 to Core Platform category
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
docs/prd/060-per-tenant-scheduled-execution.md (4)
343-351: Add verification criterion for refresh jitter (A.5).The verification criteria for Deliverable A don't include a test for the refresh jitter introduced in A.5 (lines 191-196). Consider adding:
- Multiple scheduler replicas refresh schedules with jitter (verify refresh timestamps are not synchronized)
This ensures the jitter implementation actually prevents thundering herd on the
ListSchedules()call.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/prd/060-per-tenant-scheduled-execution.md` around lines 343 - 351, Add a sixth verification criterion to Deliverable A to test the refresh jitter from A.5: verify that multiple scheduler replicas refresh schedules with jitter by asserting their refresh timestamps for ListSchedules() calls are not synchronized (e.g., collect refresh timestamps from multiple replicas and confirm variance/offsets), reference Deliverable A and the A.5 jitter behavior as the target of the test and ensure the criterion mentions ListSchedules() specifically.
266-273: Per-tenant execution limit may be too restrictive.The proposed per-tenant concurrent execution limit of "3-5, configurable" seems quite low. Consider a tenant with 10 schedules where several align on the same cron expression (e.g.,
0 2 * * *for multiple nightly jobs). All would attempt to execute simultaneously, and 5-7 would be skipped.This might be the intended behavior for resource protection, but it could surprise tenants whose schedules are legitimately concurrent by design. Consider:
- Documenting this limit prominently in manifest documentation
- Providing tenant-visible metrics when schedules are skipped due to concurrency limits
- Allowing the limit to be configured per-tenant (not just globally) for high-value customers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/prd/060-per-tenant-scheduled-execution.md` around lines 266 - 273, The per-tenant concurrent execution limit described for executeJob() (currently "3-5, configurable") is likely too restrictive and can unexpectedly SKIP many legitimate concurrent schedules; update the design so executeJob() supports a per-tenant configurable concurrency limit (falling back to a global default), add tenant-visible metrics/telemetry emitted when executions are SKIPPED for quota reasons, and document this behavior and configuration prominently in the manifest/docs; specifically, add a tenantConfig field (or equivalent) to override the global limit, ensure executeJob() reads tenantConfig.concurrentLimit before applying the semaphore, and emit a metric/log with tenant id, schedule ids, and reason when SKIPPED.
221-234: Add index onenabledcolumn for query performance.The
tenant_scheduletable will likely be queried withWHERE enabled = trueby theScheduleProviderto fetch active schedules. Without an index on theenabledcolumn, this becomes a full table scan as the schedule count grows.📊 Suggested index addition
CREATE TABLE tenant_schedule ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), schedule_name VARCHAR(128) NOT NULL, saga_name VARCHAR(128) NOT NULL, cron_expr VARCHAR(64) NOT NULL, enabled BOOLEAN NOT NULL DEFAULT true, manifest_version_id UUID, metadata JSONB, created_at TIMESTAMPTZ NOT NULL DEFAULT now(), updated_at TIMESTAMPTZ NOT NULL DEFAULT now(), UNIQUE(schedule_name) ); + +CREATE INDEX idx_tenant_schedule_enabled ON tenant_schedule(enabled) WHERE enabled = true;Using a partial index (
WHERE enabled = true) is even more efficient since it only indexes active schedules.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/prd/060-per-tenant-scheduled-execution.md` around lines 221 - 234, Add a partial index to the tenant_schedule table on the enabled column so queries from ScheduleProvider that filter WHERE enabled = true avoid full table scans; specifically, create an index named something like tenant_schedule_enabled_idx on tenant_schedule targeting the enabled column with a WHERE enabled = true predicate and include this DDL alongside the table definition/migrations so active-schedule lookups are served by the index.
184-189: Consider making the concurrency limit configurable per environment.The proposed default of "max 20 concurrent executions" may be appropriate for production but could be too restrictive for development or too permissive for resource-constrained staging environments. Since this is a global runtime limiter that affects all tenants, consider making it environment-configurable (via config or environment variable) rather than a hard-coded default.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/prd/060-per-tenant-scheduled-execution.md` around lines 184 - 189, The concurrency limiter is currently described with a hard-coded default of 20; make it environment/configurable by adding a global semaphore whose max is read from configuration (e.g., env var MAX_CONCURRENT_SCHEDULED_EXECUTIONS or config key scheduler.maxConcurrentExecutions, default 20), initialize that semaphore in the CronScheduler/boot path, and have executeJob() acquire the semaphore before starting work and release it after; if acquire fails because limit is reached, immediately mark the execution as SKIPPED with reason "concurrency limit reached" and persist that state. Ensure the semaphore instance is shared globally by CronScheduler and referenced by executeJob() (use the semaphore variable name used in the diff or create a clear symbol like globalScheduleSemaphore) so all tenant schedules respect the same configurable limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/prd/060-per-tenant-scheduled-execution.md`:
- Around line 17-26: The document incorrectly states that manifests with
`scheduled:` triggers "validate and nothing happens"; instead, update the text
to state that `validateScheduledTriggers` in domain_validator.go does parse and
validate `scheduled:` entries (extracting schedule names and enforcing
uniqueness) but there is no execution bridge to the CronScheduler
infrastructure. Change the wording to describe the real gap: manifest.proto
declares `scheduled:` triggers without a cron-expression field or mapping to
CronScheduler, and reference.go's docs are inconsistent with the manifest;
recommend adding a clear migration path (e.g., manifest-to-CronScheduler mapping
or a cron-expression field and runtime registration) so manifests both validate
and are registered with the CronScheduler.
- Line 228: Add a short clarification in the PRD noting that the
manifest_version_id field references control-plane-managed manifest versions
(not tenant-schema rows) and therefore cannot use a direct foreign key; update
the description for manifest_version_id to state it provides traceability to
control-plane manifest versions, explain how scheduled executions should resolve
and record control‑plane artifact IDs (e.g., store control-plane UUID and
optional snapshot metadata) and describe the expected lookup/validation flow
when creating or running scheduled executions so readers understand the
cross‑schema relationship.
---
Nitpick comments:
In `@docs/prd/060-per-tenant-scheduled-execution.md`:
- Around line 343-351: Add a sixth verification criterion to Deliverable A to
test the refresh jitter from A.5: verify that multiple scheduler replicas
refresh schedules with jitter by asserting their refresh timestamps for
ListSchedules() calls are not synchronized (e.g., collect refresh timestamps
from multiple replicas and confirm variance/offsets), reference Deliverable A
and the A.5 jitter behavior as the target of the test and ensure the criterion
mentions ListSchedules() specifically.
- Around line 266-273: The per-tenant concurrent execution limit described for
executeJob() (currently "3-5, configurable") is likely too restrictive and can
unexpectedly SKIP many legitimate concurrent schedules; update the design so
executeJob() supports a per-tenant configurable concurrency limit (falling back
to a global default), add tenant-visible metrics/telemetry emitted when
executions are SKIPPED for quota reasons, and document this behavior and
configuration prominently in the manifest/docs; specifically, add a tenantConfig
field (or equivalent) to override the global limit, ensure executeJob() reads
tenantConfig.concurrentLimit before applying the semaphore, and emit a
metric/log with tenant id, schedule ids, and reason when SKIPPED.
- Around line 221-234: Add a partial index to the tenant_schedule table on the
enabled column so queries from ScheduleProvider that filter WHERE enabled = true
avoid full table scans; specifically, create an index named something like
tenant_schedule_enabled_idx on tenant_schedule targeting the enabled column with
a WHERE enabled = true predicate and include this DDL alongside the table
definition/migrations so active-schedule lookups are served by the index.
- Around line 184-189: The concurrency limiter is currently described with a
hard-coded default of 20; make it environment/configurable by adding a global
semaphore whose max is read from configuration (e.g., env var
MAX_CONCURRENT_SCHEDULED_EXECUTIONS or config key
scheduler.maxConcurrentExecutions, default 20), initialize that semaphore in the
CronScheduler/boot path, and have executeJob() acquire the semaphore before
starting work and release it after; if acquire fails because limit is reached,
immediately mark the execution as SKIPPED with reason "concurrency limit
reached" and persist that state. Ensure the semaphore instance is shared
globally by CronScheduler and referenced by executeJob() (use the semaphore
variable name used in the diff or create a clear symbol like
globalScheduleSemaphore) so all tenant schedules respect the same configurable
limit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 39d8de26-83e7-490a-90ba-6d6372aae927
📒 Files selected for processing (1)
docs/prd/060-per-tenant-scheduled-execution.md
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/prd/README.md`:
- Line 69: The PRD entry for "Per-Tenant Scheduled Execution"
(060-per-tenant-scheduled-execution.md) appears twice with conflicting statuses;
open the README and keep only one entry for that PRD under the correct status
section (either the "In Progress" block or the "Not Started" block) by removing
the duplicate line `| [Per-Tenant Scheduled
Execution](060-per-tenant-scheduled-execution.md) | \`remove-platform-ref\`
(Deliverable A) | Draft |` from the incorrect section so the document has a
single authoritative status for that PRD.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ca429340-a0d6-4f61-ab1a-d471f98da34f
📒 Files selected for processing (2)
docs/prd/060-per-tenant-scheduled-execution.mddocs/prd/README.md
✅ Files skipped from review due to trivial changes (1)
- docs/prd/060-per-tenant-scheduled-execution.md
- Nuance manifest gap claim: validation does process scheduled triggers (uniqueness, prefix parsing) but no schedule is registered - trigger is syntactically accepted but operationally inert - Document manifest_version_id as a cross-schema soft reference to control-plane manifest versions (no FK constraint) - Remove duplicate PRD-060 status entry from README In Progress section (keep only in Not Started)
Claude Code ReviewCommit: SummaryWell-structured PRD with unusually strong codebase evidence backing each claim. The phased decomposition (attribution -> manifest bridge -> authenticated identity) follows a sound dependency chain and avoids the common trap of building authentication before attribution is in place. The All 10 specific code references verified against the codebase (line numbers, function names, patterns). The separate Risk Assessment
Findings
Bot Review NotesThree CodeRabbit comments were posted, all marked as "Addressed in commit decf141":
No unresolved bot threads remain. Cross-Document Consistency Check
Questions for the Author
|
All threads resolved after code fixes
There was a problem hiding this comment.
PRD-060 is well-researched with verified code references, internally consistent, and aligned with existing ADRs and PRDs. One minor suggestion in summary comment (document application-managed updated_at in B.1 schema) and one question (global vs per-tenant fairness in A.4 semaphore). See summary comment for full details.
Summary
tenant_scheduleDB table, cron guardrails, schedule health monitoringAnalysis
Based on a Six Thinking Hats panel (security, distributed systems, SRE, product, compliance) that identified:
GetUserIDFromContextused as auth gate in 5+ identity service endpoints - attributed identity MUST use separateSystemActorContextKeyAuthenticatedboolean prevents trust escalation between attributed and authenticated identitiestenant_scheduletable (proven by forecasting pattern) decouples manifest DX from scheduler mechanicsTest plan