Skip to content

docs: PRD-060 per-tenant scheduled execution architecture#2142

Merged
bjcoombs merged 3 commits intodevelopfrom
prd-tenant-scheduling
Apr 6, 2026
Merged

docs: PRD-060 per-tenant scheduled execution architecture#2142
bjcoombs merged 3 commits intodevelopfrom
prd-tenant-scheduling

Conversation

@bjcoombs
Copy link
Copy Markdown
Collaborator

@bjcoombs bjcoombs commented Apr 6, 2026

Summary

  • Define phased architecture for per-tenant scheduling identity and manifest-driven scheduling
  • Deliverable A (5 pts): Actor struct, audit attribution, concurrency limits, tenant status checks - standalone value for existing schedulers
  • Deliverable B (8 pts): Manifest-driven scheduling via tenant_schedule DB table, cron guardrails, schedule health monitoring
  • Deliverable C (13 pts): Authenticated system identity with per-execution JWT minting (deferred)

Analysis

Based on a Six Thinking Hats panel (security, distributed systems, SRE, product, compliance) that identified:

  • Verified auth bypass risk: GetUserIDFromContext used as auth gate in 5+ identity service endpoints - attributed identity MUST use separate SystemActorContextKey
  • Actor struct design: Single typed struct with Authenticated boolean prevents trust escalation between attributed and authenticated identities
  • Database-backed scheduling: tenant_schedule table (proven by forecasting pattern) decouples manifest DX from scheduler mechanics
  • Corrected Phase 0: Triple-execution risk was phantom (dead scaffolding). Real risks are unbounded goroutine concurrency, missing tenant status checks, and no cron interval floor

Test plan

  • PRD review by team for architectural alignment
  • Verify design decisions against current codebase (context key separation, Actor struct compatibility)
  • Confirm deliverable sizing estimates

Define phased approach to per-tenant scheduling identity:
- Deliverable A: Actor struct, audit attribution, concurrency limits,
  tenant status checks (5 points, standalone value)
- Deliverable B: Manifest-driven scheduling via tenant_schedule DB
  table, cron guardrails, schedule health monitoring (8 points)
- Deliverable C: Authenticated system identity with per-execution JWT
  minting (13 points, deferred until cross-service auth needed)

Key design decisions from Six Thinking Hats panel analysis:
- SystemActorContextKey separate from UserIDContextKey (verified auth
  bypass risk in 5+ identity service endpoints)
- Actor struct with Authenticated boolean prevents trust escalation
- Database-backed schedule table decouples manifest DX from scheduler
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 6, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c5a4f7fb-d0cf-4eae-873b-7595ab5ac265

📥 Commits

Reviewing files that changed from the base of the PR and between cbd5b26 and decf141.

📒 Files selected for processing (2)
  • docs/prd/060-per-tenant-scheduled-execution.md
  • docs/prd/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/prd/README.md
  • docs/prd/060-per-tenant-scheduled-execution.md

📝 Walkthrough

Walkthrough

Adds PRD-060 documenting a phased design for per-tenant scheduled execution: attribution and audit identity, a manifest-driven schedule bridge with validation and scaling guardrails, and a deferred plan for authenticated per-tenant system identities and short-lived JWTs. Also updates the PRD index/README to surface this and other new PRDs.

Changes

Cohort / File(s) Summary
Per-Tenant Scheduled Execution PRD
docs/prd/060-per-tenant-scheduled-execution.md
New PRD (~371 lines) defining three deliverables: A — attributed scheduling identity (Actor context key), execution correlation IDs, tenant-status checks, global concurrency semaphore with SKIPPED outcomes, refresh jitter, and ADR; B — manifest-driven tenant_schedule table, TenantScheduleProvider, validation rules, per-tenant caps and health monitoring; C — deferred design for per-tenant system users and short-lived JWTs for authenticated cross-service calls.
PRD Index / Status
docs/prd/README.md
Updated PRD index: added "Per-Tenant Scheduled Execution" entry and multiple new PRDs to "Not Started" and category lists (Core Platform, Operations & SaaS, Task Master docs).

Sequence Diagram(s)

sequenceDiagram
    participant Scheduler as Scheduler
    participant DB as TenantScheduleDB
    participant Limiter as GlobalConcurrencyLimiter
    participant Saga as SagaEngine
    participant Audit as AuditService

    Scheduler->>DB: read tenant schedules (manifest bridge)
    DB-->>Scheduler: return schedules (with tenant id)
    Scheduler->>Scheduler: check tenant status (suspended/deprovisioned) and apply refresh jitter
    Scheduler->>Limiter: request execution slot for tenant
    alt slot granted
        Limiter-->>Scheduler: granted
        Scheduler->>Saga: start execution (inject Actor + correlation ID)
        Saga->>Audit: emit attributed audit events
        Saga->>Audit: emit execution outcome
    else slot denied
        Limiter-->>Scheduler: denied
        Scheduler->>Audit: emit SKIPPED outcome (concurrency/limit)
    end
Loading

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding PRD-060 documentation that defines a phased architecture for per-tenant scheduled execution, which aligns with the changeset.
Description check ✅ Passed The description is well-organized and directly related to the changeset, covering the phased deliverables, architectural decisions, and test plan for the per-tenant scheduling architecture.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch prd-tenant-scheduling

Comment @coderabbitai help to get the list of available commands and usage tips.

- Add YAML frontmatter to PRD-060 for Claude Code skill triggers
- Add missing PRDs to README index: 046 (economy visualization),
  047 (security audit), 051 (party navigation), 056 (correspondence),
  057 (convergent manifest apply), 058 (economy visibility),
  059 (asset-agnostic posting layer), 060 (scheduled execution)
- Add PRD-060 to Operations & SaaS category
- Add PRD-059 to Core Platform category
coderabbitai[bot]
coderabbitai Bot previously requested changes Apr 6, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
docs/prd/060-per-tenant-scheduled-execution.md (4)

343-351: Add verification criterion for refresh jitter (A.5).

The verification criteria for Deliverable A don't include a test for the refresh jitter introduced in A.5 (lines 191-196). Consider adding:

  1. Multiple scheduler replicas refresh schedules with jitter (verify refresh timestamps are not synchronized)

This ensures the jitter implementation actually prevents thundering herd on the ListSchedules() call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/prd/060-per-tenant-scheduled-execution.md` around lines 343 - 351, Add a
sixth verification criterion to Deliverable A to test the refresh jitter from
A.5: verify that multiple scheduler replicas refresh schedules with jitter by
asserting their refresh timestamps for ListSchedules() calls are not
synchronized (e.g., collect refresh timestamps from multiple replicas and
confirm variance/offsets), reference Deliverable A and the A.5 jitter behavior
as the target of the test and ensure the criterion mentions ListSchedules()
specifically.

266-273: Per-tenant execution limit may be too restrictive.

The proposed per-tenant concurrent execution limit of "3-5, configurable" seems quite low. Consider a tenant with 10 schedules where several align on the same cron expression (e.g., 0 2 * * * for multiple nightly jobs). All would attempt to execute simultaneously, and 5-7 would be skipped.

This might be the intended behavior for resource protection, but it could surprise tenants whose schedules are legitimately concurrent by design. Consider:

  • Documenting this limit prominently in manifest documentation
  • Providing tenant-visible metrics when schedules are skipped due to concurrency limits
  • Allowing the limit to be configured per-tenant (not just globally) for high-value customers
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/prd/060-per-tenant-scheduled-execution.md` around lines 266 - 273, The
per-tenant concurrent execution limit described for executeJob() (currently
"3-5, configurable") is likely too restrictive and can unexpectedly SKIP many
legitimate concurrent schedules; update the design so executeJob() supports a
per-tenant configurable concurrency limit (falling back to a global default),
add tenant-visible metrics/telemetry emitted when executions are SKIPPED for
quota reasons, and document this behavior and configuration prominently in the
manifest/docs; specifically, add a tenantConfig field (or equivalent) to
override the global limit, ensure executeJob() reads
tenantConfig.concurrentLimit before applying the semaphore, and emit a
metric/log with tenant id, schedule ids, and reason when SKIPPED.

221-234: Add index on enabled column for query performance.

The tenant_schedule table will likely be queried with WHERE enabled = true by the ScheduleProvider to fetch active schedules. Without an index on the enabled column, this becomes a full table scan as the schedule count grows.

📊 Suggested index addition
 CREATE TABLE tenant_schedule (
     id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
     schedule_name VARCHAR(128) NOT NULL,
     saga_name VARCHAR(128) NOT NULL,
     cron_expr VARCHAR(64) NOT NULL,
     enabled BOOLEAN NOT NULL DEFAULT true,
     manifest_version_id UUID,
     metadata JSONB,
     created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
     updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
     UNIQUE(schedule_name)
 );
+
+CREATE INDEX idx_tenant_schedule_enabled ON tenant_schedule(enabled) WHERE enabled = true;

Using a partial index (WHERE enabled = true) is even more efficient since it only indexes active schedules.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/prd/060-per-tenant-scheduled-execution.md` around lines 221 - 234, Add a
partial index to the tenant_schedule table on the enabled column so queries from
ScheduleProvider that filter WHERE enabled = true avoid full table scans;
specifically, create an index named something like tenant_schedule_enabled_idx
on tenant_schedule targeting the enabled column with a WHERE enabled = true
predicate and include this DDL alongside the table definition/migrations so
active-schedule lookups are served by the index.

184-189: Consider making the concurrency limit configurable per environment.

The proposed default of "max 20 concurrent executions" may be appropriate for production but could be too restrictive for development or too permissive for resource-constrained staging environments. Since this is a global runtime limiter that affects all tenants, consider making it environment-configurable (via config or environment variable) rather than a hard-coded default.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/prd/060-per-tenant-scheduled-execution.md` around lines 184 - 189, The
concurrency limiter is currently described with a hard-coded default of 20; make
it environment/configurable by adding a global semaphore whose max is read from
configuration (e.g., env var MAX_CONCURRENT_SCHEDULED_EXECUTIONS or config key
scheduler.maxConcurrentExecutions, default 20), initialize that semaphore in the
CronScheduler/boot path, and have executeJob() acquire the semaphore before
starting work and release it after; if acquire fails because limit is reached,
immediately mark the execution as SKIPPED with reason "concurrency limit
reached" and persist that state. Ensure the semaphore instance is shared
globally by CronScheduler and referenced by executeJob() (use the semaphore
variable name used in the diff or create a clear symbol like
globalScheduleSemaphore) so all tenant schedules respect the same configurable
limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/prd/060-per-tenant-scheduled-execution.md`:
- Around line 17-26: The document incorrectly states that manifests with
`scheduled:` triggers "validate and nothing happens"; instead, update the text
to state that `validateScheduledTriggers` in domain_validator.go does parse and
validate `scheduled:` entries (extracting schedule names and enforcing
uniqueness) but there is no execution bridge to the CronScheduler
infrastructure. Change the wording to describe the real gap: manifest.proto
declares `scheduled:` triggers without a cron-expression field or mapping to
CronScheduler, and reference.go's docs are inconsistent with the manifest;
recommend adding a clear migration path (e.g., manifest-to-CronScheduler mapping
or a cron-expression field and runtime registration) so manifests both validate
and are registered with the CronScheduler.
- Line 228: Add a short clarification in the PRD noting that the
manifest_version_id field references control-plane-managed manifest versions
(not tenant-schema rows) and therefore cannot use a direct foreign key; update
the description for manifest_version_id to state it provides traceability to
control-plane manifest versions, explain how scheduled executions should resolve
and record control‑plane artifact IDs (e.g., store control-plane UUID and
optional snapshot metadata) and describe the expected lookup/validation flow
when creating or running scheduled executions so readers understand the
cross‑schema relationship.

---

Nitpick comments:
In `@docs/prd/060-per-tenant-scheduled-execution.md`:
- Around line 343-351: Add a sixth verification criterion to Deliverable A to
test the refresh jitter from A.5: verify that multiple scheduler replicas
refresh schedules with jitter by asserting their refresh timestamps for
ListSchedules() calls are not synchronized (e.g., collect refresh timestamps
from multiple replicas and confirm variance/offsets), reference Deliverable A
and the A.5 jitter behavior as the target of the test and ensure the criterion
mentions ListSchedules() specifically.
- Around line 266-273: The per-tenant concurrent execution limit described for
executeJob() (currently "3-5, configurable") is likely too restrictive and can
unexpectedly SKIP many legitimate concurrent schedules; update the design so
executeJob() supports a per-tenant configurable concurrency limit (falling back
to a global default), add tenant-visible metrics/telemetry emitted when
executions are SKIPPED for quota reasons, and document this behavior and
configuration prominently in the manifest/docs; specifically, add a tenantConfig
field (or equivalent) to override the global limit, ensure executeJob() reads
tenantConfig.concurrentLimit before applying the semaphore, and emit a
metric/log with tenant id, schedule ids, and reason when SKIPPED.
- Around line 221-234: Add a partial index to the tenant_schedule table on the
enabled column so queries from ScheduleProvider that filter WHERE enabled = true
avoid full table scans; specifically, create an index named something like
tenant_schedule_enabled_idx on tenant_schedule targeting the enabled column with
a WHERE enabled = true predicate and include this DDL alongside the table
definition/migrations so active-schedule lookups are served by the index.
- Around line 184-189: The concurrency limiter is currently described with a
hard-coded default of 20; make it environment/configurable by adding a global
semaphore whose max is read from configuration (e.g., env var
MAX_CONCURRENT_SCHEDULED_EXECUTIONS or config key
scheduler.maxConcurrentExecutions, default 20), initialize that semaphore in the
CronScheduler/boot path, and have executeJob() acquire the semaphore before
starting work and release it after; if acquire fails because limit is reached,
immediately mark the execution as SKIPPED with reason "concurrency limit
reached" and persist that state. Ensure the semaphore instance is shared
globally by CronScheduler and referenced by executeJob() (use the semaphore
variable name used in the diff or create a clear symbol like
globalScheduleSemaphore) so all tenant schedules respect the same configurable
limit.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 39d8de26-83e7-490a-90ba-6d6372aae927

📥 Commits

Reviewing files that changed from the base of the PR and between 28e0e04 and 99dee60.

📒 Files selected for processing (1)
  • docs/prd/060-per-tenant-scheduled-execution.md

Comment thread docs/prd/060-per-tenant-scheduled-execution.md Outdated
Comment thread docs/prd/060-per-tenant-scheduled-execution.md
coderabbitai[bot]
coderabbitai Bot previously requested changes Apr 6, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/prd/README.md`:
- Line 69: The PRD entry for "Per-Tenant Scheduled Execution"
(060-per-tenant-scheduled-execution.md) appears twice with conflicting statuses;
open the README and keep only one entry for that PRD under the correct status
section (either the "In Progress" block or the "Not Started" block) by removing
the duplicate line `| [Per-Tenant Scheduled
Execution](060-per-tenant-scheduled-execution.md) | \`remove-platform-ref\`
(Deliverable A) | Draft |` from the incorrect section so the document has a
single authoritative status for that PRD.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ca429340-a0d6-4f61-ab1a-d471f98da34f

📥 Commits

Reviewing files that changed from the base of the PR and between 99dee60 and cbd5b26.

📒 Files selected for processing (2)
  • docs/prd/060-per-tenant-scheduled-execution.md
  • docs/prd/README.md
✅ Files skipped from review due to trivial changes (1)
  • docs/prd/060-per-tenant-scheduled-execution.md

Comment thread docs/prd/README.md Outdated
- Nuance manifest gap claim: validation does process scheduled triggers
  (uniqueness, prefix parsing) but no schedule is registered - trigger is
  syntactically accepted but operationally inert
- Document manifest_version_id as a cross-schema soft reference to
  control-plane manifest versions (no FK constraint)
- Remove duplicate PRD-060 status entry from README In Progress section
  (keep only in Not Started)
@claude
Copy link
Copy Markdown

claude Bot commented Apr 6, 2026

Claude Code Review

Commit: decf141a | CI: running (most checks pending, CodeRabbit passed)

Summary

Well-structured PRD with unusually strong codebase evidence backing each claim. The phased decomposition (attribution -> manifest bridge -> authenticated identity) follows a sound dependency chain and avoids the common trap of building authentication before attribution is in place. The Actor struct design is forward-compatible with Phase C without data migration.

All 10 specific code references verified against the codebase (line numbers, function names, patterns). The separate SystemActorContextKey design is correctly motivated — the auth bypass risk via GetUserIDFromContext is real and confirmed at the cited endpoints.

Risk Assessment

Area Level Detail
Blast radius Low PRD only — no code changes
Rollback Safe Delete file, revert README
Scale N/A Specification document
Cross-system Low Touches scheduler, auth, audit, manifest concepts but no runtime changes
Migration N/A No migrations in this PR

Findings

Severity Location Description Status
Improvement 060-per-tenant-scheduled-execution.md:252 B.1 tenant_schedule schema: updated_at has no auto-update mechanism. CockroachDB lacks PL/pgSQL triggers. Document that application code must set updated_at in B.2, or it will always equal created_at. Open
Note 060-per-tenant-scheduled-execution.md:40-48 Gap 2 wording accurate after revision. "Syntactically accepted but operationally inert" correctly describes the state. Resolved

Bot Review Notes

Three CodeRabbit comments were posted, all marked as "Addressed in commit decf141":

  1. Gap 2 wording (Major): Author revised to "syntactically accepted but operationally inert" — accurate. Already addressed.
  2. manifest_version_id cross-schema reference (Minor): Author added "cross-schema, no FK constraint — a soft reference for audit/debugging." Already addressed.
  3. README dual-listing (Major): Fixed — PRD-060 now appears only in "Not Started." Already addressed.

No unresolved bot threads remain.

Cross-Document Consistency Check

  • ADR-0029 (Settlement Scheduler): PRD-060's TenantScheduleProvider is consistent with in-process cron + Redis leader election. No conflicts.
  • ADR-0033 (Event-Triggered Sagas): PRD-060 covers scheduled: triggers only. No overlap.
  • ADR-0036 (Embedded Dex Identity): Phase C JWT is in-process, not via Dex. Complementary.
  • PRD-021 (Platform Scheduler): PRD-060 extends it with attribution, tenant checks, manifest bridge. Consistent.
  • PRD-045 (Manifest as Sole Source of Truth): B.2 writes via ApplyManifest. Consistent.
  • PRD-057 (Convergent Manifest Apply): B.2's "diff declared schedules against existing rows" aligns with convergent semantics. Consistent.

Questions for the Author

  1. 060-per-tenant-scheduled-execution.md:196 — A.4 defaults to max 20 concurrent executions globally. A global semaphore means one tenant's burst can starve others. Was per-tenant fairness considered here, or is that intentionally deferred to B.5's per-tenant limits?

@bjcoombs bjcoombs dismissed stale reviews from coderabbitai[bot] and coderabbitai[bot] April 6, 2026 09:44

All threads resolved after code fixes

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PRD-060 is well-researched with verified code references, internally consistent, and aligned with existing ADRs and PRDs. One minor suggestion in summary comment (document application-managed updated_at in B.1 schema) and one question (global vs per-tenant fairness in A.4 semaphore). See summary comment for full details.

@bjcoombs bjcoombs merged commit 98230f2 into develop Apr 6, 2026
22 checks passed
@bjcoombs bjcoombs deleted the prd-tenant-scheduling branch April 6, 2026 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant