✨ Declarative limits management via YAML files and Lambda provisioner

## Problem or Use Case

Today, operators manage rate limits imperatively through individual CLI commands (`zae-limiter system set-defaults`, `resource set-defaults`, `entity set-limits`) or Python API calls (`repo.set_system_defaults()`, `repo.set_resource_defaults()`, `repo.set_limits()`). This works for one-off changes but creates challenges at scale:

- **No audit trail of desired state** -- operators know what limits *are* but not what they *should be*
- **Drift detection is impossible** -- no way to compare live limits against a source-of-truth document
- **Multi-namespace rollouts are tedious** -- setting the same limits across `tenant-alpha`, `tenant-beta`, etc. requires repeated CLI invocations
- **No GitOps workflow** -- limits cannot be version-controlled and reviewed through normal PR processes
- **No idempotent apply** -- running the same CLI commands twice may produce warnings or unexpected behavior

Operators need a way to declare their desired limits configuration in version-controlled YAML files and have those declarations applied idempotently, similar to how Terraform manages infrastructure state.

## Proposed Solution

### YAML Configuration Format

One YAML file per namespace. Namespaces are auto-registered on first apply if they don't exist.

```yaml
# tenant-alpha.limits.yaml
namespace: tenant-alpha

system:
  on_unavailable: allow
  limits:
    rpm:
      capacity: 10000
    tpm:
      capacity: 100000

resources:
  gpt-4:
    limits:
      rpm:
        capacity: 1000
      tpm:
        capacity: 50000
        burst: 75000
        refill_amount: 50000
        refill_period: 60
  claude-3:
    limits:
      tpm:
        capacity: 200000

entities:
  user-123:
    resources:
      gpt-4:
        limits:
          rpm:
            capacity: 500
      _default_:
        limits:
          rpm:
            capacity: 200
```

**Limit shorthand:** Only `capacity` is required. Defaults: `burst = capacity`, `refill_amount = capacity`, `refill_period = 60` (matches `Limit.per_minute()`). Override any field explicitly for custom refill behavior.

**Map keys for limits:** Limits are keyed by name (map, not list) to prevent duplicate limit names in a single declaration.

### Architecture

Both CLI and CloudFormation paths converge at a single Lambda provisioner:

```
YAML → CLI parse → Lambda invoke ──→ diff vs #PROVISIONER ──→ Repository API ──→ DynamoDB
CFN event ─────────→ Lambda ────────→ diff vs #PROVISIONER ──→ Repository API ──→ DynamoDB
```

The Lambda is the only writer. No direct DynamoDB access from the CLI. LocalStack supports Lambda containers, so no fallback is needed.

### CLI Commands

```bash
# Preview changes (dry-run, like terraform plan)
zae-limiter limits plan -f tenant-alpha.limits.yaml

# Apply limits from YAML (idempotent, like terraform apply)
zae-limiter limits apply -f tenant-alpha.limits.yaml

# Show diff between YAML and live DynamoDB state (detect out-of-band drift)
zae-limiter limits diff -f tenant-alpha.limits.yaml

# Generate a CloudFormation template from YAML
zae-limiter limits cfn-template -f tenant-alpha.limits.yaml
```

All commands inherit `--name` and `--region` (and `--endpoint-url`) from the parent CLI.

### Lambda Provisioner

A new Lambda function in the main stack that handles both CLI invocations and CFN custom resource events.

**CLI invocation payload:**

```json
{
  "action": "plan|apply",
  "manifest": {
    "namespace": "tenant-alpha",
    "system": { "on_unavailable": "allow", "limits": { "rpm": { "capacity": 10000 } } },
    "resources": { "gpt-4": { "limits": { "rpm": { "capacity": 1000 } } } },
    "entities": { "user-123": { "resources": { "gpt-4": { "limits": { "rpm": { "capacity": 500 } } } } } }
  }
}
```

**Response:**

```json
{
  "status": "applied|planned",
  "changes": [
    {"action": "create", "level": "resource", "target": "gpt-4", "limits": {"rpm": {"capacity": 1000}}},
    {"action": "update", "level": "entity", "target": "user-123/gpt-4", "limits": {"rpm": {"capacity": 500}}},
    {"action": "delete", "level": "resource", "target": "gpt-3.5-turbo"}
  ],
  "manifest_hash": "sha256:abc..."
}
```

**CFN lifecycle mapping:**

| CFN Event | Lambda Action |
|-----------|---------------|
| Create | `apply` (first apply, empty provisioner state) |
| Update | `apply` (diff against previous provisioner state) |
| Delete | `apply` with empty manifest (deletes all managed items) |

### State Tracking (Terraform-style)

The provisioner tracks "managed" vs "unmanaged" limits to avoid clobbering manual overrides:

| Scenario | Behavior |
|----------|----------|
| In YAML, not in DynamoDB | **Create** it |
| In YAML, in DynamoDB (managed) | **Update** it if changed |
| Removed from YAML, was managed | **Delete** it |
| In DynamoDB, never in YAML | **Leave it alone** |

State is stored in a DynamoDB record:

| Field | Description |
|-------|-------------|
| PK | `{ns}/SYSTEM#` |
| SK | `#PROVISIONER` |
| `managed_system` | Boolean: whether system defaults are managed |
| `managed_resources` | List of resource names managed by this manifest |
| `managed_entities` | Map of `{entity_id: [resource_list]}` managed by this manifest |
| `last_applied` | ISO timestamp of last apply |
| `applied_hash` | SHA-256 of the YAML content (for drift detection) |

**Known limitation:** The `#PROVISIONER` record is a single DynamoDB item (400KB max). This limits managed entities to ~5,000 per namespace. Documented as a known limit; solvable later with sharding or item-level tagging if needed.

### CloudFormation Integration

**Main stack** exports the provisioner Lambda ARN:

```yaml
LimitsProvisionerFunction:
  Type: AWS::Lambda::Function
  Properties:
    FunctionName: !Sub "${AWS::StackName}-limits-provisioner"
    Handler: zae_limiter_provisioner.handler.on_event

Outputs:
  LimitsProvisionerArn:
    Value: !GetAtt LimitsProvisionerFunction.Arn
    Export:
      Name: !Sub "${AWS::StackName}-LimitsProvisionerArn"
```

**Tenant stack** (generated by `limits cfn-template`), one per namespace:

```yaml
Resources:
  TenantLimits:
    Type: Custom::ZaeLimiterLimits
    Properties:
      ServiceToken: !ImportValue "my-app-LimitsProvisionerArn"
      Namespace: tenant-alpha
      System:
        OnUnavailable: allow
        Limits:
          rpm:
            Capacity: 10000
      Resources:
        gpt-4:
          Limits:
            tpm:
              Capacity: 100000
```

### New Package Structure

```
src/zae_limiter_provisioner/
├── __init__.py          # Re-exports handler, types
├── handler.py           # Lambda entry: on_event() for CLI + CFN
├── manifest.py          # LimitsManifest dataclass, YAML parsing, validation
├── differ.py            # Diff engine: manifest vs #PROVISIONER record
└── applier.py           # Applies changes via Repository API

src/zae_limiter/
├── cli.py               # New `limits` command group (plan, apply, diff, cfn-template)
├── schema.py            # New key builder: sk_provisioner()
└── repository.py        # New methods: get_provisioner_state(), put_provisioner_state()
```

### Apply Algorithm

```
apply(manifest: LimitsManifest, repo: Repository):
  1. Auto-register namespace if it doesn't exist
  2. Read current #PROVISIONER record from DynamoDB (or empty if first apply)
  3. Compute diff:

     SYSTEM:
       manifest has system + previous didn't    → set_system_defaults()
       manifest has system + previous did        → set_system_defaults() (overwrite)
       manifest lacks system + previous had      → delete_system_defaults()

     RESOURCES:
       in manifest, not in previous_managed      → set_resource_defaults()
       in manifest, in previous_managed          → set_resource_defaults() (overwrite)
       not in manifest, in previous_managed      → delete_resource_defaults()
       not in manifest, not in previous_managed  → skip (unmanaged)

     ENTITIES (same pattern):
       in manifest, not in previous_managed      → set_limits()
       in manifest, in previous_managed          → set_limits() (overwrite)
       not in manifest, in previous_managed      → delete_limits()
       not in manifest, not in previous_managed  → skip (unmanaged)

  4. Write updated #PROVISIONER record with new managed set + hash
```

**Plan mode** runs steps 1-3 and returns the diff without applying.

### Cost and Performance

| Operation | DynamoDB Cost |
|-----------|---------------|
| `limits plan` (dry-run) | Read-only: 1 RCU (provisioner state) + 1 RCU per level checked |
| `limits apply` (no changes) | 1 RCU (state) + 1 RCU per level (verify) + 1 WCU (state update) |
| `limits apply` (with changes) | 1 RCU (state) + 1 WCU per changed config + 1 WCU (state update) |

This is an administrative operation, not on the hot path -- cost is negligible.

## Alternatives Considered

1. **CLI direct to DynamoDB (no Lambda)** -- simpler but requires operators to have DynamoDB credentials, cannot be used as a CloudFormation Custom Resource
2. **CLI fallback when Lambda unavailable** -- adds a second code path to maintain. LocalStack supports Lambda containers, so no fallback needed
3. **List-style limits** (`[{name: rpm, capacity: 1000}]`) -- allows duplicate limit names by accident. Map style (`rpm: {capacity: 1000}`) prevents this
4. **Multi-namespace YAML files** -- adds complexity. One file per namespace is cleaner, matches one CFN stack per namespace
5. **Full sync (delete everything not in YAML)** -- would clobber limits set manually via CLI/API. Terraform-style state tracking preserves unmanaged items
6. **External state file** (like terraform.tfstate) -- adds file/S3 management complexity. DynamoDB `#PROVISIONER` record is self-contained

## Acceptance Criteria

- [x] YAML schema defined and validated (Pydantic or dataclass) covering `namespace`, `system`, `resources`, and `entities` sections with map-style limits
- [x] Shorthand limit syntax supported: only `capacity` required, defaults to `Limit.per_minute()` behavior
- [x] `zae-limiter limits plan -f <file>` invokes Lambda and prints a human-readable diff without modifying DynamoDB
- [x] `zae-limiter limits apply -f <file>` invokes Lambda and applies limits idempotently -- running twice with the same file produces zero changes on second run
- [x] `zae-limiter limits diff -f <file>` shows differences between YAML declaration and live DynamoDB state
- [x] `zae-limiter limits cfn-template -f <file>` generates a CloudFormation template with `Custom::ZaeLimiterLimits` resource
- [x] `#PROVISIONER` state record (`PK={ns}/SYSTEM#, SK=#PROVISIONER`) tracks managed resources, entities, and applied hash
- [x] Removing a limit from YAML and running `apply` deletes it from DynamoDB (only if previously managed per `#PROVISIONER` record)
- [x] Limits set manually via `set_system_defaults`/`set_resource_defaults`/`set_limits` are not deleted by `apply` unless tracked in `#PROVISIONER`
- [x] Namespaces are auto-registered on first `apply` if they don't exist
- [x] Lambda provisioner function handles both CLI invocations and CFN Custom Resource events (Create/Update/Delete)
- [x] Main CloudFormation stack exports `LimitsProvisionerArn` for cross-stack reference
- [x] Unit tests cover YAML parsing, diff computation, and state tracking logic
- [x] Integration tests verify idempotent apply, removal tracking, and CFN custom resource lifecycle
- [x] Documentation updated: CLAUDE.md (schema, CLI commands, access patterns), CLI docs, and operator guide cover declarative limits workflow


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

✨ Declarative limits management via YAML files and Lambda provisioner #405

Problem or Use Case

Proposed Solution

YAML Configuration Format

Architecture

CLI Commands

Lambda Provisioner

State Tracking (Terraform-style)

CloudFormation Integration

New Package Structure

Apply Algorithm

Cost and Performance

Alternatives Considered

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CFN Event	Lambda Action
Create	`apply` (first apply, empty provisioner state)
Update	`apply` (diff against previous provisioner state)
Delete	`apply` with empty manifest (deletes all managed items)

Scenario	Behavior
In YAML, not in DynamoDB	Create it
In YAML, in DynamoDB (managed)	Update it if changed
Removed from YAML, was managed	Delete it
In DynamoDB, never in YAML	Leave it alone

Field	Description
PK	`{ns}/SYSTEM#`
SK	`#PROVISIONER`
`managed_system`	Boolean: whether system defaults are managed
`managed_resources`	List of resource names managed by this manifest
`managed_entities`	Map of `{entity_id: [resource_list]}` managed by this manifest
`last_applied`	ISO timestamp of last apply
`applied_hash`	SHA-256 of the YAML content (for drift detection)

Operation	DynamoDB Cost
`limits plan` (dry-run)	Read-only: 1 RCU (provisioner state) + 1 RCU per level checked
`limits apply` (no changes)	1 RCU (state) + 1 RCU per level (verify) + 1 WCU (state update)
`limits apply` (with changes)	1 RCU (state) + 1 WCU per changed config + 1 WCU (state update)

Uh oh!

✨ Declarative limits management via YAML files and Lambda provisioner #405

Description

Problem or Use Case

Proposed Solution

YAML Configuration Format

Architecture

CLI Commands

Lambda Provisioner

State Tracking (Terraform-style)

CloudFormation Integration

New Package Structure

Apply Algorithm

Cost and Performance

Alternatives Considered

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions