Skip to content

✨ Declarative limits management via YAML files and Lambda provisioner #405

Description

@sodre

Problem or Use Case

Today, operators manage rate limits imperatively through individual CLI commands (zae-limiter system set-defaults, resource set-defaults, entity set-limits) or Python API calls (repo.set_system_defaults(), repo.set_resource_defaults(), repo.set_limits()). This works for one-off changes but creates challenges at scale:

  • No audit trail of desired state -- operators know what limits are but not what they should be
  • Drift detection is impossible -- no way to compare live limits against a source-of-truth document
  • Multi-namespace rollouts are tedious -- setting the same limits across tenant-alpha, tenant-beta, etc. requires repeated CLI invocations
  • No GitOps workflow -- limits cannot be version-controlled and reviewed through normal PR processes
  • No idempotent apply -- running the same CLI commands twice may produce warnings or unexpected behavior

Operators need a way to declare their desired limits configuration in version-controlled YAML files and have those declarations applied idempotently, similar to how Terraform manages infrastructure state.

Proposed Solution

YAML Configuration Format

One YAML file per namespace. Namespaces are auto-registered on first apply if they don't exist.

# tenant-alpha.limits.yaml
namespace: tenant-alpha

system:
  on_unavailable: allow
  limits:
    rpm:
      capacity: 10000
    tpm:
      capacity: 100000

resources:
  gpt-4:
    limits:
      rpm:
        capacity: 1000
      tpm:
        capacity: 50000
        burst: 75000
        refill_amount: 50000
        refill_period: 60
  claude-3:
    limits:
      tpm:
        capacity: 200000

entities:
  user-123:
    resources:
      gpt-4:
        limits:
          rpm:
            capacity: 500
      _default_:
        limits:
          rpm:
            capacity: 200

Limit shorthand: Only capacity is required. Defaults: burst = capacity, refill_amount = capacity, refill_period = 60 (matches Limit.per_minute()). Override any field explicitly for custom refill behavior.

Map keys for limits: Limits are keyed by name (map, not list) to prevent duplicate limit names in a single declaration.

Architecture

Both CLI and CloudFormation paths converge at a single Lambda provisioner:

YAML → CLI parse → Lambda invoke ──→ diff vs #PROVISIONER ──→ Repository API ──→ DynamoDB
CFN event ─────────→ Lambda ────────→ diff vs #PROVISIONER ──→ Repository API ──→ DynamoDB

The Lambda is the only writer. No direct DynamoDB access from the CLI. LocalStack supports Lambda containers, so no fallback is needed.

CLI Commands

# Preview changes (dry-run, like terraform plan)
zae-limiter limits plan -f tenant-alpha.limits.yaml

# Apply limits from YAML (idempotent, like terraform apply)
zae-limiter limits apply -f tenant-alpha.limits.yaml

# Show diff between YAML and live DynamoDB state (detect out-of-band drift)
zae-limiter limits diff -f tenant-alpha.limits.yaml

# Generate a CloudFormation template from YAML
zae-limiter limits cfn-template -f tenant-alpha.limits.yaml

All commands inherit --name and --region (and --endpoint-url) from the parent CLI.

Lambda Provisioner

A new Lambda function in the main stack that handles both CLI invocations and CFN custom resource events.

CLI invocation payload:

{
  "action": "plan|apply",
  "manifest": {
    "namespace": "tenant-alpha",
    "system": { "on_unavailable": "allow", "limits": { "rpm": { "capacity": 10000 } } },
    "resources": { "gpt-4": { "limits": { "rpm": { "capacity": 1000 } } } },
    "entities": { "user-123": { "resources": { "gpt-4": { "limits": { "rpm": { "capacity": 500 } } } } } }
  }
}

Response:

{
  "status": "applied|planned",
  "changes": [
    {"action": "create", "level": "resource", "target": "gpt-4", "limits": {"rpm": {"capacity": 1000}}},
    {"action": "update", "level": "entity", "target": "user-123/gpt-4", "limits": {"rpm": {"capacity": 500}}},
    {"action": "delete", "level": "resource", "target": "gpt-3.5-turbo"}
  ],
  "manifest_hash": "sha256:abc..."
}

CFN lifecycle mapping:

CFN Event Lambda Action
Create apply (first apply, empty provisioner state)
Update apply (diff against previous provisioner state)
Delete apply with empty manifest (deletes all managed items)

State Tracking (Terraform-style)

The provisioner tracks "managed" vs "unmanaged" limits to avoid clobbering manual overrides:

Scenario Behavior
In YAML, not in DynamoDB Create it
In YAML, in DynamoDB (managed) Update it if changed
Removed from YAML, was managed Delete it
In DynamoDB, never in YAML Leave it alone

State is stored in a DynamoDB record:

Field Description
PK {ns}/SYSTEM#
SK #PROVISIONER
managed_system Boolean: whether system defaults are managed
managed_resources List of resource names managed by this manifest
managed_entities Map of {entity_id: [resource_list]} managed by this manifest
last_applied ISO timestamp of last apply
applied_hash SHA-256 of the YAML content (for drift detection)

Known limitation: The #PROVISIONER record is a single DynamoDB item (400KB max). This limits managed entities to ~5,000 per namespace. Documented as a known limit; solvable later with sharding or item-level tagging if needed.

CloudFormation Integration

Main stack exports the provisioner Lambda ARN:

LimitsProvisionerFunction:
  Type: AWS::Lambda::Function
  Properties:
    FunctionName: !Sub "${AWS::StackName}-limits-provisioner"
    Handler: zae_limiter_provisioner.handler.on_event

Outputs:
  LimitsProvisionerArn:
    Value: !GetAtt LimitsProvisionerFunction.Arn
    Export:
      Name: !Sub "${AWS::StackName}-LimitsProvisionerArn"

Tenant stack (generated by limits cfn-template), one per namespace:

Resources:
  TenantLimits:
    Type: Custom::ZaeLimiterLimits
    Properties:
      ServiceToken: !ImportValue "my-app-LimitsProvisionerArn"
      Namespace: tenant-alpha
      System:
        OnUnavailable: allow
        Limits:
          rpm:
            Capacity: 10000
      Resources:
        gpt-4:
          Limits:
            tpm:
              Capacity: 100000

New Package Structure

src/zae_limiter_provisioner/
├── __init__.py          # Re-exports handler, types
├── handler.py           # Lambda entry: on_event() for CLI + CFN
├── manifest.py          # LimitsManifest dataclass, YAML parsing, validation
├── differ.py            # Diff engine: manifest vs #PROVISIONER record
└── applier.py           # Applies changes via Repository API

src/zae_limiter/
├── cli.py               # New `limits` command group (plan, apply, diff, cfn-template)
├── schema.py            # New key builder: sk_provisioner()
└── repository.py        # New methods: get_provisioner_state(), put_provisioner_state()

Apply Algorithm

apply(manifest: LimitsManifest, repo: Repository):
  1. Auto-register namespace if it doesn't exist
  2. Read current #PROVISIONER record from DynamoDB (or empty if first apply)
  3. Compute diff:

     SYSTEM:
       manifest has system + previous didn't    → set_system_defaults()
       manifest has system + previous did        → set_system_defaults() (overwrite)
       manifest lacks system + previous had      → delete_system_defaults()

     RESOURCES:
       in manifest, not in previous_managed      → set_resource_defaults()
       in manifest, in previous_managed          → set_resource_defaults() (overwrite)
       not in manifest, in previous_managed      → delete_resource_defaults()
       not in manifest, not in previous_managed  → skip (unmanaged)

     ENTITIES (same pattern):
       in manifest, not in previous_managed      → set_limits()
       in manifest, in previous_managed          → set_limits() (overwrite)
       not in manifest, in previous_managed      → delete_limits()
       not in manifest, not in previous_managed  → skip (unmanaged)

  4. Write updated #PROVISIONER record with new managed set + hash

Plan mode runs steps 1-3 and returns the diff without applying.

Cost and Performance

Operation DynamoDB Cost
limits plan (dry-run) Read-only: 1 RCU (provisioner state) + 1 RCU per level checked
limits apply (no changes) 1 RCU (state) + 1 RCU per level (verify) + 1 WCU (state update)
limits apply (with changes) 1 RCU (state) + 1 WCU per changed config + 1 WCU (state update)

This is an administrative operation, not on the hot path -- cost is negligible.

Alternatives Considered

  1. CLI direct to DynamoDB (no Lambda) -- simpler but requires operators to have DynamoDB credentials, cannot be used as a CloudFormation Custom Resource
  2. CLI fallback when Lambda unavailable -- adds a second code path to maintain. LocalStack supports Lambda containers, so no fallback needed
  3. List-style limits ([{name: rpm, capacity: 1000}]) -- allows duplicate limit names by accident. Map style (rpm: {capacity: 1000}) prevents this
  4. Multi-namespace YAML files -- adds complexity. One file per namespace is cleaner, matches one CFN stack per namespace
  5. Full sync (delete everything not in YAML) -- would clobber limits set manually via CLI/API. Terraform-style state tracking preserves unmanaged items
  6. External state file (like terraform.tfstate) -- adds file/S3 management complexity. DynamoDB #PROVISIONER record is self-contained

Acceptance Criteria

  • YAML schema defined and validated (Pydantic or dataclass) covering namespace, system, resources, and entities sections with map-style limits
  • Shorthand limit syntax supported: only capacity required, defaults to Limit.per_minute() behavior
  • zae-limiter limits plan -f <file> invokes Lambda and prints a human-readable diff without modifying DynamoDB
  • zae-limiter limits apply -f <file> invokes Lambda and applies limits idempotently -- running twice with the same file produces zero changes on second run
  • zae-limiter limits diff -f <file> shows differences between YAML declaration and live DynamoDB state
  • zae-limiter limits cfn-template -f <file> generates a CloudFormation template with Custom::ZaeLimiterLimits resource
  • #PROVISIONER state record (PK={ns}/SYSTEM#, SK=#PROVISIONER) tracks managed resources, entities, and applied hash
  • Removing a limit from YAML and running apply deletes it from DynamoDB (only if previously managed per #PROVISIONER record)
  • Limits set manually via set_system_defaults/set_resource_defaults/set_limits are not deleted by apply unless tracked in #PROVISIONER
  • Namespaces are auto-registered on first apply if they don't exist
  • Lambda provisioner function handles both CLI invocations and CFN Custom Resource events (Create/Update/Delete)
  • Main CloudFormation stack exports LimitsProvisionerArn for cross-stack reference
  • Unit tests cover YAML parsing, diff computation, and state tracking logic
  • Integration tests verify idempotent apply, removal tracking, and CFN custom resource lifecycle
  • Documentation updated: CLAUDE.md (schema, CLI commands, access patterns), CLI docs, and operator guide cover declarative limits workflow

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/cliCommand line interfacearea/infraCloudFormation, IAM, infrastructure

    Fields

    No fields configured for Feature.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions