diff --git a/.claude/commands/openspec/apply.md b/.claude/commands/openspec/apply.md new file mode 100644 index 00000000..a36fd964 --- /dev/null +++ b/.claude/commands/openspec/apply.md @@ -0,0 +1,23 @@ +--- +name: OpenSpec: Apply +description: Implement an approved OpenSpec change and keep tasks in sync. +category: OpenSpec +tags: [openspec, apply] +--- + +**Guardrails** +- Favor straightforward, minimal implementations first and add complexity only when it is requested or clearly required. +- Keep changes tightly scoped to the requested outcome. +- Refer to `openspec/AGENTS.md` (located inside the `openspec/` directory—run `ls openspec` or `openspec update` if you don't see it) if you need additional OpenSpec conventions or clarifications. + +**Steps** +Track these steps as TODOs and complete them one by one. +1. Read `changes//proposal.md`, `design.md` (if present), and `tasks.md` to confirm scope and acceptance criteria. +2. Work through tasks sequentially, keeping edits minimal and focused on the requested change. +3. Confirm completion before updating statuses—make sure every item in `tasks.md` is finished. +4. Update the checklist after all work is done so each task is marked `- [x]` and reflects reality. +5. Reference `openspec list` or `openspec show ` when additional context is required. + +**Reference** +- Use `openspec show --json --deltas-only` if you need additional context from the proposal while implementing. + diff --git a/.claude/commands/openspec/archive.md b/.claude/commands/openspec/archive.md new file mode 100644 index 00000000..dbc76958 --- /dev/null +++ b/.claude/commands/openspec/archive.md @@ -0,0 +1,27 @@ +--- +name: OpenSpec: Archive +description: Archive a deployed OpenSpec change and update specs. +category: OpenSpec +tags: [openspec, archive] +--- + +**Guardrails** +- Favor straightforward, minimal implementations first and add complexity only when it is requested or clearly required. +- Keep changes tightly scoped to the requested outcome. +- Refer to `openspec/AGENTS.md` (located inside the `openspec/` directory—run `ls openspec` or `openspec update` if you don't see it) if you need additional OpenSpec conventions or clarifications. + +**Steps** +1. Determine the change ID to archive: + - If this prompt already includes a specific change ID (for example inside a `` block populated by slash-command arguments), use that value after trimming whitespace. + - If the conversation references a change loosely (for example by title or summary), run `openspec list` to surface likely IDs, share the relevant candidates, and confirm which one the user intends. + - Otherwise, review the conversation, run `openspec list`, and ask the user which change to archive; wait for a confirmed change ID before proceeding. + - If you still cannot identify a single change ID, stop and tell the user you cannot archive anything yet. +2. Validate the change ID by running `openspec list` (or `openspec show `) and stop if the change is missing, already archived, or otherwise not ready to archive. +3. Run `openspec archive --yes` so the CLI moves the change and applies spec updates without prompts (use `--skip-specs` only for tooling-only work). +4. Review the command output to confirm the target specs were updated and the change landed in `changes/archive/`. +5. Validate with `openspec validate --strict` and inspect with `openspec show ` if anything looks off. + +**Reference** +- Use `openspec list` to confirm change IDs before archiving. +- Inspect refreshed specs with `openspec list --specs` and address any validation issues before handing off. + diff --git a/.claude/commands/openspec/proposal.md b/.claude/commands/openspec/proposal.md new file mode 100644 index 00000000..f4c1c97c --- /dev/null +++ b/.claude/commands/openspec/proposal.md @@ -0,0 +1,27 @@ +--- +name: OpenSpec: Proposal +description: Scaffold a new OpenSpec change and validate strictly. +category: OpenSpec +tags: [openspec, change] +--- + +**Guardrails** +- Favor straightforward, minimal implementations first and add complexity only when it is requested or clearly required. +- Keep changes tightly scoped to the requested outcome. +- Refer to `openspec/AGENTS.md` (located inside the `openspec/` directory—run `ls openspec` or `openspec update` if you don't see it) if you need additional OpenSpec conventions or clarifications. +- Identify any vague or ambiguous details and ask the necessary follow-up questions before editing files. + +**Steps** +1. Review `openspec/project.md`, run `openspec list` and `openspec list --specs`, and inspect related code or docs (e.g., via `rg`/`ls`) to ground the proposal in current behaviour; note any gaps that require clarification. +2. Choose a unique verb-led `change-id` and scaffold `proposal.md`, `tasks.md`, and `design.md` (when needed) under `openspec/changes//`. +3. Map the change into concrete capabilities or requirements, breaking multi-scope efforts into distinct spec deltas with clear relationships and sequencing. +4. Capture architectural reasoning in `design.md` when the solution spans multiple systems, introduces new patterns, or demands trade-off discussion before committing to specs. +5. Draft spec deltas in `changes//specs//spec.md` (one folder per capability) using `## ADDED|MODIFIED|REMOVED Requirements` with at least one `#### Scenario:` per requirement and cross-reference related capabilities when relevant. +6. Draft `tasks.md` as an ordered list of small, verifiable work items that deliver user-visible progress, include validation (tests, tooling), and highlight dependencies or parallelizable work. +7. Validate with `openspec validate --strict` and resolve every issue before sharing the proposal. + +**Reference** +- Use `openspec show --json --deltas-only` or `openspec show --type spec` to inspect details when validation fails. +- Search existing requirements with `rg -n "Requirement:|Scenario:" openspec/specs` before writing new ones. +- Explore the codebase with `rg `, `ls`, or direct file reads so proposals align with current implementation realities. + diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 00000000..06696994 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,18 @@ + +# OpenSpec Instructions + +These instructions are for AI assistants working in this project. + +Always open `@/openspec/AGENTS.md` when the request: +- Mentions planning or proposals (words like proposal, spec, change, plan) +- Introduces new capabilities, breaking changes, architecture shifts, or big performance/security work +- Sounds ambiguous and you need the authoritative spec before coding + +Use `@/openspec/AGENTS.md` to learn: +- How to create and apply change proposals +- Spec format and conventions +- Project structure and guidelines + +Keep this managed block so 'openspec update' can refresh the instructions. + + \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md index 791ead93..a4439ded 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,3 +1,22 @@ + +# OpenSpec Instructions + +These instructions are for AI assistants working in this project. + +Always open `@/openspec/AGENTS.md` when the request: +- Mentions planning or proposals (words like proposal, spec, change, plan) +- Introduces new capabilities, breaking changes, architecture shifts, or big performance/security work +- Sounds ambiguous and you need the authoritative spec before coding + +Use `@/openspec/AGENTS.md` to learn: +- How to create and apply change proposals +- Spec format and conventions +- Project structure and guidelines + +Keep this managed block so 'openspec update' can refresh the instructions. + + + # Claude Developer Guide This file provides context and guidance for Claude (or other AI assistants) working on the LabLink codebase. diff --git a/openspec/AGENTS.md b/openspec/AGENTS.md new file mode 100644 index 00000000..355969d8 --- /dev/null +++ b/openspec/AGENTS.md @@ -0,0 +1,454 @@ +# OpenSpec Instructions + +Instructions for AI coding assistants using OpenSpec for spec-driven development. + +## TL;DR Quick Checklist + +- Search existing work: `openspec spec list --long`, `openspec list` (use `rg` only for full-text search) +- Decide scope: new capability vs modify existing capability +- Pick a unique `change-id`: kebab-case, verb-led (`add-`, `update-`, `remove-`, `refactor-`) +- Scaffold: `proposal.md`, `tasks.md`, `design.md` (only if needed), and delta specs per affected capability +- Write deltas: use `## ADDED|MODIFIED|REMOVED|RENAMED Requirements`; include at least one `#### Scenario:` per requirement +- Validate: `openspec validate [change-id] --strict` and fix issues +- Request approval: Do not start implementation until proposal is approved + +## Three-Stage Workflow + +### Stage 1: Creating Changes +Create proposal when you need to: +- Add features or functionality +- Make breaking changes (API, schema) +- Change architecture or patterns +- Optimize performance (changes behavior) +- Update security patterns + +Triggers (examples): +- "Help me create a change proposal" +- "Help me plan a change" +- "Help me create a proposal" +- "I want to create a spec proposal" +- "I want to create a spec" + +Loose matching guidance: +- Contains one of: `proposal`, `change`, `spec` +- With one of: `create`, `plan`, `make`, `start`, `help` + +Skip proposal for: +- Bug fixes (restore intended behavior) +- Typos, formatting, comments +- Dependency updates (non-breaking) +- Configuration changes +- Tests for existing behavior + +**Workflow** +1. Review `openspec/project.md`, `openspec list`, and `openspec list --specs` to understand current context. +2. Choose a unique verb-led `change-id` and scaffold `proposal.md`, `tasks.md`, optional `design.md`, and spec deltas under `openspec/changes//`. +3. Draft spec deltas using `## ADDED|MODIFIED|REMOVED Requirements` with at least one `#### Scenario:` per requirement. +4. Run `openspec validate --strict` and resolve any issues before sharing the proposal. + +### Stage 2: Implementing Changes +Track these steps as TODOs and complete them one by one. +1. **Read proposal.md** - Understand what's being built +2. **Read design.md** (if exists) - Review technical decisions +3. **Read tasks.md** - Get implementation checklist +4. **Implement tasks sequentially** - Complete in order +5. **Confirm completion** - Ensure every item in `tasks.md` is finished before updating statuses +6. **Update checklist** - After all work is done, set every task to `- [x]` so the list reflects reality +7. **Approval gate** - Do not start implementation until the proposal is reviewed and approved + +### Stage 3: Archiving Changes +After deployment, create separate PR to: +- Move `changes/[name]/` → `changes/archive/YYYY-MM-DD-[name]/` +- Update `specs/` if capabilities changed +- Use `openspec archive --skip-specs --yes` for tooling-only changes (always pass the change ID explicitly) +- Run `openspec validate --strict` to confirm the archived change passes checks + +## Before Any Task + +**Context Checklist:** +- [ ] Read relevant specs in `specs/[capability]/spec.md` +- [ ] Check pending changes in `changes/` for conflicts +- [ ] Read `openspec/project.md` for conventions +- [ ] Run `openspec list` to see active changes +- [ ] Run `openspec list --specs` to see existing capabilities + +**Before Creating Specs:** +- Always check if capability already exists +- Prefer modifying existing specs over creating duplicates +- Use `openspec show [spec]` to review current state +- If request is ambiguous, ask 1–2 clarifying questions before scaffolding + +### Search Guidance +- Enumerate specs: `openspec spec list --long` (or `--json` for scripts) +- Enumerate changes: `openspec list` (or `openspec change list --json` - deprecated but available) +- Show details: + - Spec: `openspec show --type spec` (use `--json` for filters) + - Change: `openspec show --json --deltas-only` +- Full-text search (use ripgrep): `rg -n "Requirement:|Scenario:" openspec/specs` + +## Quick Start + +### CLI Commands + +```bash +# Essential commands +openspec list # List active changes +openspec list --specs # List specifications +openspec show [item] # Display change or spec +openspec validate [item] # Validate changes or specs +openspec archive [--yes|-y] # Archive after deployment (add --yes for non-interactive runs) + +# Project management +openspec init [path] # Initialize OpenSpec +openspec update [path] # Update instruction files + +# Interactive mode +openspec show # Prompts for selection +openspec validate # Bulk validation mode + +# Debugging +openspec show [change] --json --deltas-only +openspec validate [change] --strict +``` + +### Command Flags + +- `--json` - Machine-readable output +- `--type change|spec` - Disambiguate items +- `--strict` - Comprehensive validation +- `--no-interactive` - Disable prompts +- `--skip-specs` - Archive without spec updates +- `--yes`/`-y` - Skip confirmation prompts (non-interactive archive) + +## Directory Structure + +``` +openspec/ +├── project.md # Project conventions +├── specs/ # Current truth - what IS built +│ └── [capability]/ # Single focused capability +│ ├── spec.md # Requirements and scenarios +│ └── design.md # Technical patterns +├── changes/ # Proposals - what SHOULD change +│ ├── [change-name]/ +│ │ ├── proposal.md # Why, what, impact +│ │ ├── tasks.md # Implementation checklist +│ │ ├── design.md # Technical decisions (optional; see criteria) +│ │ └── specs/ # Delta changes +│ │ └── [capability]/ +│ │ └── spec.md # ADDED/MODIFIED/REMOVED +│ └── archive/ # Completed changes +``` + +## Creating Change Proposals + +### Decision Tree + +``` +New request? +├─ Bug fix restoring spec behavior? → Fix directly +├─ Typo/format/comment? → Fix directly +├─ New feature/capability? → Create proposal +├─ Breaking change? → Create proposal +├─ Architecture change? → Create proposal +└─ Unclear? → Create proposal (safer) +``` + +### Proposal Structure + +1. **Create directory:** `changes/[change-id]/` (kebab-case, verb-led, unique) + +2. **Write proposal.md:** +```markdown +## Why +[1-2 sentences on problem/opportunity] + +## What Changes +- [Bullet list of changes] +- [Mark breaking changes with **BREAKING**] + +## Impact +- Affected specs: [list capabilities] +- Affected code: [key files/systems] +``` + +3. **Create spec deltas:** `specs/[capability]/spec.md` +```markdown +## ADDED Requirements +### Requirement: New Feature +The system SHALL provide... + +#### Scenario: Success case +- **WHEN** user performs action +- **THEN** expected result + +## MODIFIED Requirements +### Requirement: Existing Feature +[Complete modified requirement] + +## REMOVED Requirements +### Requirement: Old Feature +**Reason**: [Why removing] +**Migration**: [How to handle] +``` +If multiple capabilities are affected, create multiple delta files under `changes/[change-id]/specs//spec.md`—one per capability. + +4. **Create tasks.md:** +```markdown +## 1. Implementation +- [ ] 1.1 Create database schema +- [ ] 1.2 Implement API endpoint +- [ ] 1.3 Add frontend component +- [ ] 1.4 Write tests +``` + +5. **Create design.md when needed:** +Create `design.md` if any of the following apply; otherwise omit it: +- Cross-cutting change (multiple services/modules) or a new architectural pattern +- New external dependency or significant data model changes +- Security, performance, or migration complexity +- Ambiguity that benefits from technical decisions before coding + +Minimal `design.md` skeleton: +```markdown +## Context +[Background, constraints, stakeholders] + +## Goals / Non-Goals +- Goals: [...] +- Non-Goals: [...] + +## Decisions +- Decision: [What and why] +- Alternatives considered: [Options + rationale] + +## Risks / Trade-offs +- [Risk] → Mitigation + +## Migration Plan +[Steps, rollback] + +## Open Questions +- [...] +``` + +## Spec File Format + +### Critical: Scenario Formatting + +**CORRECT** (use #### headers): +```markdown +#### Scenario: User login success +- **WHEN** valid credentials provided +- **THEN** return JWT token +``` + +**WRONG** (don't use bullets or bold): +```markdown +- **Scenario: User login** ❌ +**Scenario**: User login ❌ +### Scenario: User login ❌ +``` + +Every requirement MUST have at least one scenario. + +### Requirement Wording +- Use SHALL/MUST for normative requirements (avoid should/may unless intentionally non-normative) + +### Delta Operations + +- `## ADDED Requirements` - New capabilities +- `## MODIFIED Requirements` - Changed behavior +- `## REMOVED Requirements` - Deprecated features +- `## RENAMED Requirements` - Name changes + +Headers matched with `trim(header)` - whitespace ignored. + +#### When to use ADDED vs MODIFIED +- ADDED: Introduces a new capability or sub-capability that can stand alone as a requirement. Prefer ADDED when the change is orthogonal (e.g., adding "Slash Command Configuration") rather than altering the semantics of an existing requirement. +- MODIFIED: Changes the behavior, scope, or acceptance criteria of an existing requirement. Always paste the full, updated requirement content (header + all scenarios). The archiver will replace the entire requirement with what you provide here; partial deltas will drop previous details. +- RENAMED: Use when only the name changes. If you also change behavior, use RENAMED (name) plus MODIFIED (content) referencing the new name. + +Common pitfall: Using MODIFIED to add a new concern without including the previous text. This causes loss of detail at archive time. If you aren’t explicitly changing the existing requirement, add a new requirement under ADDED instead. + +Authoring a MODIFIED requirement correctly: +1) Locate the existing requirement in `openspec/specs//spec.md`. +2) Copy the entire requirement block (from `### Requirement: ...` through its scenarios). +3) Paste it under `## MODIFIED Requirements` and edit to reflect the new behavior. +4) Ensure the header text matches exactly (whitespace-insensitive) and keep at least one `#### Scenario:`. + +Example for RENAMED: +```markdown +## RENAMED Requirements +- FROM: `### Requirement: Login` +- TO: `### Requirement: User Authentication` +``` + +## Troubleshooting + +### Common Errors + +**"Change must have at least one delta"** +- Check `changes/[name]/specs/` exists with .md files +- Verify files have operation prefixes (## ADDED Requirements) + +**"Requirement must have at least one scenario"** +- Check scenarios use `#### Scenario:` format (4 hashtags) +- Don't use bullet points or bold for scenario headers + +**Silent scenario parsing failures** +- Exact format required: `#### Scenario: Name` +- Debug with: `openspec show [change] --json --deltas-only` + +### Validation Tips + +```bash +# Always use strict mode for comprehensive checks +openspec validate [change] --strict + +# Debug delta parsing +openspec show [change] --json | jq '.deltas' + +# Check specific requirement +openspec show [spec] --json -r 1 +``` + +## Happy Path Script + +```bash +# 1) Explore current state +openspec spec list --long +openspec list +# Optional full-text search: +# rg -n "Requirement:|Scenario:" openspec/specs +# rg -n "^#|Requirement:" openspec/changes + +# 2) Choose change id and scaffold +CHANGE=add-two-factor-auth +mkdir -p openspec/changes/$CHANGE/{specs/auth} +printf "## Why\n...\n\n## What Changes\n- ...\n\n## Impact\n- ...\n" > openspec/changes/$CHANGE/proposal.md +printf "## 1. Implementation\n- [ ] 1.1 ...\n" > openspec/changes/$CHANGE/tasks.md + +# 3) Add deltas (example) +cat > openspec/changes/$CHANGE/specs/auth/spec.md << 'EOF' +## ADDED Requirements +### Requirement: Two-Factor Authentication +Users MUST provide a second factor during login. + +#### Scenario: OTP required +- **WHEN** valid credentials are provided +- **THEN** an OTP challenge is required +EOF + +# 4) Validate +openspec validate $CHANGE --strict +``` + +## Multi-Capability Example + +``` +openspec/changes/add-2fa-notify/ +├── proposal.md +├── tasks.md +└── specs/ + ├── auth/ + │ └── spec.md # ADDED: Two-Factor Authentication + └── notifications/ + └── spec.md # ADDED: OTP email notification +``` + +auth/spec.md +```markdown +## ADDED Requirements +### Requirement: Two-Factor Authentication +... +``` + +notifications/spec.md +```markdown +## ADDED Requirements +### Requirement: OTP Email Notification +... +``` + +## Best Practices + +### Simplicity First +- Default to <100 lines of new code +- Single-file implementations until proven insufficient +- Avoid frameworks without clear justification +- Choose boring, proven patterns + +### Complexity Triggers +Only add complexity with: +- Performance data showing current solution too slow +- Concrete scale requirements (>1000 users, >100MB data) +- Multiple proven use cases requiring abstraction + +### Clear References +- Use `file.ts:42` format for code locations +- Reference specs as `specs/auth/spec.md` +- Link related changes and PRs + +### Capability Naming +- Use verb-noun: `user-auth`, `payment-capture` +- Single purpose per capability +- 10-minute understandability rule +- Split if description needs "AND" + +### Change ID Naming +- Use kebab-case, short and descriptive: `add-two-factor-auth` +- Prefer verb-led prefixes: `add-`, `update-`, `remove-`, `refactor-` +- Ensure uniqueness; if taken, append `-2`, `-3`, etc. + +## Tool Selection Guide + +| Task | Tool | Why | +|------|------|-----| +| Find files by pattern | Glob | Fast pattern matching | +| Search code content | Grep | Optimized regex search | +| Read specific files | Read | Direct file access | +| Explore unknown scope | Task | Multi-step investigation | + +## Error Recovery + +### Change Conflicts +1. Run `openspec list` to see active changes +2. Check for overlapping specs +3. Coordinate with change owners +4. Consider combining proposals + +### Validation Failures +1. Run with `--strict` flag +2. Check JSON output for details +3. Verify spec file format +4. Ensure scenarios properly formatted + +### Missing Context +1. Read project.md first +2. Check related specs +3. Review recent archives +4. Ask for clarification + +## Quick Reference + +### Stage Indicators +- `changes/` - Proposed, not yet built +- `specs/` - Built and deployed +- `archive/` - Completed changes + +### File Purposes +- `proposal.md` - Why and what +- `tasks.md` - Implementation steps +- `design.md` - Technical decisions +- `spec.md` - Requirements and behavior + +### CLI Essentials +```bash +openspec list # What's in progress? +openspec show [item] # View details +openspec validate --strict # Is it correct? +openspec archive [--yes|-y] # Mark complete (add --yes for automation) +``` + +Remember: Specs are truth. Changes are proposals. Keep them in sync. diff --git a/openspec/changes/simplify-domain-ssl-config/proposal.md b/openspec/changes/simplify-domain-ssl-config/proposal.md new file mode 100644 index 00000000..397eec44 --- /dev/null +++ b/openspec/changes/simplify-domain-ssl-config/proposal.md @@ -0,0 +1,170 @@ +# Simplify Domain and SSL Configuration + +## Why + +**Security vulnerability** identified: A security researcher found our test subdomain `test.lablink.sleap.ai` pointing to a decommissioned AWS IP (54.214.215.124), creating a subdomain takeover vulnerability (CVSS 8.2 - High Severity). The root cause is **dangling DNS records** after infrastructure destruction, not the use of sub-subdomains themselves. + +**Current configuration problems:** +1. **No DNS cleanup on destroy**: Terraform-managed DNS records aren't guaranteed to be deleted when infrastructure is destroyed +2. **Invalid combinations permitted**: System allows SSL without DNS, leading to runtime failures like Caddy configured with `http://N/A` +3. **Configuration redundancy**: Multiple overlapping fields (`dns.app_name`, `dns.pattern`, `dns.custom_subdomain`) create confusion and bugs +4. **Subdomain concatenation bugs**: Empty `custom_subdomain` creates malformed FQDNs like `.lablink.sleap.ai` +5. **Dual source of truth**: Allocator constructs URLs from config while Terraform computes FQDNs, causing client connection failures (#212) +6. **No pre-deployment validation**: Misconfigured setups only fail at runtime + +## What Changes + +**BREAKING CHANGES:** + +### 1. Simplified DNS Configuration Schema +- **REMOVE** redundant fields: `dns.app_name`, `dns.pattern`, `dns.custom_subdomain`, `dns.create_zone` +- **CHANGE** `dns.domain` to accept full domain (e.g., `lablink.sleap.ai`, `test.lablink.sleap.ai`) + - **Note**: Sub-subdomains ARE allowed (common pattern for environments) + - Validation ensures domain is non-empty when DNS enabled +- **CLARIFY** `dns.zone_id` as optional (auto-lookup if not provided) +- **KEEP** `dns.terraform_managed` flag (true = Terraform manages DNS, false = external DNS like CloudFlare) + +### 2. DNS Lifecycle Management (Terraform) +- **ADD** Terraform lifecycle hooks to ensure DNS cleanup on destroy (when `terraform_managed=true`) +- **ADD** `prevent_destroy` option for production DNS records (configurable) +- **ADD** post-destroy verification (optional GitHub Action) to check for dangling DNS records + +### 3. SSL Configuration Enhancement +- **CHANGE** `ssl.provider` enum values: + - `"none"` - HTTP only, no SSL + - `"letsencrypt"` - Free automated SSL via Caddy + Let's Encrypt + - `"cloudflare"` - CloudFlare proxy handles SSL (certificate in CloudFlare) + - `"acm"` - AWS Certificate Manager (purchased or validated cert, requires ALB) +- **REMOVE** `ssl.staging` (use `dns.enabled=false` + `ssl.provider="none"` for testing instead) +- **ADD** `ssl.certificate_arn` (optional, required when `provider="acm"`) +- **ENFORCE** validation: `ssl.provider != "none"` requires `dns.enabled=true` + +### 4. FQDN Computed by Terraform +- **ADD** `ALLOCATOR_FQDN` environment variable computed by Terraform and passed to allocator container +- **CHANGE** Terraform to compute FQDN from config (based on `dns.domain` and `ssl.provider`) +- **CHANGE** Allocator to use `ALLOCATOR_FQDN` environment variable as authoritative source +- **FIX** Issue #212: Eliminates dual source of truth between Terraform-computed FQDN and allocator-computed URL + +### 5. Enhanced Configuration Validation +- **ENHANCE** existing `lablink-validate-config` CLI with new validation rules: + - SSL (non-"none") requires DNS enabled + - DNS enabled requires non-empty `domain` field + - Domain cannot start/end with dots (catches `.lablink.sleap.ai` bug) + - CloudFlare SSL requires `terraform_managed=false` (external DNS) + - Let's Encrypt requires valid email + - ACM requires `certificate_arn` when `provider="acm"` + - `terraform_managed=true` implies Route53 (document external DNS uses `terraform_managed=false`) +- **ADD** CI validation workflow to run on config changes (lablink-template repo) + +### 6. Five Canonical Use Cases +Document and validate five explicit deployment patterns: + +**Use Case 1: IP-only Testing (Local/Dev)** +```yaml +dns: + enabled: false +ssl: + provider: "none" +``` + +**Use Case 2: CloudFlare DNS + SSL (External DNS)** +```yaml +dns: + enabled: true + terraform_managed: false # Manual CloudFlare DNS + domain: "lablink.sleap.ai" # Or "test.lablink.sleap.ai" for environments +ssl: + provider: "cloudflare" + email: "" # Not used with CloudFlare +``` + +**Use Case 3: Route53 + Let's Encrypt (Terraform-managed DNS)** +```yaml +dns: + enabled: true + terraform_managed: true + domain: "lablink.sleap.ai" # Or "test.lablink.sleap.ai" + zone_id: "Z1234567890ABC" # Optional, auto-lookup if omitted +ssl: + provider: "letsencrypt" + email: "admin@sleap.ai" +``` + +**Use Case 4: Route53 + AWS Certificate Manager** +```yaml +dns: + enabled: true + terraform_managed: true + domain: "lablink.sleap.ai" + zone_id: "Z1234567890ABC" +ssl: + provider: "acm" + certificate_arn: "arn:aws:acm:us-west-2:123456789012:certificate/abc-123" + email: "" # Not used with ACM +``` + +**Use Case 5: Route53 + Let's Encrypt (Manual DNS)** +```yaml +dns: + enabled: true + terraform_managed: false # Manual Route53 DNS records + domain: "lablink.sleap.ai" +ssl: + provider: "letsencrypt" + email: "admin@sleap.ai" +``` + +## Impact + +**Security:** +- **Primary fix**: Ensures DNS records are cleaned up when infrastructure is destroyed (prevents dangling records) +- Enforces valid SSL/DNS combinations through pre-deployment validation +- Supports enterprise SSL via AWS Certificate Manager +- Allows sub-subdomains (common pattern) while preventing malformed domains + +**Affected Repositories:** +- `talmolab/lablink`: Configuration schema, enhanced validation CLI, allocator URL logic +- `talmolab/lablink-template`: Terraform DNS lifecycle, SSL/ALB logic, example configs, CI validation + +**Affected Code (lablink repo):** +- `packages/allocator/src/lablink_allocator_service/conf/structured_config.py` - Config schema (BREAKING) +- `packages/allocator/src/lablink_allocator_service/conf/config.yaml` - Example config update +- `packages/allocator/src/lablink_allocator_service/validate_config.py` - Enhanced validation rules +- `packages/allocator/src/lablink_allocator_service/main.py` - FQDN environment variable support +- `packages/allocator/src/lablink_allocator_service/get_config.py` - Validation integration +- Tests for new validation logic + +**Affected Code (lablink-template repo):** +- Terraform DNS configuration (remove pattern logic, use single domain, add lifecycle hooks) +- Terraform SSL/Caddy configuration (pass FQDN to container) +- Terraform ALB/ACM configuration (conditional creation when ssl.provider="acm") +- Example config files for all environments (dev, test, ci-test, prod) +- CI validation workflow (add lablink-validate-config step) +- Optional post-destroy DNS verification workflow + +**Migration Path:** +- Existing deployments must update configs before next allocator release +- Migration guide provided with examples for each use case +- Validation errors provide clear migration instructions +- Sub-subdomains continue to work (no breaking change in functionality, only config schema) + +**Related Issues:** +- Fixes: talmolab/lablink-template#7 (Simplify DNS and SSL) +- Fixes: talmolab/lablink#200 (Subdomain bug persists - empty subdomain creates `.lablink.sleap.ai`) +- Fixes: talmolab/lablink#212 (FQDN environment variable) +- Implements: talmolab/lablink-template#12 (Config validation in CI) +- Addresses security disclosure: Subdomain takeover via dangling DNS records + +**Release Plan:** +1. Merge changes to lablink package (this repo) +2. Update tests for new validation rules +3. Release new allocator package version to PyPI (breaking change, bump minor version) +4. Update lablink-template with new Terraform logic (DNS lifecycle, ACM/ALB support) +5. Document migration guide in both repos with all five use cases +6. Notify users of breaking changes via GitHub release notes + +**Notes on ACM Support:** +- ACM certificates require Application Load Balancer (ALB) or CloudFront +- Terraform will conditionally create ALB when `ssl.provider="acm"` +- ALB adds cost (~$16/month + data transfer) vs Caddy (free) +- Let's Encrypt remains the default recommendation for cost-effective SSL \ No newline at end of file diff --git a/openspec/changes/simplify-domain-ssl-config/specs/configuration/spec.md b/openspec/changes/simplify-domain-ssl-config/specs/configuration/spec.md new file mode 100644 index 00000000..7d71c7cf --- /dev/null +++ b/openspec/changes/simplify-domain-ssl-config/specs/configuration/spec.md @@ -0,0 +1,252 @@ +# Configuration Management Specification + +## ADDED Requirements + +### Requirement: Domain Format Validation +The system SHALL validate domain names to prevent malformed domains while allowing sub-subdomains for environment separation. + +#### Scenario: Valid single-level subdomain accepted +- **GIVEN** a config with domain "lablink.sleap.ai" +- **WHEN** validation runs +- **THEN** validation passes + +#### Scenario: Valid sub-subdomain accepted +- **GIVEN** a config with domain "test.lablink.sleap.ai" +- **WHEN** validation runs +- **THEN** validation passes + +#### Scenario: Domain with leading dot rejected +- **GIVEN** a config with domain ".lablink.sleap.ai" +- **WHEN** validation runs +- **THEN** validation fails with error "Domain cannot start with a dot" + +#### Scenario: Domain with trailing dot rejected +- **GIVEN** a config with domain "lablink.sleap.ai." +- **WHEN** validation runs +- **THEN** validation fails with error "Domain cannot end with a dot" + +#### Scenario: Base domain without subdomain accepted +- **GIVEN** a config with domain "sleap.ai" +- **WHEN** validation runs +- **THEN** validation passes + +### Requirement: SSL and DNS Dependency Validation +The system SHALL enforce that SSL (when enabled) requires DNS to be enabled, preventing invalid runtime configurations. + +#### Scenario: SSL with DNS enabled passes validation +- **GIVEN** a config with ssl.provider="letsencrypt" and dns.enabled=true +- **WHEN** validation runs +- **THEN** validation passes + +#### Scenario: SSL without DNS fails validation +- **GIVEN** a config with ssl.provider="letsencrypt" and dns.enabled=false +- **WHEN** validation runs +- **THEN** validation fails with error "SSL requires DNS to be enabled" + +#### Scenario: No SSL with no DNS passes validation +- **GIVEN** a config with ssl.provider="none" and dns.enabled=false +- **WHEN** validation runs +- **THEN** validation passes + +### Requirement: Provider-Specific Configuration Validation +The system SHALL validate that required configuration fields are present for each SSL provider. + +#### Scenario: Let's Encrypt requires email +- **GIVEN** a config with ssl.provider="letsencrypt" and ssl.email="" +- **WHEN** validation runs +- **THEN** validation fails with error "Let's Encrypt requires email address" + +#### Scenario: ACM requires certificate ARN +- **GIVEN** a config with ssl.provider="acm" and ssl.certificate_arn="" +- **WHEN** validation runs +- **THEN** validation fails with error "ACM provider requires certificate_arn" + +#### Scenario: CloudFlare does not require email or ARN +- **GIVEN** a config with ssl.provider="cloudflare" and ssl.email="" and ssl.certificate_arn="" +- **WHEN** validation runs +- **THEN** validation passes + +#### Scenario: CloudFlare requires external DNS management +- **GIVEN** a config with ssl.provider="cloudflare" and dns.terraform_managed=true +- **WHEN** validation runs +- **THEN** validation fails with error "CloudFlare SSL requires terraform_managed=false (external DNS)" + +### Requirement: DNS Domain Required When Enabled +The system SHALL require a non-empty domain field when DNS is enabled. + +#### Scenario: DNS enabled with domain passes +- **GIVEN** a config with dns.enabled=true and dns.domain="lablink.sleap.ai" +- **WHEN** validation runs +- **THEN** validation passes + +#### Scenario: DNS enabled without domain fails +- **GIVEN** a config with dns.enabled=true and dns.domain="" +- **WHEN** validation runs +- **THEN** validation fails with error "DNS enabled requires non-empty domain field" + +#### Scenario: DNS disabled with empty domain passes +- **GIVEN** a config with dns.enabled=false and dns.domain="" +- **WHEN** validation runs +- **THEN** validation passes + +### Requirement: FQDN Computed by Infrastructure +The system SHALL compute the allocator FQDN during infrastructure deployment and provide it to the allocator service as the authoritative URL source. + +#### Scenario: FQDN computed from DNS configuration +- **GIVEN** config with dns.enabled=true, dns.domain="lablink.sleap.ai", ssl.provider="letsencrypt" +- **WHEN** Terraform deploys infrastructure +- **THEN** computes ALLOCATOR_FQDN as "https://lablink.sleap.ai" +- **AND** passes ALLOCATOR_FQDN to allocator container as environment variable + +#### Scenario: HTTP FQDN when SSL disabled +- **GIVEN** config with dns.enabled=true, dns.domain="lablink.sleap.ai", ssl.provider="none" +- **WHEN** Terraform deploys infrastructure +- **THEN** computes ALLOCATOR_FQDN as "http://lablink.sleap.ai" + +#### Scenario: IP-only mode when DNS disabled +- **GIVEN** config with dns.enabled=false +- **WHEN** Terraform deploys infrastructure +- **THEN** computes ALLOCATOR_FQDN using allocator public IP +- **AND** uses http protocol + +#### Scenario: Allocator uses FQDN from environment +- **GIVEN** ALLOCATOR_FQDN environment variable is set to "https://lablink.sleap.ai" +- **WHEN** allocator starts +- **THEN** allocator uses "https://lablink.sleap.ai" as its URL +- **AND** does not recompute URL from config +- **AND** logs indicate "Using ALLOCATOR_FQDN from environment: https://lablink.sleap.ai" + +### Requirement: DNS Lifecycle Management +The system SHALL ensure DNS records are cleaned up when infrastructure is destroyed to prevent subdomain takeover vulnerabilities. + +#### Scenario: Terraform-managed DNS records deleted on destroy +- **GIVEN** dns.terraform_managed=true +- **AND** infrastructure is deployed with DNS records created +- **WHEN** terraform destroy is executed +- **THEN** DNS records are deleted before other resources +- **AND** no dangling DNS records remain + +#### Scenario: External DNS records not managed by Terraform +- **GIVEN** dns.terraform_managed=false +- **AND** infrastructure is deployed +- **WHEN** terraform destroy is executed +- **THEN** Terraform does not attempt to delete DNS records +- **AND** user is responsible for DNS cleanup + +#### Scenario: Optional prevent_destroy protection for production +- **GIVEN** dns.terraform_managed=true +- **AND** prevent_destroy lifecycle policy is configured +- **WHEN** terraform destroy is attempted on production environment +- **THEN** Terraform blocks DNS record destruction +- **AND** provides clear error message requiring manual intervention + +## MODIFIED Requirements + +### Requirement: DNS Configuration Schema +The system SHALL support DNS configuration through a simplified schema that prevents common misconfiguration patterns while supporting flexible domain naming including sub-subdomains. + +**Configuration Fields:** +- `enabled` (bool): Whether DNS is enabled +- `terraform_managed` (bool): Whether Terraform manages DNS records (true = Route53, false = external DNS like CloudFlare) +- `domain` (str): Full domain name (e.g., "lablink.sleap.ai", "test.lablink.sleap.ai") +- `zone_id` (str, optional): Route53 hosted zone ID for explicit zone targeting (auto-lookup if omitted) + +**Removed Fields (BREAKING):** +- `app_name`: Replaced by complete domain in `domain` field +- `pattern`: Removed, use explicit domain instead +- `custom_subdomain`: Replaced by complete domain in `domain` field +- `create_zone`: Removed (zones should be pre-created and managed separately) + +#### Scenario: Simplified DNS configuration with full domain +- **GIVEN** a config with dns.enabled=true and dns.domain="lablink.sleap.ai" +- **WHEN** DNS records are created +- **THEN** A record points to allocator IP with name "lablink.sleap.ai" + +#### Scenario: Sub-subdomain for environment separation +- **GIVEN** a config with dns.enabled=true and dns.domain="test.lablink.sleap.ai" +- **WHEN** DNS records are created +- **THEN** A record points to allocator IP with name "test.lablink.sleap.ai" + +#### Scenario: Optional zone_id for explicit zone targeting +- **GIVEN** a config with dns.zone_id="Z1234567890ABC" +- **AND** dns.terraform_managed=true +- **WHEN** Terraform looks up Route53 zone +- **THEN** uses provided zone_id without auto-lookup + +#### Scenario: Zone auto-lookup when zone_id not provided +- **GIVEN** a config with dns.domain="lablink.sleap.ai" and dns.zone_id="" +- **AND** dns.terraform_managed=true +- **WHEN** Terraform looks up Route53 zone +- **THEN** queries Route53 for hosted zone matching "sleap.ai" + +#### Scenario: External DNS provider configuration +- **GIVEN** dns.terraform_managed=false and dns.domain="lablink.sleap.ai" +- **WHEN** Terraform deploys infrastructure +- **THEN** Terraform outputs allocator IP for manual DNS configuration +- **AND** does not create or manage DNS records + +### Requirement: SSL Configuration Schema +The system SHALL support multiple SSL providers through a unified configuration schema with proper validation. + +**Configuration Fields:** +- `provider` (enum): SSL certificate provider + - `"none"`: HTTP only, no SSL + - `"letsencrypt"`: Automated SSL via Caddy + Let's Encrypt + - `"cloudflare"`: CloudFlare proxy handles SSL (requires terraform_managed=false) + - `"acm"`: AWS Certificate Manager (requires ALB) +- `email` (str): Email for Let's Encrypt notifications (required when provider="letsencrypt") +- `certificate_arn` (str, optional): ACM certificate ARN (required when provider="acm") + +**Removed Fields (BREAKING):** +- `staging`: Removed (use dns.enabled=false + ssl.provider="none" for testing) + +#### Scenario: Let's Encrypt SSL with email +- **GIVEN** ssl.provider="letsencrypt" and ssl.email="admin@sleap.ai" +- **AND** dns.enabled=true +- **WHEN** allocator starts +- **THEN** Caddy requests Let's Encrypt certificate +- **AND** sends notifications to admin@sleap.ai + +#### Scenario: CloudFlare SSL without email +- **GIVEN** ssl.provider="cloudflare" and ssl.email="" +- **AND** dns.terraform_managed=false +- **WHEN** allocator starts +- **THEN** skips Caddy SSL setup (CloudFlare handles SSL) +- **AND** configures HTTP backend for CloudFlare proxy + +#### Scenario: ACM SSL with certificate ARN +- **GIVEN** ssl.provider="acm" and ssl.certificate_arn="arn:aws:acm:us-west-2:123:certificate/abc" +- **AND** dns.enabled=true +- **WHEN** Terraform deploys infrastructure +- **THEN** creates ALB with ACM certificate attached +- **AND** ALB terminates SSL, forwards HTTP to allocator + +#### Scenario: No SSL for testing +- **GIVEN** ssl.provider="none" and dns.enabled=false +- **WHEN** allocator starts +- **THEN** serves HTTP only on port 80 +- **AND** no Caddy or ALB created + +## REMOVED Requirements + +### Requirement: DNS Pattern-Based Subdomain Construction +**Reason**: This pattern-based approach led to confusion, bugs, and allowed malformed domains like `.lablink.sleap.ai`. The concatenation of app_name + pattern + custom_subdomain created multiple ways to express the same configuration, making it difficult to understand and debug. + +**Migration**: Use explicit `dns.domain` field with complete domain name. + +**Old Pattern Mapping:** +- `pattern="auto"` + `app_name="lablink"` + environment → Use `domain="lablink.sleap.ai"` (prod) or `domain="test.lablink.sleap.ai"` (test) +- `pattern="app-only"` + `app_name="lablink"` → Use `domain="lablink.sleap.ai"` +- `pattern="custom"` + `custom_subdomain="dev"` + `domain="sleap.ai"` → Use `domain="dev.sleap.ai"` + +### Requirement: SSL Staging Mode +**Reason**: The `ssl.staging` boolean added complexity without clear benefit. For unlimited testing, users should disable DNS and SSL entirely (IP-only mode), which is clearer than a "staging" boolean that modifies SSL behavior in non-obvious ways. + +**Migration**: +- Old `ssl.staging=true` (HTTP-only for testing) → Use `dns.enabled=false` + `ssl.provider="none"` +- Old `ssl.staging=false` (production HTTPS) → Use `ssl.provider="letsencrypt"` with dns.enabled=true + +### Requirement: Automatic DNS Zone Creation +**Reason**: Creating Route53 hosted zones during infrastructure deployment mixes concerns and can lead to accidental zone deletion. DNS zones should be managed separately as long-lived infrastructure. + +**Migration**: Pre-create Route53 hosted zones manually or via separate Terraform module, then reference zone_id in config or rely on auto-lookup by domain. \ No newline at end of file diff --git a/openspec/changes/simplify-domain-ssl-config/tasks.md b/openspec/changes/simplify-domain-ssl-config/tasks.md new file mode 100644 index 00000000..5ae3bdee --- /dev/null +++ b/openspec/changes/simplify-domain-ssl-config/tasks.md @@ -0,0 +1,130 @@ +# Implementation Tasks + +## 1. Configuration Schema Updates (lablink repo) +- [x] 1.1 Update `DNSConfig` dataclass in `structured_config.py` + - [x] Remove: `app_name`, `pattern`, `custom_subdomain`, `create_zone` + - [x] Keep: `enabled`, `terraform_managed`, `domain`, `zone_id` + - [x] Update docstrings with new behavior +- [x] 1.2 Update `SSLConfig` dataclass in `structured_config.py` + - [x] Change `provider` enum: `"none"`, `"letsencrypt"`, `"cloudflare"`, `"acm"` + - [x] Remove: `staging` field + - [x] Add: `certificate_arn` field (optional, for ACM) + - [x] Update docstrings +- [x] 1.3 Update example `config.yaml` with new schema + - [x] Use single `domain` field + - [x] Remove deprecated fields + - [x] Add comments explaining new structure + +## 2. Validation Logic Enhancement (lablink repo) +- [x] 2.1 Create domain validator function + - [x] Validate domain format (allows sub-subdomains) + - [x] Reject domains starting/ending with dots + - [x] Allow multi-level subdomains (e.g., `test.lablink.sleap.ai`) +- [x] 2.2 Enhance `validate_config.py` with new rules + - [x] SSL non-"none" requires DNS enabled + - [x] DNS enabled requires non-empty domain + - [x] CloudFlare SSL requires terraform_managed=false + - [x] Let's Encrypt requires valid email + - [x] ACM requires certificate_arn when provider="acm" + - [x] Domain format validation (leading/trailing dots) +- [x] 2.3 Add validation hook in `get_config.py` + - [x] Call validators after config load + - [x] Provide clear error messages on validation failure + +## 3. Allocator FQDN Support (lablink repo) +- [x] 3.1 Update `config_helpers.py` to read `ALLOCATOR_FQDN` environment variable + - [x] Priority: ALLOCATOR_FQDN > DNS config > IP fallback + - [x] Log which URL source is being used +- [x] 3.2 Update URL construction logic + - [x] Remove pattern-based concatenation + - [x] Use simple domain from config or FQDN from env +- [x] 3.3 Update client connection logic + - [x] Ensure clients can reach allocator via FQDN + - [x] Update logging for debugging connection issues + +## 4. Testing (lablink repo) +- [x] 4.1 Unit tests for domain validator + - [x] Valid single and multi-level subdomains pass + - [x] Sub-subdomains allowed (updated from proposal) + - [x] Edge cases (no dots, trailing dots, leading dots) +- [x] 4.2 Unit tests for enhanced validation + - [x] Test all five canonical use cases validate correctly + - [x] Test invalid combinations rejected + - [x] Test error messages are clear +- [x] 4.3 Integration tests for FQDN environment variable + - [x] Test FQDN takes precedence over config + - [x] Test fallback to config domain + - [x] Test fallback to IP-only mode +- [x] 4.4 Update existing tests for schema changes + - [x] Update test configs to new schema + - [x] Remove tests for deprecated fields + +## 5. Documentation (lablink repo) +- [ ] 5.1 Update CLAUDE.md with new configuration structure (future work) +- [ ] 5.2 Create migration guide document (future work) + - [ ] Old schema → new schema mapping + - [ ] Examples for each use case + - [ ] Common migration scenarios +- [ ] 5.3 Update docs/ with new configuration guide (future work) + - [ ] Document all five use cases + - [ ] Add ACM setup instructions + - [ ] Add troubleshooting section + +## 6. Package Release (lablink repo) +- [ ] 6.1 Bump allocator package version (breaking change) (future work) + - [ ] Update `pyproject.toml` version + - [ ] Update CHANGELOG.md +- [ ] 6.2 Test package build locally (future work) + - [ ] Build with `uv build` + - [ ] Test installation from built package +- [ ] 6.3 Create git tag and push (future work) +- [ ] 6.4 Publish to PyPI via workflow (future work) +- [ ] 6.5 Trigger Docker image build with new version (future work) + +## 7. Infrastructure Updates (lablink-template repo) +- [ ] 7.1 Update Terraform DNS logic + - [ ] Remove pattern-based subdomain construction + - [ ] Use single `dns.domain` value directly + - [ ] Update Route53 record creation +- [ ] 7.2 Update Terraform SSL/Caddy logic + - [ ] Handle new ssl.provider values + - [ ] Conditional Caddy installation based on provider + - [ ] Pass ALLOCATOR_FQDN as environment variable +- [ ] 7.3 Add Terraform ACM/ALB support + - [ ] Conditional ALB creation when ssl.provider="acm" + - [ ] ACM certificate attachment to ALB + - [ ] ALB target group pointing to allocator EC2 + - [ ] Security group updates for ALB +- [ ] 7.4 Update example configs + - [ ] Update dev/test/ci-test/prod configs + - [ ] Add ACM example config + - [ ] Remove deprecated fields +- [ ] 7.5 Add CI validation workflow + - [ ] Install lablink-allocator package + - [ ] Run lablink-validate-config on PRs modifying config + - [ ] Block merge if validation fails + +## 8. Testing and Validation (lablink-template repo) +- [ ] 8.1 Test deployment with each use case + - [ ] IP-only (no DNS, no SSL) + - [ ] CloudFlare DNS + SSL + - [ ] Route53 + Let's Encrypt + - [ ] Route53 + ACM (if ACM cert available) + - [ ] Route53 manual DNS + Let's Encrypt +- [ ] 8.2 Verify subdomain takeover fix + - [ ] Confirm sub-subdomains rejected at validation + - [ ] Test that old configs fail validation with clear errors +- [ ] 8.3 Verify FQDN environment variable works + - [ ] Check allocator logs show correct URL source + - [ ] Verify clients connect successfully + +## 9. Documentation and Release (lablink-template repo) +- [ ] 9.1 Update README with breaking changes notice +- [ ] 9.2 Create migration guide in docs + - [ ] Step-by-step config conversion + - [ ] Link to lablink repo migration guide +- [ ] 9.3 Update example deployment instructions + - [ ] New config structure in quick start + - [ ] ACM setup guide +- [ ] 9.4 Create GitHub release with migration notes +- [ ] 9.5 Notify users via GitHub discussions/issues \ No newline at end of file diff --git a/openspec/project.md b/openspec/project.md new file mode 100644 index 00000000..9909aea1 --- /dev/null +++ b/openspec/project.md @@ -0,0 +1,249 @@ +# Project Context + +## Purpose +**LabLink** is a dynamic VM allocation and management system for computational research workflows. It automates deployment and management of cloud-based VMs for running research software. + +### Key Goals +- Automate allocation of AWS EC2 instances for research workloads +- Provide GPU-enabled VMs for computational research +- Track VM state (available, in-use, failed) via PostgreSQL database +- Enable researchers to request VMs via web interface +- Support containerized research software deployment + +## Tech Stack + +### Backend +- **Python 3.9+**: Core language for both services + - Allocator: Python 3.11 (from `uv:python3.11` base image) + - Client: Python 3.10 (Ubuntu 22.04 default) +- **Flask**: Web framework for allocator service +- **PostgreSQL**: Database for VM state management +- **Hydra/OmegaConf**: Configuration management + +### Infrastructure +- **Terraform**: Infrastructure as Code (AWS) +- **Docker**: Containerization for both services +- **AWS EC2**: Virtual machines +- **AWS S3**: Terraform state storage +- **AWS IAM**: Authentication and authorization +- **AWS Route 53**: DNS (optional) + +### Development Tools +- **uv**: Python package manager (recommended) +- **pytest**: Testing framework with coverage +- **ruff**: Linting and formatting (PEP 8, max line 88) +- **GitHub Actions**: CI/CD pipelines +- **MkDocs Material**: Documentation + +### Package Management +- **Monorepo structure** under `packages/` + - `packages/allocator/`: Allocator service package + - `packages/client/`: Client service package +- **PyPI**: Package distribution +- **GHCR**: Docker image registry + +## Project Conventions + +### Code Style +- **Follow PEP 8** via ruff linting +- **Max line length**: 88 characters (ruff default) +- **Type hints**: Required for public functions +- **Docstrings**: Google style for public functions +- **String formatting**: Use f-strings +- **Import order**: Standard library → Third-party → Local (ruff handles) + +### Naming Conventions +- **Files**: `snake_case.py` +- **Functions/Variables**: `snake_case` +- **Classes**: `PascalCase` +- **Constants**: `UPPER_SNAKE_CASE` +- **Private members**: `_leading_underscore` + +### Architecture Patterns + +#### Monorepo Structure +``` +packages/ +├── allocator/ # Allocator service package +│ ├── src/lablink_allocator/ +│ │ ├── main.py # Flask application +│ │ ├── database.py # Database operations +│ │ ├── get_config.py # Config loader +│ │ ├── conf/ # Hydra configuration +│ │ └── terraform/ # Client VM provisioning (part of package) +│ ├── tests/ +│ ├── Dockerfile # Production image (from PyPI) +│ └── Dockerfile.dev # Development image (local code) +└── client/ # Client service package + ├── src/lablink_client/ + │ ├── subscribe.py # Allocator subscription + │ ├── check_gpu.py # GPU health checks + │ └── conf/ # Configuration + ├── tests/ + ├── Dockerfile # Production image (from PyPI) + └── Dockerfile.dev # Development image (local code) +``` + +#### Configuration Management +- **Hydra-based** structured configs (`conf/structured_config.py`) +- **Allocator**: Loads from `/config/config.yaml` (Docker mount) or falls back to bundled `conf/config.yaml` +- **Client**: Uses `ALLOCATOR_URL` env var for HTTPS support, falls back to config.yaml +- **Overrides**: Environment variables, command-line args, YAML edits + +#### Docker Strategy +- **Two Dockerfiles per package**: + - `Dockerfile.dev`: Local code with `uv sync`, dev dependencies, for CI/testing + - `Dockerfile`: PyPI packages with `uv pip install`, production-ready +- **Explicit venv paths**: `/app/.venv` (allocator), `/home/client/.venv` (client) +- **Console scripts**: Entry points defined in `pyproject.toml` + +#### Infrastructure Separation +- **Package repo** (this): Python packages, Docker images, client VM Terraform +- **Template repo** (lablink-template): Allocator infrastructure deployment (EC2, DNS, SSL) + +### Testing Strategy + +#### Unit Tests +- **Framework**: pytest with coverage plugin +- **Minimum coverage**: 90% (enforced in CI) +- **Location**: `tests/` directory per package +- **Mocking**: Mock external dependencies (AWS, database) +- **Terraform tests**: `terraform plan` validation with fixture data + +#### Test Execution +```bash +# Run tests +PYTHONPATH=. pytest + +# With coverage +PYTHONPATH=. pytest --cov + +# Specific test file +PYTHONPATH=. pytest tests/test_api_calls.py +``` + +#### CI Testing +- **Linting**: `ruff check` on both packages +- **Tests**: pytest with coverage on both packages +- **Docker build**: Verify dev images build and run correctly +- **Terraform**: Validate client VM Terraform plans + +### Git Workflow + +#### Branching Strategy +- **main**: Production-ready code +- **test**: Staging environment +- **feature branches**: For development work +- **PR required**: All changes via pull requests + +#### Commit Conventions +- **Format**: Conventional Commits style +- **Examples**: + - `feat: Add HTTPS support to client services` + - `fix: Resolve database connection timeout` + - `chore: Update dependencies` + - `docs: Add API endpoint documentation` + +#### Release Process +``` +1. Development + └─ PR → CI tests, lint, build dev images + +2. Merge to Main + └─ Auto-build latest Docker images from PyPI + +3. Publish to PyPI + └─ Tag: lablink-{package}_v{version} + └─ Trigger publish-pip.yml + +4. Build Production Images + └─ Manual trigger with version parameters +``` + +## Domain Context + +### VM State Machine +VMs transition through states: +- **available**: Ready for assignment to users +- **in-use**: Currently assigned and running research workload +- **failed**: Encountered error, needs attention + +### Allocator-Client Communication +1. **Subscription**: Client VMs register with allocator on startup +2. **Health checks**: Clients report GPU status periodically +3. **Status updates**: Clients update their in-use status +4. **HTTPS support**: Clients use `ALLOCATOR_URL` env var for secure communication + +### VM Lifecycle +1. Admin creates VMs via Terraform (allocator orchestrates) +2. Client VMs boot and run Docker containers +3. Clients subscribe to allocator +4. Users request VMs via web interface +5. Allocator assigns available VMs +6. Clients report health and status +7. Admin can destroy all VMs when needed + +### Research Workflow Integration +- Support custom Docker images for research software +- Support custom Git repositories +- GPU health monitoring for ML/DL workloads +- Configurable instance types and AMIs + +## Important Constraints + +### Security +- **NEVER commit secrets** to version control +- **Change default passwords** immediately in production +- **Rotate SSH keys** every 90 days +- **Use AWS OIDC** for GitHub Actions (no stored credentials) +- **Security groups**: Carefully configure for minimal exposure + +### Python Version Compatibility +- **Minimum**: Python 3.9 (both packages) +- **Allocator**: Developed with Python 3.11 +- **Client**: Developed with Python 3.10 + +### AWS Resource Limits +- **EC2 limits**: Check regional instance limits +- **EIP limits**: Limited number of Elastic IPs per region +- **S3**: Single bucket for Terraform state per environment + +### Docker Image Size +- **Client images**: ~6GB (includes CUDA, GPU drivers) +- **Build time**: Client builds can be slow due to NVIDIA base image + +### Known Issues +- **PostgreSQL restart**: May need manual restart after first deployment +- **Security group persistence**: May need manual deletion when recreating infrastructure +- **SSH key permissions**: Must be `chmod 600` + +## External Dependencies + +### AWS Services +- **EC2**: Virtual machine instances +- **S3**: Terraform state storage +- **IAM**: Authentication, authorization, OIDC for GitHub Actions +- **Security Groups**: Network security +- **Route 53**: DNS (optional) +- **EIP**: Elastic IP addresses + +### Third-Party Services +- **PyPI**: Python package distribution +- **GHCR** (GitHub Container Registry): Docker image storage +- **GitHub Actions**: CI/CD pipelines +- **GitHub Pages**: Documentation hosting + +### External Packages +- **Flask**: Web framework +- **psycopg2**: PostgreSQL adapter +- **Hydra/OmegaConf**: Configuration management +- **boto3**: AWS SDK (for future features) +- **pytest**: Testing framework +- **ruff**: Linting and formatting +- **uv**: Python package manager + +### Infrastructure Dependencies +- **Terraform**: 1.0+ +- **Docker**: 20.10+ +- **PostgreSQL**: 13+ +- **NVIDIA CUDA**: 12.8.1 (client images) diff --git a/packages/allocator/src/lablink_allocator_service/conf/config.yaml b/packages/allocator/src/lablink_allocator_service/conf/config.yaml index d9bf634c..26b08764 100644 --- a/packages/allocator/src/lablink_allocator_service/conf/config.yaml +++ b/packages/allocator/src/lablink_allocator_service/conf/config.yaml @@ -37,22 +37,18 @@ app: dns: enabled: true # true = use DNS for allocator URL, false = IP-only access - terraform_managed: false # false = manual DNS records, true = Terraform creates/destroys records - domain: "lablink.sleap.ai" # Your Route53 hosted zone domain - zone_id: "" # (Optional) Hardcode zone ID to skip lookup - app_name: "" # Not used with custom pattern - pattern: "custom" # "auto" = {env}.{app_name}.{domain}, "custom" = {custom_subdomain}.{domain} - custom_subdomain: "dev" # Will create: dev.lablink.sleap.ai - create_zone: false # false = use existing zone, true = create new zone + terraform_managed: false # false = manual DNS (CloudFlare, etc.), true = Terraform manages Route53 records + domain: "dev.lablink.sleap.ai" # Full domain name for the allocator (supports sub-subdomains for environments) + zone_id: "" # (Optional) Route53 hosted zone ID to skip auto-lookup eip: strategy: "dynamic" # "persistent" = reuse tagged EIP, "dynamic" = create new EIP each time tag_name: "lablink-eip" # Tag Name to identify reusable EIP (used with persistent strategy) ssl: - provider: "letsencrypt" # "letsencrypt" = Caddy auto-SSL, "cloudflare" = CloudFlare proxy, "none" = HTTP only - email: "admin@sleap.ai" # Email for Let's Encrypt notifications - staging: false # true = HTTP only (unlimited testing), false = HTTPS with trusted Let's Encrypt certs (rate limited) + provider: "letsencrypt" # "none" = HTTP only, "letsencrypt" = Caddy auto-SSL, "cloudflare" = CloudFlare proxy, "acm" = AWS Certificate Manager + email: "admin@sleap.ai" # Email for Let's Encrypt notifications (required when provider="letsencrypt") + certificate_arn: "" # ACM certificate ARN (required when provider="acm") allocator: # Docker image tag for the allocator service diff --git a/packages/allocator/src/lablink_allocator_service/conf/structured_config.py b/packages/allocator/src/lablink_allocator_service/conf/structured_config.py index 1ead9fb7..a2a1c0c4 100644 --- a/packages/allocator/src/lablink_allocator_service/conf/structured_config.py +++ b/packages/allocator/src/lablink_allocator_service/conf/structured_config.py @@ -76,34 +76,23 @@ class DNSConfig: """Configuration for DNS and domain setup. This class defines DNS settings for Route 53 hosted zones and records. - DNS can be disabled entirely, or configured with different naming patterns. + DNS can be disabled entirely, or configured with external DNS providers. Attributes: enabled (bool): Whether DNS is enabled. If False, only IP addresses are used. terraform_managed (bool): Whether Terraform creates/destroys DNS records. - If False, DNS records must be created manually in Route53. - domain (str): The base domain name (e.g., "sleap.ai"). + - True: Terraform manages Route53 DNS records (creates and destroys) + - False: External DNS management (CloudFlare, manual Route53, etc.) + domain (str): Full domain name for the allocator (e.g., "lablink.sleap.ai", "test.lablink.sleap.ai"). + Required when enabled=true. Supports sub-subdomains for environment separation. zone_id (str): Optional Route53 hosted zone ID. If provided, skips zone lookup. Use this when zone lookup finds the wrong zone (e.g., parent vs subdomain). - app_name (str): The application name used in subdomains (e.g., "lablink"). - pattern (str): Naming pattern for DNS records. Options: - - "auto": Automatically generate based on environment - prod: {app_name}.{domain} - non-prod: {env}.{app_name}.{domain} - - "app-only": Always use {app_name}.{domain} - - "custom": Use custom_subdomain value - custom_subdomain (str): Custom subdomain when pattern="custom" - create_zone (bool): Whether to create a new Route 53 hosted zone """ enabled: bool = field(default=False) terraform_managed: bool = field(default=True) domain: str = field(default="") zone_id: str = field(default="") - app_name: str = field(default="lablink") - pattern: str = field(default="auto") - custom_subdomain: str = field(default="") - create_zone: bool = field(default=False) @dataclass @@ -127,18 +116,17 @@ class SSLConfig: Attributes: provider (str): SSL provider. Options: - - "letsencrypt": Automatic SSL via Caddy + Let's Encrypt - - "cloudflare": CloudFlare proxy handles SSL - "none": HTTP only, no SSL - email (str): Email address for Let's Encrypt notifications - staging (bool): When true, serve HTTP only for unlimited testing. - When false, serve HTTPS with trusted Let's Encrypt certificates - (rate limited to 5 duplicate certs per week). + - "letsencrypt": Automatic SSL via Caddy + Let's Encrypt + - "cloudflare": CloudFlare proxy handles SSL (requires terraform_managed=false) + - "acm": AWS Certificate Manager (requires ALB, certificate_arn) + email (str): Email address for Let's Encrypt notifications (required when provider="letsencrypt") + certificate_arn (str): AWS ACM certificate ARN (required when provider="acm") """ provider: str = field(default="letsencrypt") email: str = field(default="") - staging: bool = field(default=False) + certificate_arn: str = field(default="") @dataclass diff --git a/packages/allocator/src/lablink_allocator_service/main.py b/packages/allocator/src/lablink_allocator_service/main.py index 9f74ff9f..9a48bcdb 100644 --- a/packages/allocator/src/lablink_allocator_service/main.py +++ b/packages/allocator/src/lablink_allocator_service/main.py @@ -69,42 +69,6 @@ database = None -def generate_dns_name(dns_config: DNSConfig, environment: str) -> str: - """Generate DNS name based on configuration and environment. - - Args: - dns_config: DNSConfig object from configuration - environment: Current environment (prod, test, dev, etc.) - - Returns: - str: Generated DNS name or empty string if DNS is disabled - """ - if not dns_config.enabled or not dns_config.domain: - return "" - - if dns_config.pattern == "auto": - # prod: {app_name}.{domain}, others: {env}.{app_name}.{domain} - if environment == "prod": - return f"{dns_config.app_name}.{dns_config.domain}" - else: - return f"{environment}.{dns_config.app_name}.{dns_config.domain}" - elif dns_config.pattern == "app-only": - # Always use {app_name}.{domain} - return f"{dns_config.app_name}.{dns_config.domain}" - elif dns_config.pattern == "custom": - # Use custom subdomain - if dns_config.custom_subdomain: - return dns_config.custom_subdomain - else: - logger.warning( - "DNS pattern is 'custom' but custom_subdomain is empty. Using IP only." - ) - return "" - else: - logger.warning(f"Unknown DNS pattern '{dns_config.pattern}'. Using IP only.") - return "" - - def init_database(): """Initialize the database connection.""" global database diff --git a/packages/allocator/src/lablink_allocator_service/utils/config_helpers.py b/packages/allocator/src/lablink_allocator_service/utils/config_helpers.py index 1d15eba5..7ee54752 100644 --- a/packages/allocator/src/lablink_allocator_service/utils/config_helpers.py +++ b/packages/allocator/src/lablink_allocator_service/utils/config_helpers.py @@ -1,13 +1,20 @@ """Configuration helper functions for building URLs and determining settings.""" +import os +import logging from typing import Tuple +logger = logging.getLogger(__name__) + def get_allocator_url(cfg, allocator_ip: str) -> Tuple[str, str]: """ Build the allocator URL based on configuration. - Automatically determines the correct URL based on DNS and SSL settings. + Priority order: + 1. ALLOCATOR_FQDN environment variable (set by Terraform) + 2. DNS configuration from config + 3. IP address fallback Args: cfg: Hydra configuration object @@ -17,67 +24,56 @@ def get_allocator_url(cfg, allocator_ip: str) -> Tuple[str, str]: Tuple of (base_url, protocol) Examples: - DNS enabled + Let's Encrypt SSL (production): + ALLOCATOR_FQDN environment variable: ("https://test.lablink.sleap.ai", "https") - DNS enabled + Let's Encrypt SSL (staging): - ("http://test.lablink.sleap.ai", "http") + DNS enabled + Let's Encrypt SSL: + ("https://test.lablink.sleap.ai", "https") DNS disabled + No SSL: ("http://52.40.142.146", "http") + """ + # Priority 1: Check for ALLOCATOR_FQDN environment variable (set by Terraform) + allocator_fqdn = os.getenv("ALLOCATOR_FQDN") + if allocator_fqdn: + # FQDN already includes protocol + if allocator_fqdn.startswith("https://"): + protocol = "https" + elif allocator_fqdn.startswith("http://"): + protocol = "http" + else: + # Default to http if no protocol specified + protocol = "http" + allocator_fqdn = f"{protocol}://{allocator_fqdn}" - DNS enabled + No SSL: - ("http://test.lablink.sleap.ai", "http") + logger.info(f"Using ALLOCATOR_FQDN from environment: {allocator_fqdn}") + return allocator_fqdn, protocol - DNS enabled + Cloudflare SSL: - ("https://test.lablink.sleap.ai", "https") - """ + # Priority 2: Build from DNS configuration # Determine protocol based on SSL provider - # When staging=true, Caddy serves HTTP only (no SSL certificates) - # When staging=false, Caddy serves HTTPS with trusted Let's Encrypt certs if hasattr(cfg, "ssl") and cfg.ssl.provider != "none": - is_staging = hasattr(cfg.ssl, "staging") and cfg.ssl.staging - if is_staging: - # Staging mode: HTTP only - protocol = "http" - else: - # Production mode: HTTPS with trusted certificates - protocol = "https" + protocol = "https" else: protocol = "http" # Determine host based on DNS configuration - if hasattr(cfg, "dns") and cfg.dns.enabled: - # Use DNS hostname - if cfg.dns.pattern == "custom": - # Only add subdomain if it's non-empty - if cfg.dns.custom_subdomain: - host = f"{cfg.dns.custom_subdomain}.{cfg.dns.domain}" - else: - host = cfg.dns.domain - elif cfg.dns.pattern == "auto": - # For auto pattern, would need environment/resource_suffix - # For now, fall back to custom_subdomain if available - if cfg.dns.custom_subdomain: - host = f"{cfg.dns.custom_subdomain}.{cfg.dns.domain}" - else: - host = cfg.dns.domain - else: - # Default to just the domain - host = cfg.dns.domain + if hasattr(cfg, "dns") and cfg.dns.enabled and cfg.dns.domain: + # Use DNS domain directly (now includes full domain) + host = cfg.dns.domain - # Just to be safe + # Remove leading dots if present (safety check) if host.startswith("."): host = host[1:] + logger.warning(f"Removed leading dot from domain: {host}") + + logger.info(f"Using domain from config: {host}") else: - # Use IP address + # Priority 3: Use IP address host = allocator_ip + logger.info(f"Using IP-only mode: {host}") base_url = f"{protocol}://{host}" - # Sanitize URL to remove prepended dots - base_url = base_url.replace("://.", "://") - return base_url, protocol diff --git a/packages/allocator/src/lablink_allocator_service/validate_config.py b/packages/allocator/src/lablink_allocator_service/validate_config.py index 3a513151..b8b860ff 100644 --- a/packages/allocator/src/lablink_allocator_service/validate_config.py +++ b/packages/allocator/src/lablink_allocator_service/validate_config.py @@ -24,6 +24,74 @@ logger = logging.getLogger(__name__) +def validate_domain_format(domain: str) -> Tuple[bool, str]: + """Validate domain format to prevent malformed domains. + + Args: + domain: Domain name to validate + + Returns: + Tuple of (is_valid, error_message) + """ + if not domain: + return True, "" # Empty is allowed when DNS disabled + + # Check for leading/trailing dots + if domain.startswith("."): + return False, "Domain cannot start with a dot" + if domain.endswith("."): + return False, "Domain cannot end with a dot" + + return True, "" + + +def validate_config_logic(cfg) -> Tuple[bool, str]: + """Validate configuration logic and dependencies. + + Args: + cfg: Loaded configuration object + + Returns: + Tuple of (is_valid, error_message) + """ + errors = [] + + # DNS enabled requires non-empty domain + if cfg.dns.enabled and not cfg.dns.domain: + errors.append("DNS enabled requires non-empty domain field") + + # Validate domain format + is_valid, error_msg = validate_domain_format(cfg.dns.domain) + if not is_valid: + errors.append(error_msg) + + # SSL (non-"none") requires DNS + if cfg.ssl.provider != "none" and not cfg.dns.enabled: + errors.append('SSL requires DNS to be enabled (use provider="none" for HTTP-only)') + + # Let's Encrypt requires email + if cfg.ssl.provider == "letsencrypt" and not cfg.ssl.email: + errors.append("Let's Encrypt requires email address") + + # ACM requires certificate_arn + if cfg.ssl.provider == "acm" and not cfg.ssl.certificate_arn: + errors.append("ACM provider requires certificate_arn") + + # CloudFlare SSL requires external DNS (terraform_managed=false) + if cfg.ssl.provider == "cloudflare" and cfg.dns.terraform_managed: + errors.append( + "CloudFlare SSL requires terraform_managed=false (external DNS management)" + ) + + if errors: + error_msg = "[FAIL] Config validation failed:\n" + for error in errors: + error_msg += f" - {error}\n" + return False, error_msg + + return True, "" + + def validate_config(config_path: str) -> Tuple[bool, str]: """Validate a configuration file against the schema. @@ -54,7 +122,13 @@ def validate_config(config_path: str) -> Tuple[bool, str]: try: # Use get_config() with explicit path - it validates automatically - get_config(config_path=str(path)) + cfg = get_config(config_path=path.as_posix()) + + # Run logic validation + is_valid, error_msg = validate_config_logic(cfg) + if not is_valid: + return False, error_msg + return True, "[PASS] Config validation passed" except ConfigCompositionException as e: diff --git a/packages/allocator/tests/conftest.py b/packages/allocator/tests/conftest.py index 93f4745e..8bb2706d 100644 --- a/packages/allocator/tests/conftest.py +++ b/packages/allocator/tests/conftest.py @@ -37,10 +37,6 @@ def omega_config(): "terraform_managed": False, "domain": "", "zone_id": "", - "app_name": "lablink", - "pattern": "auto", - "custom_subdomain": "", - "create_zone": False, }, "eip": { "strategy": "dynamic", @@ -49,6 +45,7 @@ def omega_config(): "ssl": { "provider": "none", "email": "", + "certificate_arn": "", }, "bucket_name": "test-bucket", "startup_script": { @@ -152,10 +149,6 @@ def valid_config_dict(): "terraform_managed": False, "domain": "lablink.example.com", "zone_id": "", - "app_name": "lablink", - "pattern": "auto", - "custom_subdomain": "", - "create_zone": False, }, "eip": { "strategy": "dynamic", @@ -164,7 +157,7 @@ def valid_config_dict(): "ssl": { "provider": "none", "email": "admin@example.com", - "staging": True, + "certificate_arn": "", }, "allocator": { "image_tag": "linux-amd64-latest-test", diff --git a/packages/allocator/tests/test_validate_config.py b/packages/allocator/tests/test_validate_config.py index 3e5b8d5b..8aadabe3 100644 --- a/packages/allocator/tests/test_validate_config.py +++ b/packages/allocator/tests/test_validate_config.py @@ -172,3 +172,133 @@ def mock_get_config(*args, **kwargs): assert "[FAIL]" in message assert "RuntimeError" in message assert "unexpected error" in message.lower() + + +# New tests for simplified DNS/SSL configuration validation + +def test_ssl_requires_dns(valid_config_dict, write_config_file): + """Test that SSL requires DNS to be enabled.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = False + config["ssl"]["provider"] = "letsencrypt" + config["ssl"]["email"] = "test@example.com" + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is False + assert "SSL requires DNS" in message + + +def test_dns_requires_domain(valid_config_dict, write_config_file): + """Test that DNS enabled requires non-empty domain.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = True + config["dns"]["domain"] = "" + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is False + assert "domain field" in message.lower() + + +def test_domain_cannot_start_with_dot(valid_config_dict, write_config_file): + """Test that domains cannot start with a dot.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = True + config["dns"]["domain"] = ".lablink.sleap.ai" + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is False + assert "cannot start with a dot" in message.lower() + + +def test_domain_cannot_end_with_dot(valid_config_dict, write_config_file): + """Test that domains cannot end with a dot.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = True + config["dns"]["domain"] = "lablink.sleap.ai." + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is False + assert "cannot end with a dot" in message.lower() + + +def test_letsencrypt_requires_email(valid_config_dict, write_config_file): + """Test that Let's Encrypt requires email address.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = True + config["dns"]["domain"] = "test.lablink.sleap.ai" + config["ssl"]["provider"] = "letsencrypt" + config["ssl"]["email"] = "" + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is False + assert "Let's Encrypt requires email" in message + + +def test_acm_requires_certificate_arn(valid_config_dict, write_config_file): + """Test that ACM requires certificate ARN.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = True + config["dns"]["domain"] = "test.lablink.sleap.ai" + config["ssl"]["provider"] = "acm" + config["ssl"]["certificate_arn"] = "" + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is False + assert "ACM provider requires certificate_arn" in message + + +def test_cloudflare_requires_external_dns(valid_config_dict, write_config_file): + """Test that CloudFlare SSL requires terraform_managed=false.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = True + config["dns"]["terraform_managed"] = True # Should be False for CloudFlare + config["dns"]["domain"] = "test.lablink.sleap.ai" + config["ssl"]["provider"] = "cloudflare" + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is False + assert "CloudFlare" in message + assert "terraform_managed=false" in message + + +def test_valid_sub_subdomain(valid_config_dict, write_config_file): + """Test that sub-subdomains are allowed.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = True + config["dns"]["domain"] = "test.lablink.sleap.ai" # Sub-subdomain + config["ssl"]["provider"] = "letsencrypt" + config["ssl"]["email"] = "admin@example.com" + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is True + assert "[PASS]" in message + + +def test_ip_only_mode_valid(valid_config_dict, write_config_file): + """Test that IP-only mode (no DNS, no SSL) is valid.""" + config = valid_config_dict.copy() + config["dns"]["enabled"] = False + config["dns"]["domain"] = "" + config["ssl"]["provider"] = "none" + + config_path = write_config_file(config) + is_valid, message = validate_config(config_path) + + assert is_valid is True + assert "[PASS]" in message