diff --git a/.coderabbit.yaml b/.coderabbit.yaml index 0dc708981..2188db328 100644 --- a/.coderabbit.yaml +++ b/.coderabbit.yaml @@ -3,3 +3,6 @@ reviews: profile: chill # assertive / chill profile auto_review: enabled: true # Enable auto-review for this repository + path_instructions: + - path: "src/**" + instructions: "If this PR adds, removes, or renames a service, module, or major component, check that AGENTS.md is updated accordingly." diff --git a/AGENTS.md b/AGENTS.md index acc1bcf81..521a3dece 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -6,6 +6,22 @@ This file provides guidance to AI agents when working with the OSMO codebase. OSMO is a workflow orchestration platform for Physical AI, managing heterogeneous Kubernetes clusters for training, simulation, and edge compute workloads. +## Workflow Requirements + +Before making any code changes in this repo, you MUST: + +1. **Explore first**: Use the Codebase Structure section below to orient yourself, then read relevant source files before proposing changes. Read existing implementations, tests, and related modules. Never modify code you haven't read. +2. **Plan before implementing**: For any non-trivial change (more than a simple one-line fix), create an explicit plan that identifies: + - Which files need to change and why + - How the change fits with existing patterns in the codebase + - What tests exist and what new tests are needed + - Any cross-cutting concerns (e.g., auth, storage backends, IPC protocols) + - A verification plan: how to confirm the change works (e.g., specific tests to run, build commands, manual checks) +3. **Check for downstream impact**: This is a multi-service platform — changes in shared libraries (`lib/`, `utils/`) can affect multiple services. Grep for usages before modifying shared code. +4. **Verify after implementation**: After completing changes, execute the verification plan — run the relevant tests/builds and confirm they pass before claiming the work is done. Never assert success without evidence. +5. **Simplify before committing**: Review your changes for unnecessary complexity, redundancy, and over-engineering before committing. Prefer the simplest solution that meets the requirements. +6. **Update documentation**: If adding, removing, or renaming a service, module, or major component, update the "Codebase Structure" section in this file as part of the same change. + ## Team Guidelines - Follow existing code patterns and conventions in the codebase @@ -15,15 +31,10 @@ OSMO is a workflow orchestration platform for Physical AI, managing heterogeneou - Copyright headers must keep "All rights reserved." on the same line as "NVIDIA CORPORATION & AFFILIATES" - If copyright lines exceed 100 characters, add `# pylint: disable=line-too-long` comment instead of breaking into multiple lines -## Tool Usage Preferences - -- Use specialized tools (Read, Edit, Write, Grep, Glob) instead of Bash commands whenever possible -- Bash tools require user intervention to allow and should only be used as a last resort -- Prefer Read over cat, Edit over sed, Write over echo/heredoc, Grep over grep, and Glob over find - -## Coding Standards +## Python Coding Standards ### Import Statements + - All imports must be at the top level of the module - Place all imports at the top of the file after the module docstring - **No exceptions**: Imports inside functions are not allowed @@ -35,12 +46,14 @@ OSMO is a workflow orchestration platform for Physical AI, managing heterogeneou - Use late binding or forward references for type hints (PEP 563) ### Variable Naming + - Do not use abbreviations in variable names unless they are well-understood abbreviations or common conventions - **Good**: `topology_key`, `config`, `i` (iterator), `x`, `y`, `z` (coordinates) - **Bad**: `tk` (for topology_key), `topo` (for topology), `req` (for requirement) - Use full, descriptive names that make code self-documenting ### Type Annotations and Data Structures + - **Use strict typing**: Add type annotations where they improve code clarity and catch errors - **Prefer dataclasses over dictionaries**: When passing structured data with multiple fields, use dataclasses instead of `Dict[str, Any]` - **Good**: `@dataclasses.dataclass class TaskTopology: name: str; requirements: List[...]` @@ -55,8 +68,178 @@ OSMO is a workflow orchestration platform for Physical AI, managing heterogeneou - **Bad**: `def process(items: List[str] = []) -> None:` - all callers share the same list instance! ### Assertions + - **Do not use `assert` statements in production code** - only in unit tests - **Reason**: Assertions can be disabled with Python's `-O` flag and should not be relied upon for runtime validation - **Use proper error handling instead**: Raise appropriate exceptions (ValueError, TypeError, etc.) for validation - **Good**: `if value is None: raise ValueError("Value cannot be None")` - **Bad**: `assert value is not None, "Value cannot be None"` + +## Codebase Structure (`src/`) + +All paths below are relative to `src/`. + +### Core Service (`service/core/`) — Main FastAPI Microservice + +Entry point: `service/core/service.py`. Framework: FastAPI + Uvicorn + OpenTelemetry. + +| Submodule | Purpose | +|-----------|---------| +| `auth/` | JWT token lifecycle, access token CRUD, user management, role assignment | +| `workflow/` | Workflow submit/list/cancel, resource quota, pool allocation, task coordination, credential management | +| `config/` | Service/workflow/dataset configuration CRUD with versioning and history. Pod templates, resource validation rules, pool/backend config | +| `data/` | Dataset/collection management, versioning with tags, multi-backend storage, streaming downloads | +| `app/` | Workflow app lifecycle (create, version, rename, delete), YAML spec validation | +| `profile/` | User profile/preferences, token identity, role/pool visibility | + +**Error types**: Defined in `lib/utils/` — see the `OSMOError` hierarchy for the full list. + +### Supporting Services + +| Service | Purpose | +|---------|---------| +| `service/router/` | Routes HTTP/WebSocket requests to backends. Sticky session routing. WebSocket endpoints for exec, portforward, rsync. | +| `service/worker/` | Kombu-based Redis job queue consumer. Deduplicates jobs. Executes `FrontendJob` subclasses. | +| `service/agent/` | Backend cluster integration via WebSocket. Receives node/pod/event/heartbeat streams from K8s clusters. | +| `service/logger/` | Receives structured logs from osmo-ctrl containers. Persists task metrics to PostgreSQL. Distributed barriers via Redis. | +| `service/delayed_job_monitor/` | Polls Redis for scheduled jobs, promotes to main queue when ready. | + +### Python Libraries (`lib/`) + +| Library | Key Classes | Purpose | +|---------|-------------|---------| +| `lib/data/storage/` | `Client`, `StorageBackend`, `ExecutorParameters`, `StoragePath` | Multi-cloud storage SDK (S3, Azure, GCS, Swift, TOS). Parallel multiprocess+multithread executor. Streaming upload/download. | +| `lib/data/dataset/` | `Manager` | Dataset lifecycle (upload, download, migrate) built on storage SDK. | +| `lib/utils/` | `LoginManager`, `ServiceClient`, `OSMOError` hierarchy | Client SDK for HTTP/WebSocket requests with JWT auth. Error types, logging, validation, credential management. | +| `lib/rsync/` | `RsyncClient` | File watch-based rsync with debounce/reconciliation. Port forwarding for remote access. | + +### Python Utilities (`utils/`) + +| Module | Key Classes | Purpose | +|--------|-------------|---------| +| `utils/job/` | `Task`, `FrontendJob`, `K8sObjectFactory`, `PodGroupTopologyBuilder` | Workflow execution framework. Task → K8s spec generation. Gang scheduling via PodGroup. Topology constraints. Backend job definitions. | +| `utils/connectors/` | `ClusterConnector`, `PostgresConnector`, `RedisConnector` | K8s API wrapper, PostgreSQL operations, Redis job queue management. | +| `utils/secret_manager/` | `SecretManager` | JWE-based secret encryption/decryption. MEK/UEK key management. | +| `utils/progress_check/` | — | Liveness/progress tracking for long-running services. | +| `utils/metrics/` | — | Prometheus metrics collection and export. | + +### CLI (`cli/`) + +Entry point: `cli.py` → `main_parser.py` (argparse). Subcommand modules: + + +| Module | Commands | +| -------------------------------------------------------------------------------------------------------------- | -------------------------------- | +| `workflow.py` | submit, list, cancel, exec, logs | +| `data.py` | upload, download, list, delete | +| `dataset.py` | Dataset management | +| `app.py` | App submission/management | +| `config.py` | Service configuration | +| `profile.py` | User profiles | +| `login.py` | Authentication | +| `pool.py`, `resources.py`, `user.py`, `credential.py`, `access_token.py`, `bucket.py`, `task.py`, `version.py` | Supporting commands | +| `backend.py` | Backend cluster management | + +Features: Tab completion (shtab), response formatting (`formatters.py`), spec editor (`editor.py`), PyInstaller packaging (`cli_builder.py`, `packaging/`). + +### Go Runtime Containers (`runtime/`) + +| Binary | Purpose | +|--------|---------| +| `runtime/cmd/ctrl/` | **osmo_ctrl** — Orchestrates workflow execution. WebSocket to workflow service. Unix socket to osmo_user. Manages data download/upload, barriers for multi-task sync, port forwarding. | +| `runtime/cmd/user/` | **osmo_user** — Executes user commands with PTY. Streams stdout/stderr to ctrl. Handles checkpointing (periodic uploads). | +| `runtime/cmd/rsync/` | **osmo_rsync** — Rsync daemon with bandwidth limiting. | + +### Go Runtime Packages (`runtime/pkg/`) + +| Package | Purpose | +|---------|---------| +| `args/` | CLI flag parsing for ctrl and user containers. | +| `messages/` | IPC message protocol between containers (exec lifecycle, log streaming, barriers). | +| `common/` | Shared utilities: command execution, file operations, circular buffer. | +| `data/` | Input/output data handling. Storage backend abstraction (S3, Swift, GCS, TOS). Mount/download/upload with retry and checkpointing. | +| `metrics/` | Execution timing and data transfer metrics collection. | +| `osmo_errors/` | Error handling with categorized exit codes and termination logging. | +| `rsync/` | Rsync daemon subprocess management with monitoring. | + +### Go Utilities (`utils/` — Go) + +| Package | Purpose | +|---------|---------| +| `roles/` | Semantic RBAC. Actions like `workflow:Create`, `dataset:Read`. LRU cache with TTL. Role sync from IDP. Pool access evaluation. | +| `postgres/` | PostgreSQL client with pgx connection pool and pgroll schema version support. | +| `redis/` | Redis client with optional TLS. | +| `logging/` | Structured slog handler compatible with Fluent Bit parsers. | +| `env.go` | Environment variable helpers with YAML config file fallback. | + +### Authorization Sidecar (`service/authz_sidecar/`) — Go gRPC + +- Implements external authorization for the API gateway +- Flow: Extract user/roles from request headers → sync roles from IDP → resolve role policies from cache/DB → evaluate semantic RBAC → return allow/deny with `x-osmo-user`, `x-osmo-roles`, `x-osmo-allowed-pools` headers + +### Frontend (`ui/`) + +- **Framework**: Next.js (App Router, Turbopack) + React + TypeScript (see `package.json` for versions) +- **Styling**: Tailwind CSS + shadcn/ui +- **State**: TanStack Query (data fetching), Zustand (UI state), nuqs (URL state) +- **Testing**: Vitest (unit), Playwright (E2E), MSW (API mocking) +- **API layer**: OpenAPI-generated types (`lib/api/generated.ts` — DO NOT EDIT) + adapter layer (`lib/api/adapter/`) that bridges backend quirks to UI expectations +- **Key routes**: pools, resources, workflows, datasets, occupancy, profile, log-viewer (under `app/(dashboard)/`) +- **Import rules**: Absolute imports only (`@/...`), no barrel exports, API types from adapter (not generated) + +### Operator (`operator/`) + +- `backend_listener.py` — WebSocket listener for backend cluster status +- `backend_worker.py` — Job execution engine for backend tasks +- `backend_test_runner/` — Test orchestration for backend validation +- `utils/node_validation_test/` — GPU validation (nvidia-smi, tflops benchmark, stuck pod detection) +### Tests + + +| Location | Framework | Scope | +| ------------------------------------------- | ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `tests/common/` | pytest + testcontainers | Shared fixtures: PostgreSQL (`database/`), S3/Swift/Redis storage (`storage/`), Docker network/registry (`core/`, `registry/`), Envoy TLS proxy (`envoy/`) | +| `tests/common/database/testdata/schema.sql` | — | Database schema definition (source of truth for test DB) | +| `runtime/pkg/*_test.go` | Go testing + testcontainers-go | Runtime package unit/integration tests | +| `utils/*_test.go` | Go testing | Go utility tests (roles, postgres, redis) | +| `ui/src/**/*.test.ts` | Vitest | Frontend unit tests | +| `ui/e2e/` | Playwright | Frontend E2E tests with page object models | + +### Key Architecture Patterns + +- **Container runtime**: Three container types per workflow — ctrl (orchestrator), user (execution), data (rsync sidecar) +- **IPC**: WebSocket (ctrl↔workflow service), Unix sockets (ctrl↔user), gRPC (authz sidecar) +- **Auth**: API gateway → authz_sidecar (semantic RBAC with `x-osmo-user`, `x-osmo-roles`, `x-osmo-allowed-pools` headers) +- **Storage**: Multi-cloud abstraction (S3/Azure/GCS/Swift/TOS) with parallel multiprocess+multithread transfer +- **Job queue**: Redis-backed Kombu queue with deduplication. Delayed jobs via Redis ZSET. +- **Databases**: PostgreSQL (pgx), Redis (caching + job queue + event streams + barriers), pgroll for schema versioning +- **Monitoring**: Prometheus + Grafana + Loki. OpenTelemetry instrumentation on FastAPI. + +### Inter-Service Communication + +``` +Client → API Gateway → authz_sidecar (gRPC Check) → Core Service (FastAPI) + ├── PostgreSQL (state) + ├── Redis (cache, job queue, events) + ├── → Worker (job consumer) + ├── ↔ Agent (WebSocket backend events) + ├── ↔ Logger (WebSocket log streaming) + ├── → Router (HTTP/WS request routing) + └── → Delayed Job Monitor (scheduled jobs) + +Workflow Execution: + Core Service → K8s Backend → [osmo_ctrl ↔ osmo_user ↔ osmo_rsync] + osmo_ctrl ↔ Core Service (WebSocket) + osmo_ctrl → Logger (WebSocket logs/metrics) +``` + +### Build & Test + +- **Build system**: Bazel (`MODULE.bazel`, `.bazelrc`) — check `MODULE.bazel` for current version +- **Python**: ruff linter (`.ruff.toml`, Google style) — check `MODULE.bazel` for Python version +- **Go**: module `go.corp.nvidia.com/osmo` (single `go.mod` at `src/`) — check `go.mod` for Go version +- **Frontend**: Next.js + pnpm, TypeScript strict mode, ESLint + Prettier — check `ui/package.json` for versions +- **Tests**: Bazel test rules, pytest + testcontainers (Python), testcontainers-go (Go), Vitest + Playwright (frontend) +- **Container images**: Built via `rules_oci` (amd64, arm64), distroless base from NVIDIA NGC +- **API spec**: OpenAPI auto-generated from FastAPI via `bazel run //src/scripts:export_openapi` + diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 000000000..9319e0556 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,7 @@ +@AGENTS.md + +## Tool Usage Preferences + +- Use specialized tools (Read, Edit, Write, Grep, Glob) instead of Bash commands whenever possible +- Bash tools require user intervention to allow and should only be used as a last resort +- Prefer Read over cat, Edit over sed, Write over echo/heredoc, Grep over grep, and Glob over find