Skip to content

Commit fdb0014

Browse files
authored
Add Workflow Requirements and detailed codebase index to AGENTS.md (#675)
1 parent 5edeab4 commit fdb0014

File tree

3 files changed

+200
-7
lines changed

3 files changed

+200
-7
lines changed

.coderabbit.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,3 +3,6 @@ reviews:
33
profile: chill # assertive / chill profile
44
auto_review:
55
enabled: true # Enable auto-review for this repository
6+
path_instructions:
7+
- path: "src/**"
8+
instructions: "If this PR adds, removes, or renames a service, module, or major component, check that AGENTS.md is updated accordingly."

AGENTS.md

Lines changed: 190 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,22 @@ This file provides guidance to AI agents when working with the OSMO codebase.
66

77
OSMO is a workflow orchestration platform for Physical AI, managing heterogeneous Kubernetes clusters for training, simulation, and edge compute workloads.
88

9+
## Workflow Requirements
10+
11+
Before making any code changes in this repo, you MUST:
12+
13+
1. **Explore first**: Use the Codebase Structure section below to orient yourself, then read relevant source files before proposing changes. Read existing implementations, tests, and related modules. Never modify code you haven't read.
14+
2. **Plan before implementing**: For any non-trivial change (more than a simple one-line fix), create an explicit plan that identifies:
15+
- Which files need to change and why
16+
- How the change fits with existing patterns in the codebase
17+
- What tests exist and what new tests are needed
18+
- Any cross-cutting concerns (e.g., auth, storage backends, IPC protocols)
19+
- A verification plan: how to confirm the change works (e.g., specific tests to run, build commands, manual checks)
20+
3. **Check for downstream impact**: This is a multi-service platform — changes in shared libraries (`lib/`, `utils/`) can affect multiple services. Grep for usages before modifying shared code.
21+
4. **Verify after implementation**: After completing changes, execute the verification plan — run the relevant tests/builds and confirm they pass before claiming the work is done. Never assert success without evidence.
22+
5. **Simplify before committing**: Review your changes for unnecessary complexity, redundancy, and over-engineering before committing. Prefer the simplest solution that meets the requirements.
23+
6. **Update documentation**: If adding, removing, or renaming a service, module, or major component, update the "Codebase Structure" section in this file as part of the same change.
24+
925
## Team Guidelines
1026

1127
- Follow existing code patterns and conventions in the codebase
@@ -15,15 +31,10 @@ OSMO is a workflow orchestration platform for Physical AI, managing heterogeneou
1531
- Copyright headers must keep "All rights reserved." on the same line as "NVIDIA CORPORATION & AFFILIATES"
1632
- If copyright lines exceed 100 characters, add `# pylint: disable=line-too-long` comment instead of breaking into multiple lines
1733

18-
## Tool Usage Preferences
19-
20-
- Use specialized tools (Read, Edit, Write, Grep, Glob) instead of Bash commands whenever possible
21-
- Bash tools require user intervention to allow and should only be used as a last resort
22-
- Prefer Read over cat, Edit over sed, Write over echo/heredoc, Grep over grep, and Glob over find
23-
24-
## Coding Standards
34+
## Python Coding Standards
2535

2636
### Import Statements
37+
2738
- All imports must be at the top level of the module
2839
- Place all imports at the top of the file after the module docstring
2940
- **No exceptions**: Imports inside functions are not allowed
@@ -35,12 +46,14 @@ OSMO is a workflow orchestration platform for Physical AI, managing heterogeneou
3546
- Use late binding or forward references for type hints (PEP 563)
3647

3748
### Variable Naming
49+
3850
- Do not use abbreviations in variable names unless they are well-understood abbreviations or common conventions
3951
- **Good**: `topology_key`, `config`, `i` (iterator), `x`, `y`, `z` (coordinates)
4052
- **Bad**: `tk` (for topology_key), `topo` (for topology), `req` (for requirement)
4153
- Use full, descriptive names that make code self-documenting
4254

4355
### Type Annotations and Data Structures
56+
4457
- **Use strict typing**: Add type annotations where they improve code clarity and catch errors
4558
- **Prefer dataclasses over dictionaries**: When passing structured data with multiple fields, use dataclasses instead of `Dict[str, Any]`
4659
- **Good**: `@dataclasses.dataclass class TaskTopology: name: str; requirements: List[...]`
@@ -55,8 +68,178 @@ OSMO is a workflow orchestration platform for Physical AI, managing heterogeneou
5568
- **Bad**: `def process(items: List[str] = []) -> None:` - all callers share the same list instance!
5669

5770
### Assertions
71+
5872
- **Do not use `assert` statements in production code** - only in unit tests
5973
- **Reason**: Assertions can be disabled with Python's `-O` flag and should not be relied upon for runtime validation
6074
- **Use proper error handling instead**: Raise appropriate exceptions (ValueError, TypeError, etc.) for validation
6175
- **Good**: `if value is None: raise ValueError("Value cannot be None")`
6276
- **Bad**: `assert value is not None, "Value cannot be None"`
77+
78+
## Codebase Structure (`src/`)
79+
80+
All paths below are relative to `src/`.
81+
82+
### Core Service (`service/core/`) — Main FastAPI Microservice
83+
84+
Entry point: `service/core/service.py`. Framework: FastAPI + Uvicorn + OpenTelemetry.
85+
86+
| Submodule | Purpose |
87+
|-----------|---------|
88+
| `auth/` | JWT token lifecycle, access token CRUD, user management, role assignment |
89+
| `workflow/` | Workflow submit/list/cancel, resource quota, pool allocation, task coordination, credential management |
90+
| `config/` | Service/workflow/dataset configuration CRUD with versioning and history. Pod templates, resource validation rules, pool/backend config |
91+
| `data/` | Dataset/collection management, versioning with tags, multi-backend storage, streaming downloads |
92+
| `app/` | Workflow app lifecycle (create, version, rename, delete), YAML spec validation |
93+
| `profile/` | User profile/preferences, token identity, role/pool visibility |
94+
95+
**Error types**: Defined in `lib/utils/` — see the `OSMOError` hierarchy for the full list.
96+
97+
### Supporting Services
98+
99+
| Service | Purpose |
100+
|---------|---------|
101+
| `service/router/` | Routes HTTP/WebSocket requests to backends. Sticky session routing. WebSocket endpoints for exec, portforward, rsync. |
102+
| `service/worker/` | Kombu-based Redis job queue consumer. Deduplicates jobs. Executes `FrontendJob` subclasses. |
103+
| `service/agent/` | Backend cluster integration via WebSocket. Receives node/pod/event/heartbeat streams from K8s clusters. |
104+
| `service/logger/` | Receives structured logs from osmo-ctrl containers. Persists task metrics to PostgreSQL. Distributed barriers via Redis. |
105+
| `service/delayed_job_monitor/` | Polls Redis for scheduled jobs, promotes to main queue when ready. |
106+
107+
### Python Libraries (`lib/`)
108+
109+
| Library | Key Classes | Purpose |
110+
|---------|-------------|---------|
111+
| `lib/data/storage/` | `Client`, `StorageBackend`, `ExecutorParameters`, `StoragePath` | Multi-cloud storage SDK (S3, Azure, GCS, Swift, TOS). Parallel multiprocess+multithread executor. Streaming upload/download. |
112+
| `lib/data/dataset/` | `Manager` | Dataset lifecycle (upload, download, migrate) built on storage SDK. |
113+
| `lib/utils/` | `LoginManager`, `ServiceClient`, `OSMOError` hierarchy | Client SDK for HTTP/WebSocket requests with JWT auth. Error types, logging, validation, credential management. |
114+
| `lib/rsync/` | `RsyncClient` | File watch-based rsync with debounce/reconciliation. Port forwarding for remote access. |
115+
116+
### Python Utilities (`utils/`)
117+
118+
| Module | Key Classes | Purpose |
119+
|--------|-------------|---------|
120+
| `utils/job/` | `Task`, `FrontendJob`, `K8sObjectFactory`, `PodGroupTopologyBuilder` | Workflow execution framework. Task → K8s spec generation. Gang scheduling via PodGroup. Topology constraints. Backend job definitions. |
121+
| `utils/connectors/` | `ClusterConnector`, `PostgresConnector`, `RedisConnector` | K8s API wrapper, PostgreSQL operations, Redis job queue management. |
122+
| `utils/secret_manager/` | `SecretManager` | JWE-based secret encryption/decryption. MEK/UEK key management. |
123+
| `utils/progress_check/` || Liveness/progress tracking for long-running services. |
124+
| `utils/metrics/` || Prometheus metrics collection and export. |
125+
126+
### CLI (`cli/`)
127+
128+
Entry point: `cli.py``main_parser.py` (argparse). Subcommand modules:
129+
130+
131+
| Module | Commands |
132+
| -------------------------------------------------------------------------------------------------------------- | -------------------------------- |
133+
| `workflow.py` | submit, list, cancel, exec, logs |
134+
| `data.py` | upload, download, list, delete |
135+
| `dataset.py` | Dataset management |
136+
| `app.py` | App submission/management |
137+
| `config.py` | Service configuration |
138+
| `profile.py` | User profiles |
139+
| `login.py` | Authentication |
140+
| `pool.py`, `resources.py`, `user.py`, `credential.py`, `access_token.py`, `bucket.py`, `task.py`, `version.py` | Supporting commands |
141+
| `backend.py` | Backend cluster management |
142+
143+
Features: Tab completion (shtab), response formatting (`formatters.py`), spec editor (`editor.py`), PyInstaller packaging (`cli_builder.py`, `packaging/`).
144+
145+
### Go Runtime Containers (`runtime/`)
146+
147+
| Binary | Purpose |
148+
|--------|---------|
149+
| `runtime/cmd/ctrl/` | **osmo_ctrl** — Orchestrates workflow execution. WebSocket to workflow service. Unix socket to osmo_user. Manages data download/upload, barriers for multi-task sync, port forwarding. |
150+
| `runtime/cmd/user/` | **osmo_user** — Executes user commands with PTY. Streams stdout/stderr to ctrl. Handles checkpointing (periodic uploads). |
151+
| `runtime/cmd/rsync/` | **osmo_rsync** — Rsync daemon with bandwidth limiting. |
152+
153+
### Go Runtime Packages (`runtime/pkg/`)
154+
155+
| Package | Purpose |
156+
|---------|---------|
157+
| `args/` | CLI flag parsing for ctrl and user containers. |
158+
| `messages/` | IPC message protocol between containers (exec lifecycle, log streaming, barriers). |
159+
| `common/` | Shared utilities: command execution, file operations, circular buffer. |
160+
| `data/` | Input/output data handling. Storage backend abstraction (S3, Swift, GCS, TOS). Mount/download/upload with retry and checkpointing. |
161+
| `metrics/` | Execution timing and data transfer metrics collection. |
162+
| `osmo_errors/` | Error handling with categorized exit codes and termination logging. |
163+
| `rsync/` | Rsync daemon subprocess management with monitoring. |
164+
165+
### Go Utilities (`utils/` — Go)
166+
167+
| Package | Purpose |
168+
|---------|---------|
169+
| `roles/` | Semantic RBAC. Actions like `workflow:Create`, `dataset:Read`. LRU cache with TTL. Role sync from IDP. Pool access evaluation. |
170+
| `postgres/` | PostgreSQL client with pgx connection pool and pgroll schema version support. |
171+
| `redis/` | Redis client with optional TLS. |
172+
| `logging/` | Structured slog handler compatible with Fluent Bit parsers. |
173+
| `env.go` | Environment variable helpers with YAML config file fallback. |
174+
175+
### Authorization Sidecar (`service/authz_sidecar/`) — Go gRPC
176+
177+
- Implements external authorization for the API gateway
178+
- Flow: Extract user/roles from request headers → sync roles from IDP → resolve role policies from cache/DB → evaluate semantic RBAC → return allow/deny with `x-osmo-user`, `x-osmo-roles`, `x-osmo-allowed-pools` headers
179+
180+
### Frontend (`ui/`)
181+
182+
- **Framework**: Next.js (App Router, Turbopack) + React + TypeScript (see `package.json` for versions)
183+
- **Styling**: Tailwind CSS + shadcn/ui
184+
- **State**: TanStack Query (data fetching), Zustand (UI state), nuqs (URL state)
185+
- **Testing**: Vitest (unit), Playwright (E2E), MSW (API mocking)
186+
- **API layer**: OpenAPI-generated types (`lib/api/generated.ts` — DO NOT EDIT) + adapter layer (`lib/api/adapter/`) that bridges backend quirks to UI expectations
187+
- **Key routes**: pools, resources, workflows, datasets, occupancy, profile, log-viewer (under `app/(dashboard)/`)
188+
- **Import rules**: Absolute imports only (`@/...`), no barrel exports, API types from adapter (not generated)
189+
190+
### Operator (`operator/`)
191+
192+
- `backend_listener.py` — WebSocket listener for backend cluster status
193+
- `backend_worker.py` — Job execution engine for backend tasks
194+
- `backend_test_runner/` — Test orchestration for backend validation
195+
- `utils/node_validation_test/` — GPU validation (nvidia-smi, tflops benchmark, stuck pod detection)
196+
### Tests
197+
198+
199+
| Location | Framework | Scope |
200+
| ------------------------------------------- | ------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------- |
201+
| `tests/common/` | pytest + testcontainers | Shared fixtures: PostgreSQL (`database/`), S3/Swift/Redis storage (`storage/`), Docker network/registry (`core/`, `registry/`), Envoy TLS proxy (`envoy/`) |
202+
| `tests/common/database/testdata/schema.sql` || Database schema definition (source of truth for test DB) |
203+
| `runtime/pkg/*_test.go` | Go testing + testcontainers-go | Runtime package unit/integration tests |
204+
| `utils/*_test.go` | Go testing | Go utility tests (roles, postgres, redis) |
205+
| `ui/src/**/*.test.ts` | Vitest | Frontend unit tests |
206+
| `ui/e2e/` | Playwright | Frontend E2E tests with page object models |
207+
208+
### Key Architecture Patterns
209+
210+
- **Container runtime**: Three container types per workflow — ctrl (orchestrator), user (execution), data (rsync sidecar)
211+
- **IPC**: WebSocket (ctrl↔workflow service), Unix sockets (ctrl↔user), gRPC (authz sidecar)
212+
- **Auth**: API gateway → authz_sidecar (semantic RBAC with `x-osmo-user`, `x-osmo-roles`, `x-osmo-allowed-pools` headers)
213+
- **Storage**: Multi-cloud abstraction (S3/Azure/GCS/Swift/TOS) with parallel multiprocess+multithread transfer
214+
- **Job queue**: Redis-backed Kombu queue with deduplication. Delayed jobs via Redis ZSET.
215+
- **Databases**: PostgreSQL (pgx), Redis (caching + job queue + event streams + barriers), pgroll for schema versioning
216+
- **Monitoring**: Prometheus + Grafana + Loki. OpenTelemetry instrumentation on FastAPI.
217+
218+
### Inter-Service Communication
219+
220+
```
221+
Client → API Gateway → authz_sidecar (gRPC Check) → Core Service (FastAPI)
222+
├── PostgreSQL (state)
223+
├── Redis (cache, job queue, events)
224+
├── → Worker (job consumer)
225+
├── ↔ Agent (WebSocket backend events)
226+
├── ↔ Logger (WebSocket log streaming)
227+
├── → Router (HTTP/WS request routing)
228+
└── → Delayed Job Monitor (scheduled jobs)
229+
230+
Workflow Execution:
231+
Core Service → K8s Backend → [osmo_ctrl ↔ osmo_user ↔ osmo_rsync]
232+
osmo_ctrl ↔ Core Service (WebSocket)
233+
osmo_ctrl → Logger (WebSocket logs/metrics)
234+
```
235+
236+
### Build & Test
237+
238+
- **Build system**: Bazel (`MODULE.bazel`, `.bazelrc`) — check `MODULE.bazel` for current version
239+
- **Python**: ruff linter (`.ruff.toml`, Google style) — check `MODULE.bazel` for Python version
240+
- **Go**: module `go.corp.nvidia.com/osmo` (single `go.mod` at `src/`) — check `go.mod` for Go version
241+
- **Frontend**: Next.js + pnpm, TypeScript strict mode, ESLint + Prettier — check `ui/package.json` for versions
242+
- **Tests**: Bazel test rules, pytest + testcontainers (Python), testcontainers-go (Go), Vitest + Playwright (frontend)
243+
- **Container images**: Built via `rules_oci` (amd64, arm64), distroless base from NVIDIA NGC
244+
- **API spec**: OpenAPI auto-generated from FastAPI via `bazel run //src/scripts:export_openapi`
245+

CLAUDE.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
@AGENTS.md
2+
3+
## Tool Usage Preferences
4+
5+
- Use specialized tools (Read, Edit, Write, Grep, Glob) instead of Bash commands whenever possible
6+
- Bash tools require user intervention to allow and should only be used as a last resort
7+
- Prefer Read over cat, Edit over sed, Write over echo/heredoc, Grep over grep, and Glob over find

0 commit comments

Comments
 (0)