Skip to content

Commit 1a2c61d

Browse files
authored
Merge pull request #110 from shiv-tyagi/sync-knowledge
Fix stale and incomplete information in CLAUDE.md and README.md
2 parents 6ff27cb + 508fd22 commit 1a2c61d

2 files changed

Lines changed: 14 additions & 12 deletions

File tree

CLAUDE.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,16 @@ Build takes ~30s on a modern machine. All crates build in one `cargo build`.
1818
cargo test
1919
```
2020

21-
314 tests, all must pass. No external services needed (no database, no network, no GPU). Tests are self-contained.
21+
All tests must pass. No external services needed (no database, no network, no GPU). Tests are self-contained.
2222

2323
## Project Layout
2424

2525
```
26-
proto/slurm.proto # Single protobuf file — all gRPC service definitions
26+
proto/slurm.proto # Public API: Slurm-compatible gRPC service definitions
27+
proto/raft_internal.proto # Internal: Raft consensus RPCs between controllers
2728
crates/
2829
spur-proto/ # Generated gRPC code (build.rs runs tonic-build)
29-
spur-core/ # Core types: Job, Node, ResourceSet, config, hostlist, WalOperation
30+
spur-core/ # Core types: Job, Node, ResourceSet, config, hostlist, WalOperation, partition, qos, step, topology, reservation, dependency, array, auth, accounting
3031
spur-net/ # WireGuard mesh networking, IP pool, address detection
3132
spur-sched/ # Backfill scheduler
3233
spurctld/ # Controller daemon (the brain)
@@ -36,13 +37,13 @@ crates/
3637
spur-cli/ # Multi-call CLI binary (spur, sbatch, squeue, etc.)
3738
spur-ffi/ # C FFI shim (libspur_compat.so)
3839
spur-spank/ # SPANK plugin host
39-
spur-k8s/ # K8s integration (stub)
40+
spur-k8s/ # K8s integration
4041
spur-tests/ # Integration test suite
4142
```
4243

4344
## Key Architecture Decisions
4445

45-
- **Single proto file**: All services defined in `proto/slurm.proto`. Controller is `SlurmController` (port 6817), agent is `SlurmAgent` (port 6818), accounting is `SlurmAccounting` (port 6819).
46+
- **Proto files**: `proto/slurm.proto` defines the public API — `SlurmController` (port 6817), `SlurmAgent` (port 6818), `SlurmAccounting` (port 6819). `proto/raft_internal.proto` is separate because Raft consensus is internal controller-to-controller plumbing, not part of the Slurm-compatible API surface that FFI and REST depend on.
4647
- **State**: Always-on Raft consensus (openraft) in `spurctld/src/raft.rs`. Even single-node deployments run a 1-member Raft cluster. The Raft log is the sole durable store; snapshots are JSON-serialized `ClusterSnapshot` blobs. Recovery happens via Raft log replay + snapshot restore.
4748
- **Scheduler**: Backfill scheduler in `spur-sched`. Runs every N seconds, assigns pending jobs to idle/mixed nodes.
4849
- **Job dispatch**: Controller dispatches `LaunchJobRequest` to ALL allocated nodes (not just the first). Each node gets `peer_nodes` list and `task_offset`.
@@ -62,7 +63,8 @@ crates/
6263

6364
1. Create a new module in `crates/spur-cli/src/` (see `net.rs` as an example)
6465
2. Add `mod yourcommand;` to `crates/spur-cli/src/main.rs`
65-
3. Add dispatch in the `match args[1].as_str()` block
66+
3. Add symlink dispatch in the `match bin_name` block (for backward-compat invocation via argv[0])
67+
4. Add native dispatch in the `match args[1].as_str()` block (for `spur <command>` invocation)
6668

6769
### Adding a new config section
6870

@@ -85,14 +87,14 @@ crates/
8587
- Async runtime: tokio (full features).
8688
- Logging: `tracing` crate with `tracing-subscriber`.
8789
- Proto conversion: Each gRPC handler converts between proto types and core types. Conversion helpers live in the same file as the server (e.g., `server.rs` has `proto_to_job_spec`, `job_to_proto`, etc.).
88-
- Node state machine: `IdleMixedAllocated` based on resource usage; `Down`/`Drain`/`Error` are admin states that override.
89-
- Job state machine: `Pending → Running → Completing → Completed/Failed/Cancelled`. See `spur-core/src/job.rs`.
90+
- Node state machine: `Idle`/`Mixed`/`Allocated` based on resource usage; `Down`/`Drain`/`Draining`/`Error`/`Unknown`/`Suspended` are admin/system states that override.
91+
- Job state machine: `Pending → Running → Completing → Completed/Failed/Cancelled/Timeout/NodeFail/Preempted`. Jobs can also be `Suspended`. See `spur-core/src/job.rs`.
9092

9193
## Environment Variables
9294

9395
| Variable | Used by | Description |
9496
|----------|---------|-------------|
95-
| `SPUR_CONTROLLER_ADDR` | CLI, spurd | Controller gRPC address (default: `http://localhost:6817`) |
97+
| `SPUR_CONTROLLER_ADDR` | CLI, spurd, spurrestd | Controller gRPC address (default: `http://localhost:6817`) |
9698
| `SPUR_WG_INTERFACE` | spurd | WireGuard interface name for address detection (default: `spur0`) |
9799
| `SPUR_PROLOG` | spurd | Script to run before each job |
98100
| `SPUR_EPILOG` | spurd | Script to run after each job |

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,7 @@ spur/
278278
├── docs/quickstart.md # Getting started guide
279279
├── crates/
280280
│ ├── spur-proto/ # Generated gRPC code
281-
│ ├── spur-core/ # Job, Node, ResourceSet, config, hostlist, WalOperation
281+
│ ├── spur-core/ # Core types: Job, Node, ResourceSet, config, hostlist, partition, qos, and more
282282
│ ├── spur-net/ # WireGuard mesh networking, address detection
283283
│ ├── spur-sched/ # Backfill scheduler, priority, timeline
284284
@@ -289,14 +289,14 @@ spur/
289289
│ ├── spur-cli/ # CLI binary (multi-call: spur, sbatch, squeue, ...)
290290
│ ├── spur-ffi/ # C FFI shim (libspur_compat.so)
291291
│ ├── spur-spank/ # SPANK plugin host
292-
│ ├── spur-k8s/ # K8s integration (post-MVP)
292+
│ ├── spur-k8s/ # K8s integration
293293
│ └── spur-tests/ # Test suite (mirrors Slurm numbering)
294294
```
295295

296296
## Testing
297297

298298
```bash
299-
cargo test # Run all 314 tests
299+
cargo test # Run all tests
300300
cargo test -p spur-tests # Run integration test suite only
301301
cargo test -p spur-core # Run core library tests only
302302
```

0 commit comments

Comments
 (0)