|
| 1 | +# CLAUDE.md |
| 2 | +<!-- Generated by Claude Sonnet 4 --> |
| 3 | + |
| 4 | +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. |
| 5 | + |
| 6 | +## Development Commands |
| 7 | + |
| 8 | +### Building |
| 9 | +```bash |
| 10 | +# Build for current platform |
| 11 | +go build -tags containers_image_openpgp -ldflags="-w -s" ./... |
| 12 | + |
| 13 | +# Build for specific platforms (as used in CI) |
| 14 | +GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -tags containers_image_openpgp -ldflags="-w -s" -o linux-amd64/ ./... |
| 15 | +GOOS=darwin GOARCH=arm64 CGO_ENABLED=0 go build -tags containers_image_openpgp -ldflags="-w -s" -o darwin-apple-silicon/ ./... |
| 16 | +``` |
| 17 | + |
| 18 | +### Testing |
| 19 | +```bash |
| 20 | +# Run full test suite (requires podman/docker and Kubernetes cluster) |
| 21 | +go test -tags containers_image_openpgp -race -json -v -coverprofile=coverage.out ./... |
| 22 | + |
| 23 | +# Generate coverage report |
| 24 | +go tool cover -func coverage.out |
| 25 | +``` |
| 26 | + |
| 27 | +### Code Quality |
| 28 | +```bash |
| 29 | +# Run security scanner |
| 30 | +gosec --exclude G402 ./... |
| 31 | + |
| 32 | +# Run static code analyzer |
| 33 | +staticcheck -checks all ./... |
| 34 | +``` |
| 35 | + |
| 36 | +### Dependencies |
| 37 | +- Requires Go 1.23.3+ |
| 38 | +- Requires either Podman or Docker runtime installed |
| 39 | +- Tests require a Kubernetes cluster (kind is used in CI) |
| 40 | +- On Ubuntu: `sudo apt-get install podman libbtrfs-dev nodejs wamerican libgpgme-dev` |
| 41 | + |
| 42 | +## Architecture Overview |
| 43 | + |
| 44 | +### Core Components |
| 45 | + |
| 46 | +**Entry Point (main.go)** |
| 47 | +- Initializes configuration from `pkg/config/config.json` |
| 48 | +- Detects container runtime (Podman/Docker) via `utils.DetectContainerRuntime` |
| 49 | +- Creates scenario orchestrator and provider factory instances |
| 50 | +- Delegates to Cobra CLI commands in `cmd/` |
| 51 | + |
| 52 | +**Configuration System (pkg/config/)** |
| 53 | +- `config.json`: Central configuration with container registries, API endpoints, paths |
| 54 | +- `config.go`: Configuration struct and loading logic with embedded JSON file |
| 55 | + |
| 56 | +**CLI Commands (cmd/)** |
| 57 | +- `root.go`: Main command structure with subcommands and global flags |
| 58 | +- Individual command files: `run.go`, `list.go`, `describe.go`, `clean.go`, etc. |
| 59 | +- Support for private registry authentication via flags or environment variables |
| 60 | + |
| 61 | +**Provider System (pkg/provider/)** |
| 62 | +- `factory/`: Factory pattern for different container registry providers |
| 63 | +- `quay/`: Quay.io registry implementation |
| 64 | +- `registryv2/`: Generic Docker Registry v2 API support |
| 65 | +- `models/`: Data structures for registry interactions |
| 66 | + |
| 67 | +**Scenario Orchestrator (pkg/scenarioorchestrator/)** |
| 68 | +- Abstracts container runtime operations (Podman/Docker) |
| 69 | +- `podman/`: Podman-specific implementation |
| 70 | +- Manages chaos scenario container lifecycle |
| 71 | + |
| 72 | +**Utility Packages** |
| 73 | +- `pkg/utils/`: Common utilities and helpers |
| 74 | +- `pkg/typing/`: Type definitions and validation |
| 75 | +- `pkg/dependencygraph/`: Dependency graph management for scenario workflows |
| 76 | +- `pkg/randomgraph/`: Random scenario generation |
| 77 | + |
| 78 | +### Key Features |
| 79 | + |
| 80 | +**Scenario Management** |
| 81 | +- List available chaos scenarios from container registries |
| 82 | +- Describe scenario details and input requirements |
| 83 | +- Run individual scenarios or orchestrated workflows |
| 84 | +- Support for detached execution mode |
| 85 | + |
| 86 | +**Graph Workflows** |
| 87 | +- Define dependency graphs of chaos scenarios in JSON format |
| 88 | +- Execute scenarios in dependency order |
| 89 | +- Support for parallel execution where dependencies allow |
| 90 | +- Scaffold new workflow templates |
| 91 | + |
| 92 | +**Random Testing** |
| 93 | +- Generate random scenario execution plans |
| 94 | +- Control parallelism and scenario count |
| 95 | +- Use seed files for template-based random generation |
| 96 | + |
| 97 | +**Private Registry Support** |
| 98 | +- Basic authentication and token-based authentication |
| 99 | +- Custom domain support beyond quay.io |
| 100 | +- TLS configuration options |
| 101 | + |
| 102 | +### Container Runtime Integration |
| 103 | + |
| 104 | +The tool auto-detects and supports both Podman and Docker: |
| 105 | +- Podman: Uses socket communication via `unix://` sockets |
| 106 | +- Docker: Standard Docker socket integration |
| 107 | +- Platform detection for Darwin vs Linux socket paths |
| 108 | +- Graceful fallback between runtimes |
| 109 | + |
| 110 | +### Configuration Patterns |
| 111 | + |
| 112 | +- Global configuration embedded in binary via `go:embed` |
| 113 | +- Runtime configuration via CLI flags and environment variables |
| 114 | +- Kubeconfig path resolution for Kubernetes integration |
| 115 | +- Custom alerts and metrics profile support |
| 116 | + |
| 117 | +## Lightspeed AI-Powered Assistance |
| 118 | + |
| 119 | +### Overview |
| 120 | +Lightspeed is krknctl's AI-powered chaos engineering assistance feature that provides intelligent command suggestions and documentation search using Retrieval-Augmented Generation (RAG) with GPU acceleration. |
| 121 | + |
| 122 | +### Major Implementation Tasks Completed |
| 123 | + |
| 124 | +#### 1. GPU Detection System Redesign |
| 125 | +**Previous System**: Complex container-based GPU detection using test images |
| 126 | +- Removed complex GPU check implementation using container images |
| 127 | +- Eliminated `GetSupportedGPUTypes()` and container-based testing approach |
| 128 | + |
| 129 | +**New System**: Platform-based automatic detection |
| 130 | +- **macOS arm64**: Automatically assumes Apple Silicon GPU support (Metal via libkrun) |
| 131 | +- **Linux with NVIDIA devices**: Detects physical NVIDIA devices (`/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`) |
| 132 | +- **Generic fallback**: CPU-only mode for all other platforms |
| 133 | +- Added `--no-gpu` flag to force CPU-only mode without device mounting |
| 134 | + |
| 135 | +#### 2. Container Runtime Support |
| 136 | +- **Podman Only**: Lightspeed exclusively supports Podman container runtime |
| 137 | +- **Docker Blocking**: Commands fail gracefully with helpful error messages when Docker is detected |
| 138 | +- **Error Handling**: Provides links to Podman GPU documentation (https://podman-desktop.io/docs/podman/gpu) |
| 139 | + |
| 140 | +#### 3. Container Architecture |
| 141 | +**Three Specialized Containers**: |
| 142 | +- **Apple Silicon** (`rag-model-apple-silicon`): Vulkan backend for Apple M1/M2/M3/M4 GPUs |
| 143 | +- **NVIDIA** (`rag-model-nvidia`): CUDA backend for NVIDIA GPUs |
| 144 | +- **Generic** (`rag-model-generic`): CPU-only fallback for all other platforms |
| 145 | + |
| 146 | +**Container Selection Logic**: |
| 147 | +- Uses `PlatformGPUDetector.GetLightspeedImageURI()` to select appropriate container |
| 148 | +- Tag construction follows `{rag_model_tag}-{architecture}` pattern from config |
| 149 | +- Device mounting handled by `PlatformGPUDetector.GetDeviceMounts()` |
| 150 | + |
| 151 | +#### 4. Configuration Integration |
| 152 | +- **Config-Based Tags**: Uses `rag_model_tag` from `pkg/config/config.json` to construct container tags |
| 153 | +- **Centralized Settings**: All RAG service parameters (ports, endpoints, timeouts) in configuration |
| 154 | +- **Private Registry Support**: Full integration with existing private registry authentication |
| 155 | + |
| 156 | +#### 5. Multi-Stage Container Build Fix |
| 157 | +**Problem**: Documentation indexing failed in builder stage of multi-stage builds |
| 158 | +- **Root Cause**: Git and Python dependencies not fully available during builder stage |
| 159 | +- **Solution**: Moved documentation indexing from builder stage to runtime stage |
| 160 | +- **Impact**: Fixed NVIDIA and Generic containers (Apple single-stage already worked) |
| 161 | + |
| 162 | +**Fixed Containers**: |
| 163 | +- **NVIDIA** (`Containerfile.nvidia`): Multi-stage build with runtime indexing |
| 164 | +- **Generic** (`Containerfile.generic`): Multi-stage build with runtime indexing |
| 165 | +- **Apple Silicon** (`Containerfile.apple-silicon`): Single-stage build (already working) |
| 166 | + |
| 167 | +#### 6. Documentation Indexing System |
| 168 | +**Sources Indexed**: |
| 169 | +- Local krknctl help documentation |
| 170 | +- Live krkn-chaos/website repository (chaos engineering guides) |
| 171 | +- Live krkn-chaos/krkn-hub repository (scenario definitions) |
| 172 | + |
| 173 | +**Indexing Process**: |
| 174 | +- **Build Time**: Creates cached indices for offline/airgapped environments |
| 175 | +- **Runtime**: Can rebuild indices with fresh documentation or use cached versions |
| 176 | +- **Verification**: Automatic validation of indexed document sources and counts |
| 177 | + |
| 178 | +#### 7. User Experience Improvements |
| 179 | +**Progress Feedback**: |
| 180 | +- Spinner with dynamic progress messages during container image pulls |
| 181 | +- Real-time feedback during RAG model deployment |
| 182 | +- Health checking with automatic retry and timeout handling |
| 183 | + |
| 184 | +**Error Handling**: |
| 185 | +- Platform-specific error messages with actionable solutions |
| 186 | +- Automatic fallback from live indexing to cached documentation |
| 187 | +- Container cleanup on deployment failures |
| 188 | + |
| 189 | +### Technical Implementation |
| 190 | + |
| 191 | +#### Core Components |
| 192 | +- **`pkg/gpucheck/gpucheck.go`**: Platform-based GPU detection logic |
| 193 | +- **`cmd/lightspeed_check.go`**: Lightspeed commands with Docker runtime blocking |
| 194 | +- **`cmd/lightspeed.go`**: RAG model deployment with GPU-specific container selection |
| 195 | +- **`pkg/config/config.go`**: Enhanced with Lightspeed-specific configuration methods |
| 196 | + |
| 197 | +#### Container Files |
| 198 | +- **`containers/lightspeed-rag/Containerfile.apple-silicon`**: Single-stage Vulkan build |
| 199 | +- **`containers/lightspeed-rag/Containerfile.nvidia`**: Multi-stage CUDA build |
| 200 | +- **`containers/lightspeed-rag/Containerfile.generic`**: Multi-stage CPU-only build |
| 201 | + |
| 202 | +#### Key Functions |
| 203 | +- **`DetectGPUAcceleration()`**: Platform-based GPU type detection |
| 204 | +- **`deployRAGModelWithGPUType()`**: GPU-aware container deployment |
| 205 | +- **`HandleContainerError()`**: Enhanced error reporting with helpful suggestions |
| 206 | + |
| 207 | +### Usage Examples |
| 208 | + |
| 209 | +```bash |
| 210 | +# Automatic GPU detection and deployment |
| 211 | +krknctl lightspeed check |
| 212 | + |
| 213 | +# AI-powered assistance with auto-detected GPU |
| 214 | +krknctl lightspeed run |
| 215 | + |
| 216 | +# Force CPU-only mode (no GPU acceleration) |
| 217 | +krknctl lightspeed run --no-gpu |
| 218 | + |
| 219 | +# Offline mode for airgapped environments |
| 220 | +krknctl lightspeed run --offline |
| 221 | +``` |
| 222 | + |
| 223 | +### Development Notes |
| 224 | +- **Testing**: Updated test suite to use new `PlatformGPUDetector` API |
| 225 | +- **Backwards Compatibility**: Maintains existing CLI interface while simplifying internals |
| 226 | +- **Build System**: All containers build successfully with proper documentation indexing |
| 227 | +- **Error Recovery**: Graceful degradation when GPU features are unavailable |
0 commit comments