Skip to content

Commit 7adf9ed

Browse files
committed
feat: Complete Lightspeed integration and documentation
- Add comprehensive Lightspeed documentation to CLAUDE.md - Update vendor dependencies for new packages - Integrate Lightspeed commands with existing CLI structure - Add configuration tests and utility functions - Update scenario orchestrator for container port mapping support Documentation: - Complete implementation guide in CLAUDE.md - All major tasks and technical details documented - Usage examples and development notes included Integration: - Full integration with existing krknctl architecture - Maintains backward compatibility - Follows established patterns and conventions Dependencies: - Updated vendor modules for testing and mock frameworks - Added necessary packages for Lightspeed functionality - Clean integration without breaking existing features
1 parent 19842d0 commit 7adf9ed

File tree

116 files changed

+8534
-1437
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

116 files changed

+8534
-1437
lines changed

.gitignore

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,5 +30,17 @@ krknctl-*
3030
.env
3131
.idea
3232
bin/
33+
bin-linux/
34+
35+
# Python
36+
__pycache__/
37+
*.py[cod]
38+
*$py.class
39+
*.so
40+
.Python
41+
venv/
42+
env/
43+
ENV/
44+
*.egg-info/
3345

3446
.DS_Store

CLAUDE.md

Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
# CLAUDE.md
2+
<!-- Generated by Claude Sonnet 4 -->
3+
4+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
5+
6+
## Development Commands
7+
8+
### Building
9+
```bash
10+
# Build for current platform
11+
go build -tags containers_image_openpgp -ldflags="-w -s" ./...
12+
13+
# Build for specific platforms (as used in CI)
14+
GOOS=linux GOARCH=amd64 CGO_ENABLED=0 go build -tags containers_image_openpgp -ldflags="-w -s" -o linux-amd64/ ./...
15+
GOOS=darwin GOARCH=arm64 CGO_ENABLED=0 go build -tags containers_image_openpgp -ldflags="-w -s" -o darwin-apple-silicon/ ./...
16+
```
17+
18+
### Testing
19+
```bash
20+
# Run full test suite (requires podman/docker and Kubernetes cluster)
21+
go test -tags containers_image_openpgp -race -json -v -coverprofile=coverage.out ./...
22+
23+
# Generate coverage report
24+
go tool cover -func coverage.out
25+
```
26+
27+
### Code Quality
28+
```bash
29+
# Run security scanner
30+
gosec --exclude G402 ./...
31+
32+
# Run static code analyzer
33+
staticcheck -checks all ./...
34+
```
35+
36+
### Dependencies
37+
- Requires Go 1.23.3+
38+
- Requires either Podman or Docker runtime installed
39+
- Tests require a Kubernetes cluster (kind is used in CI)
40+
- On Ubuntu: `sudo apt-get install podman libbtrfs-dev nodejs wamerican libgpgme-dev`
41+
42+
## Architecture Overview
43+
44+
### Core Components
45+
46+
**Entry Point (main.go)**
47+
- Initializes configuration from `pkg/config/config.json`
48+
- Detects container runtime (Podman/Docker) via `utils.DetectContainerRuntime`
49+
- Creates scenario orchestrator and provider factory instances
50+
- Delegates to Cobra CLI commands in `cmd/`
51+
52+
**Configuration System (pkg/config/)**
53+
- `config.json`: Central configuration with container registries, API endpoints, paths
54+
- `config.go`: Configuration struct and loading logic with embedded JSON file
55+
56+
**CLI Commands (cmd/)**
57+
- `root.go`: Main command structure with subcommands and global flags
58+
- Individual command files: `run.go`, `list.go`, `describe.go`, `clean.go`, etc.
59+
- Support for private registry authentication via flags or environment variables
60+
61+
**Provider System (pkg/provider/)**
62+
- `factory/`: Factory pattern for different container registry providers
63+
- `quay/`: Quay.io registry implementation
64+
- `registryv2/`: Generic Docker Registry v2 API support
65+
- `models/`: Data structures for registry interactions
66+
67+
**Scenario Orchestrator (pkg/scenarioorchestrator/)**
68+
- Abstracts container runtime operations (Podman/Docker)
69+
- `podman/`: Podman-specific implementation
70+
- Manages chaos scenario container lifecycle
71+
72+
**Utility Packages**
73+
- `pkg/utils/`: Common utilities and helpers
74+
- `pkg/typing/`: Type definitions and validation
75+
- `pkg/dependencygraph/`: Dependency graph management for scenario workflows
76+
- `pkg/randomgraph/`: Random scenario generation
77+
78+
### Key Features
79+
80+
**Scenario Management**
81+
- List available chaos scenarios from container registries
82+
- Describe scenario details and input requirements
83+
- Run individual scenarios or orchestrated workflows
84+
- Support for detached execution mode
85+
86+
**Graph Workflows**
87+
- Define dependency graphs of chaos scenarios in JSON format
88+
- Execute scenarios in dependency order
89+
- Support for parallel execution where dependencies allow
90+
- Scaffold new workflow templates
91+
92+
**Random Testing**
93+
- Generate random scenario execution plans
94+
- Control parallelism and scenario count
95+
- Use seed files for template-based random generation
96+
97+
**Private Registry Support**
98+
- Basic authentication and token-based authentication
99+
- Custom domain support beyond quay.io
100+
- TLS configuration options
101+
102+
### Container Runtime Integration
103+
104+
The tool auto-detects and supports both Podman and Docker:
105+
- Podman: Uses socket communication via `unix://` sockets
106+
- Docker: Standard Docker socket integration
107+
- Platform detection for Darwin vs Linux socket paths
108+
- Graceful fallback between runtimes
109+
110+
### Configuration Patterns
111+
112+
- Global configuration embedded in binary via `go:embed`
113+
- Runtime configuration via CLI flags and environment variables
114+
- Kubeconfig path resolution for Kubernetes integration
115+
- Custom alerts and metrics profile support
116+
117+
## Lightspeed AI-Powered Assistance
118+
119+
### Overview
120+
Lightspeed is krknctl's AI-powered chaos engineering assistance feature that provides intelligent command suggestions and documentation search using Retrieval-Augmented Generation (RAG) with GPU acceleration.
121+
122+
### Major Implementation Tasks Completed
123+
124+
#### 1. GPU Detection System Redesign
125+
**Previous System**: Complex container-based GPU detection using test images
126+
- Removed complex GPU check implementation using container images
127+
- Eliminated `GetSupportedGPUTypes()` and container-based testing approach
128+
129+
**New System**: Platform-based automatic detection
130+
- **macOS arm64**: Automatically assumes Apple Silicon GPU support (Metal via libkrun)
131+
- **Linux with NVIDIA devices**: Detects physical NVIDIA devices (`/dev/nvidia0`, `/dev/nvidiactl`, `/dev/nvidia-uvm`)
132+
- **Generic fallback**: CPU-only mode for all other platforms
133+
- Added `--no-gpu` flag to force CPU-only mode without device mounting
134+
135+
#### 2. Container Runtime Support
136+
- **Podman Only**: Lightspeed exclusively supports Podman container runtime
137+
- **Docker Blocking**: Commands fail gracefully with helpful error messages when Docker is detected
138+
- **Error Handling**: Provides links to Podman GPU documentation (https://podman-desktop.io/docs/podman/gpu)
139+
140+
#### 3. Container Architecture
141+
**Three Specialized Containers**:
142+
- **Apple Silicon** (`rag-model-apple-silicon`): Vulkan backend for Apple M1/M2/M3/M4 GPUs
143+
- **NVIDIA** (`rag-model-nvidia`): CUDA backend for NVIDIA GPUs
144+
- **Generic** (`rag-model-generic`): CPU-only fallback for all other platforms
145+
146+
**Container Selection Logic**:
147+
- Uses `PlatformGPUDetector.GetLightspeedImageURI()` to select appropriate container
148+
- Tag construction follows `{rag_model_tag}-{architecture}` pattern from config
149+
- Device mounting handled by `PlatformGPUDetector.GetDeviceMounts()`
150+
151+
#### 4. Configuration Integration
152+
- **Config-Based Tags**: Uses `rag_model_tag` from `pkg/config/config.json` to construct container tags
153+
- **Centralized Settings**: All RAG service parameters (ports, endpoints, timeouts) in configuration
154+
- **Private Registry Support**: Full integration with existing private registry authentication
155+
156+
#### 5. Multi-Stage Container Build Fix
157+
**Problem**: Documentation indexing failed in builder stage of multi-stage builds
158+
- **Root Cause**: Git and Python dependencies not fully available during builder stage
159+
- **Solution**: Moved documentation indexing from builder stage to runtime stage
160+
- **Impact**: Fixed NVIDIA and Generic containers (Apple single-stage already worked)
161+
162+
**Fixed Containers**:
163+
- **NVIDIA** (`Containerfile.nvidia`): Multi-stage build with runtime indexing
164+
- **Generic** (`Containerfile.generic`): Multi-stage build with runtime indexing
165+
- **Apple Silicon** (`Containerfile.apple-silicon`): Single-stage build (already working)
166+
167+
#### 6. Documentation Indexing System
168+
**Sources Indexed**:
169+
- Local krknctl help documentation
170+
- Live krkn-chaos/website repository (chaos engineering guides)
171+
- Live krkn-chaos/krkn-hub repository (scenario definitions)
172+
173+
**Indexing Process**:
174+
- **Build Time**: Creates cached indices for offline/airgapped environments
175+
- **Runtime**: Can rebuild indices with fresh documentation or use cached versions
176+
- **Verification**: Automatic validation of indexed document sources and counts
177+
178+
#### 7. User Experience Improvements
179+
**Progress Feedback**:
180+
- Spinner with dynamic progress messages during container image pulls
181+
- Real-time feedback during RAG model deployment
182+
- Health checking with automatic retry and timeout handling
183+
184+
**Error Handling**:
185+
- Platform-specific error messages with actionable solutions
186+
- Automatic fallback from live indexing to cached documentation
187+
- Container cleanup on deployment failures
188+
189+
### Technical Implementation
190+
191+
#### Core Components
192+
- **`pkg/gpucheck/gpucheck.go`**: Platform-based GPU detection logic
193+
- **`cmd/lightspeed_check.go`**: Lightspeed commands with Docker runtime blocking
194+
- **`cmd/lightspeed.go`**: RAG model deployment with GPU-specific container selection
195+
- **`pkg/config/config.go`**: Enhanced with Lightspeed-specific configuration methods
196+
197+
#### Container Files
198+
- **`containers/lightspeed-rag/Containerfile.apple-silicon`**: Single-stage Vulkan build
199+
- **`containers/lightspeed-rag/Containerfile.nvidia`**: Multi-stage CUDA build
200+
- **`containers/lightspeed-rag/Containerfile.generic`**: Multi-stage CPU-only build
201+
202+
#### Key Functions
203+
- **`DetectGPUAcceleration()`**: Platform-based GPU type detection
204+
- **`deployRAGModelWithGPUType()`**: GPU-aware container deployment
205+
- **`HandleContainerError()`**: Enhanced error reporting with helpful suggestions
206+
207+
### Usage Examples
208+
209+
```bash
210+
# Automatic GPU detection and deployment
211+
krknctl lightspeed check
212+
213+
# AI-powered assistance with auto-detected GPU
214+
krknctl lightspeed run
215+
216+
# Force CPU-only mode (no GPU acceleration)
217+
krknctl lightspeed run --no-gpu
218+
219+
# Offline mode for airgapped environments
220+
krknctl lightspeed run --offline
221+
```
222+
223+
### Development Notes
224+
- **Testing**: Updated test suite to use new `PlatformGPUDetector` API
225+
- **Backwards Compatibility**: Maintains existing CLI interface while simplifying internals
226+
- **Build System**: All containers build successfully with proper documentation indexing
227+
- **Error Recovery**: Graceful degradation when GPU features are unavailable

cmd/run.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -313,7 +313,7 @@ func NewRunCommand(factory *factory.ProviderFactory, scenarioOrchestrator *scena
313313
spinner.Stop()
314314
}()
315315

316-
_, err = (*scenarioOrchestrator).RunAttached(quayImageURI+":"+scenarioDetail.Name, containerName, environment, false, volumes, os.Stdout, os.Stderr, &commChan, conn, registrySettings)
316+
_, err = (*scenarioOrchestrator).RunAttached(quayImageURI+":"+scenarioDetail.Name, containerName, environment, false, volumes, nil, os.Stdout, os.Stderr, &commChan, conn, registrySettings)
317317
if err != nil {
318318
var staterr *utils.ExitError
319319
if errors.As(err, &staterr) {
@@ -324,7 +324,7 @@ func NewRunCommand(factory *factory.ProviderFactory, scenarioOrchestrator *scena
324324
scenarioDuration := time.Since(startTime)
325325
fmt.Printf("%s ran for %s\n", scenarioDetail.Name, scenarioDuration.String())
326326
} else {
327-
containerID, err := (*scenarioOrchestrator).Run(quayImageURI+":"+scenarioDetail.Name, containerName, environment, false, volumes, nil, conn, registrySettings)
327+
containerID, err := (*scenarioOrchestrator).Run(quayImageURI+":"+scenarioDetail.Name, containerName, environment, false, volumes, nil, nil, conn, registrySettings, nil)
328328
if err != nil {
329329
return err
330330
}

0 commit comments

Comments
 (0)