Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection by tsebastiani · Pull Request #78 · krkn-chaos/krknctl

tsebastiani · 2025-09-04T12:26:55Z

Summary

This PR introduces a comprehensive Lightspeed AI-powered assistance feature for krknctl, implementing intelligent chaos engineering command suggestions using Retrieval-Augmented Generation (RAG) with automatic GPU detection and acceleration. The implementation includes a complete redesign of the GPU detection system, multi-platform container support, and robust error handling.

🚀 Major Implementation Tasks Completed

1. GPU Detection System Redesign

Previous System: Complex container-based GPU detection using test images ❌

Removed complex GPU check implementation using container images
Eliminated GetSupportedGPUTypes() and container-based testing approach

New System: Platform-based automatic detection ✅

macOS arm64: Automatically assumes Apple Silicon GPU support (Metal via libkrun)
Linux with NVIDIA devices: Detects physical NVIDIA devices (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm)
Generic fallback: CPU-only mode for all other platforms
Added --no-gpu flag to force CPU-only mode without device mounting

2. Container Runtime Support

Podman Only: Lightspeed exclusively supports Podman container runtime
Docker Blocking: Commands fail gracefully with helpful error messages when Docker is detected
Error Handling: Provides links to Podman GPU documentation (https://podman-desktop.io/docs/podman/gpu)

3. Container Architecture

Three Specialized Containers:

Apple Silicon (rag-model-apple-silicon): Vulkan backend for Apple M1/M2/M3/M4 GPUs
NVIDIA (rag-model-nvidia): CUDA backend for NVIDIA GPUs
Generic (rag-model-generic): CPU-only fallback for all other platforms

Container Selection Logic:

Uses PlatformGPUDetector.GetLightspeedImageURI() to select appropriate container
Tag construction follows {rag_model_tag}-{architecture} pattern from config
Device mounting handled by PlatformGPUDetector.GetDeviceMounts()

4. Multi-Stage Container Build Fix

Problem: Documentation indexing failed in builder stage of multi-stage builds

Root Cause: Git and Python dependencies not fully available during builder stage
Solution: Moved documentation indexing from builder stage to runtime stage
Impact: Fixed NVIDIA and Generic containers (Apple single-stage already worked)

5. Documentation Indexing System

Sources Indexed:

Local krknctl help documentation
Live krkn-chaos/website repository (chaos engineering guides)
Live krkn-chaos/krkn-hub repository (scenario definitions)

Indexing Process:

Build Time: Creates cached indices for offline/airgapped environments
Runtime: Can rebuild indices with fresh documentation or use cached versions
Verification: Automatic validation of indexed document sources and counts

🔧 Technical Implementation

Core Components

pkg/gpucheck/gpucheck.go: Platform-based GPU detection logic
cmd/lightspeed_check.go: Lightspeed commands with Docker runtime blocking
cmd/lightspeed.go: RAG model deployment with GPU-specific container selection
pkg/config/config.go: Enhanced with Lightspeed-specific configuration methods

Container Files

containers/lightspeed-rag/Containerfile.apple-silicon: Single-stage Vulkan build
containers/lightspeed-rag/Containerfile.nvidia: Multi-stage CUDA build
containers/lightspeed-rag/Containerfile.generic: Multi-stage CPU-only build

Key Functions

DetectGPUAcceleration(): Platform-based GPU type detection
deployRAGModelWithGPUType(): GPU-aware container deployment
HandleContainerError(): Enhanced error reporting with helpful suggestions

🎯 New CLI Commands

Lightspeed Check

# Automatic GPU detection and validation
krknctl lightspeed check

# Force CPU-only mode
krknctl lightspeed check --no-gpu

Lightspeed Run

# AI-powered assistance with auto-detected GPU
krknctl lightspeed run

# Force CPU-only mode (no GPU acceleration)
krknctl lightspeed run --no-gpu

# Offline mode for airgapped environments  
krknctl lightspeed run --offline

🏗️ Configuration Integration

Config-Based Tags: Uses rag_model_tag from pkg/config/config.json to construct container tags
Centralized Settings: All RAG service parameters (ports, endpoints, timeouts) in configuration
Private Registry Support: Full integration with existing private registry authentication

📊 User Experience Improvements

Progress Feedback

Spinner with dynamic progress messages during container image pulls
Real-time feedback during RAG model deployment
Health checking with automatic retry and timeout handling

Error Handling

Platform-specific error messages with actionable solutions
Automatic fallback from live indexing to cached documentation
Container cleanup on deployment failures

🧪 Testing

✅ Comprehensive Test Coverage

Updated test suite to use new PlatformGPUDetector API
Platform detection tests for all supported GPU types
Container deployment tests with mock orchestrator
Error handling tests for various failure scenarios

✅ Manual Testing Verified

All three container types build and run successfully
GPU detection works correctly on Apple Silicon and NVIDIA systems
Documentation indexing includes all expected sources (krknctl + website + krkn-hub)
Interactive RAG service provides accurate responses about chaos engineering

📁 Files Changed

New Files

pkg/gpucheck/gpucheck.go - Platform-based GPU detection logic
pkg/gpucheck/gpucheck_test.go - Comprehensive unit tests
cmd/lightspeed_check.go - Lightspeed commands implementation
cmd/lightspeed_check_test.go - CLI command tests
cmd/lightspeed.go - RAG model deployment functions
containers/lightspeed-rag/Containerfile.apple-silicon - Single-stage Vulkan build
containers/lightspeed-rag/Containerfile.nvidia - Multi-stage CUDA build
containers/lightspeed-rag/Containerfile.generic - Multi-stage CPU-only build
containers/lightspeed-rag/rag_service.py - FastAPI RAG service
containers/lightspeed-rag/index_docs.py - Documentation indexing script
containers/lightspeed-rag/entrypoint.sh - Container entrypoint
containers/lightspeed-rag/requirements.txt - Python dependencies

Modified Files

pkg/config/config.json - Added Lightspeed configuration parameters
pkg/config/config.go - Added Lightspeed configuration methods
cmd/root.go - Integrated lightspeed command into CLI structure
CLAUDE.md - Added comprehensive Lightspeed documentation

🌟 Key Benefits

Intelligent Assistance

Natural language queries about chaos engineering scenarios
Smart command suggestions based on user intent
Comprehensive documentation search across all krkn projects

Automatic GPU Optimization

Zero configuration GPU detection and utilization
Cross-platform compatibility (Apple Silicon, NVIDIA, generic CPU)
Graceful degradation when GPU acceleration is unavailable

Developer Experience

Simple CLI interface following krknctl conventions
Comprehensive error messages with actionable solutions
Offline support for airgapped environments

🚀 Future Enhancements Ready

This foundation enables future integrations with the krkn-lightspeed repository for:

Enhanced RAG model capabilities
Advanced chaos engineering assistance
Integration with additional AI/ML features

🏁 Build and Test Instructions

# Build the project
go build -tags containers_image_openpgp -ldflags="-w -s" .

# Run tests
go test -tags containers_image_openpgp ./pkg/gpucheck ./pkg/config ./cmd

# Test Lightspeed commands
krknctl lightspeed check
krknctl lightspeed run --help

Assisted-by: Claude Sonnet 4
Signed-off-by: Tullio Sebastiani tsebasti@redhat.com

paigerube14 · 2025-09-12T20:01:03Z

tested with the gpu alpha release and works great!!

% krknctl-gpu lightspeed run
📣📦 a newer version of krknctl, v0.10.4-beta, is currently available, check it out! https://github.com/krkn-chaos/krknctl/releases/latest


container runtime: Podman

🔍 Detecting GPU acceleration...
✅ GPU acceleration: Apple Silicon GPU (M1, M2, M3, M4 with Metal via libkrun)

🚀 Deploying lightspeed model...
🚀 RAG container started: bf84c3cb5460e5ea1cdc6850fa544ef4fb9dfad8c9fd7b52cfbeff006ca96151
📡 Port mapping: localhost:8080 -> container:8080

🩺 Performing health check...
🩺 Health checking Lightspeed service at http://localhost:8080/health...
⏳ Waiting for service to become ready... (1/60)
⏳ Waiting for service to become ready... (2/60)
✅ Service healthy: llama-cpp-python:Llama-3.2-1B-Instruct-Q4_K_M.gguf with 199 documents indexed
✅ Lightspeed service is ready!

🤖 Starting interactive Lightspeed service on port 8080...
Type your chaos engineering questions and get intelligent krknctl command suggestions!
Type 'exit' or 'quit' to stop.
🤖 AI Assistant ready! Ask me about krknctl commands or chaos engineering:
📍 Service available at: http://localhost:8080
💡 Try asking: 'How do I run a pod deletion scenario?'
🚪 Type 'exit', 'quit', or press Ctrl+C to stop.

> how can I run a pod scenario on my namespace test-app-1

🤖 To run a pod scenario on your namespace test-app-1, use the following command: 

🎯 krknctl run pod-scenarios --namespace test-app-1

This command will target the pods in the "test-app-1" namespace. If you don't specify a namespace, it will target all pods in the namespace, including those matching the label "app: test". The default value for the "disruption-count" flag is 1. The default value for the "kill-timeout" flag is 180 seconds. If you want to run the scenario for a specific duration, you can specify it as an argument, e.g., `krknctl run pod-scenarios --chaos-duration 600`. Remember that the "source: krkn-hub" data provides the authoritative flags, so make sure to use it over the official documentation. If you're unsure about the flags, please ask and I'll help you out. 🎯

> exit
👋 Goodbye!

🧹 Cleaning up Lightspeed service...
✅ Lightspeed service stopped successfully

paigerube14 · 2025-09-23T19:46:54Z

containers/lightspeed-rag/krknctl_help.txt

+- zone-outage-scenario: Simulate availability zone outages
+- cloud-outage-scenario: Simulate cloud provider outages
+
+#### Litmus Scenarios


should we take out any litmus related docs?

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

- Add extensive unit tests for forms package functionality - Test form creation, field separation, and validation logic - Cover edge cases including malformed fields and invalid values - Test all field types: string, number, boolean, enum, file validation - Validate predefined values and default value handling - Test environment variable conversion and form result operations - 37.1% statement coverage with 100% coverage on pure functions 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

- Add new pkg/gpucheck package with PlatformGPUDetector - Implement automatic GPU detection (Apple Silicon, NVIDIA, Generic) - Add --no-gpu flag support for CPU-only mode - Replace complex container-based GPU check with simple platform detection - Add Docker runtime blocking with helpful error messages - Integrate with existing configuration system using rag_model_tag Key Features: - macOS arm64: Auto-detect Apple Silicon GPU (Metal via libkrun) - Linux: Detect NVIDIA devices (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm) - Generic: CPU-only fallback for other platforms - Error handling with links to Podman GPU documentation Files Added: - pkg/gpucheck/gpucheck.go - Core detection logic - pkg/gpucheck/gpucheck_test.go - Comprehensive test suite - cmd/lightspeed_check.go - CLI commands implementation - cmd/lightspeed_check_test.go - CLI tests

- Implement deployRAGModelWithGPUType for GPU-aware container deployment - Add three specialized containers (Apple Silicon, NVIDIA, Generic) - Create comprehensive RAG service with FastAPI and llama-cpp-python - Implement documentation indexing from multiple sources - Add spinner progress feedback during container pulls - Integrate with existing CLI structure and private registry support Container Architecture: - Apple Silicon: Single-stage Vulkan build with Metal support - NVIDIA: Multi-stage CUDA build with optimization - Generic: Multi-stage CPU-only build for fallback RAG Features: - Live documentation indexing (krkn-chaos/website + krkn-hub) - Cached index support for offline/airgapped environments - Health checking with automatic retry and cleanup - Interactive chat interface for chaos engineering assistance Files Added: - cmd/lightspeed.go - RAG deployment functions - containers/lightspeed-rag/* - All container definitions and scripts - Updated go.mod/go.sum with new dependencies

- Add comprehensive Lightspeed documentation to CLAUDE.md - Update vendor dependencies for new packages - Integrate Lightspeed commands with existing CLI structure - Add configuration tests and utility functions - Update scenario orchestrator for container port mapping support Documentation: - Complete implementation guide in CLAUDE.md - All major tasks and technical details documented - Usage examples and development notes included Integration: - Full integration with existing krknctl architecture - Maintains backward compatibility - Follows established patterns and conventions Dependencies: - Updated vendor modules for testing and mock frameworks - Added necessary packages for Lightspeed functionality - Clean integration without breaking existing features

- Replace individual Python scripts with full krkn-lightspeed repository checkout - Update all three Containerfiles (NVIDIA, Apple Silicon, Generic) to: - Clone krkn-lightspeed repo and checkout krknctl_lightspeed branch - Install requirements from the repo (except llama-cpp-python built separately) - Copy the full repository into container at /app/krkn-lightspeed/ - Update entrypoint.sh to: - Use new FastAPI server from krkn-lightspeed repository - Set correct MODEL_PATH environment variable - Change working directory to /app/krkn-lightspeed before starting service - Maintain backward compatibility with existing krknctl integration - Use port 8080 for consistency with existing configuration 🤖 Generated with Claude Code (claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Pre-download Qwen and fallback embedding models in all Containerfiles - Copy embedding model cache from builder to runtime stage - Add CONTAINER_ENV environment variable for container detection - Prevent runtime downloads that could cause blocking - Improve GPU detection and reduce startup time Changes: - NVIDIA: Pre-download models in builder, copy cache to runtime - Apple Silicon: Pre-download models during build - Generic: Pre-download models in builder, copy cache to runtime - Entrypoint: Set CONTAINER_ENV=true for proper detection 🤖 Generated with Claude Code (claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…sponses - Remove debug JSON output from interactive prompt - Add printScenarioDetail function identical to cmd/describe.go formatting - Copy newArgumentTable from cmd/tables.go for consistent parameter display - Display complete scenario information with: * Green underlined title * Justified description text (65 chars per line) * Formatted parameter table with colors (Name, Type, Description, Required, Default) - Maintain visual consistency with `krknctl describe` command output 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Major optimizations implemented: - Reduced build dependencies (removed wget, which, cmake, make) - Minimal package installation with --no-deps to avoid dependency bloat - Aggressive cleanup of Python cache files and test directories - Ultra-minimal runtime stage with selective file copying - Single-layer package installation for runtime - Shallow git clones with --depth=1 to reduce download size - Removed nvidia-container-toolkit from runtime (not needed) - Optimized virtual environment structure Expected size reduction: ~1-2GB from current 8GB+ container 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

…anch The --depth=1 flag only downloads the default branch, preventing access to the krknctl_lightspeed branch. Removed shallow clones where branch checkout is required. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add setuptools, wheel, scikit-build-core for Python package builds - Allow dependencies for fastapi, uvicorn, pydantic, chromadb - Keep sentence-transformers with deps for proper functionality - Add huggingface-hub for model downloads - Remove overly aggressive --no-deps flags that break builds 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Remove complex CUDA toolkit installation that fails on different architectures - Use precompiled llama-cpp-python wheels from PyPI with server extras - Add fallback to basic llama-cpp-python if server extras fail - Keep build dependencies minimal but functional This approach works across different architectures (x86_64, aarch64) and avoids compilation issues while maintaining CUDA support where available. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add langchain-community for document loaders - Add langchain-core for base functionality - Add langchain-text-splitters for document processing - Required for prebuild_chromadb.py script execution 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

- Add opentelemetry-api and opentelemetry-sdk for langchain tracing support - Resolves StopIteration errors in opentelemetry context loading - Required for langsmith integration in langchain-core 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> fix Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> Containerfile updates Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> linting Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> linting Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

tsebastiani force-pushed the gpu_check branch 7 times, most recently from da45ab9 to d16a735 Compare September 12, 2025 09:00

tsebastiani changed the title ~~Add GPU support detection functionality with lightspeed command~~ Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection Sep 12, 2025

tsebastiani force-pushed the gpu_check branch from 226f6ba to 7adf9ed Compare September 12, 2025 16:16

tsebastiani force-pushed the gpu_check branch 5 times, most recently from 591b748 to 07db134 Compare September 23, 2025 16:29

paigerube14 reviewed Sep 23, 2025

View reviewed changes

tsebastiani force-pushed the gpu_check branch 3 times, most recently from 78e07c0 to 75d8c70 Compare September 24, 2025 11:07

tsebastiani and others added 4 commits October 9, 2025 10:00

added run form structs and implementation

4d8dda0

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

minor nit

a19c814

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

flag parsing rebased

b7f0fb2

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

tsebastiani force-pushed the gpu_check branch 2 times, most recently from e29c1bf to 5ddb3c7 Compare October 30, 2025 10:53

tsebastiani and others added 5 commits October 31, 2025 16:36

tsebastiani and others added 25 commits October 31, 2025 16:36

updated model to 3B

1466c27

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

etrypoint updated

5395bfa

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

feat: new model containerfile

2d13dd6

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix: branch in containerfile

6dbca16

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix: updated entrypoint.sh

ae60d73

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix: updated containerfile.apple-silicon

eb8d62b

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

code refresh

abc83ba

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> fix Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

added query timer

a92df4b

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Containerfiles updates

52865d2

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> Containerfile updates Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Containerfile optimizations

b1cbb9f

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Eyecandies

febf277

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

cleanup containers folder and entrypoint.sh

faff535

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

cleanup Containerfiles

f2d93c0

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fixing gosec

6f9965a

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

renaming lightspeed to assist

58e177c

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> linting Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

changed registry pointer

40c7cf8

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

file rename

9fd1dc5

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

scenario runs correctly from assist form

7a14c70

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

tsebastiani force-pushed the gpu_check branch from 4499cd2 to 7a14c70 Compare October 31, 2025 16:40

removed LLM

1e47a35

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> linting Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

tsebastiani force-pushed the gpu_check branch from f347e9c to 1e47a35 Compare November 19, 2025 15:57

tsebastiani mentioned this pull request Jan 13, 2026

Internship Task: Natural Language–Based Chaos Scenario Discovery krkn-chaos/krkn#1051

Open

tsebastiani force-pushed the main branch from 64261f5 to d8e33e6 Compare February 9, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection#78

Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection#78
tsebastiani wants to merge 49 commits intomainfrom
gpu_check

tsebastiani commented Sep 4, 2025 •

edited

Loading

Uh oh!

paigerube14 commented Sep 12, 2025

Uh oh!

paigerube14 Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tsebastiani commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

🚀 Major Implementation Tasks Completed

1. GPU Detection System Redesign

2. Container Runtime Support

3. Container Architecture

4. Multi-Stage Container Build Fix

5. Documentation Indexing System

🔧 Technical Implementation

Core Components

Container Files

Key Functions

🎯 New CLI Commands

Lightspeed Check

Lightspeed Run

🏗️ Configuration Integration

📊 User Experience Improvements

Progress Feedback

Error Handling

🧪 Testing

✅ Comprehensive Test Coverage

✅ Manual Testing Verified

📁 Files Changed

New Files

Modified Files

🌟 Key Benefits

Intelligent Assistance

Automatic GPU Optimization

Developer Experience

🚀 Future Enhancements Ready

🏁 Build and Test Instructions

Uh oh!

paigerube14 commented Sep 12, 2025

Uh oh!

paigerube14 Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tsebastiani commented Sep 4, 2025 •

edited

Loading