Skip to content

Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection#78

Open
tsebastiani wants to merge 49 commits intomainfrom
gpu_check
Open

Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection#78
tsebastiani wants to merge 49 commits intomainfrom
gpu_check

Conversation

@tsebastiani
Copy link
Contributor

@tsebastiani tsebastiani commented Sep 4, 2025

Summary

This PR introduces a comprehensive Lightspeed AI-powered assistance feature for krknctl, implementing intelligent chaos engineering command suggestions using Retrieval-Augmented Generation (RAG) with automatic GPU detection and acceleration. The implementation includes a complete redesign of the GPU detection system, multi-platform container support, and robust error handling.

🚀 Major Implementation Tasks Completed

1. GPU Detection System Redesign

Previous System: Complex container-based GPU detection using test images ❌

  • Removed complex GPU check implementation using container images
  • Eliminated GetSupportedGPUTypes() and container-based testing approach

New System: Platform-based automatic detection ✅

  • macOS arm64: Automatically assumes Apple Silicon GPU support (Metal via libkrun)
  • Linux with NVIDIA devices: Detects physical NVIDIA devices (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm)
  • Generic fallback: CPU-only mode for all other platforms
  • Added --no-gpu flag to force CPU-only mode without device mounting

2. Container Runtime Support

  • Podman Only: Lightspeed exclusively supports Podman container runtime
  • Docker Blocking: Commands fail gracefully with helpful error messages when Docker is detected
  • Error Handling: Provides links to Podman GPU documentation (https://podman-desktop.io/docs/podman/gpu)

3. Container Architecture

Three Specialized Containers:

  • Apple Silicon (rag-model-apple-silicon): Vulkan backend for Apple M1/M2/M3/M4 GPUs
  • NVIDIA (rag-model-nvidia): CUDA backend for NVIDIA GPUs
  • Generic (rag-model-generic): CPU-only fallback for all other platforms

Container Selection Logic:

  • Uses PlatformGPUDetector.GetLightspeedImageURI() to select appropriate container
  • Tag construction follows {rag_model_tag}-{architecture} pattern from config
  • Device mounting handled by PlatformGPUDetector.GetDeviceMounts()

4. Multi-Stage Container Build Fix

Problem: Documentation indexing failed in builder stage of multi-stage builds

  • Root Cause: Git and Python dependencies not fully available during builder stage
  • Solution: Moved documentation indexing from builder stage to runtime stage
  • Impact: Fixed NVIDIA and Generic containers (Apple single-stage already worked)

5. Documentation Indexing System

Sources Indexed:

  • Local krknctl help documentation
  • Live krkn-chaos/website repository (chaos engineering guides)
  • Live krkn-chaos/krkn-hub repository (scenario definitions)

Indexing Process:

  • Build Time: Creates cached indices for offline/airgapped environments
  • Runtime: Can rebuild indices with fresh documentation or use cached versions
  • Verification: Automatic validation of indexed document sources and counts

🔧 Technical Implementation

Core Components

  • pkg/gpucheck/gpucheck.go: Platform-based GPU detection logic
  • cmd/lightspeed_check.go: Lightspeed commands with Docker runtime blocking
  • cmd/lightspeed.go: RAG model deployment with GPU-specific container selection
  • pkg/config/config.go: Enhanced with Lightspeed-specific configuration methods

Container Files

  • containers/lightspeed-rag/Containerfile.apple-silicon: Single-stage Vulkan build
  • containers/lightspeed-rag/Containerfile.nvidia: Multi-stage CUDA build
  • containers/lightspeed-rag/Containerfile.generic: Multi-stage CPU-only build

Key Functions

  • DetectGPUAcceleration(): Platform-based GPU type detection
  • deployRAGModelWithGPUType(): GPU-aware container deployment
  • HandleContainerError(): Enhanced error reporting with helpful suggestions

🎯 New CLI Commands

Lightspeed Check

# Automatic GPU detection and validation
krknctl lightspeed check

# Force CPU-only mode
krknctl lightspeed check --no-gpu

Lightspeed Run

# AI-powered assistance with auto-detected GPU
krknctl lightspeed run

# Force CPU-only mode (no GPU acceleration)
krknctl lightspeed run --no-gpu

# Offline mode for airgapped environments  
krknctl lightspeed run --offline

🏗️ Configuration Integration

  • Config-Based Tags: Uses rag_model_tag from pkg/config/config.json to construct container tags
  • Centralized Settings: All RAG service parameters (ports, endpoints, timeouts) in configuration
  • Private Registry Support: Full integration with existing private registry authentication

📊 User Experience Improvements

Progress Feedback

  • Spinner with dynamic progress messages during container image pulls
  • Real-time feedback during RAG model deployment
  • Health checking with automatic retry and timeout handling

Error Handling

  • Platform-specific error messages with actionable solutions
  • Automatic fallback from live indexing to cached documentation
  • Container cleanup on deployment failures

🧪 Testing

✅ Comprehensive Test Coverage

  • Updated test suite to use new PlatformGPUDetector API
  • Platform detection tests for all supported GPU types
  • Container deployment tests with mock orchestrator
  • Error handling tests for various failure scenarios

✅ Manual Testing Verified

  • All three container types build and run successfully
  • GPU detection works correctly on Apple Silicon and NVIDIA systems
  • Documentation indexing includes all expected sources (krknctl + website + krkn-hub)
  • Interactive RAG service provides accurate responses about chaos engineering

📁 Files Changed

New Files

  • pkg/gpucheck/gpucheck.go - Platform-based GPU detection logic
  • pkg/gpucheck/gpucheck_test.go - Comprehensive unit tests
  • cmd/lightspeed_check.go - Lightspeed commands implementation
  • cmd/lightspeed_check_test.go - CLI command tests
  • cmd/lightspeed.go - RAG model deployment functions
  • containers/lightspeed-rag/Containerfile.apple-silicon - Single-stage Vulkan build
  • containers/lightspeed-rag/Containerfile.nvidia - Multi-stage CUDA build
  • containers/lightspeed-rag/Containerfile.generic - Multi-stage CPU-only build
  • containers/lightspeed-rag/rag_service.py - FastAPI RAG service
  • containers/lightspeed-rag/index_docs.py - Documentation indexing script
  • containers/lightspeed-rag/entrypoint.sh - Container entrypoint
  • containers/lightspeed-rag/requirements.txt - Python dependencies

Modified Files

  • pkg/config/config.json - Added Lightspeed configuration parameters
  • pkg/config/config.go - Added Lightspeed configuration methods
  • cmd/root.go - Integrated lightspeed command into CLI structure
  • CLAUDE.md - Added comprehensive Lightspeed documentation

🌟 Key Benefits

Intelligent Assistance

  • Natural language queries about chaos engineering scenarios
  • Smart command suggestions based on user intent
  • Comprehensive documentation search across all krkn projects

Automatic GPU Optimization

  • Zero configuration GPU detection and utilization
  • Cross-platform compatibility (Apple Silicon, NVIDIA, generic CPU)
  • Graceful degradation when GPU acceleration is unavailable

Developer Experience

  • Simple CLI interface following krknctl conventions
  • Comprehensive error messages with actionable solutions
  • Offline support for airgapped environments

🚀 Future Enhancements Ready

This foundation enables future integrations with the krkn-lightspeed repository for:

  • Enhanced RAG model capabilities
  • Advanced chaos engineering assistance
  • Integration with additional AI/ML features

🏁 Build and Test Instructions

# Build the project
go build -tags containers_image_openpgp -ldflags="-w -s" .

# Run tests
go test -tags containers_image_openpgp ./pkg/gpucheck ./pkg/config ./cmd

# Test Lightspeed commands
krknctl lightspeed check
krknctl lightspeed run --help

Assisted-by: Claude Sonnet 4
Signed-off-by: Tullio Sebastiani tsebasti@redhat.com

@tsebastiani tsebastiani force-pushed the gpu_check branch 7 times, most recently from da45ab9 to d16a735 Compare September 12, 2025 09:00
@tsebastiani tsebastiani changed the title Add GPU support detection functionality with lightspeed command Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection Sep 12, 2025
@paigerube14
Copy link
Contributor

tested with the gpu alpha release and works great!!

% krknctl-gpu lightspeed run
📣📦 a newer version of krknctl, v0.10.4-beta, is currently available, check it out! https://github.com/krkn-chaos/krknctl/releases/latest


container runtime: Podman

🔍 Detecting GPU acceleration...
✅ GPU acceleration: Apple Silicon GPU (M1, M2, M3, M4 with Metal via libkrun)

🚀 Deploying lightspeed model...
🚀 RAG container started: bf84c3cb5460e5ea1cdc6850fa544ef4fb9dfad8c9fd7b52cfbeff006ca96151
📡 Port mapping: localhost:8080 -> container:8080

🩺 Performing health check...
🩺 Health checking Lightspeed service at http://localhost:8080/health...
⏳ Waiting for service to become ready... (1/60)
⏳ Waiting for service to become ready... (2/60)
✅ Service healthy: llama-cpp-python:Llama-3.2-1B-Instruct-Q4_K_M.gguf with 199 documents indexed
✅ Lightspeed service is ready!

🤖 Starting interactive Lightspeed service on port 8080...
Type your chaos engineering questions and get intelligent krknctl command suggestions!
Type 'exit' or 'quit' to stop.
🤖 AI Assistant ready! Ask me about krknctl commands or chaos engineering:
📍 Service available at: http://localhost:8080
💡 Try asking: 'How do I run a pod deletion scenario?'
🚪 Type 'exit', 'quit', or press Ctrl+C to stop.

> how can I run a pod scenario on my namespace test-app-1

🤖 To run a pod scenario on your namespace test-app-1, use the following command: 

🎯 krknctl run pod-scenarios --namespace test-app-1

This command will target the pods in the "test-app-1" namespace. If you don't specify a namespace, it will target all pods in the namespace, including those matching the label "app: test". The default value for the "disruption-count" flag is 1. The default value for the "kill-timeout" flag is 180 seconds. If you want to run the scenario for a specific duration, you can specify it as an argument, e.g., `krknctl run pod-scenarios --chaos-duration 600`. Remember that the "source: krkn-hub" data provides the authoritative flags, so make sure to use it over the official documentation. If you're unsure about the flags, please ask and I'll help you out. 🎯

> exit
👋 Goodbye!

🧹 Cleaning up Lightspeed service...
✅ Lightspeed service stopped successfully

@tsebastiani tsebastiani force-pushed the gpu_check branch 5 times, most recently from 591b748 to 07db134 Compare September 23, 2025 16:29
- zone-outage-scenario: Simulate availability zone outages
- cloud-outage-scenario: Simulate cloud provider outages

#### Litmus Scenarios
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we take out any litmus related docs?

@tsebastiani tsebastiani force-pushed the gpu_check branch 3 times, most recently from 78e07c0 to 75d8c70 Compare September 24, 2025 11:07
tsebastiani and others added 4 commits October 9, 2025 10:00
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
- Add extensive unit tests for forms package functionality
- Test form creation, field separation, and validation logic
- Cover edge cases including malformed fields and invalid values
- Test all field types: string, number, boolean, enum, file validation
- Validate predefined values and default value handling
- Test environment variable conversion and form result operations
- 37.1% statement coverage with 100% coverage on pure functions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
@tsebastiani tsebastiani force-pushed the gpu_check branch 2 times, most recently from e29c1bf to 5ddb3c7 Compare October 30, 2025 10:53
tsebastiani and others added 5 commits October 31, 2025 16:36
- Add new pkg/gpucheck package with PlatformGPUDetector
- Implement automatic GPU detection (Apple Silicon, NVIDIA, Generic)
- Add --no-gpu flag support for CPU-only mode
- Replace complex container-based GPU check with simple platform detection
- Add Docker runtime blocking with helpful error messages
- Integrate with existing configuration system using rag_model_tag

Key Features:
- macOS arm64: Auto-detect Apple Silicon GPU (Metal via libkrun)
- Linux: Detect NVIDIA devices (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm)
- Generic: CPU-only fallback for other platforms
- Error handling with links to Podman GPU documentation

Files Added:
- pkg/gpucheck/gpucheck.go - Core detection logic
- pkg/gpucheck/gpucheck_test.go - Comprehensive test suite
- cmd/lightspeed_check.go - CLI commands implementation
- cmd/lightspeed_check_test.go - CLI tests
- Implement deployRAGModelWithGPUType for GPU-aware container deployment
- Add three specialized containers (Apple Silicon, NVIDIA, Generic)
- Create comprehensive RAG service with FastAPI and llama-cpp-python
- Implement documentation indexing from multiple sources
- Add spinner progress feedback during container pulls
- Integrate with existing CLI structure and private registry support

Container Architecture:
- Apple Silicon: Single-stage Vulkan build with Metal support
- NVIDIA: Multi-stage CUDA build with optimization
- Generic: Multi-stage CPU-only build for fallback

RAG Features:
- Live documentation indexing (krkn-chaos/website + krkn-hub)
- Cached index support for offline/airgapped environments
- Health checking with automatic retry and cleanup
- Interactive chat interface for chaos engineering assistance

Files Added:
- cmd/lightspeed.go - RAG deployment functions
- containers/lightspeed-rag/* - All container definitions and scripts
- Updated go.mod/go.sum with new dependencies
- Add comprehensive Lightspeed documentation to CLAUDE.md
- Update vendor dependencies for new packages
- Integrate Lightspeed commands with existing CLI structure
- Add configuration tests and utility functions
- Update scenario orchestrator for container port mapping support

Documentation:
- Complete implementation guide in CLAUDE.md
- All major tasks and technical details documented
- Usage examples and development notes included

Integration:
- Full integration with existing krknctl architecture
- Maintains backward compatibility
- Follows established patterns and conventions

Dependencies:
- Updated vendor modules for testing and mock frameworks
- Added necessary packages for Lightspeed functionality
- Clean integration without breaking existing features
- Replace individual Python scripts with full krkn-lightspeed repository checkout
- Update all three Containerfiles (NVIDIA, Apple Silicon, Generic) to:
  - Clone krkn-lightspeed repo and checkout krknctl_lightspeed branch
  - Install requirements from the repo (except llama-cpp-python built separately)
  - Copy the full repository into container at /app/krkn-lightspeed/
- Update entrypoint.sh to:
  - Use new FastAPI server from krkn-lightspeed repository
  - Set correct MODEL_PATH environment variable
  - Change working directory to /app/krkn-lightspeed before starting service
- Maintain backward compatibility with existing krknctl integration
- Use port 8080 for consistency with existing configuration

🤖 Generated with Claude Code (claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Pre-download Qwen and fallback embedding models in all Containerfiles
- Copy embedding model cache from builder to runtime stage
- Add CONTAINER_ENV environment variable for container detection
- Prevent runtime downloads that could cause blocking
- Improve GPU detection and reduce startup time

Changes:
- NVIDIA: Pre-download models in builder, copy cache to runtime
- Apple Silicon: Pre-download models during build
- Generic: Pre-download models in builder, copy cache to runtime
- Entrypoint: Set CONTAINER_ENV=true for proper detection

🤖 Generated with Claude Code (claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
tsebastiani and others added 25 commits October 31, 2025 16:36
…sponses

- Remove debug JSON output from interactive prompt
- Add printScenarioDetail function identical to cmd/describe.go formatting
- Copy newArgumentTable from cmd/tables.go for consistent parameter display
- Display complete scenario information with:
  * Green underlined title
  * Justified description text (65 chars per line)
  * Formatted parameter table with colors (Name, Type, Description, Required, Default)
- Maintain visual consistency with `krknctl describe` command output

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Major optimizations implemented:
- Reduced build dependencies (removed wget, which, cmake, make)
- Minimal package installation with --no-deps to avoid dependency bloat
- Aggressive cleanup of Python cache files and test directories
- Ultra-minimal runtime stage with selective file copying
- Single-layer package installation for runtime
- Shallow git clones with --depth=1 to reduce download size
- Removed nvidia-container-toolkit from runtime (not needed)
- Optimized virtual environment structure

Expected size reduction: ~1-2GB from current 8GB+ container

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…anch

The --depth=1 flag only downloads the default branch, preventing access
to the krknctl_lightspeed branch. Removed shallow clones where branch
checkout is required.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add setuptools, wheel, scikit-build-core for Python package builds
- Allow dependencies for fastapi, uvicorn, pydantic, chromadb
- Keep sentence-transformers with deps for proper functionality
- Add huggingface-hub for model downloads
- Remove overly aggressive --no-deps flags that break builds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove complex CUDA toolkit installation that fails on different architectures
- Use precompiled llama-cpp-python wheels from PyPI with server extras
- Add fallback to basic llama-cpp-python if server extras fail
- Keep build dependencies minimal but functional

This approach works across different architectures (x86_64, aarch64) and
avoids compilation issues while maintaining CUDA support where available.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add langchain-community for document loaders
- Add langchain-core for base functionality
- Add langchain-text-splitters for document processing
- Required for prebuild_chromadb.py script execution

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Add opentelemetry-api and opentelemetry-sdk for langchain tracing support
- Resolves StopIteration errors in opentelemetry context loading
- Required for langsmith integration in langchain-core

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

fix

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

Containerfile updates

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

linting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>

linting

Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants