Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection#78
Open
tsebastiani wants to merge 49 commits intomainfrom
Open
Complete Lightspeed AI-Powered Assistance Implementation with Platform-Based GPU Detection#78tsebastiani wants to merge 49 commits intomainfrom
tsebastiani wants to merge 49 commits intomainfrom
Conversation
da45ab9 to
d16a735
Compare
226f6ba to
7adf9ed
Compare
Contributor
|
tested with the gpu alpha release and works great!! |
591b748 to
07db134
Compare
paigerube14
reviewed
Sep 23, 2025
| - zone-outage-scenario: Simulate availability zone outages | ||
| - cloud-outage-scenario: Simulate cloud provider outages | ||
|
|
||
| #### Litmus Scenarios |
Contributor
There was a problem hiding this comment.
should we take out any litmus related docs?
78e07c0 to
75d8c70
Compare
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
- Add extensive unit tests for forms package functionality - Test form creation, field separation, and validation logic - Cover edge cases including malformed fields and invalid values - Test all field types: string, number, boolean, enum, file validation - Validate predefined values and default value handling - Test environment variable conversion and form result operations - 37.1% statement coverage with 100% coverage on pure functions 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
e29c1bf to
5ddb3c7
Compare
- Add new pkg/gpucheck package with PlatformGPUDetector - Implement automatic GPU detection (Apple Silicon, NVIDIA, Generic) - Add --no-gpu flag support for CPU-only mode - Replace complex container-based GPU check with simple platform detection - Add Docker runtime blocking with helpful error messages - Integrate with existing configuration system using rag_model_tag Key Features: - macOS arm64: Auto-detect Apple Silicon GPU (Metal via libkrun) - Linux: Detect NVIDIA devices (/dev/nvidia0, /dev/nvidiactl, /dev/nvidia-uvm) - Generic: CPU-only fallback for other platforms - Error handling with links to Podman GPU documentation Files Added: - pkg/gpucheck/gpucheck.go - Core detection logic - pkg/gpucheck/gpucheck_test.go - Comprehensive test suite - cmd/lightspeed_check.go - CLI commands implementation - cmd/lightspeed_check_test.go - CLI tests
- Implement deployRAGModelWithGPUType for GPU-aware container deployment - Add three specialized containers (Apple Silicon, NVIDIA, Generic) - Create comprehensive RAG service with FastAPI and llama-cpp-python - Implement documentation indexing from multiple sources - Add spinner progress feedback during container pulls - Integrate with existing CLI structure and private registry support Container Architecture: - Apple Silicon: Single-stage Vulkan build with Metal support - NVIDIA: Multi-stage CUDA build with optimization - Generic: Multi-stage CPU-only build for fallback RAG Features: - Live documentation indexing (krkn-chaos/website + krkn-hub) - Cached index support for offline/airgapped environments - Health checking with automatic retry and cleanup - Interactive chat interface for chaos engineering assistance Files Added: - cmd/lightspeed.go - RAG deployment functions - containers/lightspeed-rag/* - All container definitions and scripts - Updated go.mod/go.sum with new dependencies
- Add comprehensive Lightspeed documentation to CLAUDE.md - Update vendor dependencies for new packages - Integrate Lightspeed commands with existing CLI structure - Add configuration tests and utility functions - Update scenario orchestrator for container port mapping support Documentation: - Complete implementation guide in CLAUDE.md - All major tasks and technical details documented - Usage examples and development notes included Integration: - Full integration with existing krknctl architecture - Maintains backward compatibility - Follows established patterns and conventions Dependencies: - Updated vendor modules for testing and mock frameworks - Added necessary packages for Lightspeed functionality - Clean integration without breaking existing features
- Replace individual Python scripts with full krkn-lightspeed repository checkout - Update all three Containerfiles (NVIDIA, Apple Silicon, Generic) to: - Clone krkn-lightspeed repo and checkout krknctl_lightspeed branch - Install requirements from the repo (except llama-cpp-python built separately) - Copy the full repository into container at /app/krkn-lightspeed/ - Update entrypoint.sh to: - Use new FastAPI server from krkn-lightspeed repository - Set correct MODEL_PATH environment variable - Change working directory to /app/krkn-lightspeed before starting service - Maintain backward compatibility with existing krknctl integration - Use port 8080 for consistency with existing configuration 🤖 Generated with Claude Code (claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Pre-download Qwen and fallback embedding models in all Containerfiles - Copy embedding model cache from builder to runtime stage - Add CONTAINER_ENV environment variable for container detection - Prevent runtime downloads that could cause blocking - Improve GPU detection and reduce startup time Changes: - NVIDIA: Pre-download models in builder, copy cache to runtime - Apple Silicon: Pre-download models during build - Generic: Pre-download models in builder, copy cache to runtime - Entrypoint: Set CONTAINER_ENV=true for proper detection 🤖 Generated with Claude Code (claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…sponses - Remove debug JSON output from interactive prompt - Add printScenarioDetail function identical to cmd/describe.go formatting - Copy newArgumentTable from cmd/tables.go for consistent parameter display - Display complete scenario information with: * Green underlined title * Justified description text (65 chars per line) * Formatted parameter table with colors (Name, Type, Description, Required, Default) - Maintain visual consistency with `krknctl describe` command output 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Major optimizations implemented: - Reduced build dependencies (removed wget, which, cmake, make) - Minimal package installation with --no-deps to avoid dependency bloat - Aggressive cleanup of Python cache files and test directories - Ultra-minimal runtime stage with selective file copying - Single-layer package installation for runtime - Shallow git clones with --depth=1 to reduce download size - Removed nvidia-container-toolkit from runtime (not needed) - Optimized virtual environment structure Expected size reduction: ~1-2GB from current 8GB+ container 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
…anch The --depth=1 flag only downloads the default branch, preventing access to the krknctl_lightspeed branch. Removed shallow clones where branch checkout is required. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add setuptools, wheel, scikit-build-core for Python package builds - Allow dependencies for fastapi, uvicorn, pydantic, chromadb - Keep sentence-transformers with deps for proper functionality - Add huggingface-hub for model downloads - Remove overly aggressive --no-deps flags that break builds 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Remove complex CUDA toolkit installation that fails on different architectures - Use precompiled llama-cpp-python wheels from PyPI with server extras - Add fallback to basic llama-cpp-python if server extras fail - Keep build dependencies minimal but functional This approach works across different architectures (x86_64, aarch64) and avoids compilation issues while maintaining CUDA support where available. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add langchain-community for document loaders - Add langchain-core for base functionality - Add langchain-text-splitters for document processing - Required for prebuild_chromadb.py script execution 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
- Add opentelemetry-api and opentelemetry-sdk for langchain tracing support - Resolves StopIteration errors in opentelemetry context loading - Required for langsmith integration in langchain-core 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> fix Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> Containerfile updates Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> linting Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
4499cd2 to
7a14c70
Compare
Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com> linting Signed-off-by: Tullio Sebastiani <tsebasti@redhat.com>
f347e9c to
1e47a35
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a comprehensive Lightspeed AI-powered assistance feature for krknctl, implementing intelligent chaos engineering command suggestions using Retrieval-Augmented Generation (RAG) with automatic GPU detection and acceleration. The implementation includes a complete redesign of the GPU detection system, multi-platform container support, and robust error handling.
🚀 Major Implementation Tasks Completed
1. GPU Detection System Redesign
Previous System: Complex container-based GPU detection using test images ❌
GetSupportedGPUTypes()and container-based testing approachNew System: Platform-based automatic detection ✅
/dev/nvidia0,/dev/nvidiactl,/dev/nvidia-uvm)--no-gpuflag to force CPU-only mode without device mounting2. Container Runtime Support
3. Container Architecture
Three Specialized Containers:
rag-model-apple-silicon): Vulkan backend for Apple M1/M2/M3/M4 GPUsrag-model-nvidia): CUDA backend for NVIDIA GPUsrag-model-generic): CPU-only fallback for all other platformsContainer Selection Logic:
PlatformGPUDetector.GetLightspeedImageURI()to select appropriate container{rag_model_tag}-{architecture}pattern from configPlatformGPUDetector.GetDeviceMounts()4. Multi-Stage Container Build Fix
Problem: Documentation indexing failed in builder stage of multi-stage builds
5. Documentation Indexing System
Sources Indexed:
Indexing Process:
🔧 Technical Implementation
Core Components
pkg/gpucheck/gpucheck.go: Platform-based GPU detection logiccmd/lightspeed_check.go: Lightspeed commands with Docker runtime blockingcmd/lightspeed.go: RAG model deployment with GPU-specific container selectionpkg/config/config.go: Enhanced with Lightspeed-specific configuration methodsContainer Files
containers/lightspeed-rag/Containerfile.apple-silicon: Single-stage Vulkan buildcontainers/lightspeed-rag/Containerfile.nvidia: Multi-stage CUDA buildcontainers/lightspeed-rag/Containerfile.generic: Multi-stage CPU-only buildKey Functions
DetectGPUAcceleration(): Platform-based GPU type detectiondeployRAGModelWithGPUType(): GPU-aware container deploymentHandleContainerError(): Enhanced error reporting with helpful suggestions🎯 New CLI Commands
Lightspeed Check
Lightspeed Run
🏗️ Configuration Integration
rag_model_tagfrompkg/config/config.jsonto construct container tags📊 User Experience Improvements
Progress Feedback
Error Handling
🧪 Testing
✅ Comprehensive Test Coverage
PlatformGPUDetectorAPI✅ Manual Testing Verified
📁 Files Changed
New Files
pkg/gpucheck/gpucheck.go- Platform-based GPU detection logicpkg/gpucheck/gpucheck_test.go- Comprehensive unit testscmd/lightspeed_check.go- Lightspeed commands implementationcmd/lightspeed_check_test.go- CLI command testscmd/lightspeed.go- RAG model deployment functionscontainers/lightspeed-rag/Containerfile.apple-silicon- Single-stage Vulkan buildcontainers/lightspeed-rag/Containerfile.nvidia- Multi-stage CUDA buildcontainers/lightspeed-rag/Containerfile.generic- Multi-stage CPU-only buildcontainers/lightspeed-rag/rag_service.py- FastAPI RAG servicecontainers/lightspeed-rag/index_docs.py- Documentation indexing scriptcontainers/lightspeed-rag/entrypoint.sh- Container entrypointcontainers/lightspeed-rag/requirements.txt- Python dependenciesModified Files
pkg/config/config.json- Added Lightspeed configuration parameterspkg/config/config.go- Added Lightspeed configuration methodscmd/root.go- Integrated lightspeed command into CLI structureCLAUDE.md- Added comprehensive Lightspeed documentation🌟 Key Benefits
Intelligent Assistance
Automatic GPU Optimization
Developer Experience
🚀 Future Enhancements Ready
This foundation enables future integrations with the krkn-lightspeed repository for:
🏁 Build and Test Instructions
Assisted-by: Claude Sonnet 4
Signed-off-by: Tullio Sebastiani tsebasti@redhat.com