Skip to content

Latest commit

 

History

History
115 lines (92 loc) · 5.02 KB

File metadata and controls

115 lines (92 loc) · 5.02 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

RamaLama is a CLI tool for managing and serving AI models using containers. It provides a container-centric approach to AI model management, supporting multiple model registries (Hugging Face, Ollama, OCI registries) and automatic GPU detection with appropriate container image selection.

Build and Development Commands

Setup

make install-requirements    # Install dev dependencies via pip

Testing

# Unit tests (pytest via tox)
make unit-tests              # Run unit tests
make unit-tests-verbose      # Run with full trace output
tox                          # Direct tox invocation

# E2E tests (pytest)
make e2e-tests               # Run with Podman (default)
make e2e-tests-docker        # Run with Docker
make e2e-tests-nocontainer   # Run without container engine

# All tests
make tests                   # Run unit tests and e2e tests, with and without a container engine

Running a single test

# Unit test
tox -- test/unit/test_cli.py::test_function_name -vvv

# E2E test
tox -e e2e -- test/e2e/test_basic.py::test_function_name -vvv

Code Quality

make validate                # Run all validation (codespell, lint, format check, man-check, type check)
make lint                    # Run ruff + shellcheck
make check-format            # Check ruff formatting + import sorting
make format                  # Auto-format with ruff + import sorting
make type-check              # Run mypy type checking
make codespell               # Check spelling

Documentation

make docs                    # Build manpages and docsite

Architecture

Source Structure (ramalama/)

  • cli.py - Main CLI entry point, argparse setup, subcommand dispatch
  • config.py - Configuration constants and loading
  • common.py - Shared utility functions, GPU detection (get_accel())
  • engine.py - Container engine abstraction (Podman/Docker)

Transport System (ramalama/transports/)

Handles pulling/pushing models from different registries:

  • base.py - Base Transport class defining the interface
  • transport_factory.py - New() and TransportFactory for creating transports
  • huggingface.py, ollama.py, oci.py, modelscope.py, rlcr.py - Registry-specific implementations
  • Transports are selected via URL scheme prefixes: huggingface://, ollama://, oci://, etc.

Model Store (ramalama/model_store/)

Manages local model storage:

  • global_store.py - GlobalModelStore for model management
  • store.py - Low-level storage operations
  • reffile.py - Reference file handling for tracking model origins

Runtime Plugin System (ramalama/plugins/)

Each inference engine is a self-contained Python plugin:

  • interface.py - Abstract base classes: RuntimePlugin and InferenceRuntimePlugin
  • loader.py - Plugin discovery and get_all_runtimes() / get_runtime(); entry point group ramalama.runtimes.v1alpha
  • registry.py - Plugin registration
  • runtimes/inference/common.py - Concrete base classes:
    • BaseInferenceRuntime — registers run and serve, implements handle_subcommand() dispatch
    • ContainerizedInferenceRuntimePlugin — extends the above with container-specific args (--api, --generate)
  • runtimes/inference/llama_cpp.py - LlamaCppPlugin(LlamaCppCommands, ContainerizedInferenceRuntimePlugin) (default); owns all RAG subcommand logic
  • runtimes/inference/llama_cpp_commands.py - LlamaCppCommands mixin providing _cmd_run, _cmd_serve, and other llama.cpp command builders
  • runtimes/inference/vllm.py - VllmPlugin(ContainerizedInferenceRuntimePlugin)
  • runtimes/inference/mlx.py - MlxPlugin(BaseInferenceRuntime) (macOS only; always --nocontainer)

configure_subcommands() in cli.py calls register_subcommands() on only the selected runtime plugin, so --help output is filtered to the active runtime's supported subcommands.

Key Patterns

  • GPU Detection: get_accel() in common.py detects GPU type (CUDA, ROCm, Vulkan, etc.) and selects appropriate container image
  • Container Images: GPU-specific images at quay.io/ramalama/{ramalama,cuda,rocm,intel-gpu,...}
  • Inference Engines: llama.cpp (default), vllm, mlx (macOS only) - each implemented as a runtime plugin under ramalama/plugins/runtimes/inference/

Test Structure

  • test/unit/ - pytest unit tests (fast, no external dependencies)
  • test/e2e/ - pytest end-to-end tests (marked with @pytest.mark.e2e)

Code Style

  • Python 3.10+ required
  • Line length: 120 characters
  • Formatting: ruff format + ruff check (I rules)
  • Type hints encouraged (mypy checked)
  • Commits require DCO sign-off (git commit -s)
  • Commit messages for PRs:
    • Use exactly one sign-off, added via git commit -s only.
    • Do not manually add a "Signed-off-by:" line in the message body.
    • Do not add any trailers (e.g. "Made-with: Cursor") after the sign-off.
    • The author's Signed-off-by must be the last line of the commit message.