Automated parameters tuning for LLM inference engines.
Features:
- Dual Deployment Modes: OME (Kubernetes) or Docker (Standalone)
- Web API: FastAPI-based REST API for task management
- Background Processing: ARQ task queue with Redis
- Database: SQLite for task and experiment tracking
Full-featured Kubernetes deployment using OME operator.
Use cases:
- Production deployments
- Multi-node clusters
- Advanced orchestration needs
Requirements:
- Kubernetes v1.28+
- OME operator installed
- kubectl configured
Quick start:
./install.sh --install-ome
python src/run_autotuner.py examples/simple_task.json --mode omeLightweight standalone deployment using Docker containers.
Use cases:
- Development and testing
- Single-node deployments
- CI/CD pipelines
- Quick prototyping
Requirements:
- Docker with GPU support
- Model files downloaded locally
- No Kubernetes needed
Quick start:
pip install docker
python src/run_autotuner.py examples/docker_task.json --mode dockerSee docs/DOCKER_MODE.md for complete Docker mode documentation.
The autotuner includes a FastAPI-based web service for managing tuning tasks:
Features:
- Create and manage tuning tasks via REST API
- Track experiment progress and results
- Background job processing with ARQ and Redis
- OpenAPI/Swagger documentation at
/docs
Starting the API server:
# From anywhere in the project
cd /root/work/inference-autotuner/src
python web/server.py
# Or using the virtual environment
/root/work/inference-autotuner/env/bin/python src/web/server.pyThe server will start on http://0.0.0.0:8000 with:
- API endpoints at
/api/* - Interactive docs at
/docs - Health check at
/health
Key Endpoints:
POST /api/tasks/- Create new tuning taskGET /api/tasks/- List all tasksGET /api/tasks/{id}- Get task detailsPOST /api/tasks/{id}/start- Start task executionGET /api/experiments/task/{id}- Get experiments for task
Database Storage: Task and experiment data is stored in SQLite at:
~/.local/share/inference-autotuner/autotuner.db
This location follows XDG Base Directory standards and persists independently of the codebase.
inference-autotuner/
├── src/ # Main source code
│ ├── controllers/ # Deployment controllers
│ │ ├── ome_controller.py
│ │ ├── docker_controller.py
│ │ ├── benchmark_controller.py
│ │ └── direct_benchmark_controller.py
│ ├── utils/ # Utilities
│ │ └── optimizer.py # Parameter grid generation
│ ├── templates/ # Kubernetes YAML templates
│ ├── web/ # Web API (FastAPI)
│ │ ├── app.py # FastAPI application
│ │ ├── server.py # Development server
│ │ ├── config.py # Settings configuration
│ │ ├── routes/ # API endpoints
│ │ ├── db/ # Database models & session
│ │ ├── schemas/ # Pydantic schemas
│ │ └── workers/ # ARQ background workers
│ ├── orchestrator.py # Main orchestration logic
│ └── run_autotuner.py # CLI entry point
├── examples/ # Task configuration examples
│ ├── simple_task.json # OME mode example
│ └── docker_task.json # Docker mode example
├── config/ # Kubernetes resources
├── docs/ # Documentation
├── requirements.txt # Python dependencies
└── README.md
Key Components:
- CLI Interface:
src/run_autotuner.py- Command-line tool for running experiments - Web API:
src/web/- REST API for task management - Orchestrator:
src/orchestrator.py- Core experiment coordination logic - Controllers:
src/controllers/- Deployment-specific implementations - Database:
~/.local/share/inference-autotuner/autotuner.db- SQLite storage
IMPORTANT: OME (Open Model Engine) is a required prerequisite for OME mode.
-
OME Operator (Open Model Engine) - REQUIRED
- Version: v0.1.3 or later
- Installed in
omenamespace - All CRDs must be present:
inferenceservices,benchmarkjobs,clusterbasemodels,clusterservingruntimes - Installation Guide: See docs/OME_INSTALLATION.md for detailed setup instructions
-
Kubernetes cluster (v1.28+) with OME installed
- Tested on Minikube v1.34.0
- Single-node or multi-node cluster
- GPU support required for inference workloads
-
kubectl configured to access the cluster
-
Python 3.8+ with pip
-
Model and Runtime Resources
- At least one
ClusterBaseModelavailable - At least one
ClusterServingRuntimeconfigured - Example:
llama-3-2-1b-instructmodel withllama-3-2-1b-instruct-rtruntime - Setup instructions in docs/OME_INSTALLATION.md
- At least one
-
Docker with GPU support
- Docker 20.10+ with NVIDIA Container Toolkit
- Test:
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
-
Python 3.8+ with dependencies
pip install -r requirements.txt
-
Model files downloaded locally
mkdir -p /mnt/data/models huggingface-cli download meta-llama/Llama-3.2-1B-Instruct \ --local-dir /mnt/data/models/llama-3-2-1b-instruct
-
genai-bench for benchmarking
pip install genai-bench
-
Redis (optional, for Web API background jobs)
docker run -d -p 6379:6379 redis:alpine
See docs/DOCKER_MODE.md for complete setup guide.
For OME Mode:
# Check Kubernetes connection
kubectl cluster-info
# Check OME installation
kubectl get pods -n ome
kubectl get crd | grep ome.io
# Check available models and runtimes
kubectl get clusterbasemodels
kubectl get clusterservingruntimes
# Verify resources
kubectl describe node | grep -A 5 "Allocated resources"For Docker Mode:
# Check Docker
docker --version
docker ps
# Check GPU access
nvidia-smi
docker run --rm --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
# Check Python dependencies
python -c "import docker; print('Docker SDK:', docker.__version__)"
python -c "import genai_bench; print('genai-bench installed')"Expected output:
- OME controller pods running
- CRDs:
inferenceservices.ome.io,benchmarkjobs.ome.io, etc. - At least one model in Ready state
- At least one runtime available
The installation script automatically installs all dependencies including OME:
# Clone repository
git clone <repository-url>
cd inference-autotuner
# Run installation with OME
./install.sh --install-omeThis will:
- ✅ Install Python virtual environment and dependencies
- ✅ Install genai-bench CLI
- ✅ Install cert-manager (OME dependency)
- ✅ Install OME operator with all CRDs
- ✅ Create Kubernetes namespace and PVC
- ✅ Verify all installations
If you prefer to install OME separately or already have it installed:
# 1. Install OME first (if not already installed)
# See docs/OME_INSTALLATION.md for detailed instructions
# 2. Run autotuner installation
./install.sh./install.sh --help # Show all options
./install.sh --install-ome # Install with OME (recommended)
./install.sh --skip-venv # Skip Python virtual environment
./install.sh --skip-k8s # Skip Kubernetes resourcesAfter installation, create model and runtime resources:
# Apply example resources (requires model access)
kubectl apply -f third_party/ome/config/models/meta/Llama-3.2-1B-Instruct.yaml
# Or create your own ClusterBaseModel and ClusterServingRuntime
# See docs/OME_INSTALLATION.md for examples# Show help
python src/run_autotuner.py --help
# OME mode (default) with K8s BenchmarkJob
python src/run_autotuner.py examples/simple_task.json
# OME mode with direct genai-bench CLI
python src/run_autotuner.py examples/simple_task.json --direct
# Docker mode (standalone)
python src/run_autotuner.py examples/docker_task.json --mode docker
# Docker mode with custom model path
python src/run_autotuner.py examples/docker_task.json --mode docker --model-path /data/models| CLI Argument | Description | Default |
|---|---|---|
--mode ome |
Use Kubernetes + OME | Yes |
--mode docker |
Use standalone Docker | No |
--direct |
Use direct genai-bench CLI (OME mode only) | No |
--kubeconfig PATH |
Path to kubeconfig (OME mode) | Auto-detect |
--model-path PATH |
Base path for models (Docker mode) | /mnt/data/models |
The autotuner supports two benchmark execution modes:
-
Kubernetes BenchmarkJob Mode (OME mode only):
- Uses OME's BenchmarkJob CRD
- Runs genai-bench in Kubernetes pods
- Requires working genai-bench Docker image
- More complex but native to OME
-
Direct CLI Mode (Recommended):
- Runs genai-bench directly using local installation
- Automatic port forwarding to InferenceService
- Bypasses Docker image issues
- Faster and more reliable for prototyping
Run benchmarks using the local genai-bench installation:
python src/run_autotuner.py examples/simple_task.json --directHow it works:
- Deploys InferenceService via OME
- Automatically sets up
kubectl port-forwardto access the service - Runs genai-bench CLI directly from
env/bin/genai-bench - Cleans up port forward after completion
- No Docker image dependencies
Requirements:
- genai-bench installed in Python environment (
pip install genai-bench) kubectlconfigured and accessible- No additional configuration needed
Run benchmarks using OME's BenchmarkJob CRD:
python src/run_autotuner.py examples/simple_task.jsonHow it works:
- Creates Kubernetes BenchmarkJob resources
- Uses genai-bench Docker image
- Results stored in PersistentVolumeClaim
Requirements:
- PVC created (see installation step 3b)
- Working genai-bench Docker image accessible to cluster
See examples/simple_task.json for the schema:
{
"task_name": "simple-tune",
"description": "Description of the tuning task",
"model": {
"name": "llama-3-2-1b-instruct",
"namespace": "autotuner"
},
"base_runtime": "sglang-base-runtime",
"parameters": {
"tp_size": {"type": "choice", "values": [1, 2]},
"mem_frac": {"type": "choice", "values": [0.85, 0.9]}
},
"optimization": {
"strategy": "grid_search",
"objective": "minimize_latency",
"max_iterations": 4,
"timeout_per_iteration": 600
},
"benchmark": {
"task": "text-to-text",
"traffic_scenarios": ["D(100,100)"],
"num_concurrency": [1, 4],
"max_time_per_iteration": 10,
"max_requests_per_iteration": 50,
"additional_params": {"temperature": "0.0"}
}
}# Basic usage (uses default kubeconfig)
python src/run_autotuner.py examples/simple_task.json
# Specify kubeconfig path
python src/run_autotuner.py examples/simple_task.json /path/to/kubeconfigResults are saved to results/<task_name>_results.json
- Load Task: Read JSON configuration file
- Generate Parameter Grid: Create all parameter combinations (grid search)
- For Each Configuration:
- Deploy InferenceService with parameters
- Wait for service to be ready
- Create and run BenchmarkJob
- Collect metrics
- Clean up resources
- Find Best: Compare objective scores and report best configuration
Task: simple-tune (4 combinations: 2 x 2)
Experiment 1: {tp_size: 1, mem_frac: 0.85}
→ Deploy InferenceService
→ Wait for ready
→ Run benchmark
→ Score: 125.3ms
Experiment 2: {tp_size: 1, mem_frac: 0.9}
→ Deploy InferenceService
→ Wait for ready
→ Run benchmark
→ Score: 118.7ms
... (continue for all combinations)
Best: {tp_size: 2, mem_frac: 0.9} → Score: 89.2ms
Currently supported:
choice: List of discrete values
Currently supported:
grid_search: Exhaustive search over all combinations
Currently supported:
minimize_latency: Minimize average end-to-end latencymaximize_throughput: Maximize tokens/second
No database persistence✅ SQLite database implementedNo web frontend✅ REST API implemented (frontend TODO)- Grid search only (no Bayesian optimization)
- Sequential execution (no parallel experiments)
- Basic error handling
- Simplified metric extraction
Completed:
- ✅ Database persistence (SQLite with SQLAlchemy)
- ✅ REST API (FastAPI with OpenAPI docs)
- ✅ Background job processing (ARQ with Redis)
- ✅ Dual deployment modes (OME + Docker)
- ✅ User data separation (~/.local/share/)
- ✅ Code reorganization (unified src/ structure)
TODO:
- Web frontend (React/Vue.js)
- Bayesian optimization
- Parallel experiment execution
- Advanced error handling
- Real-time progress tracking via WebSocket
- Result visualization and comparison
For detailed troubleshooting guidance, common issues, and solutions, see docs/TROUBLESHOOTING.md.
Quick reference:
- InferenceService deployment issues
- GPU resource problems
- Docker and Kubernetes configuration
- Model download and transfer
- Benchmark execution errors
- Monitoring and performance tips
For production implementation:
Add database backend✅ Completed - SQLite with SQLAlchemy ORMImplement REST API✅ Completed - FastAPI with OpenAPI- Add web UI (React/Vue.js + WebSocket for real-time updates)
- Add Bayesian optimization (switch from grid search)
- Enable parallel experiment execution (multi-threaded/async)
- Improve error handling and retry logic
- Add comprehensive logging and monitoring
- Implement metric aggregation and visualization
- Add user authentication and multi-tenancy
- Migrate to PostgreSQL for production scale
- DOCKER_MODE.md - Docker deployment guide
- OME_INSTALLATION.md - Kubernetes/OME setup
- TROUBLESHOOTING.md - Common issues and solutions
- GENAI_BENCH_LOGS.md - Viewing benchmark logs
See CLAUDE.md for development guidelines and project architecture.