diff --git a/skills/configure-cache-llm-d/SKILL.md b/skills/configure-cache-llm-d/SKILL.md new file mode 100644 index 0000000..b12b2fa --- /dev/null +++ b/skills/configure-cache-llm-d/SKILL.md @@ -0,0 +1,320 @@ +--- +name: configure-cache-llm-d +description: Configure and tune cache memory settings in existing llm-d deployments. Use this skill when you need to optimize cache performance by adjusting GPU memory utilization, KV cache capacity, shared memory, block size, or context length. Ideal for improving throughput, reducing latency, supporting longer contexts, fixing OOM errors, or tuning cache hit rates in production deployments. +--- + +# llm-d Cache Configuration Skill + +## 📋 Command Execution Notice + +**Before executing any command, I will:** +1. **Explain what the command does** - Clear description of purpose and expected outcome +2. **Show the actual command** - The exact command to be executed +3. **Explain why it's needed** - How it fits into the workflow + +> ## 🔔 ALWAYS NOTIFY BEFORE CREATING RESOURCES +> +> **RULE**: Before creating ANY resource (namespaces, files, Kubernetes objects), notify the user first. +> +> **Format**: "I am about to create `` named `` because ``. Proceeding now." +> +> **Never silently create resources.** Check existence first, then notify before acting. + +## Critical Rules + +1. **ALWAYS use existing skill scripts first** - Use `show-current-config.sh` and `update-cache-config.sh` before manual edits. Only perform manual edits if scripts fail due to non-standard deployment structure. + +2. **Check for existing resources** - Before deployment, check for old/conflicting deployments and clean them up. Use `helm list -n ${NAMESPACE}` and `kubectl get all -n ${NAMESPACE}`. + +3. **Verify cluster resources** - Check available GPU/RDMA resources before applying changes. Use `kubectl describe nodes` to verify capacity. + +4. **Do NOT change cluster-level definitions** - All changes must be within the designated namespace. Never modify cluster-wide resources. Always scope commands with `-n ${NAMESPACE}`. + +5. **Do NOT modify existing repository code** - Only create new files. Never edit pre-existing repository files. + +6. **Script modifications** - If existing scripts need updates, copy them to your deployment directory and modify the copy. Never edit scripts in `skills/configure-cache-llm-d/scripts/` directly. + +## Overview + +Modify cache settings in existing llm-d deployments: GPU memory utilization, block size, max context length, and shared memory (SHM). Changes apply via rolling updates with automatic backups. + +**For deployments with CPU offloading already enabled**: You can also tune CPU cache size and InferencePool prefix cache scorer configurations. + +**Note**: Initial setup of tiered prefix cache offloading (CPU RAM, local disk, or shared storage) requires redeployment. See [`guides/tiered-prefix-cache/README.md`](../../guides/tiered-prefix-cache/README.md) for new deployments. + +## When to Use + +This skill enables you to tune cache performance without redeployment: + +- **GPU Memory Utilization** (`-g`): Adjust GPU memory allocation (0.0-1.0) to balance throughput vs. OOM risk +- **Block Size** (`-b`): Change cache granularity (16-128 tokens) to optimize cache hit rates and memory efficiency +- **Max Context Length** (`-m`): Extend or reduce maximum context window to support longer documents or save memory +- **Shared Memory** (`-s`): Configure SHM size for multi-GPU tensor parallelism setups +- **CPU Cache Size**: Tune CPU offloading capacity for deployments with tiered caching already enabled +- **Prefix Cache Routing**: Adjust InferencePool scorer weights to optimize cache-aware request scheduling + + +## Workflow + +### 1. Check Current Configuration + +```bash +bash skills/configure-cache-llm-d/scripts/show-current-config.sh ${NAMESPACE} +``` + +### 2. Update Settings + +**Preview first:** +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n ${NAMESPACE} -g 0.90 -b 32 --dry-run +``` + +**Apply:** +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n ${NAMESPACE} -g 0.90 -b 32 +``` + +**Options:** +- `-g ` - GPU memory utilization (0.0-1.0) +- `-b ` - Block size in tokens (16-128) +- `-m ` - Max model length in tokens +- `-s ` - Shared memory size (e.g., 20Gi, 30Gi) + +### 3. Verify + +```bash +kubectl get pods -n ${NAMESPACE} +kubectl logs -l llm-d.ai/role=decode -n ${NAMESPACE} | grep -E "gpu_memory_utilization|block_size" +``` + +## Tuning CPU Cache (If Already Enabled) + +If your deployment already has CPU offloading enabled via OffloadingConnector or LMCache, you can tune the CPU cache size and InferencePool configuration. + +### Adjust CPU Cache Size + +**For vLLM OffloadingConnector:** + +Edit the deployment to modify `cpu_bytes_to_use` in the `--kv-transfer-config` argument: + +```bash +kubectl edit deployment -n ${NAMESPACE} +``` + +Find and modify the `cpu_bytes_to_use` value (in bytes): +```yaml +--kv-transfer-config '{"kv_connector":"OffloadingConnector","kv_role":"kv_both","kv_connector_extra_config":{"cpu_bytes_to_use":107374182400}}' +``` + +Example: Change from 100GB (107374182400) to 150GB (161061273600) + +**For LMCache Connector:** + +Edit the deployment to modify the `LMCACHE_MAX_LOCAL_CPU_SIZE` environment variable: + +```bash +kubectl edit deployment -n ${NAMESPACE} +``` + +Find and modify the environment variable (in GB): +```yaml +- name: LMCACHE_MAX_LOCAL_CPU_SIZE + value: "200.0" # Change to desired size in GB +``` + +### Tune InferencePool Prefix Cache Scorers + +**What are prefix cache scorers?** +The InferencePool uses scorers to decide which server should handle each request. When CPU offloading is enabled, you configure separate scorers for GPU cache and CPU cache to help route requests to servers that already have relevant cached data. + +**Tuning the configuration:** + +```bash +helm upgrade llm-d-infpool -n ${NAMESPACE} -f \ + oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool --version v1.4.0 +``` + +**Key parameters:** + +1. **`lruCapacityPerServer`**: Total CPU cache capacity per server (in blocks) + - Must be manually configured since vLLM doesn't emit CPU block metrics + - Example: `41000` blocks = ~100GB for Qwen-32B (41,000 blocks × 2.5MB/block) + - Calculation: 160KB/token × 16 block size = 2.5MB/block + - Adjust based on your model's block size (check vLLM logs) + +2. **Scorer weights**: Balance how the InferencePool prioritizes different factors + - Default: queue (2.0), kv-cache-util (2.0), gpu-prefix (1.0), cpu-prefix (1.0) + - CPU cache is a superset of GPU cache (CPU offloading copies GPU entries to CPU) + - Combined GPU + CPU prefix scorer weight (1.0 + 1.0 = 2.0) balances with other scorers + - Tune the ratio between GPU and CPU scorers based on your workload + +See [`guides/tiered-prefix-cache/cpu/manifests/inferencepool/values.yaml`](../../guides/tiered-prefix-cache/cpu/manifests/inferencepool/values.yaml) for full configuration example. + +## Common Scenarios + +### Increase Cache Hit Rate +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n ${NAMESPACE} -g 0.88 -b 32 +``` +Reduces GPU memory (0.95→0.88) for more cache, decreases block size (64→32) for finer matching. + +### Support Longer Contexts +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n ${NAMESPACE} -m 16384 -g 0.85 -s 30Gi +``` +Increases max length (8192→16384), reduces GPU memory (0.95→0.85), increases SHM (20Gi→30Gi). + +### Maximize Throughput +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n ${NAMESPACE} -g 0.95 -b 64 +``` +Increases GPU memory (0.90→0.95) for more capacity, standard block size (32→64). + +### Fix OOM Errors +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n ${NAMESPACE} -g 0.85 +``` +Reduces GPU memory (0.95→0.85) to reduce memory pressure. + +### Adjust Shared Memory for Multi-GPU +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n ${NAMESPACE} -s 50Gi +``` +Increases SHM (20Gi→50Gi) based on tensor parallelism configuration. + +### Manual Edits for Non-Standard Deployments +If script fails, manually edit configuration files: +1. Update `values-modelservice.yaml`: Change `--block-size` and `--gpu-memory-utilization` +2. Update `values-inferencepool.yaml`: Adjust `lruCapacityPerServer` +3. Apply: `cd deployment-dir && helmfile apply -n ${NAMESPACE}` +4. Verify: `kubectl rollout status deployment/ -n ${NAMESPACE}` + +## Non-Standard Deployment Patterns + +For deployments with custom directory structures or file naming: + +**Using Scripts:** +```bash +# Specify deployment directory explicitly +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -d deployments/your-deployment -n ${NAMESPACE} -g 0.95 -b 64 +``` + +**Manual Updates:** +1. Locate your ModelService and InferencePool values files +2. Edit ModelService values for `--block-size` and `--gpu-memory-utilization` +3. **Important**: When changing block size, recalculate InferencePool cache capacities: + - Formula: `new_capacity = old_capacity × (old_block_size / new_block_size)` + - Example: 32→64 blocks: GPU cache 31,250→15,625, CPU cache 41,000→20,500 +4. Apply changes: `cd deployment-dir && helmfile apply -n ${NAMESPACE}` + +## Validation and Rollback + +### Validate Configuration Consistency + +Before applying changes, verify: +```bash +# Check block size consistency +kubectl get deployment -n ${NAMESPACE} -o yaml | grep "block-size" + +# Verify cache capacity calculations +kubectl get inferencepool -n ${NAMESPACE} -o yaml | grep "lruCapacityPerServer" + +# Check SHM allocation +kubectl get pods -n ${NAMESPACE} -o yaml | grep -A 2 "shm" +``` + +### Rollback Procedure + +If changes cause issues: +```bash +# Automatic backups are created in deployments//backups/ +cd deployments//backups/ + +# Restore from backup +cp backup-DDMMYYYY-HHMMSS/ms-values.yaml ../ms-values.yaml +cp backup-DDMMYYYY-HHMMSS/gaie-values.yaml ../gaie-values.yaml + +# Reapply +cd .. +helmfile apply -n ${NAMESPACE} + +# Verify rollback +kubectl rollout status deployment/ -n ${NAMESPACE} +kubectl logs -l llm-d.ai/role=decode -n ${NAMESPACE} | grep -E "gpu_memory_utilization|block_size" +``` + +## Monitoring + +### Monitoring Commands +```bash +# KV Cache Usage +kubectl logs -l llm-d.ai/role=decode -n ${NAMESPACE} | grep "kv_cache_usage" + +# GPU Memory +kubectl exec -n ${NAMESPACE} -- nvidia-smi + +# Cache Hit Rate +kubectl logs -l inferencepool= -n ${NAMESPACE} | grep "cache_hit_rate" +``` + +## Troubleshooting Guidance + +For detailed troubleshooting guidance, see [TROUBLESHOOTING.md](./references/TROUBLESHOOTING.md). + + +## Pre-Deployment Checklist + +Before applying cache configuration changes: + +1. **Check for old deployments** ask user before cleanup: + ```bash + helm list -n ${NAMESPACE} + kubectl get all -n ${NAMESPACE} + ``` + If old deployments exist, **ask user**: "Found old deployments [list]. Should I clean them up?" + Only proceed with cleanup after user approval. + +2. **Verify cluster resources**: + ```bash + kubectl describe nodes | grep -A 5 "Allocated resources" + kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.allocatable."nvidia\.com/gpu" + ``` + +3. **Check current configuration**: + ```bash + bash skills/configure-cache-llm-d/scripts/show-current-config.sh ${NAMESPACE} + ``` + +4. **Preview changes with --dry-run**: + ```bash + bash skills/configure-cache-llm-d/scripts/update-cache-config.sh -n ${NAMESPACE} -g 0.95 --dry-run + ``` + +## Safety Guidelines + +- ✅ **NEVER delete resources without user approval** +- ✅ Check for conflicting deployments, ask before cleanup +- ✅ Verify sufficient cluster resources (GPU, RDMA, memory) +- ✅ Always check current config first +- ✅ Use `--dry-run` to preview changes +- ✅ Automatic backups created in `deployments//backups/` +- ✅ Rolling updates maintain availability +- ✅ Verify settings in pod logs after changes + +### Guides +- **[Tiered Prefix Cache](../../guides/tiered-prefix-cache/README.md)**: Comprehensive guide on prefix cache offloading strategies + - **[CPU Offloading](../../guides/tiered-prefix-cache/cpu/README.md)**: Initial setup requires redeployment; tuning can be done on existing deployments + - **[Storage Offloading](../../guides/tiered-prefix-cache/storage/README.md)**: Requires redeployment +- **[Inference Scheduling](../../guides/inference-scheduling/README.md)**: Prefix-aware request scheduling optimizations + +## Scripts Reference + +See [scripts/README.md](scripts/README.md) for detailed documentation. \ No newline at end of file diff --git a/skills/configure-cache-llm-d/references/TROUBLESHOOTING.md b/skills/configure-cache-llm-d/references/TROUBLESHOOTING.md new file mode 100644 index 0000000..06701cc --- /dev/null +++ b/skills/configure-cache-llm-d/references/TROUBLESHOOTING.md @@ -0,0 +1,223 @@ +# Troubleshooting Guidance + +## Common Issues + +### Deployment Issues + +#### Helm Secret Conflict +**Symptom**: `Secret "xxx" exists and cannot be imported into the current release: invalid ownership metadata` + +**Root Cause**: Secret was owned by a previous Helm release with a different name. Helm tracks resource ownership via annotations and cannot automatically adopt resources from other releases. + +**Solution**: +```bash +# Delete the conflicting secret +kubectl delete secret -n ${NAMESPACE} + +# Then retry helmfile apply +helmfile apply -n ${NAMESPACE} +``` + +**Prevention**: Clean up old deployments before creating new ones with different release names. + +#### Pods Pending - Insufficient Resources +**Symptom**: `0/X nodes available: Y Insufficient nvidia.com/gpu, Z Insufficient rdma/ib` + +**Root Cause**: Insufficient GPU or RDMA resources in cluster. + +**Solutions**: + +1. **Check resources first**: + ```bash + kubectl describe nodes | grep -A 5 "Allocated resources" + kubectl get nodes -o custom-columns=NAME:.metadata.name,GPUs:.status.allocatable."nvidia\.com/gpu" + ``` + +2. **Clean up old deployments** (ask user first): + ```bash + helm list -n ${NAMESPACE} + # Ask user: "Found releases [X, Y]. Should I uninstall them?" + # Only after approval: helm uninstall -n ${NAMESPACE} + ``` + +3. **Scale down replicas**: + ```bash + kubectl scale deployment --replicas=1 -n ${NAMESPACE} + kubectl scale deployment --replicas=2 -n ${NAMESPACE} + ``` + +4. **Make RDMA optional** (edit values file, comment out `rdma/ib` from resources) + +#### Script Can't Find Deployment Files +**Symptom**: `Error: Could not find ModelService values file` + +**Root Cause**: Script auto-detection expects standard file patterns but deployment uses different structure. + +**Solutions**: + +1. **For standard deployments**: + ```bash + bash skills/configure-cache-llm-d/scripts/update-cache-config.sh -n ${NAMESPACE} -g 0.95 + ``` + +2. **For custom directory structures**: + ```bash + # Specify deployment directory explicitly + bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -d deployments/your-deployment -n ${NAMESPACE} -g 0.95 -b 64 + ``` + +3. **For non-standard file naming**: + - Manually edit the values files + - Follow the manual update procedure in SKILL.md + +### Configuration Issues + +#### InferencePool Cache Capacity Mismatch After Block Size Change +**Symptom**: After changing block size, cache hit rates drop or InferencePool routing becomes inefficient. + +**Root Cause**: When block size changes, the InferencePool cache capacities (`lruCapacityPerServer`) must be recalculated. The script may not automatically update these values. + +**Solution**: +```bash +# Manual calculation formula: +# new_capacity = old_capacity × (old_block_size / new_block_size) + +# Example: Changing from block size 32 to 64 +# GPU cache: 31,250 × (32/64) = 15,625 blocks +# CPU cache: 41,000 × (32/64) = 20,500 blocks + +# Edit your InferencePool values file +# Update lruCapacityPerServer for both gpu-prefix-cache-scorer and cpu-prefix-cache-scorer +``` + +**Verification**: +```bash +# Check InferencePool configuration +kubectl get inferencepool -n ${NAMESPACE} -o yaml | grep -A 5 "lruCapacityPerServer" + +# Verify block size in ModelService +kubectl logs -l llm-d.ai/role=decode -n ${NAMESPACE} | grep "block_size" +``` + +#### Block Size Inconsistency Between ModelService and InferencePool +**Symptom**: Cache routing inefficiency, unexpected cache misses. + +**Root Cause**: ModelService and InferencePool have different block size configurations. + +**Solution**: +```bash +# Check ModelService block size +kubectl get deployment -n ${NAMESPACE} -o yaml | grep "block-size" + +# Check InferencePool configuration +kubectl get inferencepool -n ${NAMESPACE} -o yaml | grep -A 10 "prefixCacheScorers" + +# Ensure both use the same block size +# Update both configurations to match +``` + +### Runtime Issues + +#### OOM Errors +**Symptom**: Pods crash with out-of-memory errors. + +**Solutions**: +```bash +# Reduce GPU memory utilization +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh -n ${NAMESPACE} -g 0.85 + +# Reduce max model length +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh -n ${NAMESPACE} -m 4096 + +# Check actual GPU memory usage +kubectl exec -n ${NAMESPACE} -- nvidia-smi +``` + +#### Low Cache Hit Rate +**Symptom**: Cache hit rate metrics show low values. + +**Solutions**: +```bash +# Decrease block size for finer-grained matching +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh -n ${NAMESPACE} -b 32 + +# Reduce GPU memory to allocate more blocks +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh -n ${NAMESPACE} -g 0.88 -b 32 + +# Verify block size consistency +kubectl logs -l llm-d.ai/role=decode -n ${NAMESPACE} | grep "block_size" +kubectl get inferencepool -n ${NAMESPACE} -o yaml | grep "lruCapacityPerServer" +``` + +#### SHM Errors +**Symptom**: Errors related to shared memory allocation, especially with tensor parallelism > 2. + +**Solutions**: +```bash +# Increase shared memory size +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh -n ${NAMESPACE} -s 40Gi + +# Verify SHM allocation +kubectl exec -n ${NAMESPACE} -- df -h /dev/shm + +# For TP=4, recommended SHM: 30-40Gi +# For TP=8, recommended SHM: 50-60Gi +``` + +#### Pods Not Restarting After Configuration Change +**Symptom**: Configuration changes applied but pods still running with old settings. + +**Solutions**: +```bash +# Force rolling restart +kubectl rollout restart deployment/ -n ${NAMESPACE} + +# Monitor rollout status +kubectl rollout status deployment/ -n ${NAMESPACE} + +# Verify new configuration in logs +kubectl logs -l llm-d.ai/role=decode -n ${NAMESPACE} | grep -E "gpu_memory_utilization|block_size" +``` + +## Pre-Deployment Checklist + +See SKILL.md for complete checklist. Key points: +- Check for old deployments (ask user before cleanup) +- Verify cluster resources +- Review current config +- Preview with `--dry-run` + +## Rollback Procedure + +If configuration changes cause issues: + +```bash +# Navigate to backup directory +cd deployments//backups/ + +# List available backups +ls -lt + +# Restore from most recent backup (adjust paths based on your deployment structure) +cp backup-DDMMYYYY-HHMMSS/ ../ +cp backup-DDMMYYYY-HHMMSS/ ../ + +# Reapply configuration +cd .. +helmfile apply -n ${NAMESPACE} + +# Verify rollback +kubectl rollout status deployment/ -n ${NAMESPACE} +kubectl logs -l llm-d.ai/role=decode -n ${NAMESPACE} | grep -E "gpu_memory_utilization|block_size" +``` + +## Getting Help + +**Debug commands**: +- Pod events: `kubectl describe pod -n ${NAMESPACE}` +- Pod logs: `kubectl logs -n ${NAMESPACE}` +- InferencePool logs: `kubectl logs -l inferencepool= -n ${NAMESPACE}` +- Helm status: `helm status -n ${NAMESPACE}` + +**When filing issues**, include: deployment structure, config changes, error logs, and resource availability. diff --git a/skills/configure-cache-llm-d/scripts/README.md b/skills/configure-cache-llm-d/scripts/README.md new file mode 100644 index 0000000..5b9405c --- /dev/null +++ b/skills/configure-cache-llm-d/scripts/README.md @@ -0,0 +1,151 @@ +# llm-d Cache Configuration Scripts + +Helper scripts for modifying cache settings in llm-d deployments. + +## Scripts + +### show-current-config.sh + +Display current cache configuration for a deployment. + +**Usage:** +```bash +bash skills/configure-cache-llm-d/scripts/show-current-config.sh +``` + +**Example:** +```bash +bash skills/configure-cache-llm-d/scripts/show-current-config.sh llmd-ns +``` + +**Output:** +- GPU memory utilization +- Block size +- Max model length +- Shared memory (SHM) size +- Tensor parallelism +- InferencePool configuration +- Resource usage + +### update-cache-config.sh + +Update cache configuration and apply changes with rolling update. + +**Usage:** +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh -n [options] +``` + +**Options:** +- `-n ` - Target namespace (required) +- `-g ` - GPU memory utilization (0.0-1.0) +- `-b ` - Block size in tokens +- `-m ` - Max model length in tokens +- `-s ` - Shared memory size (e.g., 20Gi, 30Gi) +- `-d ` - Deployment directory (auto-detected) +- `-r ` - Helm release name (auto-detected) +- `--dry-run` - Preview changes without applying + +**Examples:** + +Increase cache capacity: +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n llmd-ns -g 0.90 -b 32 +``` + +Support longer contexts: +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n llmd-ns -m 16384 -g 0.85 -s 30Gi +``` + +Preview changes: +```bash +bash skills/configure-cache-llm-d/scripts/update-cache-config.sh \ + -n llmd-ns -g 0.90 --dry-run +``` + +**Features:** +- Auto-detects deployment directory +- Backs up configuration files +- Updates both ms-values.yaml and gaie-values.yaml (if needed) +- Applies changes with helmfile +- Verifies rollout completion +- Provides rollback instructions + +## Quick Reference + +### Cache Parameters + +| Parameter | Location | Default | Range | Purpose | +|-----------|----------|---------|-------|---------| +| GPU Memory Utilization | `--gpu-memory-utilization` | 0.95 | 0.0-1.0 | Controls KV cache capacity | +| Block Size | `--block-size` | 64 | 16-128 | Cache granularity for prefix matching | +| Max Model Length | `--max-model-len` | Model-specific | Up to model max | Maximum context window | +| Shared Memory | `sizeLimit` | 20Gi | 10Gi-50Gi | IPC for multi-GPU setups | + +### Common Scenarios + +**Increase cache hit rate:** +```bash +-g 0.90 -b 32 +``` + +**Support longer contexts:** +```bash +-m 16384 -g 0.85 -s 30Gi +``` + +**Maximize throughput:** +```bash +-g 0.95 -b 64 +``` + +**High tensor parallelism (TP=8):** +```bash +-s 40Gi +``` + +## Workflow + +1. **Check current config:** + ```bash + bash show-current-config.sh + ``` + +2. **Preview changes:** + ```bash + bash update-cache-config.sh -n -g 0.90 --dry-run + ``` + +3. **Apply changes:** + ```bash + bash update-cache-config.sh -n -g 0.90 + ``` + +4. **Verify:** + ```bash + kubectl get pods -n + kubectl logs -n | grep gpu_memory_utilization + ``` + +## Troubleshooting + +**Script can't find deployment directory:** +- Specify with `-d` option: `-d deployments/deploy-` + +**Changes not applied:** +- Check Helm release: `helm list -n ` +- Force restart: `kubectl rollout restart deployment/ -n ` + +**Block size mismatch warning:** +- Script automatically updates both ms-values.yaml and gaie-values.yaml +- Verify: `bash show-current-config.sh ` + +## Safety + +- All changes create timestamped backups in `deployments//backups/` +- Rolling updates maintain availability (zero downtime) +- Use `--dry-run` to preview changes first +- Rollback instructions provided after each update \ No newline at end of file diff --git a/skills/configure-cache-llm-d/scripts/show-current-config.sh b/skills/configure-cache-llm-d/scripts/show-current-config.sh new file mode 100755 index 0000000..1f6d76d --- /dev/null +++ b/skills/configure-cache-llm-d/scripts/show-current-config.sh @@ -0,0 +1,86 @@ +#!/bin/bash +# Show current cache configuration for llm-d deployment + +set -e + +NAMESPACE=${1:-${NAMESPACE}} + +if [ -z "$NAMESPACE" ]; then + echo "Usage: $0 " + echo "Or set NAMESPACE environment variable" + exit 1 +fi + +echo "=== Current Cache Configuration in namespace: $NAMESPACE ===" +echo "" + +# Find model service deployments +DEPLOYMENTS=$(kubectl get deployment -n "$NAMESPACE" -l llm-d.ai/role=decode -o name 2>/dev/null || true) + +if [ -z "$DEPLOYMENTS" ]; then + echo "No decode deployments found in namespace $NAMESPACE" + exit 1 +fi + +for DEPLOY in $DEPLOYMENTS; do + DEPLOY_NAME=$(echo "$DEPLOY" | cut -d'/' -f2) + echo "Deployment: $DEPLOY_NAME" + echo "---" + + # Get pod to inspect + POD=$(kubectl get pods -n "$NAMESPACE" -l llm-d.ai/role=decode --field-selector=status.phase=Running -o name 2>/dev/null | head -1) + + if [ -n "$POD" ]; then + POD_NAME=$(echo "$POD" | cut -d'/' -f2) + + # Extract cache settings from pod spec + echo "GPU Memory Utilization:" + kubectl get "$POD" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].args}' | \ + grep -o 'gpu-memory-utilization=[0-9.]*' || echo " Not set (using default)" + + echo "" + echo "Block Size:" + kubectl get "$POD" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].args}' | \ + grep -o 'block-size=[0-9]*' || echo " Not set (using default)" + + echo "" + echo "Max Model Length:" + kubectl get "$POD" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].args}' | \ + grep -o 'max-model-len=[0-9]*' || echo " Not set (using default)" + + echo "" + echo "Shared Memory (SHM):" + kubectl get "$POD" -n "$NAMESPACE" -o jsonpath='{.spec.volumes[?(@.name=="shm")].emptyDir.sizeLimit}' || echo " Not configured" + + echo "" + echo "Tensor Parallelism:" + kubectl get "$POD" -n "$NAMESPACE" -o jsonpath='{.spec.containers[0].args}' | \ + grep -o 'tensor-parallel-size=[0-9]*' || echo " Not set (TP=1)" + + echo "" + echo "---" + echo "" + else + echo " No running pods found" + echo "" + fi +done + +# Check for InferencePool configuration +echo "=== InferencePool Configuration ===" +POOLS=$(kubectl get inferencepool -n "$NAMESPACE" -o name 2>/dev/null || true) +if [ -n "$POOLS" ]; then + for POOL in $POOLS; do + POOL_NAME=$(echo "$POOL" | cut -d'/' -f2) + echo "Pool: $POOL_NAME" + kubectl get inferencepool "$POOL_NAME" -n "$NAMESPACE" -o jsonpath='{.spec.modelServerType}' 2>/dev/null || echo " Type: unknown" + echo "" + done +else + echo "No InferencePools found" +fi + +echo "" +echo "=== Resource Usage ===" +kubectl top pods -n "$NAMESPACE" -l llm-d.ai/role=decode 2>/dev/null || echo "Metrics not available (metrics-server may not be installed)" + diff --git a/skills/configure-cache-llm-d/scripts/update-cache-config.sh b/skills/configure-cache-llm-d/scripts/update-cache-config.sh new file mode 100755 index 0000000..b49df0a --- /dev/null +++ b/skills/configure-cache-llm-d/scripts/update-cache-config.sh @@ -0,0 +1,238 @@ +#!/bin/bash +# Update cache configuration for llm-d deployment + +set -e + +usage() { + cat << EOF +Update cache configuration for llm-d deployment + +Usage: $0 -n [options] + +Required: + -n Target namespace + +Cache Options (at least one required): + -g GPU memory utilization (0.0-1.0, e.g., 0.90) + -b Block size in tokens (e.g., 32, 64) + -m Max model length in tokens (e.g., 8192, 16384) + -s Shared memory size (e.g., 20Gi, 30Gi) + +Deployment Options: + -d Deployment directory (auto-detected if not provided) + -r Helm release name (auto-detected if not provided) + --dry-run Show changes without applying + +Examples: + # Increase cache capacity + $0 -n llmd-ns -g 0.90 -b 32 + + # Support longer contexts + $0 -n llmd-ns -m 16384 -g 0.85 -s 30Gi + + # Preview changes + $0 -n llmd-ns -g 0.90 --dry-run + +EOF + exit 1 +} + +# Parse arguments +NAMESPACE="" +GPU_MEM="" +BLOCK_SIZE="" +MAX_LEN="" +SHM_SIZE="" +DEPLOY_DIR="" +RELEASE_NAME="" +DRY_RUN=false + +while [[ $# -gt 0 ]]; do + case $1 in + -n) NAMESPACE="$2"; shift 2 ;; + -g) GPU_MEM="$2"; shift 2 ;; + -b) BLOCK_SIZE="$2"; shift 2 ;; + -m) MAX_LEN="$2"; shift 2 ;; + -s) SHM_SIZE="$2"; shift 2 ;; + -d) DEPLOY_DIR="$2"; shift 2 ;; + -r) RELEASE_NAME="$2"; shift 2 ;; + --dry-run) DRY_RUN=true; shift ;; + -h|--help) usage ;; + *) echo "Unknown option: $1"; usage ;; + esac +done + +# Validate required arguments +if [ -z "$NAMESPACE" ]; then + echo "Error: Namespace is required" + usage +fi + +if [ -z "$GPU_MEM" ] && [ -z "$BLOCK_SIZE" ] && [ -z "$MAX_LEN" ] && [ -z "$SHM_SIZE" ]; then + echo "Error: At least one cache option must be specified" + usage +fi + +# Auto-detect deployment directory if not provided +if [ -z "$DEPLOY_DIR" ]; then + echo "Auto-detecting deployment directory..." + DEPLOY_DIR=$(find deployments -type f -name "ms-values.yaml" -path "*/deploy-*/*" | head -1 | xargs dirname) + if [ -z "$DEPLOY_DIR" ]; then + echo "Error: Could not auto-detect deployment directory" + echo "Please specify with -d option" + exit 1 + fi + echo "Found: $DEPLOY_DIR" +fi + +# Verify deployment directory exists +if [ ! -d "$DEPLOY_DIR" ]; then + echo "Error: Deployment directory not found: $DEPLOY_DIR" + exit 1 +fi + +MS_VALUES="$DEPLOY_DIR/ms-values.yaml" +GAIE_VALUES="$DEPLOY_DIR/gaie-values.yaml" + +if [ ! -f "$MS_VALUES" ]; then + echo "Error: ms-values.yaml not found in $DEPLOY_DIR" + exit 1 +fi + +echo "=== Cache Configuration Update ===" +echo "Namespace: $NAMESPACE" +echo "Deployment: $DEPLOY_DIR" +echo "" + +# Show current configuration +echo "Current Configuration:" +if [ -f "$MS_VALUES" ]; then + echo " GPU Memory Utilization: $(grep -o 'gpu-memory-utilization=[0-9.]*' "$MS_VALUES" || echo 'not set')" + echo " Block Size: $(grep -o 'block-size=[0-9]*' "$MS_VALUES" || echo 'not set')" + echo " Max Model Length: $(grep -o 'max-model-len=[0-9]*' "$MS_VALUES" || echo 'not set')" + echo " Shared Memory: $(grep -A 2 'name: shm' "$MS_VALUES" | grep 'sizeLimit:' | awk '{print $2}' || echo 'not set')" +fi +echo "" + +# Show proposed changes +echo "Proposed Changes:" +[ -n "$GPU_MEM" ] && echo " GPU Memory Utilization: $GPU_MEM" +[ -n "$BLOCK_SIZE" ] && echo " Block Size: $BLOCK_SIZE" +[ -n "$MAX_LEN" ] && echo " Max Model Length: $MAX_LEN" +[ -n "$SHM_SIZE" ] && echo " Shared Memory: $SHM_SIZE" +echo "" + +if [ "$DRY_RUN" = true ]; then + echo "DRY RUN - No changes will be applied" + echo "" + echo "To apply these changes, run without --dry-run flag" + exit 0 +fi + +# Backup original files +BACKUP_DIR="$DEPLOY_DIR/backups" +mkdir -p "$BACKUP_DIR" +TIMESTAMP=$(date +%Y%m%d-%H%M%S) +cp "$MS_VALUES" "$BACKUP_DIR/ms-values.yaml.$TIMESTAMP" +echo "Backed up ms-values.yaml to $BACKUP_DIR/ms-values.yaml.$TIMESTAMP" + +if [ -f "$GAIE_VALUES" ] && [ -n "$BLOCK_SIZE" ]; then + cp "$GAIE_VALUES" "$BACKUP_DIR/gaie-values.yaml.$TIMESTAMP" + echo "Backed up gaie-values.yaml to $BACKUP_DIR/gaie-values.yaml.$TIMESTAMP" +fi +echo "" + +# Update ms-values.yaml +echo "Updating $MS_VALUES..." + +if [ -n "$GPU_MEM" ]; then + if grep -q "gpu-memory-utilization=" "$MS_VALUES"; then + sed -i.tmp "s/--gpu-memory-utilization=[0-9.]*/--gpu-memory-utilization=$GPU_MEM/" "$MS_VALUES" + echo " Updated GPU memory utilization to $GPU_MEM" + else + echo " Warning: gpu-memory-utilization not found in file" + fi +fi + +if [ -n "$BLOCK_SIZE" ]; then + if grep -q "block-size=" "$MS_VALUES"; then + sed -i.tmp "s/--block-size=[0-9]*/--block-size=$BLOCK_SIZE/" "$MS_VALUES" + echo " Updated block size to $BLOCK_SIZE" + else + echo " Warning: block-size not found in file" + fi +fi + +if [ -n "$MAX_LEN" ]; then + if grep -q "max-model-len=" "$MS_VALUES"; then + sed -i.tmp "s/--max-model-len=[0-9]*/--max-model-len=$MAX_LEN/" "$MS_VALUES" + echo " Updated max model length to $MAX_LEN" + else + echo " Warning: max-model-len not found in file" + fi +fi + +if [ -n "$SHM_SIZE" ]; then + if grep -q "sizeLimit:" "$MS_VALUES"; then + sed -i.tmp "/name: shm/,/sizeLimit:/ s/sizeLimit: .*/sizeLimit: $SHM_SIZE/" "$MS_VALUES" + echo " Updated shared memory to $SHM_SIZE" + else + echo " Warning: sizeLimit not found in file" + fi +fi + +# Clean up temp files +rm -f "$MS_VALUES.tmp" + +# Update gaie-values.yaml if block size changed +if [ -n "$BLOCK_SIZE" ] && [ -f "$GAIE_VALUES" ]; then + echo "" + echo "Updating $GAIE_VALUES..." + if grep -q "blockSize:" "$GAIE_VALUES"; then + sed -i.tmp "s/blockSize: [0-9]*/blockSize: $BLOCK_SIZE/" "$GAIE_VALUES" + echo " Updated blockSize to $BLOCK_SIZE (must match vLLM)" + rm -f "$GAIE_VALUES.tmp" + else + echo " Note: blockSize not found (may not be using precise prefix cache)" + fi +fi + +echo "" +echo "=== Applying Changes ===" +echo "Running: helmfile apply -n $NAMESPACE" +echo "" + +cd "$DEPLOY_DIR" +helmfile apply -n "$NAMESPACE" + +echo "" +echo "=== Verifying Deployment ===" +echo "Waiting for rollout to complete..." +sleep 5 + +# Wait for rollout +DEPLOYMENTS=$(kubectl get deployment -n "$NAMESPACE" -l llm-d.ai/role=decode -o name 2>/dev/null || true) +for DEPLOY in $DEPLOYMENTS; do + echo "Checking $DEPLOY..." + kubectl rollout status "$DEPLOY" -n "$NAMESPACE" --timeout=300s || true +done + +echo "" +echo "=== New Configuration ===" +POD=$(kubectl get pods -n "$NAMESPACE" -l llm-d.ai/role=decode --field-selector=status.phase=Running -o name 2>/dev/null | head -1) +if [ -n "$POD" ]; then + echo "Verifying settings in pod..." + kubectl logs "$POD" -n "$NAMESPACE" --tail=100 | grep -E "gpu_memory_utilization|block_size|max_model_len" || echo "Settings not yet in logs (pod may still be starting)" +fi + +echo "" +echo "✓ Cache configuration updated successfully!" +echo "" +echo "To verify:" +echo " kubectl get pods -n $NAMESPACE" +echo " kubectl logs -n $NAMESPACE | grep -E 'gpu_memory_utilization|block_size|max_model_len'" +echo "" +echo "To rollback:" +echo " cp $BACKUP_DIR/ms-values.yaml.$TIMESTAMP $MS_VALUES" +echo " cd $DEPLOY_DIR && helmfile apply -n $NAMESPACE" +