Date: January 16, 2026
Repository: parallelworks/activate-rag-vllm
Objective: Improve, refactor, and consolidate the repository for long-term supportability, ease of use, and multi-environment deployment (Singularity-focused for HPC).
This plan outlines the steps to:
- Merge the
nemotronbranch improvements intomain - Consolidate duplicate code and configurations
- Add flexible model sourcing (local path or HuggingFace pull)
- Improve Singularity deployment for HPC environments
- Create a unified, user-friendly workflow experience
┌─────────────────────────────────────────────────────────────────┐
│ User / Open WebUI │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ RAG Proxy (Port 8081) │
│ - OpenAI-compatible endpoints │
│ - Injects RAG context into prompts │
│ - Citation handling │
└───────────────┬─────────────────────────────────────────────────┘
│
┌───────┴───────┐
▼ ▼
┌───────────────┐ ┌─────────────────────────────────────────────┐
│ RAG Server │ │ vLLM Server (8000) │
│ (8080) │ │ - OpenAI-compatible inference API │
│ - ChromaDB │ │ - GPU acceleration │
└───────┬───────┘ └─────────────────────────────────────────────┘
│
▼
┌───────────────┐ ┌─────────────────────────────────────────────┐
│ ChromaDB │◄─│ Indexer (background) │
│ (8001) │ │ - File watcher for docs │
└───────────────┘ └─────────────────────────────────────────────┘
Priority: High | Effort: Medium
The nemotron branch contains valuable improvements that should be merged:
| Feature | Description | Action |
|---|---|---|
controller.sh |
Extracted preprocessing logic | ✅ Adopt |
parallelworks/checkout action |
Cleaner git clone | ✅ Adopt |
| PBS scheduler support | Extended HPC compatibility | ✅ Adopt |
| VLLM attention backend options | 20+ backend choices | ✅ Adopt |
| Offline mode defaults | TRANSFORMERS_OFFLINE=1 |
✅ Adopt |
| Container pull options | pull boolean + bucket source |
✅ Adopt |
| Tiktoken encodings download | Offline tokenizer support | ✅ Adopt |
Implementation Steps:
# Create integration branch
git checkout main
git checkout -b feature/nemotron-integration
git merge origin/nemotron --no-commit
# Resolve conflicts, keeping best of both
# Test thoroughly before merging to mainPriority: High | Effort: Medium
Current State: 4 similar workflow files with 70%+ code duplication
workflow.yaml(main)workflow-vllm.yaml(vLLM-only mode)yamls/hsp.yaml(HPC-specific)yamls/emed.yaml(medical domain)
Target State: Single workflow.yaml with conditional sections
Implementation:
# Proposed unified workflow.yaml structure
name: activate-rag-vllm
description: Deploy vLLM + RAG stack on HPC or cloud
inputs:
# === Mode Selection ===
deployment_mode:
type: dropdown
label: Deployment Mode
options:
- label: "vLLM + RAG (Full Stack)"
value: all
- label: "vLLM Only"
value: vllm
default: all
# === Model Configuration ===
model_source:
type: dropdown
label: Model Source
options:
- label: "Local Path (pre-downloaded)"
value: local
- label: "HuggingFace Hub (auto-download)"
value: huggingface
default: local
model_path:
type: text
label: Local Model Path
description: "Full path to model weights directory"
hidden: inputs.model_source != 'local'
hf_model_id:
type: text
label: HuggingFace Model ID
placeholder: "meta-llama/Llama-3.1-8B-Instruct"
hidden: inputs.model_source != 'huggingface'
# === Scheduler Selection ===
scheduler:
type: dropdown
label: Job Scheduler
options:
- { label: SSH (direct), value: ssh }
- { label: SLURM, value: slurm }
- { label: PBS, value: pbs }
default: slurm
# Conditional scheduler options shown based on selection
slurm_partition:
hidden: inputs.submit_to_scheduler != 'slurm'
pbs_queue:
hidden: inputs.submit_to_scheduler != 'pbs'Priority: High | Effort: Low
Current State: Logic split between start_service.sh and controller.sh
Target State: Single start_service.sh with modular functions
Proposed Structure:
#!/bin/bash
# start_service.sh - Unified entrypoint
set -euo pipefail
# Source common functions
source "$(dirname "$0")/lib/functions.sh"
# Main execution
main() {
parse_arguments "$@"
detect_environment # Docker vs Singularity vs local
validate_config
setup_model # New: handles local vs HF download
configure_ports
launch_services
wait_for_ready
export_session_port
}
main "$@"Priority: High | Effort: Medium
Create a model management system that supports:
- Local pre-downloaded models
- HuggingFace Hub downloads (git-lfs preferred for HPC)
- Cached model reuse across runs
New File: lib/model_manager.sh
#!/bin/bash
# lib/model_manager.sh - Model download and validation
MODEL_CACHE_BASE="${MODEL_CACHE_BASE:-$HOME/.cache/activate-models}"
setup_model() {
local source="$1" # local | huggingface
local model_id="$2" # path or HF model ID
local hf_token="$3" # optional HF token
case "$source" in
local)
validate_local_model "$model_id"
MODEL_PATH="$model_id"
;;
huggingface)
download_hf_model "$model_id" "$hf_token"
MODEL_PATH="$MODEL_CACHE_BASE/$model_id"
;;
esac
export MODEL_PATH
}
validate_local_model() {
local path="$1"
if [[ ! -d "$path" ]]; then
error "Model directory not found: $path"
exit 1
fi
# Check for required files
local required_files=("config.json" "tokenizer.json")
for file in "${required_files[@]}"; do
if [[ ! -f "$path/$file" ]]; then
warn "Missing expected file: $path/$file"
fi
done
info "Local model validated: $path"
}
download_hf_model() {
local model_id="$1"
local hf_token="$2"
local target_dir="$MODEL_CACHE_BASE/$model_id"
if [[ -d "$target_dir" ]] && model_is_complete "$target_dir"; then
info "Model already cached: $target_dir"
return 0
fi
mkdir -p "$target_dir"
# Prefer git-lfs for HPC (more reliable than hf_hub_download)
info "Downloading model via git-lfs: $model_id"
local repo_url="https://huggingface.co/$model_id"
if [[ -n "$hf_token" ]]; then
repo_url="https://user:${hf_token}@huggingface.co/$model_id"
fi
GIT_LFS_SKIP_SMUDGE=0 git clone --depth 1 "$repo_url" "$target_dir"
# Verify download
if ! model_is_complete "$target_dir"; then
error "Model download incomplete"
exit 1
fi
info "Model downloaded successfully: $target_dir"
}
model_is_complete() {
local path="$1"
[[ -f "$path/config.json" ]] && \
[[ -f "$path/tokenizer.json" || -f "$path/tokenizer_config.json" ]]
}Priority: High | Effort: Medium
Update workflow.yaml to show/hide form elements based on model source:
inputs:
model:
type: section
label: Model Configuration
source:
type: dropdown
label: Model Source
options:
- label: "📁 Local Path (recommended for HPC)"
value: local
description: "Use pre-downloaded model weights"
- label: "🤗 HuggingFace Hub"
value: huggingface
description: "Download from HuggingFace (requires network)"
default: local
# Shown when source=local
local_path:
type: text
label: Model Path
placeholder: /path/to/model/weights
description: "Full path to directory containing model weights"
required: true
hidden:
source: '!= local'
# Shown when source=huggingface
hf_model_id:
type: text
label: HuggingFace Model ID
placeholder: meta-llama/Llama-3.1-8B-Instruct
hidden:
source: '!= huggingface'
hf_token:
type: secret
label: HuggingFace Token
description: "Required for gated models (Llama, etc.)"
hidden:
source: '!= huggingface'
cache_dir:
type: text
label: Model Cache Directory
default: ~/pw/models
description: "Where to store downloaded models"
hidden:
source: '!= huggingface'Priority: High | Effort: Medium
Issues to Address:
- Manual
__MODEL_PATH__substitution - No native env var interpolation
- Port management complexity
Proposed singularity/singularity-compose.yml:
version: "1.0"
instances:
vllm:
build:
context: .
recipe: Singularity.vllm
ports:
- "${VLLM_PORT:-8000}:8000"
volumes:
- "${MODEL_PATH}:/models/active:ro"
- "${HF_CACHE:-./cache}:/root/.cache/huggingface"
environment:
- MODEL_NAME=/models/active
- VLLM_API_KEY=${VLLM_API_KEY:-}
- CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-all}
runtime:
options: "--nv" # GPU support
start:
options: "--env-file env.sh"
rag:
build:
context: .
recipe: Singularity.rag
depends_on:
- vllm
ports:
- "${RAG_PROXY_PORT:-8081}:8081"
- "${RAG_SERVER_PORT:-8080}:8080"
- "${CHROMA_PORT:-8001}:8001"
volumes:
- "${DOCS_DIR:-./docs}:/docs:rw"
- "${CHROMA_DATA:-./chroma_data}:/chroma_data"
environment:
- VLLM_URL=http://127.0.0.1:${VLLM_PORT:-8000}/v1
- VLLM_API_KEY=${VLLM_API_KEY:-}Priority: Medium | Effort: Low
Create configuration presets for common HPC environments:
New File: configs/hpc-presets.yaml
presets:
# Navy DSRC systems
navy-hpc:
scheduler: pbs
container_source: bucket
container_bucket: "gs://navy-containers/activate"
offline_mode: true
defaults:
gpu_type: "nvidia_a100"
max_model_len: 32768
# AFRL systems
afrl-hpc:
scheduler: slurm
container_source: local
offline_mode: true
defaults:
partition: "gpu"
qos: "normal"
# AWS cloud
aws-cloud:
scheduler: slurm
container_source: pull
offline_mode: false
defaults:
instance_type: "p4d.24xlarge"
# Local development
local-dev:
scheduler: ssh
container_source: build
offline_mode: false
defaults:
gpu_type: "auto-detect"Priority: Medium | Effort: Low
New File: lib/preflight.sh
#!/bin/bash
# lib/preflight.sh - Pre-flight checks for HPC deployment
preflight_checks() {
local errors=0
info "Running pre-flight checks..."
# Check Singularity
if ! command -v singularity &>/dev/null; then
error "Singularity not found in PATH"
((errors++))
else
local version=$(singularity --version 2>/dev/null)
info "Singularity: $version"
fi
# Check GPU access
if ! nvidia-smi &>/dev/null; then
warn "nvidia-smi not available - GPU may not be accessible"
else
local gpu_count=$(nvidia-smi -L | wc -l)
info "GPUs detected: $gpu_count"
fi
# Check model path
if [[ "$MODEL_SOURCE" == "local" ]]; then
if [[ ! -d "$MODEL_PATH" ]]; then
error "Model path not found: $MODEL_PATH"
((errors++))
fi
fi
# Check disk space for cache
local cache_dir="${MODEL_CACHE_BASE:-$HOME/.cache}"
local free_gb=$(df -BG "$cache_dir" | awk 'NR==2 {print $4}' | tr -d 'G')
if (( free_gb < 50 )); then
warn "Low disk space for model cache: ${free_gb}GB free"
fi
# Check network (if HF download needed)
if [[ "$MODEL_SOURCE" == "huggingface" ]]; then
if ! curl -s --connect-timeout 5 https://huggingface.co &>/dev/null; then
error "Cannot reach HuggingFace Hub - check network/proxy"
((errors++))
fi
fi
if (( errors > 0 )); then
error "Pre-flight checks failed with $errors error(s)"
return 1
fi
info "Pre-flight checks passed ✓"
return 0
}Priority: Medium | Effort: Low
start_service.sh changes:
#!/bin/bash
set -euo pipefail # Add -e for exit on error
trap cleanup EXIT ERR
cleanup() {
local exit_code=$?
if (( exit_code != 0 )); then
error "Script failed with exit code: $exit_code"
# Capture logs for debugging
if [[ -d "./logs" ]]; then
tar -czf "debug-logs-$(date +%Y%m%d-%H%M%S).tar.gz" ./logs/
fi
fi
}Priority: Medium | Effort: Medium
New File: lib/config_validator.py
#!/usr/bin/env python3
"""Validate configuration before service launch."""
import os
import sys
import json
from pathlib import Path
def validate_model_config(config: dict) -> list[str]:
"""Validate model configuration."""
errors = []
model_path = config.get("MODEL_PATH") or config.get("model_path")
if not model_path:
errors.append("MODEL_PATH not specified")
elif not Path(model_path).exists():
errors.append(f"Model path does not exist: {model_path}")
else:
# Check for required model files
required = ["config.json"]
for f in required:
if not (Path(model_path) / f).exists():
errors.append(f"Missing required file: {model_path}/{f}")
return errors
def validate_port_config(config: dict) -> list[str]:
"""Validate port configuration."""
errors = []
ports = {
"VLLM_PORT": config.get("VLLM_PORT", 8000),
"RAG_PROXY_PORT": config.get("RAG_PROXY_PORT", 8081),
"RAG_SERVER_PORT": config.get("RAG_SERVER_PORT", 8080),
"CHROMA_PORT": config.get("CHROMA_PORT", 8001),
}
# Check for port conflicts
used_ports = list(ports.values())
if len(used_ports) != len(set(used_ports)):
errors.append("Port conflict detected - duplicate port assignments")
return errors
def main():
"""Run all validations."""
config = dict(os.environ)
# Also load from env.sh if present
env_file = Path("env.sh")
if env_file.exists():
for line in env_file.read_text().splitlines():
if "=" in line and not line.startswith("#"):
key, _, value = line.partition("=")
key = key.replace("export ", "").strip()
config[key] = value.strip().strip('"').strip("'")
all_errors = []
all_errors.extend(validate_model_config(config))
all_errors.extend(validate_port_config(config))
if all_errors:
print("Configuration validation failed:", file=sys.stderr)
for error in all_errors:
print(f" ✗ {error}", file=sys.stderr)
sys.exit(1)
print("Configuration validation passed ✓")
sys.exit(0)
if __name__ == "__main__":
main()Priority: Low | Effort: Low
Add to lib/functions.sh:
# Logging functions with timestamps
LOG_FILE="${LOG_DIR:-./logs}/service-$(date +%Y%m%d-%H%M%S).log"
mkdir -p "$(dirname "$LOG_FILE")"
log() {
local level="$1"
shift
local message="$*"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] [$level] $message" | tee -a "$LOG_FILE"
}
info() { log "INFO" "$@"; }
warn() { log "WARN" "$@" >&2; }
error() { log "ERROR" "$@" >&2; }
debug() { [[ "${DEBUG:-0}" == "1" ]] && log "DEBUG" "$@"; }Priority: Medium | Effort: Low
Update README.md with clear quickstart:
## Quick Start
### Option 1: Local Model (Recommended for HPC)
1. **Ensure model weights are available**:
```bash
ls /path/to/your/model/
# Should contain: config.json, tokenizer.json, *.safetensors- Deploy via ParallelWorks:
- Select "Local Path" as Model Source
- Enter full path to model directory
- Choose your scheduler (SLURM/PBS/SSH)
- Submit workflow
-
Get HuggingFace token (for gated models):
- Visit https://huggingface.co/settings/tokens
- Create token with "read" permissions
-
Deploy via ParallelWorks:
- Select "HuggingFace Hub" as Model Source
- Enter model ID (e.g.,
meta-llama/Llama-3.1-8B-Instruct) - Paste your HF token
- Submit workflow
### 5.2 Interactive Configuration Wizard
**Priority**: Low | **Effort**: Medium
**New File**: `scripts/configure.sh`
```bash
#!/bin/bash
# Interactive configuration wizard for local development
echo "=== ACTIVATE RAG-vLLM Configuration Wizard ==="
echo
# Model source
echo "How will you provide the model?"
select model_source in "Local Path" "HuggingFace Download"; do
case $model_source in
"Local Path")
read -p "Enter model path: " MODEL_PATH
if [[ ! -d "$MODEL_PATH" ]]; then
echo "Warning: Path does not exist"
fi
break
;;
"HuggingFace Download")
read -p "Enter HuggingFace model ID: " HF_MODEL_ID
read -sp "Enter HuggingFace token (optional): " HF_TOKEN
echo
MODEL_PATH="$HOME/.cache/activate-models/$HF_MODEL_ID"
break
;;
esac
done
# Deployment mode
echo
echo "What do you want to deploy?"
select runtype in "vLLM + RAG (Full Stack)" "vLLM Only"; do
case $runtype in
"vLLM + RAG"*) RUNTYPE="all"; break ;;
"vLLM Only") RUNTYPE="vllm"; break ;;
esac
done
# Generate env.sh
cat > env.sh << EOF
# Generated by configure.sh on $(date)
export MODEL_PATH="$MODEL_PATH"
export RUNTYPE="$RUNTYPE"
export HF_TOKEN="${HF_TOKEN:-}"
export TRANSFORMERS_OFFLINE=1
EOF
echo
echo "Configuration saved to env.sh"
echo "Run: ./start_service.sh"
Priority: Low | Effort: High
New Directory: tests/
tests/
├── conftest.py
├── test_rag_server.py
├── test_rag_proxy.py
├── test_indexer.py
└── integration/
└── test_e2e.py
Priority: Low | Effort: Medium
New File: .github/workflows/ci.yml
name: CI
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install ruff
- run: ruff check .
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: shellcheck *.sh lib/*.sh
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install -r requirements.txt pytest
- run: pytest tests/ -v| Phase | Tasks | Duration | Dependencies |
|---|---|---|---|
| 1 | Branch merge, YAML consolidation, script unification | 1 week | None |
| 2 | Model management, conditional forms | 1 week | Phase 1 |
| 3 | Singularity optimization, HPC presets | 1 week | Phase 1-2 |
| 4 | Error handling, validation, logging | 3 days | Phase 1 |
| 5 | Documentation, wizard | 2 days | Phase 1-3 |
| 6 | Testing, CI/CD | 1 week | Phase 1-4 |
Total Estimated Time: 4-5 weeks
activate-rag-vllm/
├── workflow.yaml # Unified workflow (replaces 4 files)
├── start_service.sh # Main entrypoint
├── indexer.py
├── rag_proxy.py
├── rag_server.py
├── indexer_config.yaml
├── README.md # Updated with quickstart
├── lib/ # NEW: Shared functions
│ ├── functions.sh
│ ├── model_manager.sh
│ ├── preflight.sh
│ └── config_validator.py
├── configs/ # NEW: Configuration presets
│ ├── hpc-presets.yaml
│ └── defaults.yaml
├── singularity/
│ ├── singularity-compose.yml # Updated
│ ├── Singularity.rag
│ ├── Singularity.vllm
│ └── env.sh.example
├── docker/ # Retained for local dev
│ └── ...
├── scripts/ # NEW: Utility scripts
│ └── configure.sh
├── tests/ # NEW: Test suite
│ └── ...
├── docs/
│ ├── IMPLEMENTATION_PLAN.md # This document
│ ├── ARCHITECTURE.md # NEW: Architecture docs
│ └── HPC_GUIDE.md # NEW: HPC deployment guide
└── .github/
└── workflows/
└── ci.yml # NEW: CI/CD
- ✅ Single
workflow.yamlhandles all deployment modes - ✅ Users can specify local model path OR HuggingFace model ID
- ✅ Git-lfs based HuggingFace downloads work on HPC systems
- ✅ Pre-flight checks validate configuration before deployment
- ✅ Clear error messages guide users to resolution
- ✅ Documentation enables self-service onboarding
- ✅ Singularity deployment works reliably on HPC clusters
| Risk | Mitigation |
|---|---|
| Breaking existing workflows | Maintain backward compatibility, gradual rollout |
| HPC network restrictions | Default to offline mode, pre-pull containers |
| Model download failures | Implement retry logic, resume capability |
| GPU detection issues | Explicit CUDA_VISIBLE_DEVICES configuration |
| Port conflicts | Dynamic port allocation with conflict detection |
- Immediate: Create feature branch for Phase 1
- Week 1: Complete branch merge and YAML consolidation
- Week 2: Implement model management system
- Week 3: Optimize Singularity deployment
- Week 4: Documentation and testing
- Week 5: User acceptance testing and rollout
Document maintained by: ACTIVATE Team
Last updated: January 16, 2026