ops(judge-ft): merge LoRA, push to HF Hub, stand up vLLM serving

## Context

Child of #90. Blocks on #108 (acceptance-gate pass).

Production deployment of the trained model — merging the LoRA, publishing weights, running vLLM as a service.

## Scope

### 1. Merge LoRA

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", torch_dtype="auto")
merged = PeftModel.from_pretrained(base, "path/to/lora-best").merge_and_unload()
merged.save_pretrained("./llmtrace-qwen-judge-v1")
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct").save_pretrained("./llmtrace-qwen-judge-v1")
```

Script location: `autoresearch-rl/examples/security-judge/scripts/merge_and_export.py`.

### 2. HF Hub publication

Private repo `epappas/llmtrace-qwen-judge-v1`. Include:

- Model + tokenizer weights.
- `README.md` — training corpus, hyperparameters, headline metrics from #108, ship-or-no-ship context, intended use.
- `eval_protocol.json` — the same metric definitions used during training, for reproducibility.
- Reference to the #108 evaluation report in LLMTrace.

Gate: make it `public=false` until legal/compliance approves. Fine-tuned models derived from open weights can inherit licence obligations.

### 3. vLLM serving

Target: one dedicated container in the cluster running:

```bash
vllm serve ./llmtrace-qwen-judge-v1 \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --dtype auto \
  --gpu-memory-utilization 0.85 \
  --served-model-name llmtrace-qwen-judge-v1
```

Helm chart or Compose file under `deployments/` tracked in this issue.

Expected resource footprint: ~2 GiB VRAM at fp16 for Qwen-0.5B; fits on any small GPU (even A10).

### 4. Health checks

- `/v1/models` reachable.
- One canned-prompt call returns a valid 6-field verdict within 500 ms.
- Auto-restart on crash (Kubernetes liveness probe or Compose restart policy).

### 5. Security

- Internal-only network. No public ingress.
- No auth required for internal calls (vLLM doesn't enforce it anyway). LLMTrace authenticates its own egress.
- Log scraping: vLLM access logs → the same OTEL pipeline the proxy uses.

## Acceptance

- [ ] `merge_and_export.py` produces a loadable HF directory.
- [ ] HF Hub repo populated, private.
- [ ] vLLM container running in the target cluster, reachable from LLMTrace pods.
- [ ] Canned-prompt smoke test passes from inside the cluster.
- [ ] Helm/Compose artefact committed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ops(judge-ft): merge LoRA, push to HF Hub, stand up vLLM serving #109

Context

Scope

1. Merge LoRA

2. HF Hub publication

3. vLLM serving

4. Health checks

5. Security

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ops(judge-ft): merge LoRA, push to HF Hub, stand up vLLM serving #109

Description

Context

Scope

1. Merge LoRA

2. HF Hub publication

3. vLLM serving

4. Health checks

5. Security

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions