Context
Child of #90. Blocks on #108 (acceptance-gate pass).
Production deployment of the trained model — merging the LoRA, publishing weights, running vLLM as a service.
Scope
1. Merge LoRA
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", torch_dtype="auto")
merged = PeftModel.from_pretrained(base, "path/to/lora-best").merge_and_unload()
merged.save_pretrained("./llmtrace-qwen-judge-v1")
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct").save_pretrained("./llmtrace-qwen-judge-v1")
Script location: autoresearch-rl/examples/security-judge/scripts/merge_and_export.py.
2. HF Hub publication
Private repo epappas/llmtrace-qwen-judge-v1. Include:
Gate: make it public=false until legal/compliance approves. Fine-tuned models derived from open weights can inherit licence obligations.
3. vLLM serving
Target: one dedicated container in the cluster running:
vllm serve ./llmtrace-qwen-judge-v1 \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--dtype auto \
--gpu-memory-utilization 0.85 \
--served-model-name llmtrace-qwen-judge-v1
Helm chart or Compose file under deployments/ tracked in this issue.
Expected resource footprint: ~2 GiB VRAM at fp16 for Qwen-0.5B; fits on any small GPU (even A10).
4. Health checks
/v1/models reachable.
- One canned-prompt call returns a valid 6-field verdict within 500 ms.
- Auto-restart on crash (Kubernetes liveness probe or Compose restart policy).
5. Security
- Internal-only network. No public ingress.
- No auth required for internal calls (vLLM doesn't enforce it anyway). LLMTrace authenticates its own egress.
- Log scraping: vLLM access logs → the same OTEL pipeline the proxy uses.
Acceptance
Context
Child of #90. Blocks on #108 (acceptance-gate pass).
Production deployment of the trained model — merging the LoRA, publishing weights, running vLLM as a service.
Scope
1. Merge LoRA
Script location:
autoresearch-rl/examples/security-judge/scripts/merge_and_export.py.2. HF Hub publication
Private repo
epappas/llmtrace-qwen-judge-v1. Include:README.md— training corpus, hyperparameters, headline metrics from research(judge-ft): acceptance-gate evaluation + DeBERTa head-to-head #108, ship-or-no-ship context, intended use.eval_protocol.json— the same metric definitions used during training, for reproducibility.Gate: make it
public=falseuntil legal/compliance approves. Fine-tuned models derived from open weights can inherit licence obligations.3. vLLM serving
Target: one dedicated container in the cluster running:
Helm chart or Compose file under
deployments/tracked in this issue.Expected resource footprint: ~2 GiB VRAM at fp16 for Qwen-0.5B; fits on any small GPU (even A10).
4. Health checks
/v1/modelsreachable.5. Security
Acceptance
merge_and_export.pyproduces a loadable HF directory.