Skip to content

ops(judge-ft): merge LoRA, push to HF Hub, stand up vLLM serving #109

@epappas

Description

@epappas

Context

Child of #90. Blocks on #108 (acceptance-gate pass).

Production deployment of the trained model — merging the LoRA, publishing weights, running vLLM as a service.

Scope

1. Merge LoRA

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", torch_dtype="auto")
merged = PeftModel.from_pretrained(base, "path/to/lora-best").merge_and_unload()
merged.save_pretrained("./llmtrace-qwen-judge-v1")
AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct").save_pretrained("./llmtrace-qwen-judge-v1")

Script location: autoresearch-rl/examples/security-judge/scripts/merge_and_export.py.

2. HF Hub publication

Private repo epappas/llmtrace-qwen-judge-v1. Include:

Gate: make it public=false until legal/compliance approves. Fine-tuned models derived from open weights can inherit licence obligations.

3. vLLM serving

Target: one dedicated container in the cluster running:

vllm serve ./llmtrace-qwen-judge-v1 \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --dtype auto \
  --gpu-memory-utilization 0.85 \
  --served-model-name llmtrace-qwen-judge-v1

Helm chart or Compose file under deployments/ tracked in this issue.

Expected resource footprint: ~2 GiB VRAM at fp16 for Qwen-0.5B; fits on any small GPU (even A10).

4. Health checks

  • /v1/models reachable.
  • One canned-prompt call returns a valid 6-field verdict within 500 ms.
  • Auto-restart on crash (Kubernetes liveness probe or Compose restart policy).

5. Security

  • Internal-only network. No public ingress.
  • No auth required for internal calls (vLLM doesn't enforce it anyway). LLMTrace authenticates its own egress.
  • Log scraping: vLLM access logs → the same OTEL pipeline the proxy uses.

Acceptance

  • merge_and_export.py produces a loadable HF directory.
  • HF Hub repo populated, private.
  • vLLM container running in the target cluster, reachable from LLMTrace pods.
  • Canned-prompt smoke test passes from inside the cluster.
  • Helm/Compose artefact committed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions