You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
privacy-filter: revise GPU-leak fix per review (watchdog + expandable_segments)
Addresses the code review of the first cut:
- Root cause now fixed at the source: PYTORCH_CUDA_ALLOC_CONF=expandable_segments
lets the CUDA allocator shrink reserved segments instead of ratcheting up.
- Drop per-request torch.cuda.empty_cache(): a synchronizing cudaFree on the hot
path stalled the shared GPU and the co-located models it was meant to protect.
A 30s watchdog thread now releases idle blocks off the request path.
- Real fail-safe instead of a silent 500-storm: the watchdog hard-restarts the
container (os._exit -> restart:unless-stopped) if this process's reserved VRAM
exceeds GPU_MEM_LIMIT_GB, and an acute CUDA-OOM in a request also exits. The
prior "self-OOMs and restarts" comment was false — a caught OOM returned 500
while the process stayed up behind a still-healthy /v1/models probe.
- Drop set_per_process_memory_fraction: the 0.10 (~14GB) guess could OOM legit
batch_size=32 requests, and device_map="auto" planned against the full card
and ignored the cap anyway. Bound the work via PRIVACY_BATCH_SIZE instead;
inputs are NOT truncated (a privacy filter must see the whole text).
- device=0 instead of device_map="auto" (no accelerate planner mismatch).
- Drop torch.inference_mode(): redundant with the pipeline's internal no_grad
and stricter (risked raising under trust_remote_code custom models).
- Tolerant env parsing + clamps so a malformed knob can't crash-loop boot.
Validated: small-models.yaml parses and the embedded server.py compiles.
0 commit comments