privacy-filter: cap GPU memory + release cache to stop VRAM leak#51
Open
lloydmak99 wants to merge 2 commits into
Open
privacy-filter: cap GPU memory + release cache to stop VRAM leak#51lloydmak99 wants to merge 2 commits into
lloydmak99 wants to merge 2 commits into
Conversation
privacy-filter is an inline HF Transformers token-classification server (`pipeline(..., device_map="auto")`) with no memory bound. Under steady traffic the CUDA caching allocator's reserved memory ratchets up and is never released, so the process slowly hoards the GPU it shares with Qwen3-VL, FLUX, embeddings, reranker and whisper (GPU 7). Observed ~93 GB held on an H200 for a model that needs ~1-2 GB. As privacy-filter fills the card (free ~50 GB -> ~0 over 1-2 days) the largest co-tenant, Qwen3-VL (~49 GB at --gpu-memory-utilization 0.35), can no longer load and crash-loops with `torch.AcceleratorError: CUDA error: out of memory`. The same leak OOM'd embeddings/whisper on 2026-05-25. Hits both small-models hosts (gpu11, gpu02) since they run identical config. Fix (inline server + container env): - empty_cache() after every request (core fix): returns cached-but-unused CUDA blocks to the driver so reserved memory stops ratcheting. - set_per_process_memory_fraction(GPU_MEM_FRACTION, 0) (fail-safe): hard ceiling so the process self-OOMs/restarts instead of starving neighbours. Default 0.10 (~14 GB on a 140 GB H200), env-tunable. - torch.inference_mode() around inference: no autograd state retained. Interim mitigation already applied by recreating the container, which frees the leaked VRAM but recurs in ~1-2 days; this makes it permanent. Ship via the normal tag + compose/up redeploy of small-models.yaml.
Contributor
Author
|
Tracking issue: nearai/infra#158 |
…_segments) Addresses the code review of the first cut: - Root cause now fixed at the source: PYTORCH_CUDA_ALLOC_CONF=expandable_segments lets the CUDA allocator shrink reserved segments instead of ratcheting up. - Drop per-request torch.cuda.empty_cache(): a synchronizing cudaFree on the hot path stalled the shared GPU and the co-located models it was meant to protect. A 30s watchdog thread now releases idle blocks off the request path. - Real fail-safe instead of a silent 500-storm: the watchdog hard-restarts the container (os._exit -> restart:unless-stopped) if this process's reserved VRAM exceeds GPU_MEM_LIMIT_GB, and an acute CUDA-OOM in a request also exits. The prior "self-OOMs and restarts" comment was false — a caught OOM returned 500 while the process stayed up behind a still-healthy /v1/models probe. - Drop set_per_process_memory_fraction: the 0.10 (~14GB) guess could OOM legit batch_size=32 requests, and device_map="auto" planned against the full card and ignored the cap anyway. Bound the work via PRIVACY_BATCH_SIZE instead; inputs are NOT truncated (a privacy filter must see the whole text). - device=0 instead of device_map="auto" (no accelerate planner mismatch). - Drop torch.inference_mode(): redundant with the pipeline's internal no_grad and stricter (risked raising under trust_remote_code custom models). - Tolerant env parsing + clamps so a malformed knob can't crash-loop boot. Validated: small-models.yaml parses and the embedded server.py compiles.
Contributor
Author
|
Revised in 7295a65 to address review:
Validated: YAML parses and the embedded |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
privacy-filter(inline HF Transformers token-classification server insmall-models.yaml,pipeline(..., device_map="auto"), per-requestbatch_size=32) has no GPU memory bound. Under steady traffic PyTorch's CUDA caching allocator ratchets its reserved memory up and never releases it, so the process slowly hoards the GPU it shares with Qwen3-VL, FLUX, embeddings, reranker and whisper (GPU 7).Measured on 2026-05-29 (gpu11, H200 ~140 GB): recreating
privacy-filterfreed ~93 GB — for a model that needs ~1–2 GB.Impact
As privacy-filter fills the card (free ~50 GB → ~0 over 1–2 days), the largest co-tenant Qwen3-VL (~49 GB at
--gpu-memory-utilization 0.35) can no longer load and crash-loops withtorch.AcceleratorError: CUDA error: out of memory. The same leak OOM'd embeddings/whisper on 2026-05-25 ("No available memory for cache blocks"). Affects both small-models hosts (gpu11 + gpu02 — identical config).This is not a static GPU-budget misconfig of the small models, and not gemma (different GPUs): the vLLM/SGLang co-tenants hard-cap their VRAM, so the only unbounded consumer is the raw-HF privacy-filter.
How it was isolated
Recreate-and-watch (per-process
nvidia-smiis unreachable — CVMs reject SSH, compose-manager has no exec): recreating FLUX freed only its ~22 GB static pool and Qwen3-VL kept crash-looping; recreating privacy-filter freed ~93 GB and Qwen3-VL recovered.Fix
Inline
server.py+ container env:torch.cuda.empty_cache()after every request (core fix) — returns cached-but-unused CUDA blocks to the driver so reserved memory stops ratcheting up.torch.cuda.set_per_process_memory_fraction(GPU_MEM_FRACTION, 0)(fail-safe) — hard ceiling so the process self-OOMs/restarts instead of starving its neighbours. DefaultGPU_MEM_FRACTION=0.10(~14 GB on a 140 GB H200), env-tunable without an image rebuild.torch.inference_mode()around inference — no autograd state retained across requests.Validated:
small-models.yamlparses and the embeddedserver.pycompiles.Deploy
Normal tag + redeploy of
small-models.yamlto both hosts (POST :8080/compose/upwith the new tag,services:["<privacy-filter container>"],force_recreate:true).Follow-up (optional)
If reserved-memory fragmentation still creeps, add
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True(left out here to avoid any interaction with the per-process fraction cap on torch 2.5.1).