Skip to content

Commit a0b614b

Browse files
LloydLloyd
authored andcommitted
privacy-filter: cap GPU memory + release cache to stop VRAM leak
privacy-filter is an inline HF Transformers token-classification server (`pipeline(..., device_map="auto")`) with no memory bound. Under steady traffic the CUDA caching allocator's reserved memory ratchets up and is never released, so the process slowly hoards the GPU it shares with Qwen3-VL, FLUX, embeddings, reranker and whisper (GPU 7). Observed ~93 GB held on an H200 for a model that needs ~1-2 GB. As privacy-filter fills the card (free ~50 GB -> ~0 over 1-2 days) the largest co-tenant, Qwen3-VL (~49 GB at --gpu-memory-utilization 0.35), can no longer load and crash-loops with `torch.AcceleratorError: CUDA error: out of memory`. The same leak OOM'd embeddings/whisper on 2026-05-25. Hits both small-models hosts (gpu11, gpu02) since they run identical config. Fix (inline server + container env): - empty_cache() after every request (core fix): returns cached-but-unused CUDA blocks to the driver so reserved memory stops ratcheting. - set_per_process_memory_fraction(GPU_MEM_FRACTION, 0) (fail-safe): hard ceiling so the process self-OOMs/restarts instead of starving neighbours. Default 0.10 (~14 GB on a 140 GB H200), env-tunable. - torch.inference_mode() around inference: no autograd state retained. Interim mitigation already applied by recreating the container, which frees the leaked VRAM but recurs in ~1-2 days; this makes it permanent. Ship via the normal tag + compose/up redeploy of small-models.yaml.
1 parent f8ad79e commit a0b614b

1 file changed

Lines changed: 48 additions & 23 deletions

File tree

small-models.yaml

Lines changed: 48 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -193,6 +193,7 @@ x-privacy-filter-common: &privacy-filter-common
193193
"uvicorn[standard]"
194194
WORKDIR /app
195195
COPY <<'PYEOF' /app/server.py
196+
import os
196197
import torch
197198
from fastapi import FastAPI, HTTPException
198199
from pydantic import BaseModel, Field
@@ -201,6 +202,16 @@ x-privacy-filter-common: &privacy-filter-common
201202
MODEL_ID = "openai/privacy-filter"
202203
MODEL_REVISION = "7ffa9a043d54d1be65afb281eddf0ffbe629385b"
203204
205+
# GPU 7 is shared with Qwen3-VL / FLUX / embeddings / reranker / whisper.
206+
# The HF pipeline's CUDA caching allocator ratchets its reserved memory up
207+
# under traffic and never releases it, slowly hoarding the whole card and
208+
# starving the co-located models until they OOM (Qwen3-VL crash-loops).
209+
# Cap this process to a small fraction of the device so it is fail-safe:
210+
# it self-OOMs and restarts instead of stealing VRAM from its neighbours.
211+
GPU_MEM_FRACTION = float(os.environ.get("GPU_MEM_FRACTION", "0.10"))
212+
if torch.cuda.is_available():
213+
torch.cuda.set_per_process_memory_fraction(GPU_MEM_FRACTION, 0)
214+
204215
clf = pipeline(
205216
"token-classification",
206217
model=MODEL_ID,
@@ -231,29 +242,39 @@ x-privacy-filter-common: &privacy-filter-common
231242
if not texts or any(not isinstance(t, str) for t in texts):
232243
raise HTTPException(400, "input must be a non-empty string or list of strings")
233244
234-
# HF pipeline defaults batch_size=1 — pass 32 so the GPU is actually
235-
# fed in parallel for list inputs.
236-
raw = clf(texts, batch_size=32)
237-
238-
# Single batched tokenize for usage counts instead of N sequential calls.
239-
tok_lens = [len(ids) for ids in tokenizer(texts).input_ids]
240-
241-
data = []
242-
for i, spans in enumerate(raw):
243-
kept = [
244-
{
245-
"category": s["entity_group"],
246-
"score": float(s["score"]),
247-
"text": s["word"],
248-
"start": int(s["start"]),
249-
"end": int(s["end"]),
250-
}
251-
for s in spans
252-
if float(s["score"]) >= req.threshold
253-
]
254-
data.append({"index": i, "spans": kept, "usage": {"input_tokens": tok_lens[i]}})
255-
256-
return {"model": MODEL_ID, "data": data}
245+
try:
246+
# HF pipeline defaults batch_size=1 — pass 32 so the GPU is actually
247+
# fed in parallel for list inputs. inference_mode avoids retaining
248+
# any autograd state across requests.
249+
with torch.inference_mode():
250+
raw = clf(texts, batch_size=32)
251+
252+
# Single batched tokenize for usage counts instead of N sequential calls.
253+
tok_lens = [len(ids) for ids in tokenizer(texts).input_ids]
254+
255+
data = []
256+
for i, spans in enumerate(raw):
257+
kept = [
258+
{
259+
"category": s["entity_group"],
260+
"score": float(s["score"]),
261+
"text": s["word"],
262+
"start": int(s["start"]),
263+
"end": int(s["end"]),
264+
}
265+
for s in spans
266+
if float(s["score"]) >= req.threshold
267+
]
268+
data.append({"index": i, "spans": kept, "usage": {"input_tokens": tok_lens[i]}})
269+
270+
return {"model": MODEL_ID, "data": data}
271+
finally:
272+
# Return cached-but-unused CUDA blocks to the driver after every
273+
# request so reserved memory does not ratchet up over time on the
274+
# shared GPU. This is the core leak fix; the fraction cap above is
275+
# the fail-safe.
276+
if torch.cuda.is_available():
277+
torch.cuda.empty_cache()
257278
PYEOF
258279
EXPOSE 8000
259280
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
@@ -263,6 +284,10 @@ x-privacy-filter-common: &privacy-filter-common
263284
- HF_TOKEN=${HUGGING_FACE_HUB_TOKEN}
264285
- HF_HUB_OFFLINE=${HF_HUB_OFFLINE:-0}
265286
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
287+
# Hard ceiling on this process's share of the shared GPU (see server.py).
288+
# Tune without rebuilding the image. ~0.10 of a 140 GB H200 ≈ 14 GB, ample
289+
# for the classifier and leaves the card for Qwen3-VL/FLUX/etc.
290+
- GPU_MEM_FRACTION=0.10
266291
restart: unless-stopped
267292
stop_grace_period: 5m
268293
logging: *logging-conf

0 commit comments

Comments
 (0)