Round-trip PII redaction for Claude Code. A transparent local proxy sits between Claude Code and the Anthropic API:
- Outbound — names, emails, phones, addresses, URLs, dates, account
numbers, and secrets are detected by
openai/privacy-filterand replaced with deterministic short tokens likename-a3f2b1. - Inbound — when Claude's response (streaming or not) refers to those tokens, they are restored to the real values before the response reaches Claude Code. From your perspective, nothing changed.
The Anthropic API never sees the real values; Claude operates on opaque
typed tokens and can still reason about them (name-a3f2b1 ≠ name-9c7e21,
email-fad6db is "an email", secret-aabbcc is "a credential"). A
deterministic HMAC keeps the same real value mapping to the same token
across requests, sessions, and proxy restarts.
/plugin marketplace add https://github.com/NodeNestor/nestor-plugins
/plugin install pii-proxy
Restart your terminal and start a new Claude Code session. The SessionStart
hook will:
- Create
proxy/.venvand installonnxruntime,tokenizers,huggingface_hub,numpy. - Set
ANTHROPIC_BASE_URL=http://127.0.0.1:5599in~/.claude/settings.json(chains transparently if rolling-context or another proxy is already configured). - Start the proxy in the background.
First request downloads ~1.5 GB of int8 ONNX weights from Hugging Face
(or ~770 MB if you pin PII_PROXY_QUANT=model_q4f16.onnx). After that,
detection is local — no network calls outside of the Anthropic upstream
itself.
git clone https://github.com/NodeNestor/claude-pii-proxy
cd claude-pii-proxy
# Windows
powershell -ExecutionPolicy Bypass -File install.ps1
# Linux / macOS
./install.shThe installer:
- Creates
proxy/.venvand installsonnxruntime,tokenizers,huggingface_hub,numpy. - Sets
ANTHROPIC_BASE_URL=http://127.0.0.1:5599in~/.claude/settings.json(chains transparently if another proxy was already configured). - Symlinks the repo into
~/.claude/plugins/pii-proxyso theSessionStarthook keeps the proxy alive.
/pii-gpu cuda # NVIDIA + CUDA via onnxruntime-gpu
/pii-gpu directml # Win/AMD/Intel via DirectX 12 (note: broken on q4 graphs)
/pii-gpu coreml # Apple Silicon
/pii-gpu cpu # back to default
The command reinstalls onnxruntime in the proxy venv and writes
PII_PROXY_PROVIDERS into settings.json. Restart Claude Code afterwards.
Claude Code ──► PII Proxy (:5599) ──► Anthropic API
│ detect & redact text fields
│ persist real ↔ token map
│
◄── restore tokens in SSE stream / JSON response
- Detection:
openai/privacy-filter(1.5 B total / 50 M active params, MoE, Apache-2.0). Runs locally via onnxruntime — no PyTorch. - Default quant: int8
model_quantized.onnx, the fastest variant on CPU (~45 ms / single, ~6 ms / string when batched). Override withPII_PROXY_QUANT=model_q4f16.onnxfor smallest disk (~770 MB) at higher latency. - Length bucketing: power-of-two padded lengths so a single 2 K-token system prompt doesn't drag short messages along to the same forward-pass size.
- Volatile-tag-aware caching:
<system-reminder>,<local-command-stdout>,<local-command-caveat>and<available-deferred-tools>are split out of the cache key. Stable message content survives across turns even though Claude Code refreshes the timestamps each request. - 1M-context support: inputs >8 000 chars are split into overlapping sub-chunks (320-char overlap) and detected per chunk. Each sub-chunk hashes/caches independently, so re-sending the same 100 KB code paste is 0 ms after the first time.
- Persistent span cache at
~/.claude/pii-proxy-spans.jsonl: detection results survive proxy restarts and are shared across Claude Code sessions. Slow first-time inference happens at most once per unique chunk on this machine, ever. Append-only JSONL — unbounded by design, no eviction; we paid the inference cost once, we keep the result forever. Use/pii-clear-cacheif you ever want to reclaim disk.
| Command | What it does |
|---|---|
/pii-stats |
Show config, providers, cache sizes, mapping count |
/pii-config |
Show every tunable + how to change it |
/pii-clear-cache |
Drop the span cache (token map kept — safe, just slows the next request on already-seen content) |
/pii-clear-tokens --confirm |
DANGEROUS — wipe token↔value map. Past tokens in conversation history will stop restoring |
/pii-gpu cuda|directml|coreml|cpu |
Switch ORT package + provider |
Every tunable can be set in ~/.claude/settings.json under env:
| Var | Default | Meaning |
|---|---|---|
PII_PROXY_PORT |
5599 |
Listen port |
PII_PROXY_UPSTREAM |
https://api.anthropic.com |
Upstream URL (auto-set when chaining behind another proxy) |
PII_PROXY_MODEL |
openai/privacy-filter |
HF model id |
PII_PROXY_QUANT |
model_quantized.onnx (int8) |
Pin a specific ONNX variant — e.g. model_q4f16.onnx for smallest disk |
PII_PROXY_PROVIDERS |
auto | Comma-separated ORT providers in priority order, e.g. CUDAExecutionProvider,CPUExecutionProvider |
PII_PROXY_THREADS |
min(8, cpu/2) |
ORT intra_op_num_threads. 8 is the sweet spot on most CPUs for this model |
PII_PROXY_MIN_SCORE |
0.5 |
Minimum classifier confidence to redact |
PII_PROXY_WARMUP |
1 |
Pre-load the model at startup |
| Path | What | Safe to delete? |
|---|---|---|
~/.claude/pii-proxy-spans.jsonl |
Cached PII spans — append-only JSONL, unbounded. Each unique chunk takes ~200–1000 B. Restoring from this file is essentially free, so we never evict; large files are not a performance problem — appends are O(new entries), not O(file size). | Yes — /pii-clear-cache |
~/.claude/pii-proxy-map.json |
Token ↔ real-value map | Not while past conversations exist — old name-XXXXXX tokens become unrestorable forever |
~/.claude/pii-proxy.secret |
HMAC seed for deterministic tokens | If deleted, future tokens differ; old map still works for old tokens |
~/.claude/pii-proxy.log |
Proxy stdout/stderr | Yes |
~/.claude/pii-proxy-debug.log |
Request lifecycle | Yes |
GET /health JSON: status, upstream, model, mapping count
GET /stats Human-readable config + cache report
POST /admin/cache/clear Wipe span cache (in-memory + disk)
POST /admin/tokens/clear Wipe token map (in-memory + disk) — destructive
GET /debug/map Counts only, never reveals real values
All other paths under /v1/* are forwarded to the Anthropic upstream after redaction.
Conservative by default — only text content blocks and tool_result text
get redacted. Tool definitions, tool_use input dicts, model IDs, and other
metadata are left alone so file paths and command arguments aren't mangled.
A name embedded inside a path like C:\Users\ludde\foo.py will not be
redacted there, because the substitution would break the path.
| Scenario | Latency |
|---|---|
| Cold model load | ~10 s (one-time per process; SessionStart hook hides it) |
| Single short string | ~45 ms |
| Batch of 16 short strings | ~89 ms (5.5 ms / string) |
| Batch of 32 short strings | ~280 ms (8.8 ms / string) |
| 25 KB doc — first time ever | ~3.8 s |
| 25 KB doc — after proxy restart | 0 ms (loads from disk cache) |
| Warmed-up 10-turn conversation, per turn | ~175 ms |
| 1M-token-equivalent conversation, subsequent turns | 0 ms |
For this model (50 M active params, MoE, int8) GPU is not a universal speedup.
Numbers from onnxruntime-gpu 1.25 on Windows + driver 591.86 / CUDA 12.8:
| Input | CPU (24-core, 8 ORT threads) | RTX 5060 Ti | RTX 4060 |
|---|---|---|---|
| Single short string | ~45 ms | ~44 ms | ~230 ms |
| Batch of 16 short strings | ~89 ms | ~80 ms | ~265 ms |
| 25 KB doc (cold) | ~3.8 s | ~1.05 s | ~1.66 s |
Read this as: CPU is the right default for typical Claude Code conversations (short messages, lots of cache hits). CUDA on a strong GPU helps mainly for cold-start on long pastes (3–4× speedup on a 25 KB doc, scaling further for larger docs and 1 M-token first turns). On weaker GPUs (the 4060 here) CUDA can be slower than CPU for short inputs because kernel-launch overhead dominates.
If you regularly paste big docs, run /pii-gpu cuda. Otherwise stay on CPU.
- Detection misses are possible. The model card warns about uncommon
names, non-English text, and novel credential formats. For high-stakes
redaction, pair with a regex pass for known token formats (
sk-ant-*, AWS keys, etc.) and review. - Reverse map loss = unrecoverable. Deleting
pii-proxy-map.jsonorphans any tokens still alive in conversation history — they will appear in chat asname-a3f2b1instead of being restored. The HMAC keeps minting consistent for new occurrences. - Streaming token restoration uses a small per-block tail buffer. Tokens that span SSE delta boundaries are still restored correctly; output is delayed by at most one token's worth of streaming (~16 chars).
- Not a compliance tool. Use as part of a privacy-by-design approach, not as standalone anonymization for regulated workloads.
proxy/
server.py HTTP proxy + SSE stream rewriter + admin endpoints
redactor.py payload walker, token map, HMAC minting, streaming restorer
engine.py onnxruntime + tokenizers, length bucketing, BIOES decoder
requirements.txt
commands/ plugin slash commands (/pii-stats, /pii-gpu, ...)
hooks/ SessionStart hook + Windows/POSIX starters
test/
mock_anthropic.py capture-and-echo upstream for round-trip assertions
test_proxy.py 18-assertion mock e2e (streaming + non-streaming)
e2e_real.py real-API e2e using your Claude Code subscription
Dockerfile.mock, Dockerfile.test
.claude-plugin/plugin.json
docker-compose.test.yml
install.ps1, install.sh
MIT. See LICENSE.
The openai/privacy-filter model is Apache-2.0; this repo does not
redistribute it — it is downloaded on first use from Hugging Face.