This guide covers how to configure LLMTrace's security policies, rate limiting, cost controls, alerting, and other operational features. Every option shown here corresponds to a real configuration field — copy the snippets directly into your config.yaml.
Quick links: Minimal config · Production config · High-security config · Cost-control config
- Configuration Basics
- Security Analysis Policies
- Rate Limiting
- Cost Caps & Budgets
- Alert Channels
- Circuit Breaker
- Anomaly Detection
- Streaming Security Analysis
- PII Detection & Redaction
- Compliance Reporting
- OWASP LLM Top 10 Coverage
- Putting It All Together
LLMTrace loads configuration from a YAML file. Settings can be overridden via environment variables or CLI flags.
Precedence (highest wins): CLI flags → environment variables → config file → defaults.
# Start with the example config
cp config.example.yaml config.yaml
# Validate before running
./target/release/llmtrace-proxy validate --config config.yaml
# Run the proxy
./target/release/llmtrace-proxy --config config.yamlChoose one of three storage profiles depending on your environment:
| Profile | Backend | Use Case |
|---|---|---|
memory |
In-memory (lost on restart) | Testing, CI/CD |
lite |
SQLite | Development, single-node production |
production |
ClickHouse + PostgreSQL + Redis | Multi-node, high-throughput production |
storage:
profile: "lite" # "memory", "lite", or "production"
database_path: "llmtrace.db" # Used by "lite" profile
# Production profile requires all three:
# clickhouse_url: "http://localhost:8123"
# clickhouse_database: "llmtrace"
# postgres_url: "postgres://llmtrace:llmtrace@localhost:5432/llmtrace"
# redis_url: "redis://127.0.0.1:6379"LLMTrace provides regex-based security analysis by default. When the binary is built with the ml feature and ML is enabled in config, the proxy uses an ensemble analyser that combines regex findings with ML classifiers.
Regex analysis is enabled by a single toggle and requires no additional infrastructure:
# Master toggle — enables the regex-based security engine
enable_security_analysis: true
# How long to wait for security analysis before falling back
# Set lower for latency-sensitive workloads, higher for thoroughness
security_analysis_timeout_ms: 5000The regex engine detects a mix of prompt injection, role injection, jailbreak, and leakage patterns. Examples include:
| Category | Examples | Severity |
|---|---|---|
| Prompt injection | "ignore previous instructions", "forget everything" | High |
| Role injection | system:, assistant: appearing in user messages |
High/Medium |
| Jailbreaks | DAN-style patterns with "no restrictions" | Critical |
| Encoding attacks | Base64-encoded malicious instructions | High |
| Delimiter injection | ---system:, ===instructions: |
High |
| PII patterns | Email, phone, SSN, credit card, IBAN, UK NIN | Medium |
| Data leakage | System prompt leaks, credential exposure | High/Critical |
| Agent actions | Dangerous commands, suspicious URLs, sensitive files | Critical/High |
No configuration is needed beyond enable_security_analysis: true — all patterns are compiled at startup.
For higher accuracy and fewer false positives, enable ML-based detection. When ml_preload: true, the proxy loads HuggingFace models at startup and runs local inference alongside regex analysis. If ml_preload is set to false, the proxy currently stays in regex-only mode (ML models are not loaded lazily).
security_analysis:
# Enable ML prompt injection detection
# Requires: binary built with `cargo build --features ml`
ml_enabled: true
# HuggingFace model for prompt injection classification
# This DeBERTa v3 model is specifically trained for prompt injection
ml_model: "protectai/deberta-v3-base-prompt-injection-v2"
# Confidence threshold (0.0–1.0)
# Lower = more sensitive (more detections, more false positives)
# Higher = more specific (fewer detections, fewer false positives)
# 0.8 is a good starting point; tune based on your false-positive tolerance
ml_threshold: 0.8
# Where to cache downloaded models (avoids re-downloading on restart)
ml_cache_dir: "~/.cache/llmtrace/models"
# Pre-load models at startup (recommended for production)
# If false, first request incurs model download latency
ml_preload: true
# Timeout for model download at startup (seconds)
# Large models (DeBERTa) can be ~500MB — give enough time on slow connections
ml_download_timeout_seconds: 300
# Enable NER-based PII detection for person names, organizations, locations
# Catches PII that regex patterns miss (e.g., "John Smith lives in London")
ner_enabled: true
# HuggingFace model for Named Entity Recognition
ner_model: "dslim/bert-base-NER"
# Optional jailbreak classifier (runs alongside prompt injection)
jailbreak_enabled: true
jailbreak_threshold: 0.7
# Optional feature-level fusion classifier (ADR-013)
fusion_enabled: false
fusion_model_path: nullWhen to use ML detection:
- You need to catch adversarial prompt injections that evade regex patterns
- You handle untrusted user input at scale
- You want NER-based PII detection for names and organisations (regex can only catch structured formats like emails/SSNs)
When regex-only is sufficient:
- Internal tools where inputs are semi-trusted
- You want zero external dependencies and minimal latency overhead
- Your primary concern is PII in structured formats
When ML is enabled, LLMTrace uses the ensemble analyser:
- Regex findings are always included.
- ML findings are added when the model is loaded successfully.
- Findings are merged and deduplicated; overlapping items include metadata such as
ensemble_agreement.
Rate limiting protects your LLM provider from excessive requests and prevents runaway agents from consuming your entire quota.
rate_limiting:
enabled: true
# Sustained requests per second (across all tenants unless overridden)
# 100 RPS is a safe starting point for most OpenAI plans
requests_per_second: 100
# Burst size — how many requests can fire in a quick burst
# Set to 2x your RPS for bursty agent workloads
# Set equal to RPS for smooth, predictable traffic
burst_size: 200
# Sliding window for rate calculation (seconds)
window_seconds: 60Different tenants (teams, applications, environments) often need different limits:
rate_limiting:
enabled: true
requests_per_second: 100 # Default for all tenants
burst_size: 200
window_seconds: 60
# Override specific tenants by UUID
tenant_overrides:
# Production AI assistant — needs higher throughput
"550e8400-e29b-41d4-a716-446655440001":
requests_per_second: 500
burst_size: 1000
# Development/testing tenant — keep it low
"550e8400-e29b-41d4-a716-446655440002":
requests_per_second: 20
burst_size: 40
# Batch processing tenant — moderate steady, low burst
"550e8400-e29b-41d4-a716-446655440003":
requests_per_second: 200
burst_size: 250How it works:
- Requests exceeding the limit receive HTTP
429 Too Many Requests - The proxy includes
Retry-Afterheaders so well-behaved clients back off - Tenant is identified via the
X-LLMTrace-Tenant-IDheader or API key derivation
| Scenario | RPS | Burst | Why |
|---|---|---|---|
| Single chatbot | 10–50 | 2× RPS | Human typing speed limits natural request rate |
| Agent swarm (10 agents) | 100–500 | 2× RPS | Agents fire requests concurrently |
| Batch embedding pipeline | 200–1000 | 1× RPS | Steady throughput, no burst needed |
| Rate-limited LLM plan | Match plan limit | +20% headroom | Prevent provider-side 429s |
Cost caps prevent billing surprises by enforcing per-agent and per-tenant spending limits. When a budget is exceeded, requests are rejected before they reach the LLM provider.
cost_caps:
enabled: true
# Default budgets applied to all tenants/agents
default_budget_caps:
# Hourly cap: catch runaway loops quickly
- window: hourly
hard_limit_usd: 10.0 # Reject requests above this
soft_limit_usd: 8.0 # Alert (but allow) above this
# Daily cap: overall spending control
- window: daily
hard_limit_usd: 100.0
soft_limit_usd: 80.0| Limit Type | Behaviour | Use Case |
|---|---|---|
| Soft limit | Triggers an alert but allows the request | Early warning — "you're spending fast" |
| Hard limit | Rejects the request with HTTP 429 | Hard stop — "no more spending this period" |
Set soft limits at ~80% of hard limits to give teams time to react before hitting the wall.
Prevent individual requests from consuming excessive tokens (useful for catching infinite loops or excessively long prompts):
cost_caps:
enabled: true
default_token_cap:
max_prompt_tokens: 8192 # Reject prompts longer than this
max_completion_tokens: 4096 # Limit response length
max_total_tokens: 16384 # Combined capDifferent agents have different cost profiles. Override budgets per agent using the X-LLMTrace-Agent-ID header:
cost_caps:
enabled: true
default_budget_caps:
- window: daily
hard_limit_usd: 50.0
agents:
# Heavy research agent — needs a bigger budget
- agent_id: "research-agent"
budget_caps:
- window: daily
hard_limit_usd: 500.0
soft_limit_usd: 400.0
- window: hourly
hard_limit_usd: 100.0
token_cap:
max_prompt_tokens: 16384
max_completion_tokens: 8192
# Simple Q&A bot — keep it tight
- agent_id: "faq-bot"
budget_caps:
- window: daily
hard_limit_usd: 10.0
soft_limit_usd: 8.0
token_cap:
max_prompt_tokens: 2048
max_completion_tokens: 1024
max_total_tokens: 4096
# Batch processing — weekly budget, no hourly limit
- agent_id: "batch-embedder"
budget_caps:
- window: weekly
hard_limit_usd: 1000.0
soft_limit_usd: 800.0| Window | Duration | Good For |
|---|---|---|
hourly |
1 hour rolling | Catching runaway loops quickly |
daily |
24 hours rolling | Day-to-day budget control |
weekly |
7 days rolling | Batch workloads with variable daily usage |
monthly |
30 days rolling | Long-term budget planning |
If you use fine-tuned or self-hosted models, add custom pricing so cost estimation is accurate:
cost_estimation:
enabled: true
# Load pricing from an external file (hot-reloaded on SIGHUP)
# pricing_file: "config/pricing.yaml"
# Inline pricing overrides (per 1M tokens, in USD)
custom_models:
my-fine-tuned-gpt4:
input_per_million: 5.0
output_per_million: 10.0
local-llama-70b:
input_per_million: 0.0 # Self-hosted = no API cost
output_per_million: 0.0Alerts notify your team when security findings exceed configured thresholds. LLMTrace supports multiple simultaneous channels with independent severity filters.
The simplest setup — one webhook URL:
alerts:
enabled: true
webhook_url: "https://hooks.slack.com/services/T00/B00/xxx"
min_severity: "High" # Only alert on High and Critical
min_security_score: 70 # Minimum confidence score (0–100)
cooldown_seconds: 300 # 5 minutes between duplicate alertsFor production, send different severity levels to different channels:
alerts:
enabled: true
cooldown_seconds: 300 # Global deduplication window
channels:
# Slack: Medium and above — for the security team's awareness
- type: slack
url: "https://hooks.slack.com/services/T00/B00/xxx"
min_severity: "Medium"
min_security_score: 50
# PagerDuty: Critical only — wake someone up
- type: pagerduty
routing_key: "your-pagerduty-events-v2-routing-key"
min_severity: "Critical"
min_security_score: 90
# Custom webhook: High and above — for your SIEM or log aggregator
- type: webhook
url: "https://your-siem.internal/api/llm-alerts"
min_severity: "High"
min_security_score: 70| Type | Required Fields | Notes |
|---|---|---|
slack |
url (Incoming Webhook URL) |
Posts formatted Slack messages with finding details |
pagerduty |
routing_key (Events API v2) |
Creates PagerDuty events |
webhook |
url (any HTTP endpoint) |
POSTs a JSON payload with full finding data |
email |
(none) | Accepted in config but currently skipped (not implemented) |
LLMTrace compares each finding's severity (Info → Critical) and confidence score (0–100) against the configured minimums. There is no fixed mapping between score and severity; both are emitted by the analysers.
The cooldown_seconds setting prevents alert storms:
alerts:
cooldown_seconds: 300 # 5 minutesThis means: if the same finding type fires multiple times within 5 minutes, only the first alert is sent. This prevents a flood of identical alerts when an attacker probes your system repeatedly.
The config schema includes an alerts.escalation block, but the proxy does not currently implement escalation behaviour. Any escalation settings are ignored at runtime.
The circuit breaker degrades LLMTrace to a pure pass-through proxy when internal subsystems (storage, security analysis) fail repeatedly. This ensures your application keeps working even if LLMTrace's analysis layer has issues.
circuit_breaker:
enabled: true
# Open the circuit after this many consecutive failures
# Lower = faster degradation (more protective of upstream)
# Higher = more tolerant of transient errors
failure_threshold: 10
# How long to wait before trying again (half-open state)
# 30s is a good default; increase for slow-recovering backends
recovery_timeout_ms: 30000
# Number of probe requests in half-open state
# If these succeed, the circuit closes and normal operation resumes
half_open_max_calls: 3CLOSED (normal) → failures exceed threshold → OPEN (pass-through)
│
recovery_timeout_ms
│
HALF-OPEN
(probe calls)
╱ ╲
success failure
│ │
CLOSED OPEN
| Environment | failure_threshold |
recovery_timeout_ms |
Why |
|---|---|---|---|
| Development | 3 | 5000 | Fail fast, recover fast |
| Production (standard) | 10 | 30000 | Tolerate transient blips |
| Production (high-availability) | 5 | 60000 | Open quickly, recover cautiously |
Anomaly detection uses statistical analysis (moving averages and standard deviations) to identify unusual behaviour per tenant — cost spikes, token spikes, velocity surges, and latency outliers.
anomaly_detection:
enabled: true
# Sliding window size (number of recent observations)
# Larger = more stable baseline, slower to react to real changes
# Smaller = faster to react, but more false positives
window_size: 100
# Sigma threshold for anomaly flagging
# 2.0 = ~5% false positive rate (aggressive)
# 3.0 = ~0.3% false positive rate (balanced, recommended)
# 4.0 = ~0.006% false positive rate (conservative)
sigma_threshold: 3.0
# Which dimensions to check
check_cost: true # Flag requests with abnormally high estimated cost
check_tokens: true # Flag requests with abnormally high token count
check_velocity: true # Flag tenants with abnormally high request rate
check_latency: true # Flag requests with abnormally high latency| Type | What It Detects | Typical Cause |
|---|---|---|
cost_spike |
Single request costs much more than the tenant's average | Runaway prompt, model upgrade without budget adjustment |
token_spike |
Unusually high token count for a single request | Prompt injection inflating context, copy-paste of large documents |
velocity_spike |
Tenant is sending requests much faster than usual | Automated loop, DDoS-like behaviour, misconfigured retry logic |
latency_spike |
Response taking much longer than typical | Provider degradation, overly complex prompt, model overload |
Sigma Threshold
2.0 3.0 4.0
┌────────┬────────┬────────┐
Window 50 │ Noisy │ OK │ Quiet │
Window 100 │ OK │ Best │ OK │
Window 200 │ Good │ OK │ Slow │
└────────┴────────┴────────┘
Start with window_size: 100 and sigma_threshold: 3.0. Adjust based on your false-positive tolerance.
For streaming responses (SSE), LLMTrace can run incremental security checks during the stream rather than waiting for completion. This provides early warning when a response starts leaking sensitive data mid-stream.
# Must also have streaming enabled
enable_streaming: true
streaming_analysis:
enabled: true
# Check every N tokens during the stream
# Lower = faster detection, marginally more CPU
# Higher = less overhead, slower detection
# 50 is a good balance for most workloads
token_interval: 50
# Enable output-side checks (PII/secrets/toxicity) on streaming content
output_enabled: true
# If a critical finding is detected mid-stream, inject a warning and stop
early_stop_on_critical: trueHow it works:
- As SSE chunks arrive, the proxy accumulates tokens
- Every
token_intervaltokens, it runs lightweight regex pattern matching on the accumulated content (streaming analysis uses regex only) - If a critical finding is detected mid-stream, an alert fires immediately (doesn't wait for stream completion)
- After the stream completes, full analysis runs on the complete response
Note: Output-side streaming checks require output_safety.enabled: true in addition to streaming_analysis.output_enabled.
When to enable:
- Long-running streaming responses (creative writing, code generation)
- High-security environments where early detection matters
- When you want real-time alerts, not post-hoc analysis
Beyond detecting PII, LLMTrace can optionally redact it inside the security analyser output. The proxy does not currently replace upstream responses or stored traces with redacted text — redaction is available for downstream processing and future pipeline use.
pii:
# "alert_only" — Detect and report PII, but don't modify text (default)
# "alert_and_redact" — Detect, report, AND replace PII with [PII:TYPE] tags
# "redact_silent" — Redact PII silently without generating findings
action: "alert_only"| Type | Pattern | Example | Confidence |
|---|---|---|---|
email |
Standard email format | user@example.com |
0.90 |
phone_number |
US formats | 555-123-4567, (555) 123-4567 |
0.85 |
ssn |
US Social Security Number | 456-78-9012 |
0.95 |
credit_card |
16-digit card numbers | 4111 1111 1111 1111 |
0.90 |
uk_nin |
UK National Insurance Number | AB 12 34 56 C |
0.90 |
iban |
International Bank Account Number | DE89 3704 0044 0532 0130 00 |
0.85 |
eu_passport_de |
German passport | C01X00T2Z |
0.60 |
eu_passport_fr |
French passport | 12AB34567 |
0.65 |
eu_passport_it |
Italian passport | AA1234567 |
0.60 |
eu_passport_es |
Spanish passport | ABC123456 |
0.60 |
eu_passport_nl |
Dutch passport | NX1A2B3C4 |
0.60 |
intl_phone |
International phone numbers | +44 20 7946 0958 |
0.80 |
nhs_number |
UK NHS number | 943 476 5919 |
0.70 |
canadian_sin |
Canadian Social Insurance Number | 046-454-286 |
0.80 |
australian_tfn |
Australian Tax File Number | 123 456 789 |
0.70 |
LLMTrace automatically suppresses PII matches that are likely false positives:
- Matches inside fenced code blocks (
```) - Matches on indented code lines (4+ spaces)
- Matches inside URLs
- Well-known placeholder values (e.g.,
123-45-6789, all-zeros)
With action: "alert_and_redact", the text:
Contact John at john@example.com or call 555-123-4567
Becomes:
Contact John at [PII:EMAIL] or call [PII:PHONE_NUMBER]
LLMTrace generates compliance reports (SOC2, GDPR, HIPAA) from stored audit events and trace data. Reports are persisted in the metadata repository and can be retrieved via the API.
- Audit events are recorded for tenant and API key operations (and report generation)
- Reports aggregate audit events and trace data over a time period into a structured compliance document
- Storage: Reports are stored as JSON in the metadata repository (SQLite or PostgreSQL)
| Type | Standard | What It Covers |
|---|---|---|
soc2 |
SOC 2 Type II | Audit trail of all operations, access control events, security findings |
gdpr |
GDPR | Data processing activity records, PII detection events, data retention compliance |
hipaa |
HIPAA | Audit logs for healthcare data, access tracking, security incident records |
Audit events are generated automatically when trace storage is enabled. For complete compliance coverage, ensure:
# Recommended: traces must be stored to include trace data in reports
enable_trace_storage: true
# Optional: security analysis adds findings that enrich reports
enable_security_analysis: true
# Recommended: production storage for durable audit trail
storage:
profile: "production"
postgres_url: "postgres://llmtrace:llmtrace@localhost:5432/llmtrace"
# Disable auto-migrate in production — run migrations explicitly
auto_migrate: falseLLMTrace includes built-in tests mapped to the OWASP Top 10 for LLM Applications. The current test suite covers the following categories:
| OWASP ID | Category | Tests |
|---|---|---|
| LLM01 | Prompt Injection | ✅ |
| LLM02 | Insecure Output Handling | ✅ |
| LLM06 | Sensitive Information Disclosure | ✅ |
| LLM07 | Insecure Plugin Design | ✅ |
Other OWASP categories are not currently covered by tests in this repo.
# All OWASP tests
cargo test --test owasp_llm_top10
# Specific category
cargo test --test owasp_llm_top10 owasp_llm01 # Prompt Injection
cargo test --test owasp_llm_top10 owasp_llm06 # PII Detection
cargo test --test owasp_llm_top10 owasp_llm07 # Agent Action AnalysisFor detailed coverage breakdown, see docs/security/OWASP_LLM_TOP10.md.
The example configurations in the examples/ directory show complete, ready-to-use setups:
| Config | Use Case | Key Features |
|---|---|---|
config-minimal.yaml |
Getting started | SQLite, regex security, basic rate limits |
config-production.yaml |
Full production | ClickHouse + Postgres + Redis, all features, multi-channel alerts |
config-high-security.yaml |
Maximum security | ML detection, streaming analysis, strict limits, PagerDuty escalation |
config-cost-control.yaml |
Cost management | Tight budgets, per-agent caps, cost anomaly alerts |
Start with minimal: — Get the proxy running with config-minimal.yaml
Add alerting: — Connect Slack/webhook so you see what's happening
Enable cost caps: — Prevent billing surprises
Tune rate limits: — Adjust per-tenant based on observed traffic
Enable anomaly detection: — After you have ~100 requests of baseline data
Consider ML detection: — If you handle untrusted user input
Enable streaming analysis: — For long-running streaming responses
Set up compliance: — When you need SOC2/GDPR/HIPAA audit trails
| Variable | Overrides | Example |
|---|---|---|
LLMTRACE_LISTEN_ADDR |
listen_addr |
0.0.0.0:9090 |
LLMTRACE_UPSTREAM_URL |
upstream_url |
http://localhost:11434 |
LLMTRACE_STORAGE_PROFILE |
storage.profile |
production |
LLMTRACE_STORAGE_DATABASE_PATH |
storage.database_path |
/var/lib/llmtrace/traces.db |
LLMTRACE_CLICKHOUSE_URL |
storage.clickhouse_url |
http://clickhouse:8123 |
LLMTRACE_CLICKHOUSE_DATABASE |
storage.clickhouse_database |
llmtrace |
LLMTRACE_POSTGRES_URL |
storage.postgres_url |
postgres://user:pass@pg:5432/llmtrace |
LLMTRACE_REDIS_URL |
storage.redis_url |
redis://redis:6379 |
LLMTRACE_LOG_LEVEL |
logging.level (via CLI env) |
debug |
LLMTRACE_LOG_FORMAT |
logging.format (via CLI env) |
json |
RUST_LOG |
Fine-grained tracing | llmtrace_proxy=debug,info |