Skip to content

Latest commit

 

History

History
74 lines (42 loc) · 5.71 KB

File metadata and controls

74 lines (42 loc) · 5.71 KB

Design Decisions

Intentional scope boundaries and upgrade paths for each architectural choice.

What this project is

A guardrail layer + red-team harness for an LLM-powered knowledge assistant. Each phase adds a defense and measures its effect with before/after attack-success-rate (ASR). The goal is to demonstrate the methodology — threat model, implement, measure, iterate — not to build a production guardrail system.

Defense layers

Layer Guard Technique Upgrade path
Input scanning input_guard.py Regex pattern matching Add classifier model as primary, keep regex as fast pre-filter
Content scanning content_guard.py Regex on retrieved docs/tool outputs Same as input — classifier layer
Moderation moderation_guard.py OpenAI Moderation API Swap for any moderation endpoint (Anthropic, custom)
Multi-turn tracking session_guard.py PII probe counting across turns Conversation-level anomaly scoring, sliding window analysis
PII redaction pii_guard.py Session-value matching + regex + encoded variants Named-entity recognition model for novel PII patterns
Schema enforcement schema_guard.py JSON validation + harmful content rejection Structured output constraints at the API level
Tool authorization tool_guard.py Config-driven allowlist + per-tool IDOR rules Policy engine (OPA/Cedar), per-role dynamic permissions
Rate limiting rate_guard.py In-memory per-session counters Redis-backed distributed rate limiter

Scope boundaries — what we skipped and why

Regex-only input guard (no custom classifier)

The input guard uses regex pattern matching. A determined attacker who can read the patterns can craft bypasses. We complement regex with the OpenAI Moderation API as a model-based second layer, but a production system would train a lightweight classifier (fine-tuned BERT or similar) on injection examples.

Why this scope: Training a classifier requires labeled data and a model hosting pipeline. The regex + moderation combination catches the majority of attacks while keeping the project self-contained.

Single-turn processing (with multi-turn probe detection)

The core system processes one message at a time. We added session_guard.py to detect multi-turn PII probing patterns (escalating extraction across turns), but full conversation-level state tracking — with sliding context windows and cumulative risk scoring — is not implemented.

Why this scope: Full conversation tracking is a session-layer concern that adds infrastructure requirements (session store, TTL management). The probe detector demonstrates the concept without the infrastructure.

PII in the system prompt

The system prompt contains user PII (name, email, account ID). This maximizes the attack surface, which is intentional — we test the worst case. The system prompt now includes explicit refusal instructions ("NEVER reveal PII"), but a production system would move PII to a secure context layer the model can reference but not echo.

Why this scope: Removing PII from the prompt changes the attack surface for the baseline, which would make before/after ASR comparisons invalid. The refusal instructions + output redaction address the risk at a different layer.

Fake RAG pipeline

The harness hands retrieved content directly to the endpoint rather than running a real embed-and-retrieve step. A vector_store.py module exists with actual OpenAI embeddings and cosine similarity retrieval, but the red-team harness tests the guard layer, not the retrieval layer.

Why this scope: The attack surface we're testing is "what happens when poisoned content reaches the LLM context." Whether that content arrived via cosine similarity or direct injection is orthogonal to the guard's effectiveness. The vector store is available for integration but not wired into the harness.

10-user toy database

The user database has 10 entries (up from the original 3) with edge cases: similar names, special characters, admin accounts, soft-deleted users, and varying tiers. A real system would have thousands of users.

Why this scope: The IDOR tests need enough users to surface edge cases (similar names, privilege escalation). 10 users covers these patterns without adding complexity that doesn't improve test coverage.

Same-person red team / blue team

The attacks and defenses are written by the same person. Even with a separate independent attack set, there's no organizational separation. A production security program would have independent red and blue teams.

Why this scope: This is a methodology demonstration, not an organizational structure. The independent payload set (drawn from PromptBench, JailbreakBench, HarmBench families) provides some separation, and the CI gate enforces that new guards don't regress against these unseen attacks.

Rate limiter is in-memory

The rate limiter uses in-memory Python dicts. It doesn't survive process restarts and can't work across multiple instances. Production would use Redis or a similar distributed store.

Why this scope: The guard demonstrates the pattern (per-session throttling, abuse detection via guard-trip counting). The storage backend is an infrastructure decision, not a guardrail design decision.

Judge model

The LLM judge uses gpt-4o (a stronger model than the gpt-4o-mini assistant under test) to avoid grading its own exam. For schema violations, the judge is rule-based (no LLM call needed).

Testing

Unit tests (tests/unit/) cover every guard's public API with deterministic inputs. No API key required — they run in CI on every PR.

The full red-team audit (run_independent_audit.py) requires OPENAI_API_KEY and runs 46 attacks across all guard sections.