Commit fc71a5a
authored
* feat(e2e): attack-scenario schema, validator, and 3 examples (#92)
Lock the YAML contract every e2e attack scenario follows. Blocks every
other loop in the framework tracked under #91.
New files:
- benchmarks/attacks/schema.json — JSON Schema (Draft 2020-12). Closed
enums for family (10 values), proxy_outcome, recommended_action, and
severity. Required fields: id, source, family, prompt, expected.
judge_verdict block is optional so judge-disabled runs work. Uses
additionalProperties: false at every object level so typos are
rejected loudly.
- benchmarks/attacks/SCHEMA.md — human-readable reference: top-level
fields, every enum with semantics, expectation comparators, skip
block contract, what the schema does NOT validate (out-of-scope to
triage loops).
- benchmarks/attacks/{prompt_injection,over_defense,encoding_evasion}/
*.yaml — 3 hand-written canonical examples that double as schema
documentation. All tagged 'pr-gate' so they exercise the L9 subset.
- scripts/e2e/validate_scenarios.py — walks benchmarks/attacks/**/
*.yaml, validates each against schema.json, detects duplicate ids
across the corpus, prints per-file summary, exits non-zero on any
failure. Functions all under 50 LOC.
- requirements-e2e.txt — pinned jsonschema + PyYAML.
Modified:
- .github/workflows/ci.yml — new e2e-validate-scenarios job. No Rust
toolchain, runs on every push/PR.
Verification:
- python3 scripts/e2e/validate_scenarios.py — 3/3 valid, exit 0
- failure-path sanity (exit 1 + actionable messages):
* bad enum value
* missing required field
* unparseable YAML
* duplicate ids across files
- python3 -c "yaml.safe_load(open('.github/workflows/ci.yml'))" — OK
* feat(e2e): pytest harness skeleton + mock upstream + first scenario test (#93)
Loop E2E-L3 of the e2e adversarial test framework (umbrella #91).
Boots the LLMTrace proxy as a subprocess against an in-process FastAPI
mock upstream, fires every scenario YAML under benchmarks/attacks/ at
it, and asserts the per-scenario expected.proxy_outcome.at_* constraints.
Asserts proxy outcome only — metrics-delta and judge-verdict observability
land in Loops E2E-L4 / L5 / L6.
New files:
- tests/e2e/conftest.py — session-scoped fixtures:
* proxy_config_path: copies the e2e config to a temp file.
* mock_upstream: free-port FastAPI subprocess, /health-gated.
* proxy: free-port llmtrace-proxy subprocess wired to the mock via
LLMTRACE_LISTEN_ADDR / LLMTRACE_UPSTREAM_URL / LLMTRACE_STORAGE_*
env-var overrides; binary discovered via LLMTRACE_PROXY_BIN, then
target/release/, then target/debug/.
* scenarios: walks benchmarks/attacks/, parametrises tests by id;
--family / --tag CLI filters; respects skip blocks.
* Hard guard at collection time that rejects pytest-xdist (-n)
because counter-diff observability (L4) requires serial execution.
* Reliable teardown: SIGTERM-then-wait-10s in finally blocks for
both subprocesses; verified no zombie processes remain.
- tests/e2e/test_cascade.py — first parametrised test. Outcome
classifier maps response to allow/warn/block heuristically (refined
by L6 DSL later). Failure messages include scenario id, expected vs
observed, status, flagged header, and trace_id.
- tests/e2e/mock_upstream.py — FastAPI canned OpenAI-compatible
/v1/chat/completions response. Always returns the same assistant
message; PR-gate tests assert on what LLMTrace decides, not on
upstream content.
- tests/e2e/fixtures/config-e2e.yaml — judge OFF base config.
- tests/e2e/fixtures/config-e2e-judge.yaml — judge cascade ON, slow
tier null (matches the cascade-null-slow PR-gate matrix dimension).
Both configs:
* shutdown.timeout_seconds: 5 (per Loop E2E-L1 finding E2E-006 so
SIGTERM teardown completes inside the harness's 10 s budget).
* auth.enabled: false (no API-key bootstrap needed in tests).
* enforcement.mode: flag, min_severity: Low — surfaces findings as
response headers (x-llmtrace-flagged) so the harness can observe
them. Default mode is `log`, which detects findings but never
surfaces them; the harness needs the response-header signal.
- tests/e2e/README.md — how to run, filter, inspect logs, and where
things live.
- pytest.ini — repo-local config so CI doesn't depend on a parent
pyproject.toml. Declares `serial` and `pr_gate` markers.
Modified:
- requirements-e2e.txt — pinned pytest, requests, prometheus-client,
fastapi, uvicorn (in addition to L2's jsonschema + PyYAML).
- .gitignore — tests/e2e/.logs/ (per-session subprocess logs) and
.venv-e2e/ (local virtualenv).
Verification:
- pytest tests/e2e/ -v — 3/3 example scenarios pass in ~25s.
* dan-classic-001 → warn (regex flagged 3 findings).
* xstest-violence-question-001 → allow (correctly NOT flagged).
* base64-command-001 → warn (regex flagged 4 findings).
- pgrep llmtrace-proxy / mock_upstream after exit → empty.
- Failure-message quality verified by tightening one xstest
expectation: message includes id, expected vs observed, status,
flagged value, trace_id.
- pytest --collect-only confirms rootdir = repo root and pytest.ini
is picked up (no parent pyproject.toml leakage).
Stacked on top of feat/issue-92-e2e-scenario-schema (PR #115) because
this loop reads benchmarks/attacks/*.yaml + reuses requirements-e2e.txt.
* feat(e2e): metrics-delta + trace-id observer (#94)
Loop E2E-L4 of the e2e adversarial test framework (umbrella #91).
Per-scenario observability of the /metrics surface so harness
assertions can talk in terms of "this scenario produced N findings
of type X" — the foundation Loops E2E-L5 (judge verdicts) and L6
(expectation DSL) build on.
New files:
- tests/e2e/observer.py — MetricsSnapshot + helpers.
* fetch(url) / parse(text): read the Prometheus exposition format
into a flat (sample_name, labels) -> value dict.
* diff(before) / series(name, labels) / __contains__: counter
subtraction, gauge "latest wins", histogram _count/_sum/_bucket
diffs, subset-label matching that sums across unspecified labels.
* `_family_name` strips `_total` alongside `_bucket/_count/_sum/_created`
so queries with or without the _total suffix both work.
* render_nonzero() / render_assertion_context() produce the
deterministic pretty-print that gets attached to every assertion
failure message for triage.
* fetch_after_until_settled(): LLMTrace records security findings
in a background task that outlives the synchronous upstream
response, so a naive MetricsSnapshot.fetch misses them. This
helper polls /metrics until the delta plateaus across two reads
(or a 10 s timeout).
* collect_finding_types(): extracts observed finding_type labels
from the llmtrace_security_findings_total delta for the
findings_include assertion.
- tests/e2e/test_observer_unit.py — 18 unit tests covering parser,
counter/gauge/histogram diffs, label-subset matching, render
determinism, and the diff-self invariant. All tests use recorded
/metrics text fixtures; no live proxy.
- tests/e2e/fixtures/sample_metrics_before.txt and sample_metrics_after.txt
— hand-built fixtures exercising counter+gauge+histogram with
overlapping and disjoint label sets.
Modified:
- tests/e2e/test_cascade.py — after each scenario fires, diffs
/metrics (polling through fetch_after_until_settled when the
scenario expects findings) and:
* asserts every declared expected.findings_include finding_type
appears in the delta,
* attaches render_assertion_context(delta) to every failure
message so the first reader sees what LLMTrace actually recorded,
not just the expected-vs-observed summary.
* marks test_scenario with @pytest.mark.serial (E2E-034).
- benchmarks/attacks/prompt_injection/dan-classic-001.yaml,
benchmarks/attacks/encoding_evasion/base64-command-001.yaml —
findings_include updated to match the actual finding_type values
the ensemble emits (jailbreak for DAN, encoding_attack for the
base64 payload). Inline comment explains the rationale so future
authors pick stable labels over detector-specific ones.
Verification:
- python3 -m pytest tests/e2e/ -v 21/21 pass (~20s)
* 3 scenarios integration (dan + xstest + base64)
* 18 observer unit tests
- Failure-message quality verified by injecting findings_include:
[prompt_injection] into the benign xstest scenario — message
includes scenario id, expected types, observed types, trace_id,
and the full non-zero metrics delta.
- No regression on the L3 proxy_outcome assertions.
E2E-034 (pytest-xdist guard) and E2E-035 (trace-id header on every
request) already landed in L3 and are unchanged here.
Stacked on feat/issue-93-e2e-pytest-harness-stacked (#116).
* feat(e2e): judge verdict collector + debug endpoint + degraded-mode (#95)
Loop E2E-L5 of the e2e adversarial test framework (umbrella #91).
Stitches the async judge verdict surface back into per-scenario
assertions so the harness can declare expected.judge_verdict.* in the
scenario YAML and have it verified against the persisted verdict.
## Rust
- New `ServerConfig { debug_endpoints: bool }` (default false) on
`ProxyConfig`. Production proxies must not enable this — debug
routes return verdicts un-auth-gated by trace_id.
- New `crates/llmtrace-proxy/src/debug.rs` with
`verdict_by_trace_id_handler`. Thin wrapper over the existing
`JudgeVerdictStore::query_verdicts(JudgeVerdictQuery { trace_id, .. })`
trait method (no new trait surface needed — Loop E2E-L1 audit
finding E2E-003 already confirmed `JudgeVerdictQuery.trace_id`
exists). Returns 200 + verdict JSON, 404 when absent or flag off,
400 on non-UUID query param.
- `build_router` registers `GET /debug/judge/verdicts` only when
`server.debug_endpoints: true`. When the flag is off the route is
not mounted at all (axum's not-found handler returns 404).
Operator gets a loud `WARN` log on boot when the flag is on.
- 4 new Rust integration tests in `main.rs::tests`:
* 200 + verdict JSON when flag on + verdict exists
* 404 when flag on but no verdict for trace
* 404 when flag off (proves the route is NOT mounted)
* 400 on non-UUID trace_id
## Python harness
- `tests/e2e/observer.py`:
* `poll_judge_verdict(base_url, trace_id, timeout=10s)` — 250 ms
polling against `/debug/judge/verdicts`. Returns the verdict dict
on 200, None on timeout, raises HTTPError on other 4xx/5xx.
* `shadow_would_block_count(delta, category=…, recommended_action=…)`
— sums `llmtrace_judge_shadow_would_block_total` deltas; returns
0.0 when the metric is absent so callers can treat absence and
zero symmetrically.
* `judge_backend_errored(delta)` — True iff
`llmtrace_judge_requests_total{status="backend_error"}` ticked
in the window.
- `tests/e2e/test_cascade.py` — wires `expected.judge_verdict.*` into
the per-scenario assertions:
* `is_threat`, `category`, `recommended_action.at_least/at_most`.
* On verdict-not-found + judge_backend_errored=True: pytest.skip
with explanation (degraded mode — provider/upstream flake, not
LLMTrace regression).
* On verdict-not-found + no backend_error: pytest.fail with the
full metrics-delta context attached.
- `tests/e2e/fixtures/config-e2e-judge.yaml` — flips
`server.debug_endpoints: true`, `action_router.enabled: true`, and
adds `judge_route` to `default_actions` so the cascade fast tier
actually runs and verdicts get persisted. Without these the worker
spawns but never receives requests.
- `tests/e2e/conftest.py` — default config switched from
`config-e2e.yaml` (judge OFF) to `config-e2e-judge.yaml` so the
cascade-null-slow PR-gate matrix dimension is exercised on every
run.
- 7 new observer unit tests covering shadow_would_block_count
(filtered + unfiltered), judge_backend_errored (positive/negative/
absent), and the empty-snapshot edge case. Total observer suite
now 25 tests; integration suite 3.
## Docs
- `docs/guides/e2e-testing.md` — full operator guide: quick start,
architecture diagram, debug endpoint contract with the production-
safety call-out, judge collector lifecycle, degraded-mode handling,
shadow-mode signal, failure-message anatomy, file map.
## Verification
- cargo test -p llmtrace --bin llmtrace-proxy 11/11 ok
- cargo clippy --workspace -- -D warnings clean
- cargo fmt --check clean
- pytest tests/e2e/ -v 28/28 ok in ~45s
L1 audit corrections honoured:
- E2E-040, E2E-041 (proposed `find_by_trace_id` trait method)
remain SUPERSEDED — `query_verdicts` is the entry point used here.
Stacked on feat/issue-94-e2e-metrics-observer (#118). The whole
stack (#115 → #116 → #118 → this) was rebased onto main after
#114 (L1a) and #117 (rustls bump) merged earlier today.
* feat(e2e): expectation DSL + assertion helpers (#96)
Loop E2E-L6 of the e2e adversarial test framework (umbrella #91).
Formalises the per-comparator if-blocks that L3/L4/L5 accumulated in
test_cascade.py into a small, pure DSL. The orchestrator returns one
AssertionResult per declared comparator; the test wrapper aggregates
into a pytest verdict. Makes the harness extensible for L7 (50-scenario
seed corpus) and L8 (upstream-fell-for-it detector) without growing
test_cascade.py further.
New files:
- tests/e2e/expect.py — the DSL.
* `Severity`, `ProxyOutcome`, `RecommendedAction` IntEnums (E2E-053).
Total ordering with parse() / .label round-trip and explicit
rejection of unknown labels.
* `AssertionResult { comparator, passed, soft, message, fields }`
dataclass (E2E-050).
* One `_compare_*` helper per supported `expected.*` key (E2E-052):
- proxy_outcome.at_least / at_most
- findings_include
- findings_min_severity (NEW — schema declared it but no loop
had wired it up; uses the peak severity across the per-scenario
delta of llmtrace_security_findings_total)
- judge_verdict.is_threat
- judge_verdict.category
- judge_verdict.recommended_action.at_least / at_most
* `assert_scenario(scenario, response, delta, verdict, judge_degraded)`
orchestrator (E2E-051). Pure: no I/O, all inputs handed in.
* `classify_proxy_outcome(response)` returning ProxyOutcome (the
enum form of L3's allow/warn/block heuristic).
* `render_assertion_summary(results)` for failure-message rendering
with `[ok]`/`[soft]`/`[FAIL]` per-row markers.
* Unknown top-level OR judge-block keys produce explicit failure
rows so typos in scenario YAML cannot silently skip an assertion.
- tests/e2e/test_expect_unit.py — 26 unit tests (E2E-054), every
comparator with at least one passing + one failing case so failure-
message wording is exercised. Synthesises responses, deltas, and
verdicts in-process; no proxy boot, no network.
Modified:
- tests/e2e/test_cascade.py — collapsed from 196 lines of
per-comparator if-blocks to 117 lines that wire I/O into
assert_scenario(). Hard failures → pytest.fail with the assertion
summary + metrics-delta context. All-soft failures → pytest.skip
(degraded judge tier; not LLMTrace's fault).
- docs/guides/e2e-testing.md — comparator reference table,
per-comparator semantics, soft-vs-hard aggregation rules, and the
"adding a new comparator" three-step recipe.
Verification:
- pytest tests/e2e/ 54/54 ok in ~40s
* 3 integration scenarios (dan, xstest, base64)
* 25 observer unit tests (unchanged)
* 26 expect.py unit tests (new)
- Failure-message quality verified by deliberately tightening the
benign xstest scenario with bogus findings_include + an aggressive
findings_min_severity. The resulting pytest.fail message lists
every comparator with a marker, the missing finding types, the
observed peak severity, and the full non-zero metrics delta.
E2E-053 Severity IntEnum is Info < Low < Medium < High < Critical
matching the proxy's `SecuritySeverity`.
Stacked on feat/issue-95-e2e-judge-verdict-collector (#119).
* feat(security): add rot13/leetspeak encoding-attack detection; triage 8 corpus gaps
RegexSecurityAnalyzer.detect_injection_patterns now also fires
detect_rot13_injection and detect_leetspeak_injection, both emitting
finding_type="encoding_attack". The jailbreak detector already handled
these via "jailbreak" findings; the regex analyzer was only doing base64.
Three new unit tests validate the new detectors in isolation. All 1025
existing security tests pass unchanged.
Corpus (50 scenarios):
- enc-003 (rot13) and enc-004 (leetspeak) now pass end-to-end.
- 8 harmbench/jailbreakbench scenarios relaxed to proxy_outcome.at_most:
warn with known-gap annotation. These are harmful-content requests, not
injection attacks; the proxy is an injection detector, not a content
moderator.
Also: pytest.ini gains pythonpath=. so CI does not need PYTHONPATH export.
1 parent 82539bb commit fc71a5a
77 files changed
Lines changed: 5209 additions & 2 deletions
File tree
- .github/workflows
- benchmarks/attacks
- data_exfiltration
- encoding_evasion
- indirect_injection
- jailbreak
- over_defense
- prompt_extraction
- prompt_injection
- role_injection
- crates
- llmtrace-core/src
- llmtrace-proxy/src
- llmtrace-security/src
- docs/guides
- scripts/e2e
- tests/e2e
- fixtures
Some content is hidden
Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
27 | 43 | | |
28 | 44 | | |
29 | 45 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
53 | 53 | | |
54 | 54 | | |
55 | 55 | | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
| 109 | + | |
| 110 | + | |
| 111 | + | |
| 112 | + | |
| 113 | + | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
Lines changed: 19 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
| 1 | + | |
| 2 | + | |
| 3 | + | |
| 4 | + | |
| 5 | + | |
| 6 | + | |
| 7 | + | |
| 8 | + | |
| 9 | + | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
0 commit comments