-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathHARNESS_ASSESSMENT.txt
More file actions
535 lines (454 loc) · 33.5 KB
/
Copy pathHARNESS_ASSESSMENT.txt
File metadata and controls
535 lines (454 loc) · 33.5 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
╭─── Claude Code v2.1.92 ──────────────────────────────────────────────────────╮
│ │ Tips for getting │
│ Welcome back! │ started │
│ │ Run /init to create a … │
│ ▐▛███▜▌ │ ─────────────────────── │
│ ▝▜█████▛▘ │ Recent activity │
│ ▘▘ ▝▝ │ No recent activity │
│ │ │
│ stepfun/step-3.5-flash:free · API Usage Billing │ │
│ ~/BrokeLLM │ │
╰──────────────────────────────────────────────────────────────────────────────╯
❯ comprehensive code gap & security analysis
● I'll conduct a comprehensive code gap and security analysis. Let me start by
mapping the project structure and then run parallel deep-dive analyses.
Searched for 1 pattern, read 11 files (ctrl+o to expand)
● Launching comprehensive analysis across security, code gaps, architecture, and
test coverage in parallel.
● Agent(Security vulnerability analysis)
⎿ Backgrounded agent (↓ to manage · ctrl+o to expand)
Read 2 files (ctrl+o to expand)
● The security analysis is running in background. Let me work on code gaps,
architecture consistency, and test coverage in parallel.
● Bash(python3 -c "
import json…)
⎿ === cmd_ functions in _mapping.py ===
cmd_config
cmd_doctor
… +34 lines (ctrl+o to expand)
Searched for 1 pattern (ctrl+o to expand)
● Bash(python3 -m unittest -v /home/bamn/BrokeLLM/tests/test_mapping.py 2>&1 |
tail -30)
⎿ test_harness_evaluate_reuses_cached_role_and_final_verdict (tests.test
_mapping.MappingTestCase) ... ok
test_harness_run_register_creates_active_run_and_ledger (tests.test_ma
… +33 lines (ctrl+o to expand)
● Bash(python3 -c "
import pathlib…)
⎿ bin/broke: 1359 lines
bin/_mapping.py: 3114 lines
bin/_proxy.py: 735 lines
… +17 lines (ctrl+o to expand)
● Security analysis is still running in the background. Let me compile what I've
found so far across the other dimensions, then fold in the security report
when it's ready.
● Bash(python3 -c "
import pathlib…)
⎿ === CODE GAP ANALYSIS ===
cmd functions and whether they have a broke bash handler:
… +23 lines (ctrl+o to expand)
● Agent(Check background security analysis)
⎿ Done (15 tool uses · 106.0k tokens · 3m 27s)
(ctrl+o to expand)
● Here's the comprehensive code gap & security analysis:
---
BrokeLLM — Comprehensive Code Gap & Security Analysis
Date: 2026-04-04
Scope: bin/{broke, _mapping.py, _proxy.py, _harness_shim.py,
_socket_bridge.py}, install.sh, tests/
Total codebase: 5,653 lines (34 cmd functions, 80 internal functions, 33
tests)
---
1. SECURITY VULNERABILITIES
CRITICAL (1)
┌─────┬─────────────────────────┬──────────────────────────────────────────┐
│ # │ File │ Issue │
├─────┼─────────────────────────┼──────────────────────────────────────────┤
│ │ │ Secret values written to │
│ 1 │ bin/_mapping.py:433-456 │ .launch_audit.log — secret_surface list │
│ │ │ captures actual key names that match │
│ │ │ `TOKEN │
└─────┴─────────────────────────┴──────────────────────────────────────────┘
HIGH (7)
#: 2
File: bin/broke:293-314
Issue: PATH hijacking via shim directory — append_harness_runtime_env prepends
$HARNESS_RUN_DIR/shims to $PATH. If a worker can compromise that directory,
it inserts malicious binaries named after common tools (git, curl, make).
Fix: lock directory to 0700, validate shims are symlinks only to
_harness_shim.py.
────────────────────────────────────────
#: 3
File: bin/broke:517-542
Issue: Symlink race in ensure_secret_file() — writes tokens without `O_CREAT
────────────────────────────────────────
#: 4
File: bin/_proxy.py:434-438
Issue: Static token auth — CLIENT_TOKEN loaded once at startup, plain-text
comparison, no expiration or rate limiting on auth attempts. Fix: add auth
attempt throttling, periodic rotation.
────────────────────────────────────────
#: 5
File: bin/_mapping.py:200-214
Issue: Symlink race on .env.claude — set_claude_env_export writes without
validating path isn't a symlink first. Fix: lstat() check before write.
────────────────────────────────────────
#: 6
File: bin/_mapping.py:330-338
Issue: Permission drift not enforced — _bad_runtime_permissions() reports but
never fixes. Secret files could end up world-readable via manual edits. Fix:
preflight should be hard-fail on permission drift.
────────────────────────────────────────
#: 7
File: bin/broke:550-603
Issue: Sandbox escape via bwrap bind mounts — --ro-bind-try $HOME/.local,
$HOME/.config, $HOME/.cache are overly broad. A malicious process could
access user configs, browser data, SSH keys in those trees. Fix: scope to
only required paths.
MEDIUM (6)
#: 8
File: bin/broke:28-29
Issue: Pass-through env leakage — CLIENT_PASS_ENV_KEYS includes SSH_AUTH_SOCK,
DBUS_SESSION_BUS_ADDRESS, WAYLAND_DISPLAY, DISPLAY — these are sensitive
session handles that don't need to reach the LLM client.
────────────────────────────────────────
#: 9
File: bin/_proxy.py:447
Issue: Unbounded Content-Length — reads self.rfile.read(int(...)) with no size
limit. Fix: cap at a reasonable max (e.g., 1 MiB).
────────────────────────────────────────
#: 10
File: bin/broke:116-132
Issue: Broad regex for env var matching — append_matching_file_env uses
^[A-Z_][A-Z0-9_]*$ which matches any uppercase variable, not just provider
keys.
────────────────────────────────────────
#: 11
File: bin/_mapping.py:517-542
Issue: ensure_secret_file() reads existing file — if file exists with prior
token, it cats it back out to stdout — could leak to command substitution.
────────────────────────────────────────
#: 12
File: bin/_proxy.py
Issue: No KeyboardInterrupt handler — graceful shutdown missing. Unix socket
left behind on SIGINT.
────────────────────────────────────────
#: 13
File: install.sh:112
Issue: cp .env.template .env creates files with placeholder secrets —
installer
doesn't verify or clear placeholder values.
LOW (2)
┌─────┬─────────────────────────┬──────────────────────────────────────────┐
│ # │ File │ Issue │
├─────┼─────────────────────────┼──────────────────────────────────────────┤
│ │ │ Git status artifact leakage — │
│ 14 │ bin/_mapping.py:400-412 │ collect_harness_checkpoint_artifact │
│ │ │ exposes branch names, file paths in git │
│ │ │ status output. │
├─────┼─────────────────────────┼──────────────────────────────────────────┤
│ 15 │ _mapping.py │ TOCTOU race — path.exists() then open() │
│ │ │ pattern used in multiple places. │
└─────┴─────────────────────────┴──────────────────────────────────────────┘
---
2. CODE GAPS
Untested Functions (13 of 34 cmd functions lack dedicated tests)
┌──────────────────────────────┬────────┬─────────────────────────────────┐
│ Function │ Risk │ Notes │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_init │ Low │ Indirectly tested via setUp │
│ │ │ calling cmd_init() │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_list │ Medium │ Display logic, no state │
│ │ │ mutation │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_metrics │ Medium │ Requires running gateway — hard │
│ │ │ to unit test │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_probe │ Medium │ Sends live HTTP requests — │
│ │ │ needs mock │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_validate │ Medium │ Configuration validation not │
│ │ │ tested │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_team_save / │ Medium │ Team persistence untested │
│ cmd_team_load │ │ │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_team_delete / │ Low │ CRUD operations untested │
│ cmd_team_list │ │ │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_profile_new / │ Medium │ Profile creation/loading │
│ cmd_profile_load │ │ untested │
├──────────────────────────────┼────────┼─────────────────────────────────┤
│ cmd_profile_delete / │ Low │ Profile CRUD untested │
│ cmd_profile_list │ │ │
└──────────────────────────────┴────────┴─────────────────────────────────┘
No unit tests for:
- bin/_proxy.py (only 3 proxy tests via shim, not direct)
- bin/_harness_shim.py (zero tests)
- bin/_socket_bridge.py (only constant checks, no network tests)
- install.sh (only grep assertions, no execution tests)
- Harness evaluate caching flow — partially tested but no replay validation
- Import with --overwrite flag
- Snapshot restore with index vs timestamp
- Profile inheritance over team access policies
---
3. ARCHITECTURE CONSISTENCY
Area: Token generation
Finding: ensure_secret_file() in bash and ensure_secret_file-equivalent flow
in
Python use different seed prefixes (broke-client- vs auto-generated). Not
necessarily a bug but inconsistent.
────────────────────────────────────────
Area: Event hashing
Finding: _mapping.py and _harness_shim.py each implement their own sha256_text
+ canonical_json + append_event. Duplicated logic — single source of truth
needed.
────────────────────────────────────────
Area: File locking
Finding: Both _proxy.py and _mapping.py implement locked_file()/_locked_file()
context managers independently. Same pattern, different names.
────────────────────────────────────────
Area: State normalization
Finding: _proxy.py:143-164 and _mapping.py both normalize key/model cooldown
state. The proxy does it on every request; _mapping.py does it on load.
Potential for stale state between proxy and mapping state files.
────────────────────────────────────────
Area: Config generation
Finding: cmd_config() writes config.json for LiteLLM but also writes
.env.claude. These are coupled operations with no rollback if one fails.
---
4. TEST COVERAGE SUMMARY
- 33 tests, all passing
- 34 cmd functions, ~13 completely untested
- ~80 internal functions, many untested (evidence packet builders, prefix
resolution, artifact stores, harness verdict logic)
- 0 integration tests (no end-to-end gateway → proxy → backend flow test)
- 0 performance tests (request latency through proxy, concurrent client
limits)
- 0 security regression tests (no tests preventing token leaks, path
traversal, etc.)
---
5. RECOMMENDED PRIORITY ACTIONS
┌──────────┬───────────────────────────────────────────────────────────────┐
│ Priority │ Action │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P0 │ Remove secret key names from audit log entries in │
│ │ _mapping.py:433-456 │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P1 │ Add file permission enforcement to preflight (fail on drift, │
│ │ don't just warn) │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P1 │ Replace ensure_secret_file() with safe file creation │
│ │ (`O_CREAT │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P1 │ Lock shim directory to 0700 and validate all entries are │
│ │ symlinks to _harness_shim.py │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P2 │ Deduplicate sha256_text/canonical_json/append_event across │
│ │ _mapping.py and _harness_shim.py │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P2 │ Add unit tests for _proxy.py request handler (mock │
│ │ connections) │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P2 │ Add tests for untested cmd functions: validate, profile_*, │
│ │ team_load/team_delete │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P2 │ Narrow CLIENT_PASS_ENV_KEYS to remove SSH_AUTH_SOCK, │
│ │ DBUS_SESSION_BUS_ADDRESS, DISPLAY │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P3 │ Add Content-Length limit in _proxy.py │
├──────────┼───────────────────────────────────────────────────────────────┤
│ P3 │ Add KeyboardInterrupt handler to _proxy.py:main() │
└──────────┴───────────────────────────────────────────────────────────────┘
✻ Sautéed for 13m 18s
● Agent "Security vulnerability analysis" completed
● The security analysis agent's findings have already been incorporated into the
report above — all severity-rated findings are there. No additional follow-up
needed.
❯ /copy
⎿ Copied to clipboard (164 characters, 1 lines)
Also written to /tmp/claude/response.md
❯ /copy
⎿ Copied to clipboard (164 characters, 1 lines)
Also written to /tmp/claude/response.md
❯ /copy
⎿ Copied to clipboard (164 characters, 1 lines)
Also written to /tmp/claude/response.md
❯ /n-agentic-harnesses
● Using n-agentic-harnesses in evaluation mode — the BrokeLLM harness itself is
the subject, and we just completed a full security + code gap analysis. Let me
read the evaluation references to frame findings against established harness
primitives.
Read 2 files (ctrl+o to expand)
● Now let me read the remaining key references to evaluate BrokeLLM's harness
against established primitives.
Read 4 files (ctrl+o to expand)
● Now I have the full evaluation framework. Here's the harness-specific
evaluation of BrokeLLM, structured per the skill's output contract:
---
BrokeLLM — Agentic Harness Evaluation
Classification: code agent + embedded AI feature
Harness shape: Single-process gateway orchestrator with policy-driven model
routing and event-harness evaluation
Architecture level: ~Level 2-3 (durable harness + extensible surface, but
approaching complexity limits for solo maintenance)
---
FINDINGS
F1 — Tool/Execution Boundary: Shim PATH Trust Model
Severity: HIGH — safety gap
bin/broke:293-314 creates a shims/ directory and prepends it to $PATH. Every
entry is a symlink to _harness_shim.py, but nothing validates that symlinks
haven't been replaced between creation and execution. A compromised agent
could swap a symlink to point at an attacker binary.
Why it matters: The entire safety story of harness mediation depends on all
tool calls flowing through _harness_shim.py. A swapped symlink bypasses the
event ledger entirely.
Operator impact: Silent execution — no event is logged for hijacked commands.
Fix direction: After prepare_harness_shims(), create an immutable manifest of
expected symlink targets (e.g., hash each link destination). At execution
time, validate before exec. Better: use execve() with the absolute binary path
instead of relying on $PATH resolution, or bwrap --bind-mount each shim
individually.
F2 — Missing Permission Tiers for Harness Commands
Severity: HIGH — missing primitive
The shim classifies commands into verification:test, verification:task, vcs,
runtime, tool — but no policy engine blocks or gates anything based on
classification. The cwd_allowed() check is the sole permission gate. A git
push --force or rm -rf classified as vcs executes without approval.
Why it matters: The skill's permission tier model (always-allow / ask-first /
never-allow) is absent.
Operator impact: Workers can execute high-risk actions through the shim
without review.
Fix direction: Add a policy.deny / policy.require_approval table keyed by
command + argument patterns. At minimum, deny destructive verbs (rm, dd, mkfs,
git reset --hard) regardless of cwd scope.
F3 — Workflow State Conflated with Logging
Severity: MEDIUM — durability risk
The harness writes events to events.jsonl with chained hashes (good), but
there is no explicit "waiting", "retry", "approved", or "blocked" state
machine. The harness knows only run.registered → command.started →
command.completed → verdict.issued. If a worker crashes mid-task, there's no
retry_scheduled or compensated state.
Why it matters: Crashes leave runs in implicit "running" state forever; no
resumability path.
Operator impact: Stuck runs require manual cleanup; no visibility into what
was in flight.
Fix direction: Add a status field to the run record that progresses through:
registered → executing → collecting_checkpoint → awaiting_verdict →
completed/failed/escalated. Write state changes atomically. Add a broke
harness recover <run-id> command.
F4 — Event Hash Chain Not Validated on Read
Severity: MEDIUM — audit integrity gap
Events are chained with prev_event_hash, but nothing verifies the chain on
_read_json() or at verdict time. A malicious process could modify an earlier
event without detection unless someone manually replays the hashes.
Why it matters: Chain integrity is a write-time-only guarantee; tampering is
detectable only if explicitly checked.
Fix direction: Add broke harness verify-chain <run-id> that walks the file and
confirms each event_hash matches recomputation. Consider validating on
verdict issuance.
F5 — Evaluation Coverage Is Thin for Production-Harness Claims
Severity: MEDIUM — missing primitive
33 tests, all in test_mapping.py. Zero tests for:
- _proxy.py request handler error paths (e.g., 502 from LiteLLM, malformed
JSON body)
- _harness_shim.py policy enforcement (e.g., cwd violation, classification
accuracy)
- _socket_bridge.py connection behavior (idle timeout, slot exhaustion)
- Import/export integrity (overwrite flag, malformed JSON)
- Profile inheritance over team policies
By the skill's rule: "do not evaluate from vibes alone" — the harness claims
balanced/high_assurance verdicts but has no tests confirming those verdicts
produce correct outcomes under adversarial input.
Operator impact: Regression risk on every edit to proxy policy or harness
verdict logic.
Fix direction: Minimum add: proxy handler unit tests (mock upstream), shim
cwd/classification tests, verdict correctness matrix with known
inputs/outputs.
F6 — Context Assembly Has No Budget Discipline
Severity: LOW — maintainability
cmd_config() generates ALL models for ALL key variants into a single
config.json LiteLLM file. There's no concept of "load only the tools relevant
to the current context." If a user has 2 keys per provider across 6 providers
plus fallback chains, the config file sprawls.
Operator impact: Larger config → slower provider startup → higher chance of
config errors.
Fix direction: Dynamic profile generation — emit only active models + fallback
chain for the current team/slot mapping.
F7 — No Cost/Token Visibility
Severity: LOW — operational gap
cmd_metrics parses prometheus success/failure counts from LiteLLM, but there's
no token or cost tracking, no cost-per-lane visibility, and no "runaway
behavior indicator." The entire value proposition (route to free backends)
lacks the single metric that proves it's working: cost saved.
Operator impact: Cannot validate ROI or detect a backend drifting from free to
paid tier.
Fix direction: Parse litellm_total_tokens and litellm_response_tokens from
metrics; derive per-request cost estimates based on provider pricing tables.
F8 — Duplication Across Components
Severity: LOW — maintainability degradation
The following logic is duplicated across files:
- sha256_text / sha256_obj / canonical_json — replicated in _mapping.py,
_harness_shim.py, and _proxy.py
- File locking (locked_file / _locked_file) — two independent implementations
- read_json with fallback / read with parsing — three versions
This is exactly the pattern solo-devs regret at 6 months.
Fix direction: Create bin/_harness_common.py (or reuse _mapping.py as the
single source of truth) for shared hashing, JSON, locking, and event append.
Import everywhere.
---
STRENGTHS (brief)
1. Event-driven audit trail — events.jsonl with chained hashes, artifact refs,
and role separation is a strong foundation
2. Preflight integrity checks — cmd_preflight() validates PYTHONPATH
pollution, lockfile drift, file permissions, .pth injection
3. Multi-layer sandbox — normal/hardened/strict profiles with bwrap isolation,
network policy, and runtime directory separation
4. Policy-driven verdict — throughput/balanced/high_assurance with block_on,
retry_on, escalation thresholds
5. Provider abstraction — clean claude/codex/gemini env sanitization with
sanitize_client_env
6. Key rotation state — cooldown timers, auth_failed tracking, generation
numbers
---
UPGRADE PATH (sequenced)
┌───────┬───────────────────────────┬─────────────────────────────────────┐
│ Order │ Fix │ Rationale │
├───────┼───────────────────────────┼─────────────────────────────────────┤
│ 1 │ F1: Validate shims before │ Closes the highest-severity safety │
│ │ exec │ gap in a single-function change │
├───────┼───────────────────────────┼─────────────────────────────────────┤
│ 2 │ F2: Add deny table for │ Lowest-effort safety win; prevents │
│ │ destructive commands │ silent destructive actions │
├───────┼───────────────────────────┼─────────────────────────────────────┤
│ 3 │ F4: Verify event chain on │ Makes audit integrity │
│ │ verdict │ bidirectional, not write-only │
├───────┼───────────────────────────┼─────────────────────────────────────┤
│ 4 │ F5: Add proxy + shim │ Covers the biggest blind spots in │
│ │ tests │ the test matrix │
├───────┼───────────────────────────┼─────────────────────────────────────┤
│ 5 │ F3: Run state machine │ Adds resumability and crash │
│ │ │ recovery │
├───────┼───────────────────────────┼─────────────────────────────────────┤
│ 6 │ F8: Deduplicate │ Prevents future divergence │
│ │ hash/lock/JSON │ │
├───────┼───────────────────────────┼─────────────────────────────────────┤
│ 7 │ F7: Add cost tracking │ Proves the system's value; detects │
│ │ │ paid-tier drift │
├───────┼───────────────────────────┼─────────────────────────────────────┤
│ 8 │ F6: Dynamic profile │ Nice-to-have; only needed if config │
│ │ config │ sprawl becomes real │
└───────┴───────────────────────────┴─────────────────────────────────────┘
---
ACCEPTANCE CRITERIA AFTER FIXES
1. All shim symlinks verified before any command execution
2. Destructive commands (rm -rf, git reset --hard, dd) blocked by shim
regardless of cwd scope
3. Event chain verification passes for all completed runs
4. Proxy tests cover: 502 upstream failure, JSON parse error, auth bypass
attempt, missing token
5. Harness state machine shows clear status transitions in broke harness
status
6. All 34 cmd functions have at least one dedicated test
✻ Sautéed for 2m 0s