You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: reports/introspect-development-2026-03-28.md
+18-38Lines changed: 18 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -300,29 +300,9 @@ This is the same Datomic API, same schema, same queries — just no persistence.
300
300
301
301
---
302
302
303
-
## 6. Bugs Found and Fixed
304
-
305
-
Eleven bugs were found and fixed across two testing passes. (Beware of the above code; I have only proved it correct, not tested it. The testing revealed that proving and correctness are, as usual, unrelated.)
306
-
307
-
| # | Severity | Bug | Fix |
308
-
|---|----------|-----|-----|
309
-
| 1 | Critical |`->>` threading swapped argument order in `resolve-question-params` → NPE on startup | Direct function call instead of threading |
310
-
| 2 | Critical |`format-history` NPE on skipped records (nil `:target`) | Default to `"unknown"` for nil fields |
311
-
| 3 | Medium |`load-history` crash on corrupted/partial EDN files | try/catch with `[]` fallback |
312
-
| 4 | Medium |`parse-proposal` NPE when LLM returns nil text | Early return nil |
313
-
| 5 | Security |`git add -A` staged `.env` and `data/`| Allowlist: only `resources/` and `src/noumenon/`|
314
-
| 6 | Medium | Division by zero in model evaluation on empty dataset | Early return zero scores |
315
-
| 7 | Medium |`cross-entropy-loss` ArrayIndexOutOfBounds on OOB labels | Bounds check with max penalty fallback |
316
-
| 8 | Low | Revert destroyed file formatting (`pr-str` flattens multi-line) | Save/restore raw bytes instead of parsed data |
| 10 | Low | CLI error showed global help instead of introspect help | Include `:subcommand` in parse results |
319
-
| 11 | Critical | No exception recovery: thrown exception after apply leaves modified file on disk | try/catch with automatic revert, record as `:error`|
303
+
## 6. Preliminary Test Results
320
304
321
-
---
322
-
323
-
## 7. Preliminary Test Results
324
-
325
-
### 7.1 End-to-end runs
305
+
### 6.1End-to-end runs
326
306
327
307
Four end-to-end runs were completed during development, exercising all major code paths. All runs targeted the Noumenon repository itself using the GLM provider with Sonnet.
328
308
@@ -333,7 +313,7 @@ Four end-to-end runs were completed during development, exercising all major cod
**Result:** Mean score dropped from **0.682 to 0.636** (-4.5 percentage points). The system correctly reverted the change, restoring the original example selection with exact byte-level fidelity.
390
370
391
-
### 7.4 Runs 3-4: Graceful parse failure handling
371
+
### 6.4Runs 3-4: Graceful parse failure handling
392
372
393
373
The optimizer LLM returned malformed EDN. Actual error message:
394
374
@@ -399,7 +379,7 @@ introspect: failed to parse proposal, skipping
399
379
400
380
The parse error was caught, the iteration was logged as `:skipped`, no files were modified, and the loop completed normally. This validates the nil-handling and error recovery paths.
401
381
402
-
### 7.5 Datomic persistence verified
382
+
### 6.5Datomic persistence verified
403
383
404
384
After the e2e run, the meta database was queried to confirm persistence:
405
385
@@ -410,21 +390,21 @@ Runs: 1
410
390
411
391
The run ID, baseline, final score, and all iteration records survived the Datomic round-trip.
412
392
413
-
### 7.6 Baseline variability
393
+
### 6.6Baseline variability
414
394
415
395
The baseline scores varied across runs (0.523, 0.682, 0.659, 0.636) despite using the same database and prompt configuration. This is because the evaluation runs each question through `agent/ask`, which makes multiple LLM calls. The agent may choose different query strategies each run, and the LLM's output varies even at temperature 0 due to server-side batching effects.
416
396
417
397
This variability is the main technical risk for the introspect loop: a modification that appears to improve the score by +0.02 might just be noise. The current threshold of +0.001 is conservative in the wrong direction — it catches true improvements but also false positives. Future work should either run evaluations multiple times or increase the threshold.
418
398
419
399
---
420
400
421
-
## 8. User and Agent Affordances
401
+
## 7. User and Agent Affordances
422
402
423
-
### 8.1 The need
403
+
### 7.1The need
424
404
425
405
Two audiences use Noumenon: humans via the CLI, and AI agents via MCP. Both need to be able to trigger self-improvement runs, control their cost, and inspect results. The CLI user might run an overnight optimization session. The MCP agent might trigger introspect when it notices the ask agent performing poorly on a particular class of questions.
The `noumenon_introspect_start` tool launches an async run and returns a run ID. `noumenon_introspect_status` and `noumenon_introspect_stop` monitor and control it. `noumenon_introspect_history` routes introspect queries to the internal meta database.
453
433
454
-
### 8.4 Queryable history
434
+
### 7.4Queryable history
455
435
456
436
All iterations are persisted to the internal Datomic meta database as component entities of the run. This enables Datalog queries for post-hoc analysis — `introspect-improvements` shows all kept improvements with deltas, `introspect-failed-approaches` shows what was tried and didn't work (so the optimizer can avoid repeating failures), and `introspect-score-trend` tracks progress over time.
457
437
458
438
---
459
439
460
-
## 9. Known Limitations
440
+
## 8. Known Limitations
461
441
462
-
### 9.1 No statistical significance testing
442
+
### 8.1 No statistical significance testing
463
443
464
444
The evaluation runs each question once per iteration (or N times with `--eval-runs`). With LLM non-determinism, small deltas may be noise. The current improvement threshold of +0.001 catches true improvements but also false positives. Increasing `--eval-runs` helps but also increases cost.
465
445
466
-
### 9.2 No human review gate for code changes
446
+
### 8.2 No human review gate for code changes
467
447
468
448
The `:code` target auto-reverts on lint or test failure, but there is no mechanism for human review before applying code changes. For production use, code changes should be proposed on a branch.
469
449
470
-
### 9.3 No prompt caching across evaluations
450
+
### 8.3 No prompt caching across evaluations
471
451
472
452
Each question creates a fresh `agent/ask` session. The system prompt is re-sent with every LLM call. [Anthropic API](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) prompt caching is used within a single agent session but not across questions.
473
453
474
454
---
475
455
476
-
## 10. Implemented Since Initial Development
456
+
## 9. Implemented Since Initial Development
477
457
478
458
The following items were originally listed as future directions and have since been implemented:
479
459
@@ -486,7 +466,7 @@ The following items were originally listed as future directions and have since b
486
466
-**Human target constraint** (`--target examples,rules`) — restricts which targets the optimizer may choose
1.**[Deep Diamond](https://github.com/uncomplicate/deep-diamond) GPU training** — swap the pure-Clojure model for GPU-accelerated training when the model grows beyond toy size (requires a feasibility spike)
492
472
2.**Prompts and queries in Datomic** — store prompt templates and named queries in the meta database instead of classpath resources, enabling transactional modification with automatic rollback via Datomic's immutable history (significant refactor, better as its own branch)
0 commit comments