Skip to content

Commit 5fce484

Browse files
leifericfclaude
andcommitted
docs(report): remove bug list section from introspect report
Visible in git history; not needed in the report. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 3d5a24c commit 5fce484

1 file changed

Lines changed: 18 additions & 38 deletions

File tree

reports/introspect-development-2026-03-28.md

Lines changed: 18 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -300,29 +300,9 @@ This is the same Datomic API, same schema, same queries — just no persistence.
300300

301301
---
302302

303-
## 6. Bugs Found and Fixed
304-
305-
Eleven bugs were found and fixed across two testing passes. (Beware of the above code; I have only proved it correct, not tested it. The testing revealed that proving and correctness are, as usual, unrelated.)
306-
307-
| # | Severity | Bug | Fix |
308-
|---|----------|-----|-----|
309-
| 1 | Critical | `->>` threading swapped argument order in `resolve-question-params` → NPE on startup | Direct function call instead of threading |
310-
| 2 | Critical | `format-history` NPE on skipped records (nil `:target`) | Default to `"unknown"` for nil fields |
311-
| 3 | Medium | `load-history` crash on corrupted/partial EDN files | try/catch with `[]` fallback |
312-
| 4 | Medium | `parse-proposal` NPE when LLM returns nil text | Early return nil |
313-
| 5 | Security | `git add -A` staged `.env` and `data/` | Allowlist: only `resources/` and `src/noumenon/` |
314-
| 6 | Medium | Division by zero in model evaluation on empty dataset | Early return zero scores |
315-
| 7 | Medium | `cross-entropy-loss` ArrayIndexOutOfBounds on OOB labels | Bounds check with max penalty fallback |
316-
| 8 | Low | Revert destroyed file formatting (`pr-str` flattens multi-line) | Save/restore raw bytes instead of parsed data |
317-
| 9 | Security | Path traversal: `src/noumenon/../../etc/passwd.clj` passed validation | Reject paths containing `..` |
318-
| 10 | Low | CLI error showed global help instead of introspect help | Include `:subcommand` in parse results |
319-
| 11 | Critical | No exception recovery: thrown exception after apply leaves modified file on disk | try/catch with automatic revert, record as `:error` |
303+
## 6. Preliminary Test Results
320304

321-
---
322-
323-
## 7. Preliminary Test Results
324-
325-
### 7.1 End-to-end runs
305+
### 6.1End-to-end runs
326306

327307
Four end-to-end runs were completed during development, exercising all major code paths. All runs targeted the Noumenon repository itself using the GLM provider with Sonnet.
328308

@@ -333,7 +313,7 @@ Four end-to-end runs were completed during development, exercising all major cod
333313
| 3 | 0.659 | -- | -- | Skipped | -- | LLM parse failure |
334314
| 4 | 0.636 | -- | -- | Skipped | -- | LLM parse failure |
335315

336-
### 7.2 Run 1: Successful improvement (+6.8%)
316+
### 6.2Run 1: Successful improvement (+6.8%)
337317

338318
**Optimizer's gap analysis input** (excerpt from the actual meta-prompt sent to the LLM):
339319

@@ -374,7 +354,7 @@ WRONG answers (highest priority):
374354

375355
**Result:** Mean score improved from **0.523 to 0.591** (+6.8 percentage points). The system kept the modification.
376356

377-
### 7.3 Run 2: Correctly reverted regression (-4.5%)
357+
### 6.3Run 2: Correctly reverted regression (-4.5%)
378358

379359
**Optimizer's response** (verbatim):
380360

@@ -388,7 +368,7 @@ WRONG answers (highest priority):
388368

389369
**Result:** Mean score dropped from **0.682 to 0.636** (-4.5 percentage points). The system correctly reverted the change, restoring the original example selection with exact byte-level fidelity.
390370

391-
### 7.4 Runs 3-4: Graceful parse failure handling
371+
### 6.4Runs 3-4: Graceful parse failure handling
392372

393373
The optimizer LLM returned malformed EDN. Actual error message:
394374

@@ -399,7 +379,7 @@ introspect: failed to parse proposal, skipping
399379

400380
The parse error was caught, the iteration was logged as `:skipped`, no files were modified, and the loop completed normally. This validates the nil-handling and error recovery paths.
401381

402-
### 7.5 Datomic persistence verified
382+
### 6.5Datomic persistence verified
403383

404384
After the e2e run, the meta database was queried to confirm persistence:
405385

@@ -410,21 +390,21 @@ Runs: 1
410390

411391
The run ID, baseline, final score, and all iteration records survived the Datomic round-trip.
412392

413-
### 7.6 Baseline variability
393+
### 6.6Baseline variability
414394

415395
The baseline scores varied across runs (0.523, 0.682, 0.659, 0.636) despite using the same database and prompt configuration. This is because the evaluation runs each question through `agent/ask`, which makes multiple LLM calls. The agent may choose different query strategies each run, and the LLM's output varies even at temperature 0 due to server-side batching effects.
416396

417397
This variability is the main technical risk for the introspect loop: a modification that appears to improve the score by +0.02 might just be noise. The current threshold of +0.001 is conservative in the wrong direction — it catches true improvements but also false positives. Future work should either run evaluations multiple times or increase the threshold.
418398

419399
---
420400

421-
## 8. User and Agent Affordances
401+
## 7. User and Agent Affordances
422402

423-
### 8.1 The need
403+
### 7.1The need
424404

425405
Two audiences use Noumenon: humans via the CLI, and AI agents via MCP. Both need to be able to trigger self-improvement runs, control their cost, and inspect results. The CLI user might run an overnight optimization session. The MCP agent might trigger introspect when it notices the ask agent performing poorly on a particular class of questions.
426406

427-
### 8.2 CLI interface
407+
### 7.2CLI interface
428408

429409
```bash
430410
# Run 10 iterations with default settings
@@ -447,33 +427,33 @@ clj -M:run introspect --max-iterations 20 --git-commit .
447427
| `--git-commit` | Auto-commit each improvement |
448428
| `--verbose` | Log verbose output to stderr |
449429

450-
### 8.3 MCP tools
430+
### 7.3MCP tools
451431

452432
The `noumenon_introspect_start` tool launches an async run and returns a run ID. `noumenon_introspect_status` and `noumenon_introspect_stop` monitor and control it. `noumenon_introspect_history` routes introspect queries to the internal meta database.
453433

454-
### 8.4 Queryable history
434+
### 7.4Queryable history
455435

456436
All iterations are persisted to the internal Datomic meta database as component entities of the run. This enables Datalog queries for post-hoc analysis — `introspect-improvements` shows all kept improvements with deltas, `introspect-failed-approaches` shows what was tried and didn't work (so the optimizer can avoid repeating failures), and `introspect-score-trend` tracks progress over time.
457437

458438
---
459439

460-
## 9. Known Limitations
440+
## 8. Known Limitations
461441

462-
### 9.1 No statistical significance testing
442+
### 8.1 No statistical significance testing
463443

464444
The evaluation runs each question once per iteration (or N times with `--eval-runs`). With LLM non-determinism, small deltas may be noise. The current improvement threshold of +0.001 catches true improvements but also false positives. Increasing `--eval-runs` helps but also increases cost.
465445

466-
### 9.2 No human review gate for code changes
446+
### 8.2 No human review gate for code changes
467447

468448
The `:code` target auto-reverts on lint or test failure, but there is no mechanism for human review before applying code changes. For production use, code changes should be proposed on a branch.
469449

470-
### 9.3 No prompt caching across evaluations
450+
### 8.3 No prompt caching across evaluations
471451

472452
Each question creates a fresh `agent/ask` session. The system prompt is re-sent with every LLM call. [Anthropic API](https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching) prompt caching is used within a single agent session but not across questions.
473453

474454
---
475455

476-
## 10. Implemented Since Initial Development
456+
## 9. Implemented Since Initial Development
477457

478458
The following items were originally listed as future directions and have since been implemented:
479459

@@ -486,7 +466,7 @@ The following items were originally listed as future directions and have since b
486466
- **Human target constraint** (`--target examples,rules`) — restricts which targets the optimizer may choose
487467
- **Pre-trained weight shipping** — documented workflow: `cp data/models/latest.edn resources/model/weights.edn`
488468

489-
## 11. Remaining Future Directions
469+
## 10. Remaining Future Directions
490470

491471
1. **[Deep Diamond](https://github.com/uncomplicate/deep-diamond) GPU training** — swap the pure-Clojure model for GPU-accelerated training when the model grows beyond toy size (requires a feasibility spike)
492472
2. **Prompts and queries in Datomic** — store prompt templates and named queries in the meta database instead of classpath resources, enabling transactional modification with automatic rollback via Datomic's immutable history (significant refactor, better as its own branch)

0 commit comments

Comments
 (0)