Skip to content

Conversation

@mikeyobrien
Copy link
Owner

Summary

  • Introduces preflight/backpressure improvements (auto-preflight, new gates for coverage, cargo audit, performance regression, verifier quality) plus spec completeness/acceptance criteria checks.
  • Updates orchestration/robot flows and skills/presets (Self-Healer hat, decision confidence protocol, skill consolidation, SOP alignment, preset trimming, human-in-the-loop refinements).
  • Expands CLI/tooling & TUI behavior (doctor command, tutorial onboarding, skill discovery, token/tsx parsing fixes, task status normalization, TUI width/iteration buffer fixes).
  • Broadens test coverage across CLI/core/adapters and adds/refreshes docs and templates.
  • Adds --yolo to Codex CLI invocation and adjusts tests.

Testing

  • cargo test

Commits (last 71)

Integrates the existing `ralph preflight` checks into the loop start
sequence so issues are caught before orchestration begins.

- Run preflight checks after loop context setup, before loop execution
- Configurable via features.preflight (enabled by default)
- --skip-preflight CLI flag overrides config
- Critical failures block the loop; warnings logged but allowed
- Strict mode treats warnings as failures
- Worktree cleanup on preflight failure for non-primary loops
- Deferred worktree registry registration until preflight passes
When `persistent: true` is set in event_loop config, the loop stays
alive after LOOP_COMPLETE instead of terminating. A task.resume event
is injected to keep the loop idling until new work arrives. Hard limits
(max_iterations, max_runtime, max_cost) still terminate normally.

Closes task-1769829997-dba7
…eflight opt-in

- Add 🩹 Self-Healer hat with automated recovery strategies (rollback, skip, reduce scope, fallback, escalate)
- Fix doctor.rs to strip Windows .exe/.cmd/.bat/.com extensions from backend names
- Add comprehensive tests for loops, preflight, and task CLI
- Make preflight checks opt-in (enabled: false by default) and skip outside git repos
- Fix UTF-8 boundary issue in event loop content truncation
- Require completion event to be last in JSONL batch
- Update backpressure docs to include mutation testing (warning-only)
Adds a new 'specs' preflight check that validates .spec.md files have
Given/When/Then acceptance criteria — a prerequisite for the Level 5
spec-driven pipeline where specs automatically generate acceptance tests.

The check:
- Recursively scans the specs directory for .spec.md files
- Skips specs marked as 'status: implemented' (already reviewed)
- Detects acceptance criteria in bold, plain text, and list formats
- Warns (non-blocking) when specs lack testable criteria
- Integrates into existing preflight infrastructure (ralph preflight --check specs)

Includes 13 unit tests covering all paths: empty dirs, complete specs,
incomplete specs, implemented specs, subdirectories, and all three
Given/When/Then format variants.
…parser

Add structured Given/When/Then parser (extract_acceptance_criteria) to
ralph-core that returns AcceptanceCriterion triples from spec content.
Create spec-to-test skill that teaches AI agents to generate test stubs
with 1:1 mapping to spec criteria. Wire into spec-driven.yml implementer
hat as step 0 (red phase of TDD). 13 new tests for the parser.
Add `specs: pass/fail` as a new optional backpressure dimension that verifies
spec acceptance criteria are satisfied by passing tests. When reported as
`specs: fail`, it blocks build.done events (like performance regression).
When omitted, it does not block (backwards compatible).

Changes:
- BackpressureEvidence: new `specs_verified: Option<bool>` field
- QualityReport: new `specs_verified: Option<bool>` field with failed_dimensions
- EventParser: parse `specs: pass/fail` from build.done and `quality.specs` from verify payloads
- Event loop: include specs status in backpressure rejection logs
- Instructions: mention specs in backpressure check list
- 9 new tests covering all spec evidence parsing paths
Update `ralph plan` (PDD SOP) to output artifacts in the directory
structure expected by spec-driven and pdd-to-code-assist presets:

- Default output dir: specs/{task_name}/ (was .sop/planning/)
- Flat layout: design.md, plan.md, requirements.md (was nested design/, implementation/)
- Renamed idea-honing.md → requirements.md (matches preset expectations)
- Added Given-When-Then acceptance criteria section to design template
- Updated Ralph Integration step to suggest spec-driven presets
- Synced .claude/skills/pdd/SKILL.md with bundled SOP
…eline

Update `ralph task` (Code Task Generator SOP) to output artifacts in
the directory structure expected by spec-driven presets:

- Default output dir: specs/{task_name}/tasks/ (was .ralph/tasks/)
- Reference design.md (was design/detailed-design.md) matching flat layout
- Updated Ralph Integration step to suggest spec-driven presets
- Updated examples to show specs/ directory paths
- Synced .claude/skills/code-task-generator/SKILL.md with bundled SOP
The task.start event handler did *self = Self::new() which wiped
iteration buffers, current_view, and following_latest state. This
caused the header to show "iter 1/0" and all previous iteration
output to disappear (garbled display).

Now preserves iterations, current_view, following_latest, and
new_iteration_alert across the reset, matching the existing pattern
for hat_map and loop_started.

Adds regression test to prevent reoccurrence.
…core 15

Audit of 27 presets identified 13 as redundant, experimental, or aspirational.
Removed to reduce user confusion and present a clear, opinionated set.

Removed (with rationale):
- feature-minimal: stripped version of feature.yml, confusing duplication
- tdd-red-green: code-assist.yml covers TDD with more flexibility
- adversarial-review: niche security review, review.yml suffices
- socratic-learning: teaching experiment, not a real workflow
- mob-programming: interesting concept but untested/unused
- scientific-method: overlaps with debug.yml hypothesis-driven approach
- code-archaeology: research.yml covers legacy code exploration
- performance-optimization: niche, debug + profiling covers it
- api-design: feature.yml or spec-driven.yml covers API work
- documentation-first: docs.yml covers documentation-driven work
- incident-response: aspirational, no production monitoring integration
- migration-safety: aspirational, very niche
- confession-loop: experimental quality pattern, code-assist has scoring
- planning.yml: web UI specific, not embedded

Remaining 15 presets: bugfix, code-assist, debug, deploy, docs, feature,
gap-analysis, hatless-baseline, merge-loop, pdd-to-code-assist, pr-review,
refactor, research, review, spec-driven

Updated: presets.rs, sync-embedded-files.sh, docs/guide/presets.md,
presets/index.json, and all tests referencing removed presets.
wait_for_response() and the Telegram message handler both used the
default events.jsonl path instead of reading the current-events marker
to find the active timestamped events file. This caused interact.human
to send questions via Telegram but never receive responses — it was
watching the wrong file, and responses were written to the wrong file.
ContentPane::render() always advanced x by 1 per character, but Unicode
wide characters (emoji, CJK, etc.) occupy 2 terminal columns. This caused
cascading misalignment where text after any wide character appeared garbled
with dropped/shifted characters.

Uses unicode-width to determine actual display width, resets trailing cells
for wide characters, and wraps before the edge when a wide character would
straddle the right boundary.
Chaos mode was an experimental feature that was never fully implemented —
the loop_runner only logged a TODO and immediately returned ChaosModeComplete.
Removes ~560 lines of unused code across 10 files:
- Delete chaos_mode.rs (254 LOC)
- Remove ChaosModeConfig, ResearchFocus, ChaosOutput from config.rs
- Remove ChaosModeComplete/ChaosModeMaxIterations TerminationReason variants
- Remove triggers_chaos_mode() method
- Remove --chaos and --chaos-max-iterations CLI args
- Clean up match arms in display, summary_writer, loop_runner, bench
- Update tests to remove chaos-related assertions
…ISS #7)

Session recording modules (cli_capture, session_recorder, session_player)
and their dependents (replay_backend, smoke_runner) are now conditionally
compiled behind `#[cfg(feature = "recording")]`. This reduces the default
binary size by ~1,147 LOC when recording is not needed.

Workspace-internal crates enable the feature explicitly, so all existing
functionality and tests continue to work unchanged.
… fragments

Enable YAML anchors to de-duplicate instruction blocks across hats.
HatConfig.extra_instructions is a Vec<String> that gets merged into
instructions during config normalization.

Also: derive Default for PreflightConfig (clippy derivable_impls),
remove unused default_false helper.

KISS item #3 — prep for hat config de-duplication.
Redundant with code-assist phase 2.2 which already performs project
analysis during implementation. Most projects already have AGENTS.md
and README.md files, making standalone documentation generation
unnecessary. Removes 314 lines of SOP guidance.
…rations skill (KISS item #10)

Merged two overlapping skills (531 lines) into a single "ralph-operations" skill
(213 lines), eliminating ~318 lines of duplication. One reference point for loop
lifecycle management, diagnostics analysis, and troubleshooting.
…(KISS item #11)

111 "You MUST" directives created constraint overload, reducing LLM compliance.
Consolidated to 32 focused constraints by:
- Extracting repeated rules to Important Notes (doc/code separation, snippet labeling)
- Lifting shared Code Phase constraints to phase-wide section
- Removing obvious/implied behaviors (mkdir before use, handle errors)
- Condensing verbose Troubleshooting into concise paragraphs
- Trusting agent judgment for non-critical style decisions

All critical invariants preserved: TDD cycle, no broken commits, no push,
convention compliance, CODEASSIST.md integration, separation of concerns.

Net reduction: 463 → 214 lines (-54%), 111 → 32 MUSTs (-71%)
… item #11)

PDD: 85 → 19 MUSTs (-78%), 298 → 147 lines (-51%)
Code-Task-Generator: 57 → 8 MUSTs (-86%), 349 → 159 lines (-54%)

Applied same simplification pattern as code-assist (cf79fc3):
- Extracted repeated cross-step rules to Important Notes section
- Removed obvious/implied behaviors (create dirs, use tools)
- Condensed verbose troubleshooting into concise paragraphs
- Converted "because this could..." rationale into short descriptions
- Trusted agent judgment for non-critical decisions

All critical invariants preserved: user-driven flow, one-question-at-a-time
requirements, user approval before generation, Given-When-Then acceptance
criteria, code task format spec, Ralph integration offering.

Combined Item #11 totals: 253 → 59 MUSTs across all 3 SOPs (-77%).
Fixes accumulated clippy pedantic warnings that fail under -D warnings:
- preflight.rs: unnecessary raw string hashes (r#"" → r"") in 16 test literals
- event_loop/tests.rs: bool_assert_comparison, unnecessary_map_or, cloned_ref_to_slice_refs
- bot.rs: manual_string_new ("".to_string() → String::new())
- memory.rs: useless_format and manual_div_ceil
- content.rs: single-char string pattern (.contains("x") → .contains('x'))
…(KISS item #2)

Consolidate spec-to-test (247 lines) and test-generation (127 lines) into
one test-driven-development skill (134 lines) with three input modes:
- Mode A: From Spec (.spec.md) — replaces spec-to-test
- Mode B: From Task (.code-task.md) — replaces code-assist phase 4 guidance
- Mode C: From Description — replaces test-generation

Updated references in spec-driven.yml presets, integration tests, and
skill_registry.rs test fixtures. Net reduction: ~240 lines.
…S item #13)

ralph-tui declared ralph-adapters as a dependency but never imported or
used any types from it. Removing this dead dependency cleans up the crate
dependency graph and avoids pulling in ~6K LOC of backend adapter code
(plus transitive deps like portable-pty, vt100, termimad) when building
the TUI.
…ait (KISS item #9)

Introduce a RobotService trait in ralph-proto that abstracts the
human-in-the-loop communication surface (send_question, wait_for_response,
send_checkin, shutdown_flag, stop). The EventLoop now holds an
Option<Box<dyn RobotService>> instead of a concrete TelegramService,
and the CLI layer (loop_runner.rs) creates and injects the service.

This removes ralph-telegram from ralph-core's dependency graph entirely,
keeping the core event loop decoupled from any specific communication
platform.
…adata, event naming

Core event handling:
- Add separate human_pending queue in EventBus for human.* events
- Rename interact.human → human.interact for consistency with human.response/human.guidance
- Route human events to Ralph hat when no other pending events
- Update RobotService trait docs to reflect new event naming

TUI improvements:
- Track per-iteration hat/backend metadata for accurate review display
- Show max_iterations in header (e.g., [iter 3/50])
- Add human interaction state tracking to TuiState
- Update header widget to use iteration metadata when reviewing past iterations
- Add prepare_tui_iteration helper in loop_runner

Documentation:
- Update AGENTS.md with corrected event names (human.interact, human.response)
- Update ralph-telegram README and robot-interaction-skill.md
- Update presets (bugfix.yml, code-assist.yml) and ralph.bot.yml config
@mikeyobrien mikeyobrien changed the title feat: preflight/backpressure gates, skills updates, and CLI improvements chore: preflight gates + skills refresh Feb 2, 2026
@claude
Copy link

claude bot commented Feb 2, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@mikeyobrien mikeyobrien merged commit 92be62f into main Feb 2, 2026
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants