You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Improve fitness evaluation reliability by extracting the first valid score payload from mixed model output, including fenced JSON and malformed-leading-object cases.
Add focused unit tests for parsing edge cases and wire the helper into evaluateFitness to reduce fallback aggregate=0 outcomes.
Update IMPLEMENTATION_PLAN.md with the completed resilience task and validation notes.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy file name to clipboardExpand all lines: IMPLEMENTATION_PLAN.md
+15-2Lines changed: 15 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -236,6 +236,19 @@ This plan lists prioritized tasks required to bring the implementation into full
236
236
-**Dependencies:** Task 16
237
237
-**Notes:**
238
238
- Targets the regression observed at iteration 25 where evaluation timed out and fallback scoring forced `aggregate=0`.
239
-
- Expanded timeout detection to inspect string errors, `Error` instances, and nested `cause` chains used by SDK-wrapped errors.
240
-
- Keeps retry behavior behavior-safe while reducing false negatives in timeout detection.
239
+
- Expanded timeout detection to inspect string errors, `Error` instances, and nested `cause` chains used by SDK-wrapped errors.
240
+
- Keeps retry behavior behavior-safe while reducing false negatives in timeout detection.
241
+
- Validation run after this change: `npm run typecheck`, `npm run lint`, `npm test`, and `npm audit --production` all pass; audit reports 0 vulnerabilities.
242
+
243
+
## 19. Ralph Loop Evaluation JSON Extraction Resilience
244
+
-**Task:** Harden fitness-evaluation response parsing so valid scoring JSON is recovered from mixed prose/code-fence outputs instead of triggering fallback aggregate scoring. **[COMPLETE]**
- Targets the score-regression pattern where evaluation responses may include extra wrapper text and cause JSON parse misses that force fallback scores (`aggregate=0`).
251
+
- Added `extractFitnessJsonPayload()` with balanced-brace scanning to find the first valid JSON object containing required fitness score fields, including content embedded in markdown code fences.
252
+
- Updated `evaluateFitness()` in `ralph-loop.ts` to use the new helper, preserving existing score clamping and checklist normalization.
253
+
- Added unit coverage for plain JSON, fenced JSON with surrounding text, malformed-leading-object recovery, and null return when no valid payload exists.
241
254
- Validation run after this change: `npm run typecheck`, `npm run lint`, `npm test`, and `npm audit --production` all pass; audit reports 0 vulnerabilities.
0 commit comments