Commit 2714a61
fix(training): repair TOON bullet/indexed outputs + tighten synth prompts
Three quality issues surfaced when running the synthesizers against
gpt-oss-120b on Groq (the canonical first real run, 2375 evaluator
records). Pre-repair conformance was 84% across 5 evaluators.
scripts/transform_repair_toon_bullets.py
Two repair passes:
1. Markdown-bullet repair: `key:\\n - x\\n - y` → `key[2]:\\n - x\\n - y`
gpt-oss emits this for `strengths`/`improvements`/`learnings` in
reflection records (2/3 of failures).
2. Indexed-assign repair: `topics[0]: x\\ntopics[1]: y\\ntopics[2]: z`
→ `topics[3]:\\n - x\\n - y\\n - z`. gpt-oss emits this for
`topics`/`keyPoints` in summarization (all 497 failures pre-repair).
Idempotent. Drops records that don't parse even after repair.
scripts/synthesize_evaluator_prompts.py
- FACT_EXTRACTION_TEMPLATE: explicit STRICT op vocabulary section that
forbids `op: insert`/`op: add` (gpt-oss emitted these in 24% of
fact_extractor records). Canonical ops are add_durable, add_current,
strengthen, decay, contradict.
- INITIAL_SUMMARIZATION_TEMPLATE: replaced the runtime example
`topics[0]: ... topics[1]: ...` with the canonical TOON array form
`topics[N]:\\n - x\\n - y` and explicitly forbids nested sub-keys
in keyPoints (the source of 47% of summarization failures).
scripts/audit_pipeline_shapes.py
- validate_reflection now checks the correct fields per the runtime's
reflectionTemplate (thought/quality_score/strengths/improvements/
learnings) — was previously checking task_completed which belongs
to reflection_evaluator. Quality_score validator accepts int, float,
"78", and "78/100" forms.
scripts/publish_dataset_to_hf.py
Publish allowlist extended to include data/synthesized/evaluators/ and
data/synthesized/phase3/ so the new Phase-4 + Phase-3 records ship
with the next dataset push.
Conformance after the repair pass on the same 2375 records:
reflection_evaluator: 100.0%
long_term_extraction: 100.0%
reflection (post-repair): 97.7%
fact_extractor (with stricter prompt, re-running): TBD
summarization (with stricter prompt, re-running): TBD
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 50e940e commit 2714a61
4 files changed
Lines changed: 326 additions & 16 deletions
File tree
- packages/training/scripts
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
27 | 28 | | |
28 | 29 | | |
29 | 30 | | |
| |||
308 | 309 | | |
309 | 310 | | |
310 | 311 | | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
311 | 317 | | |
312 | 318 | | |
313 | 319 | | |
314 | | - | |
| 320 | + | |
| 321 | + | |
315 | 322 | | |
316 | 323 | | |
317 | | - | |
318 | | - | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
319 | 342 | | |
320 | 343 | | |
321 | 344 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
192 | 192 | | |
193 | 193 | | |
194 | 194 | | |
195 | | - | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
196 | 198 | | |
197 | | - | |
| 199 | + | |
| 200 | + | |
198 | 201 | | |
199 | 202 | | |
200 | 203 | | |
| |||
Lines changed: 93 additions & 11 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
209 | 209 | | |
210 | 210 | | |
211 | 211 | | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
| 218 | + | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
212 | 266 | | |
213 | 267 | | |
214 | 268 | | |
| |||
401 | 455 | | |
402 | 456 | | |
403 | 457 | | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
404 | 470 | | |
405 | 471 | | |
406 | 472 | | |
| |||
449 | 515 | | |
450 | 516 | | |
451 | 517 | | |
452 | | - | |
453 | | - | |
454 | | - | |
455 | | - | |
456 | | - | |
457 | | - | |
| 518 | + | |
| 519 | + | |
| 520 | + | |
| 521 | + | |
| 522 | + | |
| 523 | + | |
| 524 | + | |
| 525 | + | |
| 526 | + | |
| 527 | + | |
| 528 | + | |
| 529 | + | |
| 530 | + | |
| 531 | + | |
| 532 | + | |
| 533 | + | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
| 538 | + | |
| 539 | + | |
458 | 540 | | |
459 | 541 | | |
460 | 542 | | |
| |||
866 | 948 | | |
867 | 949 | | |
868 | 950 | | |
869 | | - | |
| 951 | + | |
870 | 952 | | |
871 | 953 | | |
872 | 954 | | |
| |||
881 | 963 | | |
882 | 964 | | |
883 | 965 | | |
884 | | - | |
| 966 | + | |
885 | 967 | | |
886 | 968 | | |
887 | 969 | | |
| |||
901 | 983 | | |
902 | 984 | | |
903 | 985 | | |
904 | | - | |
| 986 | + | |
905 | 987 | | |
906 | 988 | | |
907 | 989 | | |
| |||
922 | 1004 | | |
923 | 1005 | | |
924 | 1006 | | |
925 | | - | |
| 1007 | + | |
926 | 1008 | | |
927 | 1009 | | |
928 | 1010 | | |
| |||
942 | 1024 | | |
943 | 1025 | | |
944 | 1026 | | |
945 | | - | |
| 1027 | + | |
946 | 1028 | | |
947 | 1029 | | |
948 | 1030 | | |
| |||
0 commit comments