[codex] Record 2026-04-13 fleet hotspot stage 1 by NeapolitanIcecream · Pull Request #47 · NeapolitanIcecream/recoleta

NeapolitanIcecream · 2026-04-22T06:30:45Z

Summary

This PR records the first completed measurement wave for the 2026-04-13 fleet day e2e hotspot study.

It adds the missing DFX and instrumentation needed to compare shadow replays, captures the stage-1 experiment writeup in-repo, and lands the first measured prepare/enrich treatment branch.

What Changed

add scripts/compare_shadow_day_runs.py to diff shadow control vs candidate runs and emit delta.json and report.md
add translation counters and pipeline.translate.materialize_localized.duration_ms
relax arXiv html_document reuse so existing html_document + html_document_md can skip cleanup/pandoc/write on replay
add optional metrics_recorder support to export_trend_static_site(...) and wire workflow site-build substep metrics with low-cardinality names
add focused coverage for compare math, translation counters, arXiv reuse, and site-build metric callback behavior
add docs/plans/2026-04-22-fleet-day-e2e-hotspot-study-stage-1.md with commands, baseline/control/candidate results, and acceptance verdict

Experiment Result

Live baseline (bench-out/e2e-20260413-baseline) showed:

fleet wall time: 565.77s
aggregate hotspots: ingest 29.80%, translate 22.88%, analyze 15.97%, ideas:day 14.69%, trends:day 13.31%
first branch selector: arXiv html_document reuse, not HN fetch reuse

Shadow control vs first treatment (bench-out/shadow-20260413-control vs bench-out/shadow-20260413-arxiv-html-reuse) showed:

fleet wall time: 706.11s -> 701.55s (-4.56s, 0.65%)
software_intelligence target mechanism improved:
- pipeline.enrich.arxiv.html_document.fetch_ms_sum: 59971 -> 51398
- cleanup_ms_sum: 59238 -> 44513
- pandoc_ms_sum: 80330 -> 64880
- ingest: 179433ms -> 169506ms
branch verdict: not accepted
- fleet gain stayed below 8%
- target dominant step stayed below 15%
- terminal state behavior stayed unchanged

Why This Matters

This stage turns the experiment from ad-hoc timing notes into a repeatable shadow-compare workflow with enough metrics to explain wins and reject false positives. The first treatment did reduce the intended arXiv cleanup/pandoc work, but not enough to clear the promotion bar, so the next branch can move on with evidence instead of intuition.

Validation

uv run pytest tests/test_compare_shadow_day_runs.py tests/test_translation_runtime_parallelism.py tests/test_recoleta_specs_arxiv_html_document_md.py tests/test_trends_static_site.py -q
uv run pytest tests/test_recoleta_specs_run_once_cli.py -q -k 'site_build or trends_static_site or arxiv_html_document or translation_runtime_parallelism or compare_shadow_day_runs'
uv run pytest tests/test_recoleta_specs_trends_cli_billing_report.py -q -k site_build
uv run ruff check recoleta/site.py recoleta/cli/workflow_steps.py recoleta/pipeline/enrich_stage.py scripts/compare_shadow_day_runs.py tests/test_trends_static_site.py tests/test_recoleta_specs_arxiv_html_document_md.py tests/test_translation_runtime_parallelism.py tests/test_compare_shadow_day_runs.py

NeapolitanIcecream added 4 commits April 22, 2026 14:30

feat(dfx): record 2026-04-13 fleet hotspot stage 1

c138bef

docs(plan): update fleet hotspot continuation

771bd5a

feat(enrich): measure html maintext parallel shadow branch

866b5af

feat(dfx): trace ideas agent tool calls

61fa553

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Record 2026-04-13 fleet hotspot stage 1#47

[codex] Record 2026-04-13 fleet hotspot stage 1#47
NeapolitanIcecream wants to merge 4 commits intomainfrom
codex/fleet-day-e2e-hotspot-study-stage1

NeapolitanIcecream commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NeapolitanIcecream commented Apr 22, 2026

Summary

What Changed

Experiment Result

Why This Matters

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant