Skip to content

[codex] Record 2026-04-13 fleet hotspot stage 1#47

Draft
NeapolitanIcecream wants to merge 4 commits intomainfrom
codex/fleet-day-e2e-hotspot-study-stage1
Draft

[codex] Record 2026-04-13 fleet hotspot stage 1#47
NeapolitanIcecream wants to merge 4 commits intomainfrom
codex/fleet-day-e2e-hotspot-study-stage1

Conversation

@NeapolitanIcecream
Copy link
Copy Markdown
Owner

Summary

This PR records the first completed measurement wave for the 2026-04-13 fleet day e2e hotspot study.

It adds the missing DFX and instrumentation needed to compare shadow replays, captures the stage-1 experiment writeup in-repo, and lands the first measured prepare/enrich treatment branch.

What Changed

  • add scripts/compare_shadow_day_runs.py to diff shadow control vs candidate runs and emit delta.json and report.md
  • add translation counters and pipeline.translate.materialize_localized.duration_ms
  • relax arXiv html_document reuse so existing html_document + html_document_md can skip cleanup/pandoc/write on replay
  • add optional metrics_recorder support to export_trend_static_site(...) and wire workflow site-build substep metrics with low-cardinality names
  • add focused coverage for compare math, translation counters, arXiv reuse, and site-build metric callback behavior
  • add docs/plans/2026-04-22-fleet-day-e2e-hotspot-study-stage-1.md with commands, baseline/control/candidate results, and acceptance verdict

Experiment Result

Live baseline (bench-out/e2e-20260413-baseline) showed:

  • fleet wall time: 565.77s
  • aggregate hotspots: ingest 29.80%, translate 22.88%, analyze 15.97%, ideas:day 14.69%, trends:day 13.31%
  • first branch selector: arXiv html_document reuse, not HN fetch reuse

Shadow control vs first treatment (bench-out/shadow-20260413-control vs bench-out/shadow-20260413-arxiv-html-reuse) showed:

  • fleet wall time: 706.11s -> 701.55s (-4.56s, 0.65%)
  • software_intelligence target mechanism improved:
    • pipeline.enrich.arxiv.html_document.fetch_ms_sum: 59971 -> 51398
    • cleanup_ms_sum: 59238 -> 44513
    • pandoc_ms_sum: 80330 -> 64880
    • ingest: 179433ms -> 169506ms
  • branch verdict: not accepted
    • fleet gain stayed below 8%
    • target dominant step stayed below 15%
    • terminal state behavior stayed unchanged

Why This Matters

This stage turns the experiment from ad-hoc timing notes into a repeatable shadow-compare workflow with enough metrics to explain wins and reject false positives. The first treatment did reduce the intended arXiv cleanup/pandoc work, but not enough to clear the promotion bar, so the next branch can move on with evidence instead of intuition.

Validation

  • uv run pytest tests/test_compare_shadow_day_runs.py tests/test_translation_runtime_parallelism.py tests/test_recoleta_specs_arxiv_html_document_md.py tests/test_trends_static_site.py -q
  • uv run pytest tests/test_recoleta_specs_run_once_cli.py -q -k 'site_build or trends_static_site or arxiv_html_document or translation_runtime_parallelism or compare_shadow_day_runs'
  • uv run pytest tests/test_recoleta_specs_trends_cli_billing_report.py -q -k site_build
  • uv run ruff check recoleta/site.py recoleta/cli/workflow_steps.py recoleta/pipeline/enrich_stage.py scripts/compare_shadow_day_runs.py tests/test_trends_static_site.py tests/test_recoleta_specs_arxiv_html_document_md.py tests/test_translation_runtime_parallelism.py tests/test_compare_shadow_day_runs.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant