[codex] Record 2026-04-13 fleet hotspot stage 1#47
Draft
NeapolitanIcecream wants to merge 4 commits intomainfrom
Draft
[codex] Record 2026-04-13 fleet hotspot stage 1#47NeapolitanIcecream wants to merge 4 commits intomainfrom
NeapolitanIcecream wants to merge 4 commits intomainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR records the first completed measurement wave for the 2026-04-13 fleet day e2e hotspot study.
It adds the missing DFX and instrumentation needed to compare shadow replays, captures the stage-1 experiment writeup in-repo, and lands the first measured prepare/enrich treatment branch.
What Changed
scripts/compare_shadow_day_runs.pyto diff shadow control vs candidate runs and emitdelta.jsonandreport.mdpipeline.translate.materialize_localized.duration_mshtml_documentreuse so existinghtml_document+html_document_mdcan skip cleanup/pandoc/write on replaymetrics_recordersupport toexport_trend_static_site(...)and wire workflow site-build substep metrics with low-cardinality namesdocs/plans/2026-04-22-fleet-day-e2e-hotspot-study-stage-1.mdwith commands, baseline/control/candidate results, and acceptance verdictExperiment Result
Live baseline (
bench-out/e2e-20260413-baseline) showed:565.77singest 29.80%,translate 22.88%,analyze 15.97%,ideas:day 14.69%,trends:day 13.31%html_documentreuse, not HN fetch reuseShadow control vs first treatment (
bench-out/shadow-20260413-controlvsbench-out/shadow-20260413-arxiv-html-reuse) showed:706.11s -> 701.55s(-4.56s,0.65%)software_intelligencetarget mechanism improved:pipeline.enrich.arxiv.html_document.fetch_ms_sum:59971 -> 51398cleanup_ms_sum:59238 -> 44513pandoc_ms_sum:80330 -> 64880ingest:179433ms -> 169506ms8%15%Why This Matters
This stage turns the experiment from ad-hoc timing notes into a repeatable shadow-compare workflow with enough metrics to explain wins and reject false positives. The first treatment did reduce the intended arXiv cleanup/pandoc work, but not enough to clear the promotion bar, so the next branch can move on with evidence instead of intuition.
Validation
uv run pytest tests/test_compare_shadow_day_runs.py tests/test_translation_runtime_parallelism.py tests/test_recoleta_specs_arxiv_html_document_md.py tests/test_trends_static_site.py -quv run pytest tests/test_recoleta_specs_run_once_cli.py -q -k 'site_build or trends_static_site or arxiv_html_document or translation_runtime_parallelism or compare_shadow_day_runs'uv run pytest tests/test_recoleta_specs_trends_cli_billing_report.py -q -k site_builduv run ruff check recoleta/site.py recoleta/cli/workflow_steps.py recoleta/pipeline/enrich_stage.py scripts/compare_shadow_day_runs.py tests/test_trends_static_site.py tests/test_recoleta_specs_arxiv_html_document_md.py tests/test_translation_runtime_parallelism.py tests/test_compare_shadow_day_runs.py