- Goal: compare observed-only GRP deployment rules on top of the same retuned GRP fit, and pick a final slide-ready procedure.
- Primary metric(s): retrospective
Regret@1, predicted BPB realism, and deployment movement measured by mean phase TV. - Constraints: keep the nonlinear retuning procedure fixed; vary only deployment.
- Date: 2026-04-02
- Code refs:
/Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_retuned.py/Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_observed_hull.py
- Baseline numbers:
- raw retuned optimum reaches retrospective
Regret@1 = 0fromk >= 80, but predicted optima are unrealistically optimistic (~1.029) and the mixture moves a lot. - observed-only full hull keeps
Regret@1 = 0fromk >= 80, with more realistic predicted BPB (~1.065) and lower movement.
- raw retuned optimum reaches retrospective
- Hypothesis: restricting deployment to observed-run mixtures should preserve retrospective choice quality while stabilizing the optimum and making predicted BPBs more realistic.
- Command:
uv run python /Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_deployment_variants.py
- Config:
- same per-subset GRP retuning as
benchmark_grp_retuned.py - deployment variants:
- best predicted observed run
- convex hull of top-4 predicted observed runs
- convex hull of top-8 predicted observed runs
- convex hull of top-16 predicted observed runs
- convex hull of all observed runs
- same per-subset GRP retuning as
- Result:
- all variants have the same retrospective
Regret@1profile: misses only atk=40,60, then zero fromk >= 80 - after
k >= 80, movement / predicted-value summary:top1_observed: mean move0.256, mean predicted1.0792top4_hull: mean move0.147, mean predicted1.0739top8_hull: mean move0.195, mean predicted1.0702top16_hull: mean move0.127, mean predicted1.0666all_observed_hull: mean move0.120, mean predicted1.0653
- all variants have the same retrospective
- Interpretation:
- the full observed-run hull is the cleanest local rule among those tried
- it matches the other variants on retrospective choice quality, but gives the lowest movement and the most realistic predicted BPB
- Next action:
- use the observed-only full hull as the documented deployment rule in the slides
- if needed later, validate a few observed-hull subset deployments by training them
- Hypothesis: a hull over the top actual observed runs should retain the good behavior of the full observed hull, while being cleaner and slightly more local.
- Command:
uv run python /Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_deployment_variants.pyuv run python -m marin.run.iris_run --config /Users/calvinxu/Projects/Work/Marin/marin/lib/iris/examples/marin.yaml -- --no-wait --job-name dm-genericfamily-top8actual-hull-subset-optima-$(date +%Y%m%d-%H%M%S) --region us-east5 --zone us-east5-a --cpu 2 --memory 16GB --disk 20GB --extra marin:tpu --extra marin:eval -- python -m experiments.domain_phase_mix.launch_two_phase_many_genericfamily_top8actual_hull_subset_optima --tpu-type v5p-8 --max-concurrent 4
- Config:
- added these observed-only variants:
top4_actual_hulltop8_actual_hulltop16_actual_hullall_hull_disp0.01,all_hull_disp0.02all_hull_to_bestactual0.02,all_hull_to_bestactual0.05
- added these observed-only variants:
- Result:
- after
k >= 80:all_observed_hull: mean predicted1.0653, mean move0.1200, mean support8.75top8_actual_hull: mean predicted1.0661, mean move0.1100, mean support6.75top16_actual_hull: mean predicted1.0660, mean move0.1096, mean support9.13
- dispersion / locality penalties reduced nearest-TV distance but generally worsened predicted BPB and often collapsed toward a single observed run
- launched representative subset validation for
top8_actual_hull:- parent job
/calvinxu/dm-genericfamily-top8actual-hull-subset-optima-20260402-181147
- parent job
- after
- Interpretation:
top8_actual_hullis the best cleaner policy: essentially tied withtop16_actual_hull, but simpler and sparser- the full observed hull still wins slightly on pure predicted BPB, but the difference is negligible
- Next action:
- validate representative
top8_actual_hullsubset runs first - only run an all-subset sweep if the representative validations look materially better than the raw retuned subset optima
- validate representative