marin-community · Calvin-Xu · Jan 14, 2026 · Jan 14, 2026 · Jan 17, 2026 · Jan 19, 2026
diff --git a/.agents/logbooks/grp_deployment_variants.md b/.agents/logbooks/grp_deployment_variants.md
@@ -0,0 +1,70 @@
+# GRP Deployment Variants: Research Logbook
+
+## Scope
+- Goal: compare observed-only GRP deployment rules on top of the same retuned GRP fit, and pick a final slide-ready procedure.
+- Primary metric(s): retrospective `Regret@1`, predicted BPB realism, and deployment movement measured by mean phase TV.
+- Constraints: keep the nonlinear retuning procedure fixed; vary only deployment.
+
+## Baseline
+- Date: 2026-04-02
+- Code refs:
+  - `/Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_retuned.py`
+  - `/Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_observed_hull.py`
+- Baseline numbers:
+  - raw retuned optimum reaches retrospective `Regret@1 = 0` from `k >= 80`, but predicted optima are unrealistically optimistic (`~1.029`) and the mixture moves a lot.
+  - observed-only full hull keeps `Regret@1 = 0` from `k >= 80`, with more realistic predicted BPB (`~1.065`) and lower movement.
+
+## Experiment Log
+### 2026-04-02 17:31 - Observed-only deployment variants
+- Hypothesis: restricting deployment to observed-run mixtures should preserve retrospective choice quality while stabilizing the optimum and making predicted BPBs more realistic.
+- Command:
+  - `uv run python /Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_deployment_variants.py`
+- Config:
+  - same per-subset GRP retuning as `benchmark_grp_retuned.py`
+  - deployment variants:
+    - best predicted observed run
+    - convex hull of top-4 predicted observed runs
+    - convex hull of top-8 predicted observed runs
+    - convex hull of top-16 predicted observed runs
+    - convex hull of all observed runs
+- Result:
+  - all variants have the same retrospective `Regret@1` profile: misses only at `k=40,60`, then zero from `k >= 80`
+  - after `k >= 80`, movement / predicted-value summary:
+    - `top1_observed`: mean move `0.256`, mean predicted `1.0792`
+    - `top4_hull`: mean move `0.147`, mean predicted `1.0739`
+    - `top8_hull`: mean move `0.195`, mean predicted `1.0702`
+    - `top16_hull`: mean move `0.127`, mean predicted `1.0666`
+    - `all_observed_hull`: mean move `0.120`, mean predicted `1.0653`
+- Interpretation:
+  - the full observed-run hull is the cleanest local rule among those tried
+  - it matches the other variants on retrospective choice quality, but gives the lowest movement and the most realistic predicted BPB
+- Next action:
+  - use the observed-only full hull as the documented deployment rule in the slides
+  - if needed later, validate a few observed-hull subset deployments by training them
+
+### 2026-04-02 18:12 - More deployment regularizers and representative validation launch
+- Hypothesis: a hull over the top actual observed runs should retain the good behavior of the full observed hull, while being cleaner and slightly more local.
+- Command:
+  - `uv run python /Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_deployment_variants.py`
+  - `uv run python -m marin.run.iris_run --config /Users/calvinxu/Projects/Work/Marin/marin/lib/iris/examples/marin.yaml -- --no-wait --job-name dm-genericfamily-top8actual-hull-subset-optima-$(date +%Y%m%d-%H%M%S) --region us-east5 --zone us-east5-a --cpu 2 --memory 16GB --disk 20GB --extra marin:tpu --extra marin:eval -- python -m experiments.domain_phase_mix.launch_two_phase_many_genericfamily_top8actual_hull_subset_optima --tpu-type v5p-8 --max-concurrent 4`
+- Config:
+  - added these observed-only variants:
+    - `top4_actual_hull`
+    - `top8_actual_hull`
+    - `top16_actual_hull`
+    - `all_hull_disp0.01`, `all_hull_disp0.02`
+    - `all_hull_to_bestactual0.02`, `all_hull_to_bestactual0.05`
+- Result:
+  - after `k >= 80`:
+    - `all_observed_hull`: mean predicted `1.0653`, mean move `0.1200`, mean support `8.75`
+    - `top8_actual_hull`: mean predicted `1.0661`, mean move `0.1100`, mean support `6.75`
+    - `top16_actual_hull`: mean predicted `1.0660`, mean move `0.1096`, mean support `9.13`
+  - dispersion / locality penalties reduced nearest-TV distance but generally worsened predicted BPB and often collapsed toward a single observed run
+  - launched representative subset validation for `top8_actual_hull`:
+    - parent job `/calvinxu/dm-genericfamily-top8actual-hull-subset-optima-20260402-181147`
+- Interpretation:
+  - `top8_actual_hull` is the best cleaner policy: essentially tied with `top16_actual_hull`, but simpler and sparser
+  - the full observed hull still wins slightly on pure predicted BPB, but the difference is negligible
+- Next action:
+  - validate representative `top8_actual_hull` subset runs first
+  - only run an all-subset sweep if the representative validations look materially better than the raw retuned subset optima
diff --git a/.agents/logbooks/offline_rl_v5.md b/.agents/logbooks/offline_rl_v5.md
@@ -0,0 +1,70 @@
+# Offline RL v5: Research Logbook
+
+## Scope
+- Goal: find a new three-phase offline-control method that beats the current legacy `outcome_planner` under the pooled dense v4 evaluation setup.
+- Primary metric(s): `fqe_value_mean`, `dr_value_mean`, fold win counts against `legacy_outcome_planner`.
+- Constraints: keep evaluation on the existing three-phase target folds; do not justify rollout unless the method also beats `fixed_best_schedule`.
+
+## Baseline
+- Date: 2026-03-14
+- Code refs:
+  - `experiments/domain_phase_mix/offline_rl/train_offline_policy_bench.py`
+  - `experiments/domain_phase_mix/offline_rl/train_three_phase_policy_bench_v4.py`
+- Baseline numbers on the pooled dense v4 folds:
+  - `legacy_outcome_planner`: `fqe_value_mean = 4.0709`, `dr_value_mean = 3.1376`
+  - `fixed_best_schedule`: `fqe_value_mean = 4.1268`, `dr_value_mean = 4.3350`
+
+## Experiment Log
+### 2026-03-14 16:00 - pooled direct/hybrid follow-up
+- Hypothesis: the old planner's advantage came from direct final-objective scoring, while v3/v4 Q-only variants improved support behavior but lost ranking power. A pooled direct planner on dense features should beat the legacy planner, and a hybrid `Q + direct` planner may also help.
+- Command:
+  - `uv run python /tmp/eval_v5_candidates.py`
+- Config:
+  - dataset: `/Users/calvinxu/Projects/Coursework/CS234/Project/RL_Bench/offline_rl_v4_three_phase_target_pooled_aux_20260312/dataset_v4`
+  - candidates:
+    - `dense_direct_v5`: pooled dense direct planner with `reward_bonus_weight = 0.08`, `support_lambda = 0.02`
+    - `hybrid_q_direct_v5`: pooled dynamic-Q plus direct-utility hybrid with `direct_alpha = 2.0`, `support_lambda = 0.05`
+- Result:
+  - `dense_direct_v5`
+    - `fqe_value_mean = 4.0860`
+    - `dr_value_mean = 4.1887`
+    - `beat_legacy_fqe_folds = 5/5`
+    - `beat_legacy_fqe_and_dr_folds = 4/5`
+  - `hybrid_q_direct_v5`
+    - `fqe_value_mean = 4.0870`
+    - `dr_value_mean = 4.0522`
+    - `beat_legacy_fqe_folds = 5/5`
+    - `beat_legacy_fqe_and_dr_folds = 3/5`
+- Interpretation:
+  - `dense_direct_v5` is the best new offline method so far.
+  - The direct objective model was the right inductive bias to recover; pure Q-style models were too noisy on this action-sparse dataset.
+  - Neither v5 method beats `fixed_best_schedule`, so rollout remains unjustified.
+- Next action:
+  - keep `dense_direct_v5` as the new offline baseline and focus future work on closing the gap to `fixed_best_schedule`.
+
+### 2026-03-14 17:10 - v6/v7/v8 follow-ups against fixed schedule
+- Hypothesis:
+  - v6: the remaining gap comes from overusing StarCoder in phase 0; cap phase 0 and train only on three-phase targets.
+  - v7: preserve the best historical prefix and only adapt later phases.
+  - v8: use a conservative phase-2 adapter on top of the fixed best schedule.
+- Command:
+  - `uv run python /tmp/eval_v6_candidates.py`
+  - `uv run python /tmp/eval_v7_fixed_prefix.py`
+  - `uv run python /tmp/eval_v8_conservative.py`
+  - `uv run python /tmp/eval_v8_conservative_hi.py`
+- Result:
+  - best v6 candidate: `three_only_capped_hybrid_v6`
+    - `fqe_value_mean = 4.1002`
+    - `dr_value_mean = 4.1781`
+  - best v7 candidate: `fixed_phase0_plus_hybrid_v7`
+    - `fqe_value_mean = 4.1026`
+    - `dr_value_mean = 4.1869`
+  - best v8 conservative candidate: `conservative_phase2_direct_m0.10`
+    - `fqe_value_mean = 4.1201`
+    - `dr_value_mean = 4.4583`
+- Interpretation:
+  - phase-0 control was part of the issue, but fixing it alone was not enough.
+  - the conservative phase-2 adapter gets very close to the fixed schedule and even exceeds it on mean DR, but still misses on mean FQE.
+  - none of these variants beats `fixed_best_schedule` on both FQE and DR, so none is rollout-ready.
+- Next action:
+  - keep `dense_direct_v5` as the best replacement for the legacy planner, and treat conservative/fixed-prefix adapters as promising but still incomplete branches.