Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
209 commits
Select commit Hold shift + click to select a range
fadf8b4
Olmo3 (3B), RegMix 1M, 60M (1B) Swarm Test
Calvin-Xu Jan 14, 2026
2475d75
lr sweep
Calvin-Xu Jan 14, 2026
a0403c1
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 17, 2026
40665a1
initial ver
Calvin-Xu Jan 19, 2026
58197da
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 19, 2026
4610692
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 19, 2026
0219fc3
refactor experiment setup
Calvin-Xu Jan 20, 2026
9fb3ed1
revamp analysis as step
Calvin-Xu Jan 20, 2026
03ff0ed
rename dir
Calvin-Xu Jan 20, 2026
d9b5b69
tweak
Calvin-Xu Jan 20, 2026
c2dd95d
use dolmino for midtrain
Calvin-Xu Jan 20, 2026
fa8be5e
pass chat template
Calvin-Xu Jan 20, 2026
a1107ba
better names
Calvin-Xu Jan 20, 2026
ffb69bd
fix natural proportions
Calvin-Xu Jan 20, 2026
46c07bd
add analyze
Calvin-Xu Jan 20, 2026
a5d74d1
actually count tokens
Calvin-Xu Jan 20, 2026
f441493
tweaks
Calvin-Xu Jan 20, 2026
dd3770e
Add Dolma 3 Pool, Domino Pool
Calvin-Xu Jan 20, 2026
9ec84da
fix optimizer, misc
Calvin-Xu Jan 21, 2026
b487f49
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 21, 2026
37f9c87
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 21, 2026
9f3ca53
tweak domain ratios
Calvin-Xu Jan 21, 2026
9733268
weight sampling tweaks
Calvin-Xu Jan 21, 2026
4e07498
fix
Calvin-Xu Jan 21, 2026
7628afe
fix mixture logging stages
Calvin-Xu Jan 21, 2026
445d834
executor parallelism cap
Calvin-Xu Jan 21, 2026
167918d
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 21, 2026
e82c930
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 21, 2026
29cf94d
Allow baseline runs
Calvin-Xu Jan 23, 2026
095a139
more baselines
Calvin-Xu Jan 23, 2026
c6fd3e9
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 23, 2026
aa8b93e
bad tweak
Calvin-Xu Jan 23, 2026
5e7de36
artifacts
Calvin-Xu Jan 24, 2026
92a9e77
plots
Calvin-Xu Jan 25, 2026
97bef18
new Dolma 3 split
Calvin-Xu Jan 25, 2026
0fd56f5
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 25, 2026
afd6e25
fit on task bpb artifacts
Calvin-Xu Jan 26, 2026
a074f81
fix
Calvin-Xu Jan 26, 2026
32ac0f5
artifacts
Calvin-Xu Jan 26, 2026
07ddba3
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 26, 2026
2856e32
HF & zephyr file validation logging
Calvin-Xu Jan 26, 2026
0d25de6
tweak
Calvin-Xu Jan 26, 2026
1b7c8c8
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 26, 2026
33e3288
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 27, 2026
56ec24e
Add CC only
Calvin-Xu Jan 29, 2026
fe49895
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 29, 2026
b73e988
two-phase replay experiment
Calvin-Xu Jan 29, 2026
148fa46
tweak
Calvin-Xu Jan 29, 2026
a88f408
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 29, 2026
12a654c
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 30, 2026
32cd6b9
save_tokenizer_to_gcs
Calvin-Xu Jan 30, 2026
3eea5e7
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 30, 2026
d190a6a
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 30, 2026
71c8100
save hf datasets to gcs
Calvin-Xu Jan 30, 2026
a972d07
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 30, 2026
3eb6184
visualize mix weight sampling, tweaks
Calvin-Xu Jan 31, 2026
9f23ae1
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 31, 2026
1e379a8
more cache tokenizer
Calvin-Xu Jan 31, 2026
ae57dae
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 31, 2026
63dbfef
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Jan 31, 2026
4494ced
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 1, 2026
519eb57
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 2, 2026
565a193
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 2, 2026
ac8640d
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 3, 2026
2b17b54
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 3, 2026
497b2df
bunch of fixes
Calvin-Xu Feb 3, 2026
c239c1d
Merge branches 'calvin/swarm-olmo3-regmix-test' and 'main' of https:/…
Calvin-Xu Feb 5, 2026
4ba0484
Update eval dataset cache with improved manifest tracking
Calvin-Xu Feb 5, 2026
c4b9940
Fix trailing slash in fs.put/fs.get for correct directory handling
Calvin-Xu Feb 5, 2026
ccf9397
fix
Calvin-Xu Feb 5, 2026
e2a75bb
Make inference engine parameters configurable to avoid OOM
Calvin-Xu Feb 5, 2026
e318fba
Further reduce inference params to avoid OOM
Calvin-Xu Feb 6, 2026
baa3eda
tokenization default parallelism tweak
Calvin-Xu Feb 6, 2026
2ceb7b5
ruff
Calvin-Xu Feb 6, 2026
02f7656
temp fixes
Calvin-Xu Feb 9, 2026
4e639f0
artifacts
Calvin-Xu Feb 9, 2026
108a553
tweak
Calvin-Xu Feb 9, 2026
613386e
allow resumption during write_levanter_cache
Calvin-Xu Feb 10, 2026
3514775
bump zephyr_max_parallelism
Calvin-Xu Feb 10, 2026
ea758ea
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 10, 2026
92df358
restore logging
Calvin-Xu Feb 10, 2026
2e86ed8
Support resumable writes in write_levanter_cache
Calvin-Xu Feb 10, 2026
dafdd39
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 10, 2026
0b9a047
artifacts
Calvin-Xu Feb 10, 2026
475138b
improve
Calvin-Xu Feb 10, 2026
edda7cd
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 10, 2026
66543be
Merge branch 'calvin/resumable-levanter-cache-writes' of https://gith…
Calvin-Xu Feb 10, 2026
ac3932f
update new regmix run
Calvin-Xu Feb 10, 2026
fc00be0
tweaks
Calvin-Xu Feb 10, 2026
7f3df9f
parametric regression & artifacts
Calvin-Xu Feb 12, 2026
4b7ff0c
artifacts (not a good plot actually)
Calvin-Xu Feb 12, 2026
29abf9e
artifacts
Calvin-Xu Feb 13, 2026
ac5dff7
lint
Calvin-Xu Feb 13, 2026
b3dd812
remove all html artifacts
Calvin-Xu Feb 13, 2026
e2760bf
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 13, 2026
dee55ea
single phase exp
Calvin-Xu Feb 13, 2026
cacf869
artifacts
Calvin-Xu Feb 13, 2026
1de4e7e
lint
Calvin-Xu Feb 13, 2026
682f806
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 15, 2026
3fc26e3
update plot
Calvin-Xu Feb 15, 2026
71b30bf
update plot
Calvin-Xu Feb 15, 2026
cc53a4a
cleanup
Calvin-Xu Feb 15, 2026
46c0d3d
preliminary parametric fits results
Calvin-Xu Feb 15, 2026
9b2b09d
two_phase_starcoder_5 (50 more)
Calvin-Xu Feb 15, 2026
f5548e2
analyze step dep fix
Calvin-Xu Feb 15, 2026
706883c
artifacts
Calvin-Xu Feb 15, 2026
05d5bce
rename
Calvin-Xu Feb 15, 2026
b653a5c
plotting update & artifacts
Calvin-Xu Feb 15, 2026
53b8e87
holdout analysis refactor
Calvin-Xu Feb 16, 2026
3c50f97
lint
Calvin-Xu Feb 16, 2026
4531f8e
more baselines
Calvin-Xu Feb 16, 2026
2c52dcb
new models & artifacts
Calvin-Xu Feb 16, 2026
4f550f7
clean
Calvin-Xu Feb 16, 2026
f67f74c
validated on predicted optima
Calvin-Xu Feb 16, 2026
82099c7
try fitting on single phase, 3 phases
Calvin-Xu Feb 17, 2026
87988cf
tweak to Dirichlet weight sampler
Calvin-Xu Feb 17, 2026
4adf4f1
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 17, 2026
780b2cf
Add CES overfit + artifacts
Calvin-Xu Feb 17, 2026
4c7a608
more evals & artifacts
Calvin-Xu Feb 17, 2026
acd9675
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 17, 2026
2ca2b64
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 18, 2026
5570d17
More 3 phase starcoder
Calvin-Xu Feb 20, 2026
5af2f58
use scipy optimize; new plot
Calvin-Xu Feb 20, 2026
ec09da5
big dump
Calvin-Xu Feb 24, 2026
fbdaab3
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 24, 2026
754519e
more forms
Calvin-Xu Feb 25, 2026
71ab638
"next gen"
Calvin-Xu Feb 27, 2026
95bea5a
plot updates
Calvin-Xu Feb 28, 2026
03231cf
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Feb 28, 2026
9623531
Merge branch 'main' of https://github.com/marin-community/marin into …
Calvin-Xu Mar 2, 2026
843456c
misc
Calvin-Xu Mar 2, 2026
a9c9b4f
jitter analysis
Calvin-Xu Mar 2, 2026
052de59
addendum
Calvin-Xu Mar 2, 2026
1c53821
DS-RE-CEQ, plots
Calvin-Xu Mar 2, 2026
beff0d6
misc ng
Calvin-Xu Mar 3, 2026
27ffdc3
misc presentation
Calvin-Xu Mar 3, 2026
a312ba4
Merge origin/main into calvin/swarm-olmo3-regmix-test
Calvin-Xu Mar 5, 2026
318f3cf
fix dolma counts
Calvin-Xu Mar 5, 2026
6a4e9b9
d3rlpy
Calvin-Xu Mar 5, 2026
2eb4d6a
single phase misc
Calvin-Xu Mar 7, 2026
2f1a307
misc
Calvin-Xu Mar 7, 2026
6996c13
RL baseline 2
Calvin-Xu Mar 7, 2026
f5662d7
Fix offline RL rollout state handling
Calvin-Xu Mar 7, 2026
2a3e16c
_validation_experiment_name
Calvin-Xu Mar 7, 2026
497ee9e
Add StarCoder static selector benchmark tooling
Calvin-Xu Mar 9, 2026
a85d00d
Improve offline RL StarCoder rollout evaluation
Calvin-Xu Mar 9, 2026
76e6d40
Add StarCoder benchmark artifacts
Calvin-Xu Mar 9, 2026
443d21e
Add StarCoder proportional and Olmix baselines
Calvin-Xu Mar 10, 2026
7ace2d4
Preserve native schedules in offline RL rollouts
Calvin-Xu Mar 10, 2026
5632c3a
Refresh StarCoder validation plot artifacts
Calvin-Xu Mar 10, 2026
e756fbd
Merge remote-tracking branch 'origin/main' into calvin/swarm-olmo3-re…
Calvin-Xu Mar 13, 2026
101e7e0
Add StarCoder WSD repeat pipelines and validation plots
Calvin-Xu Mar 14, 2026
581bcab
Add dense offline RL policy benchmarks
Calvin-Xu Mar 14, 2026
78870cf
Add pooled direct offline RL v5 benchmark
Calvin-Xu Mar 14, 2026
38cc562
Record v6-v8 offline RL follow-up results
Calvin-Xu Mar 14, 2026
cbaf357
wsd experiment
Calvin-Xu Mar 15, 2026
d2d7d18
Audit Dolma 3 pools and guard Stack-Edu tokenization
Calvin-Xu Mar 15, 2026
b073ad5
Merge remote-tracking branch 'origin/main' into calvin/swarm-olmo3-re…
Calvin-Xu Mar 15, 2026
9cd64aa
Align tokenize pipeline with main
Calvin-Xu Mar 15, 2026
8184700
Add Dolma3 Dolmino nextgen loop and canary fixes
Calvin-Xu Mar 16, 2026
972d3d7
Use merged top-level caches for Dolma3 Dolmino swarm
Calvin-Xu Mar 16, 2026
14ead23
Use hierarchical runtime loading for nextgen swarm
Calvin-Xu Mar 18, 2026
9939855
Skip empty hierarchical eval splits
Calvin-Xu Mar 18, 2026
48b171b
Speed up hierarchical nextgen data startup
Calvin-Xu Mar 18, 2026
c433450
multi domain swarm prep
Calvin-Xu Mar 18, 2026
d72c16b
Add two-phase-many surrogate baseline analysis
Calvin-Xu Mar 21, 2026
6a605de
Add run_00097 seed study and result recovery
Calvin-Xu Mar 21, 2026
ba98f25
Restore mixture weight logging for wrapped datasets
Calvin-Xu Mar 21, 2026
c5837e8
Add fixed-subset epoching and seed-noise study tooling
Calvin-Xu Mar 23, 2026
22da85c
Add budget and model overrides for run_00097 compute studies
Calvin-Xu Mar 23, 2026
830c6d3
Add observed-run helpers and update noise/runtime plot
Calvin-Xu Mar 23, 2026
b570fc6
Add first-10 fixed-subset panel launcher
Calvin-Xu Mar 23, 2026
e46feaf
Add fixed-subset swarm exports and 300M noise study
Calvin-Xu Mar 25, 2026
b254fcb
Update run_00097 noise and rank plots
Calvin-Xu Mar 25, 2026
185a891
Merge remote-tracking branch 'origin/main' into calvin/swarm-olmo3-re…
Calvin-Xu Mar 25, 2026
eb8435a
dump
Calvin-Xu Mar 30, 2026
7deab93
dump
Calvin-Xu Mar 31, 2026
db1c6e7
Add seeded phase-composition surrogates
Calvin-Xu Mar 31, 2026
f7f706d
Merge origin/main into calvin/swarm-olmo3-regmix-test
Calvin-Xu Mar 31, 2026
463c88f
[domain_phase_mix] Migrate launchers to Iris and validate no-groups
Calvin-Xu Apr 1, 2026
b3b1291
Merge remote-tracking branch 'origin/main' into calvin/swarm-olmo3-re…
Calvin-Xu Apr 1, 2026
a011b9b
[domain_phase_mix] Add Iris qsplit 300M replay launcher
Calvin-Xu Apr 5, 2026
e19c6a4
[domain_phase_mix] Add GRP follow-up baselines and launchers
Calvin-Xu Apr 5, 2026
97e8c27
[domain_phase_mix] Check in GRP convergence benchmarks
Calvin-Xu Apr 5, 2026
315aada
[domain_phase_mix] Stabilize qsplit reruns and overlap evals
Calvin-Xu Apr 6, 2026
810fa62
[domain_phase_mix] Add GRP power-law and intrinsic follow-ups
Calvin-Xu Apr 7, 2026
f10db22
[domain_phase_mix] Harden overlap eval collection
Calvin-Xu Apr 8, 2026
40774c2
[domain_phase_mix] Make top-level caches region-aware
Calvin-Xu Apr 8, 2026
e2517f7
[domain_phase_mix] Refresh validated GRP artifacts
Calvin-Xu Apr 8, 2026
9af4abf
[iris] Fix stale JAX coordinator retries
Calvin-Xu Apr 10, 2026
d82faef
[levanter] Defer Iris TPU init on TPU jobs
Calvin-Xu Apr 10, 2026
9a84ed5
[domain_phase_mix] Add qsplit pilot launchers and stratified baselines
Calvin-Xu Apr 11, 2026
0ec2631
[domain_phase_mix] Add GRP family-curvature and penalty variants
Calvin-Xu Apr 11, 2026
9f6bcc3
lint
Calvin-Xu Apr 11, 2026
c1c7c13
also lint
Calvin-Xu Apr 11, 2026
33d2300
[docs] Add qsplit pilot debug logs
Calvin-Xu Apr 11, 2026
62aeb59
[levanter] Retry lm-eval task loads safely
Calvin-Xu Apr 11, 2026
e54f904
[domain_phase_mix] Add GRP raw convergence tooling
Calvin-Xu Apr 11, 2026
0b13d7e
[fray] Format pip package parsing
Calvin-Xu Apr 11, 2026
191d302
[domain_phase_mix] Refresh exploratory cached artifacts
Calvin-Xu Apr 11, 2026
4d87c67
[domain_phase_mix] Support region-agnostic swarm relaunches
Calvin-Xu Apr 12, 2026
a3a4f77
[domain_phase_mix] Add GRP per-domain penalty probe
Calvin-Xu Apr 12, 2026
4d3ebec
[domain_phase_mix] Add power-family-penalty analysis outputs
Calvin-Xu Apr 13, 2026
d3be0d1
[domain_phase_mix] Harden region-agnostic replay launches
Calvin-Xu Apr 13, 2026
2a04529
[levanter] Fallback multihost sync without JAX client
Calvin-Xu Apr 13, 2026
92d87d7
Merge origin/main into calvin/swarm-olmo3-regmix-test
Calvin-Xu Apr 13, 2026
268c6ea
[domain-phase-mix] Add run registries and parity reruns
Calvin-Xu Apr 16, 2026
b33dc8f
[domain-phase-mix] Add scaling sweep and Olmix analysis
Calvin-Xu Apr 17, 2026
0df3226
[levanter] Keep cached eval datasets offline
Calvin-Xu Apr 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
70 changes: 70 additions & 0 deletions .agents/logbooks/grp_deployment_variants.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# GRP Deployment Variants: Research Logbook

## Scope
- Goal: compare observed-only GRP deployment rules on top of the same retuned GRP fit, and pick a final slide-ready procedure.
- Primary metric(s): retrospective `Regret@1`, predicted BPB realism, and deployment movement measured by mean phase TV.
- Constraints: keep the nonlinear retuning procedure fixed; vary only deployment.

## Baseline
- Date: 2026-04-02
- Code refs:
- `/Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_retuned.py`
- `/Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_observed_hull.py`
- Baseline numbers:
- raw retuned optimum reaches retrospective `Regret@1 = 0` from `k >= 80`, but predicted optima are unrealistically optimistic (`~1.029`) and the mixture moves a lot.
- observed-only full hull keeps `Regret@1 = 0` from `k >= 80`, with more realistic predicted BPB (`~1.065`) and lower movement.

## Experiment Log
### 2026-04-02 17:31 - Observed-only deployment variants
- Hypothesis: restricting deployment to observed-run mixtures should preserve retrospective choice quality while stabilizing the optimum and making predicted BPBs more realistic.
- Command:
- `uv run python /Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_deployment_variants.py`
- Config:
- same per-subset GRP retuning as `benchmark_grp_retuned.py`
- deployment variants:
- best predicted observed run
- convex hull of top-4 predicted observed runs
- convex hull of top-8 predicted observed runs
- convex hull of top-16 predicted observed runs
- convex hull of all observed runs
- Result:
- all variants have the same retrospective `Regret@1` profile: misses only at `k=40,60`, then zero from `k >= 80`
- after `k >= 80`, movement / predicted-value summary:
- `top1_observed`: mean move `0.256`, mean predicted `1.0792`
- `top4_hull`: mean move `0.147`, mean predicted `1.0739`
- `top8_hull`: mean move `0.195`, mean predicted `1.0702`
- `top16_hull`: mean move `0.127`, mean predicted `1.0666`
- `all_observed_hull`: mean move `0.120`, mean predicted `1.0653`
- Interpretation:
- the full observed-run hull is the cleanest local rule among those tried
- it matches the other variants on retrospective choice quality, but gives the lowest movement and the most realistic predicted BPB
- Next action:
- use the observed-only full hull as the documented deployment rule in the slides
- if needed later, validate a few observed-hull subset deployments by training them

### 2026-04-02 18:12 - More deployment regularizers and representative validation launch
- Hypothesis: a hull over the top actual observed runs should retain the good behavior of the full observed hull, while being cleaner and slightly more local.
- Command:
- `uv run python /Users/calvinxu/Projects/Work/Marin/marin/experiments/domain_phase_mix/exploratory/two_phase_many/benchmark_grp_deployment_variants.py`
- `uv run python -m marin.run.iris_run --config /Users/calvinxu/Projects/Work/Marin/marin/lib/iris/examples/marin.yaml -- --no-wait --job-name dm-genericfamily-top8actual-hull-subset-optima-$(date +%Y%m%d-%H%M%S) --region us-east5 --zone us-east5-a --cpu 2 --memory 16GB --disk 20GB --extra marin:tpu --extra marin:eval -- python -m experiments.domain_phase_mix.launch_two_phase_many_genericfamily_top8actual_hull_subset_optima --tpu-type v5p-8 --max-concurrent 4`
- Config:
- added these observed-only variants:
- `top4_actual_hull`
- `top8_actual_hull`
- `top16_actual_hull`
- `all_hull_disp0.01`, `all_hull_disp0.02`
- `all_hull_to_bestactual0.02`, `all_hull_to_bestactual0.05`
- Result:
- after `k >= 80`:
- `all_observed_hull`: mean predicted `1.0653`, mean move `0.1200`, mean support `8.75`
- `top8_actual_hull`: mean predicted `1.0661`, mean move `0.1100`, mean support `6.75`
- `top16_actual_hull`: mean predicted `1.0660`, mean move `0.1096`, mean support `9.13`
- dispersion / locality penalties reduced nearest-TV distance but generally worsened predicted BPB and often collapsed toward a single observed run
- launched representative subset validation for `top8_actual_hull`:
- parent job `/calvinxu/dm-genericfamily-top8actual-hull-subset-optima-20260402-181147`
- Interpretation:
- `top8_actual_hull` is the best cleaner policy: essentially tied with `top16_actual_hull`, but simpler and sparser
- the full observed hull still wins slightly on pure predicted BPB, but the difference is negligible
- Next action:
- validate representative `top8_actual_hull` subset runs first
- only run an all-subset sweep if the representative validations look materially better than the raw retuned subset optima
70 changes: 70 additions & 0 deletions .agents/logbooks/offline_rl_v5.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Offline RL v5: Research Logbook

## Scope
- Goal: find a new three-phase offline-control method that beats the current legacy `outcome_planner` under the pooled dense v4 evaluation setup.
- Primary metric(s): `fqe_value_mean`, `dr_value_mean`, fold win counts against `legacy_outcome_planner`.
- Constraints: keep evaluation on the existing three-phase target folds; do not justify rollout unless the method also beats `fixed_best_schedule`.

## Baseline
- Date: 2026-03-14
- Code refs:
- `experiments/domain_phase_mix/offline_rl/train_offline_policy_bench.py`
- `experiments/domain_phase_mix/offline_rl/train_three_phase_policy_bench_v4.py`
- Baseline numbers on the pooled dense v4 folds:
- `legacy_outcome_planner`: `fqe_value_mean = 4.0709`, `dr_value_mean = 3.1376`
- `fixed_best_schedule`: `fqe_value_mean = 4.1268`, `dr_value_mean = 4.3350`

## Experiment Log
### 2026-03-14 16:00 - pooled direct/hybrid follow-up
- Hypothesis: the old planner's advantage came from direct final-objective scoring, while v3/v4 Q-only variants improved support behavior but lost ranking power. A pooled direct planner on dense features should beat the legacy planner, and a hybrid `Q + direct` planner may also help.
- Command:
- `uv run python /tmp/eval_v5_candidates.py`
- Config:
- dataset: `/Users/calvinxu/Projects/Coursework/CS234/Project/RL_Bench/offline_rl_v4_three_phase_target_pooled_aux_20260312/dataset_v4`
- candidates:
- `dense_direct_v5`: pooled dense direct planner with `reward_bonus_weight = 0.08`, `support_lambda = 0.02`
- `hybrid_q_direct_v5`: pooled dynamic-Q plus direct-utility hybrid with `direct_alpha = 2.0`, `support_lambda = 0.05`
- Result:
- `dense_direct_v5`
- `fqe_value_mean = 4.0860`
- `dr_value_mean = 4.1887`
- `beat_legacy_fqe_folds = 5/5`
- `beat_legacy_fqe_and_dr_folds = 4/5`
- `hybrid_q_direct_v5`
- `fqe_value_mean = 4.0870`
- `dr_value_mean = 4.0522`
- `beat_legacy_fqe_folds = 5/5`
- `beat_legacy_fqe_and_dr_folds = 3/5`
- Interpretation:
- `dense_direct_v5` is the best new offline method so far.
- The direct objective model was the right inductive bias to recover; pure Q-style models were too noisy on this action-sparse dataset.
- Neither v5 method beats `fixed_best_schedule`, so rollout remains unjustified.
- Next action:
- keep `dense_direct_v5` as the new offline baseline and focus future work on closing the gap to `fixed_best_schedule`.

### 2026-03-14 17:10 - v6/v7/v8 follow-ups against fixed schedule
- Hypothesis:
- v6: the remaining gap comes from overusing StarCoder in phase 0; cap phase 0 and train only on three-phase targets.
- v7: preserve the best historical prefix and only adapt later phases.
- v8: use a conservative phase-2 adapter on top of the fixed best schedule.
- Command:
- `uv run python /tmp/eval_v6_candidates.py`
- `uv run python /tmp/eval_v7_fixed_prefix.py`
- `uv run python /tmp/eval_v8_conservative.py`
- `uv run python /tmp/eval_v8_conservative_hi.py`
- Result:
- best v6 candidate: `three_only_capped_hybrid_v6`
- `fqe_value_mean = 4.1002`
- `dr_value_mean = 4.1781`
- best v7 candidate: `fixed_phase0_plus_hybrid_v7`
- `fqe_value_mean = 4.1026`
- `dr_value_mean = 4.1869`
- best v8 conservative candidate: `conservative_phase2_direct_m0.10`
- `fqe_value_mean = 4.1201`
- `dr_value_mean = 4.4583`
- Interpretation:
- phase-0 control was part of the issue, but fixing it alone was not enough.
- the conservative phase-2 adapter gets very close to the fixed schedule and even exceeds it on mean DR, but still misses on mean FQE.
- none of these variants beats `fixed_best_schedule` on both FQE and DR, so none is rollout-ready.
- Next action:
- keep `dense_direct_v5` as the best replacement for the legacy planner, and treat conservative/fixed-prefix adapters as promising but still incomplete branches.
Loading