Skip to content

Commit 00bdcc8

Browse files
committed
Merge origin/main into tdv/reasoning-gym-pr1
2 parents e3e2aa5 + a136498 commit 00bdcc8

File tree

217 files changed

+22230
-7323
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

217 files changed

+22230
-7323
lines changed

.agents/skills/fix-docs/SKILL.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,8 @@
1+
---
2+
name: fix-docs
3+
description: Fix markdown docs in `lib/iris`, `lib/zephyr`, and `lib/fray` to align with Marin's agent-doc principles. Use when asked to repair, modernize, or de-rot docs in those directories.
4+
---
5+
16
Your task is to fix the markdown docs within `lib/iris`, `lib/zephyr` and `lib/fray` so that they maximally comply with the principles below. Do NOT fix docs outside of the aforementioned directories.
27

38
Your output: You will dispatch sub-agents that will (1) thoroughly parse the code and the docs and (2) make all the documentation changes that are deemed appropriate, locally. You will commit the changes locally into a single commit, inform the user of the commit, and summarize the changes you made. Under no circumstances should you push any commit to the repo without explicit approval from the user.

.github/workflows/iris-coreweave-ci.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ jobs:
104104
run: |
105105
cd lib/iris && uv run --group dev iris -v \
106106
--config=examples/coreweave-ci.yaml \
107-
cluster start
107+
cluster start --fresh
108108
109109
- name: Run integration tests
110110
env:

.github/workflows/iris-dev-restart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ name: Iris - Dev Cluster Daily Restart
22

33
on:
44
schedule:
5-
# Daily at 06:00 UTC
6-
- cron: "0 6 * * *"
5+
# Daily at 05:00 UTC — staggered before canary ferry (06:00 UTC)
6+
- cron: "0 5 * * *"
77
workflow_dispatch:
88

99
permissions:

.github/workflows/marin-canary-ferry-cw.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ jobs:
5858
enable-cache: true
5959

6060
- name: Install dependencies
61-
run: uv sync --all-packages --extra=cpu --no-default-groups
61+
run: uv sync --all-packages --extra=cpu --extra=controller --no-default-groups
6262

6363
- name: Write CoreWeave kubeconfig
6464
run: |
@@ -89,7 +89,7 @@ jobs:
8989
run: |
9090
JOB_ID=$(.venv/bin/iris --config=${{ env.IRIS_CONFIG }} \
9191
job run --no-wait \
92-
--memory=16G --disk=16G --cpu=1 --extra=cpu \
92+
--memory=2G --disk=4G --cpu=1 --extra=cpu \
9393
-e MARIN_PREFIX s3://marin-na/marin/ \
9494
-e RUN_ID "$RUN_ID" \
9595
-e CANARY_ACCELERATOR "$CANARY_ACCELERATOR" \
@@ -195,7 +195,7 @@ jobs:
195195
Read .agents/skills/canary-triage/SKILL.md and follow it.
196196
claude_args: |
197197
--model opus
198-
--max-turns 50
198+
--max-turns 500
199199
--allowedTools "Bash(kubectl:*),Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
200200
env:
201201
CANARY_LANE: gpu

.github/workflows/marin-canary-ferry.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ jobs:
7878
run: |
7979
JOB_ID=$(.venv/bin/iris --config=${{ env.IRIS_CONFIG }} \
8080
job run --no-wait \
81-
--memory=16G --disk=16G --cpu=1 --extra=cpu \
81+
--memory=2G --disk=4G --cpu=1 --extra=cpu \
8282
--reserve v5p-8 \
8383
-e RUN_ID "$RUN_ID" \
8484
-e CANARY_ACCELERATOR "$CANARY_ACCELERATOR" \
@@ -165,7 +165,7 @@ jobs:
165165
Read .agents/skills/canary-triage/SKILL.md and follow it.
166166
claude_args: |
167167
--model opus
168-
--max-turns 50
168+
--max-turns 500
169169
--allowedTools "Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
170170
env:
171171
CANARY_LANE: tpu

.github/workflows/marin-datakit-smoke.yaml

Lines changed: 18 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ jobs:
1919
cancel-in-progress: true
2020
env:
2121
SMOKE_RUN_ID: datakit-smoke-${{ github.run_id }}-${{ github.run_attempt }}
22-
# MARIN_PREFIX is defaulted by the ferry entrypoint to marin_temp_bucket(ttl_days=1).
22+
FERRY_STATUS_PATH: gs://marin-tmp-us-central1/ttl=1d/ci/datakit-smoke-${{ github.run_id }}-${{ github.run_attempt }}/ferry_run_status.json
2323
WANDB_ENTITY: marin-community
2424
WANDB_PROJECT: marin
2525
IRIS_CONFIG: lib/iris/examples/marin-dev.yaml
@@ -70,6 +70,7 @@ jobs:
7070
job run --no-wait \
7171
--memory=2G --disk=4G --cpu=1 --extra=cpu \
7272
-e SMOKE_RUN_ID "$SMOKE_RUN_ID" \
73+
-e FERRY_STATUS_PATH "$FERRY_STATUS_PATH" \
7374
-e WANDB_ENTITY "$WANDB_ENTITY" \
7475
-e WANDB_PROJECT "$WANDB_PROJECT" \
7576
-e WANDB_API_KEY "$WANDB_API_KEY" \
@@ -113,12 +114,24 @@ jobs:
113114
esac
114115
done
115116
117+
- name: Read ferry status
118+
id: ferry_status
119+
shell: bash -l {0}
120+
run: |
121+
PREFIX=$(.venv/bin/python -c "
122+
import json
123+
from rigging.filesystem import url_to_fs
124+
fs, _ = url_to_fs('$FERRY_STATUS_PATH')
125+
with fs.open('$FERRY_STATUS_PATH') as f:
126+
print(json.load(f)['marin_prefix'])
127+
")
128+
echo "marin_prefix=$PREFIX" >> "$GITHUB_OUTPUT"
129+
echo "Ferry output prefix: $PREFIX"
130+
116131
- name: Validate datakit smoke outputs
117132
shell: bash -l {0}
118133
env:
119-
SMOKE_RUN_ID: ${{ env.SMOKE_RUN_ID }}
120-
# MARIN_PREFIX intentionally unset — validate script defaults via marin_temp_bucket,
121-
# matching the ferry entrypoint default.
134+
MARIN_PREFIX: ${{ steps.ferry_status.outputs.marin_prefix }}
122135
run: .venv/bin/python scripts/datakit/validate_ferry_outputs.py
123136

124137
- name: Capture failure diagnostics
@@ -143,7 +156,7 @@ jobs:
143156
Read .agents/skills/canary-triage/SKILL.md and follow it.
144157
claude_args: |
145158
--model opus
146-
--max-turns 50
159+
--max-turns 500
147160
--allowedTools "Bash(gh:*),Bash(.venv/bin/iris:*),Bash(.venv/bin/python:*),Bash(cat:*),Bash(jq:*),Bash(head:*),Bash(tail:*),Bash(grep:*)"
148161
env:
149162
CANARY_LANE: datakit-smoke
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
name: marin-libs - Build Wheels
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
mode:
7+
description: "Build mode"
8+
type: choice
9+
options: [nightly, manual]
10+
default: manual
11+
schedule:
12+
- cron: "0 6 * * *" # 06:00 UTC daily
13+
push:
14+
tags:
15+
- "marin-libs-v*"
16+
pull_request:
17+
paths:
18+
- "lib/**"
19+
- "scripts/python_libs_package.py"
20+
- ".github/workflows/marin-libs-wheels.yaml"
21+
22+
concurrency:
23+
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
24+
cancel-in-progress: false # don't kill an in-flight nightly mid-publish
25+
26+
permissions:
27+
contents: write # creating GH releases
28+
pull-requests: read
29+
30+
jobs:
31+
resolve:
32+
runs-on: ubuntu-latest
33+
outputs:
34+
mode: ${{ steps.pick.outputs.mode }}
35+
version: ${{ steps.pick.outputs.version }}
36+
steps:
37+
- id: pick
38+
run: |
39+
set -euo pipefail
40+
if [[ "${GITHUB_EVENT_NAME}" == "push" && "${GITHUB_REF}" == refs/tags/marin-libs-v* ]]; then
41+
echo "mode=stable" >> "$GITHUB_OUTPUT"
42+
echo "version=${GITHUB_REF_NAME#marin-libs-v}" >> "$GITHUB_OUTPUT"
43+
elif [[ "${GITHUB_EVENT_NAME}" == "schedule" ]]; then
44+
echo "mode=nightly" >> "$GITHUB_OUTPUT"
45+
echo "version=" >> "$GITHUB_OUTPUT"
46+
elif [[ "${GITHUB_EVENT_NAME}" == "workflow_dispatch" ]]; then
47+
echo "mode=${{ github.event.inputs.mode }}" >> "$GITHUB_OUTPUT"
48+
echo "version=" >> "$GITHUB_OUTPUT"
49+
else
50+
# pull_request: build-only smoke test
51+
echo "mode=manual" >> "$GITHUB_OUTPUT"
52+
echo "version=" >> "$GITHUB_OUTPUT"
53+
fi
54+
55+
build:
56+
needs: resolve
57+
runs-on: ubuntu-latest
58+
steps:
59+
- uses: actions/checkout@v4
60+
with:
61+
fetch-depth: 0 # for git rev-parse in manual mode
62+
- uses: astral-sh/setup-uv@v7
63+
64+
- name: Build wheels
65+
run: |
66+
uv run python scripts/python_libs_package.py \
67+
--mode "${{ needs.resolve.outputs.mode }}" \
68+
${{ needs.resolve.outputs.version && format('--version {0}', needs.resolve.outputs.version) || '' }} \
69+
--skip-publish
70+
71+
- uses: actions/upload-artifact@v4
72+
with:
73+
name: marin-libs-wheels
74+
# BUILD_INFO.json travels with the wheels so the publish job uses
75+
# the same resolved version the build job stamped in, instead of
76+
# re-computing it (which would drift across midnight UTC).
77+
path: |
78+
dist/*.whl
79+
dist/BUILD_INFO.json
80+
retention-days: 14
81+
82+
publish:
83+
needs: [resolve, build]
84+
if: github.event_name != 'pull_request'
85+
runs-on: ubuntu-latest
86+
steps:
87+
- uses: actions/checkout@v4
88+
- uses: astral-sh/setup-uv@v7
89+
- uses: actions/download-artifact@v4
90+
with:
91+
name: marin-libs-wheels
92+
path: dist
93+
- name: Publish releases and prune nightlies
94+
env:
95+
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
96+
run: |
97+
uv run python scripts/python_libs_package.py \
98+
--mode "${{ needs.resolve.outputs.mode }}" \
99+
${{ needs.resolve.outputs.version && format('--version {0}', needs.resolve.outputs.version) || '' }} \
100+
--publish-only

experiments/defaults.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -206,6 +206,9 @@ def default_tokenize(
206206
*,
207207
sample_count: int | VersionedValue[int] | None = None,
208208
is_validation: bool = False,
209+
levanter_batch_size: int | None = None,
210+
resources: ResourceConfig | None = None,
211+
worker_resources: ResourceConfig | None = None,
209212
) -> ExecutorStep:
210213
"""
211214
Tokenizes a dataset using the specified tokenizer and Levanter's tokenization infrastructure.
@@ -228,6 +231,11 @@ def default_tokenize(
228231
An ExecutorStep that represents the tokenized dataset.
229232
"""
230233

234+
# Common kwargs for config constructors
235+
extra_kwargs: dict = {}
236+
if worker_resources is not None:
237+
extra_kwargs["worker_resources"] = worker_resources
238+
231239
# sniff out if it's a HuggingFace dataset
232240
if isinstance(dataset, HfDatasetSpec):
233241
config = HfTokenizeConfig(
@@ -237,6 +245,8 @@ def default_tokenize(
237245
tokenizer=ensure_versioned(tokenizer),
238246
format=format,
239247
sample_count=ensure_versioned(sample_count) if sample_count is not None else None,
248+
levanter_batch_size=levanter_batch_size,
249+
**extra_kwargs,
240250
)
241251
elif (
242252
isinstance(dataset, str)
@@ -250,6 +260,8 @@ def default_tokenize(
250260
tokenizer=ensure_versioned(tokenizer),
251261
format=format,
252262
sample_count=ensure_versioned(sample_count) if sample_count is not None else None,
263+
levanter_batch_size=levanter_batch_size,
264+
**extra_kwargs,
253265
)
254266
else:
255267
config = TokenizeConfig(
@@ -259,14 +271,16 @@ def default_tokenize(
259271
tokenizer=ensure_versioned(tokenizer),
260272
format=format,
261273
sample_count=ensure_versioned(sample_count) if sample_count is not None else None,
274+
levanter_batch_size=levanter_batch_size,
275+
**extra_kwargs,
262276
)
263277

264278
return ExecutorStep(
265279
name=os.path.join("tokenized", name),
266280
description=f"Tokenize raw text using the {tokenizer} tokenizer.",
267281
fn=remote(
268282
tokenize,
269-
resources=ResourceConfig.with_cpu(cpu=4, ram="16g", disk="10g"),
283+
resources=resources or ResourceConfig.with_cpu(cpu=4, ram="16g", disk="10g"),
270284
pip_dependency_groups=["cpu"],
271285
env_vars={
272286
"TRANSFORMERS_NO_TORCH": "1",

experiments/ferries/datakit_ferry.py

Lines changed: 19 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,11 @@
77
Output paths are placed under ``$MARIN_PREFIX/datakit-smoke/$SMOKE_RUN_ID/...``.
88
"""
99

10+
import json
1011
import logging
1112
import os
1213

13-
from rigging.filesystem import marin_temp_bucket
14+
from rigging.filesystem import marin_temp_bucket, url_to_fs
1415
from rigging.log_setup import configure_logging
1516

1617
from fray import ResourceConfig
@@ -109,14 +110,30 @@ def build_steps(run_id: str) -> list[StepSpec]:
109110
return [downloaded, normalized, deduped, consolidated, tokenized]
110111

111112

113+
def _write_status(status: str, marin_prefix: str) -> None:
114+
"""Write ferry run status to FERRY_STATUS_PATH if set."""
115+
status_path = os.environ.get("FERRY_STATUS_PATH")
116+
if not status_path:
117+
return
118+
payload = json.dumps({"status": status, "marin_prefix": marin_prefix})
119+
fs, _ = url_to_fs(status_path)
120+
with fs.open(status_path, "w") as f:
121+
f.write(payload)
122+
logger.info("Wrote ferry status to %s", status_path)
123+
124+
112125
def main() -> None:
113126
configure_logging()
114127
if not os.environ.get("MARIN_PREFIX"):
115128
os.environ["MARIN_PREFIX"] = marin_temp_bucket(ttl_days=1)
116129

117-
logger.info("MARIN_PREFIX defaulted to %s", os.environ["MARIN_PREFIX"])
130+
marin_prefix = os.environ["MARIN_PREFIX"]
131+
logger.info("MARIN_PREFIX defaulted to %s", marin_prefix)
118132
run_id = os.environ["SMOKE_RUN_ID"]
133+
134+
_write_status("running", marin_prefix)
119135
StepRunner().run(build_steps(run_id))
136+
_write_status("succeeded", marin_prefix)
120137

121138

122139
if __name__ == "__main__":

experiments/pretraining_datasets/__init__.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,10 @@
4848
downloads as nemotron_v2_downloads,
4949
tokenize_nemotron_v2_family,
5050
)
51+
from experiments.pretraining_datasets.common_corpus import (
52+
common_corpus_download,
53+
tokenize_common_corpus,
54+
)
5155
from experiments.pretraining_datasets.nsf_awards import (
5256
nsf_awards_download,
5357
nsf_awards_tokenized,
@@ -117,6 +121,11 @@
117121
"download": dolmino_downloads["dolmino"],
118122
"tokenize_fn": lambda: {"dolmino_math/all": tokenize_dolmino_math()},
119123
},
124+
"common_corpus": {
125+
"subsets": ["all"],
126+
"download": common_corpus_download,
127+
"tokenize_fn": lambda: {"common_corpus/all": tokenize_common_corpus()},
128+
},
120129
"nsf_awards": {
121130
"subsets": ["all"],
122131
"download": nsf_awards_download,

0 commit comments

Comments
 (0)