-
Notifications
You must be signed in to change notification settings - Fork 16
[wip] OLMo3 anneals #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
aetting
wants to merge
193
commits into
main
Choose a base branch
from
olmo3-anneals
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
[wip] OLMo3 anneals #126
Changes from 190 commits
Commits
Show all changes
193 commits
Select commit
Hold shift + click to select a range
2740d49
add round1 anneal configs
aetting 39da9f2
Pin to swafix and add smoketest config
undfined e765967
Trainer config updates
undfined 14233fa
More fixes
undfined d753873
oops
undfined 528bb19
More config tweaks
undfined 610a9de
Imports
undfined 2a3e8cb
Fix for WSD class bug
undfined 068b776
Match ac_config from swafix
undfined 613ab1a
Match sliding window changes
undfined 4bbb982
More shenans
undfined 65eb9e4
Typo
undfined a1ac0da
comment
undfined 1353c5a
Can't load state with new dataset
undfined 7c523ee
OOM
undfined b9df024
olmo3 settings and new paths
aetting 09f3a38
resources and web name
aetting 13ef591
new web paths
aetting 7b2afe8
Use improved scheduler branch
undfined 1b35a4d
Merge branch 'undfined/swafix-core' into olmo3-anneals
aetting 335283b
update round1 anneal paths (missing two)
aetting 8d834bb
update example configs
aetting 657084e
add web paths
aetting c167043
Merge branch 'undfined/swafix-core' into olmo3-anneals
aetting de07e7a
consistency updates
aetting 71cf247
Use new dolmino math and update weights
undfined 6104825
Allow repetitions in hqweb
undfined 953ae4a
Not enough tokens for dolmino
undfined e614cae
Adjust reddit target
undfined 70684e2
Try double rbz
undfined 95859f9
oops
undfined ec6a6bc
Back to 8192 rbz
undfined 106f175
try with float8
undfined 035d19a
Newer torch
undfined 36cc983
dp tweaks
undfined 2520bdb
match pretrain
undfined aee96c2
More tweaks for large job
undfined 8b085bd
baseline dolmino anneal config
aetting d897712
paths bucket and format fix
aetting 95ef206
mj anneals rd1
4339fdd
Tweaks for mj anneals
undfined a0544c2
Rejiggered ratios for OMR rewrites
e3b8261
merge
a562e4c
restore trainer state from save folder
epwalsh 8520a13
Merge pull request #127 from allenai/epwalsh/olmo3-anneals
aetting dba258a
update example priority
aetting 2979e40
restore model_and_optim
aetting a209758
add lr-test-config
aetting ba57e1f
Added a bunch of nanoanneals
e595476
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
7983868
fixed typo
9fe0df4
added submodular dolmino math curves
70a1a94
path format consistency
aetting a11ea98
fix name
aetting 47211e3
added gs->weka tool
14d46c9
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
148b539
idk this is some luca thing maybe\?
6cbe734
Added convert from config
c78dc1e
diff convert
acbd449
tyler wanted me to do this, idk
fd233ad
Merge branch 'main' into olmo3-anneals
de49cd6
Adds v2 hq fim stackedu microanneal
undfined cd707d9
convert with custom branching
14b95d3
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
90ac849
Adds v2++ hq fim stackedu microanneal
undfined 1670251
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
undfined ba62835
add 7T anneals
aetting b41ca95
step update
aetting d06c704
fix run names
aetting 12befcd
add ae microanneals
aetting 6d42d1d
bump up rank microbatch size
aetting e7e55ac
Added eval script for midtraining
40408cf
uncomment
5ed2c7b
merge
d849517
Update README.md
revbucket 9dc008a
add testrun
aetting d5f4d88
Rename olmo2 anneals and add olmo3-fim-code configs
undfined b1dd917
Added 'missing eval' stuff
f0eeaa6
Too many workers counting tokens
undfined 8eb33fc
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
c471589
merged davidhs backfill stuff
8645829
increment eval version
0657fb0
Wrong weight for hqweb
undfined 29a3653
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
undfined 38aeeea
add highthresh diverse qa config
aetting 6fd7250
Added mjnewmath-bestof
aba11c2
add wip anneal round 2 config
aetting 2144071
added kodkode mjicroanneals
3f32ad1
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
079cdf5
update reasoning and math ratios
aetting 8ae28c7
add round2 8T
aetting 371e551
8 nodes
aetting ed104ab
update reasoning paths and run names
aetting 687bce5
updated code path
aetting fc81e62
updated paths
soldni 830c5c1
adjusting ratios
soldni 35d7cde
merged main
d2cd5a9
add follow-up reasoning microanneals
aetting e8dc411
Adds 10b anneal with 35/30/35 web/code/etc ratios
undfined 984e751
add reddit lowthresh663 microanneal
aetting 5773e19
added megamath-web-pro-max anneals
7182650
more reddit lowthresh microanneals
aetting 647294d
fix nonmc name
aetting bee78f9
cleaned up
0b90f6b
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
b108729
remove model_and_optim in load_path
aetting c18c541
add round 3 macroanneal configs
aetting bc2c081
add 12T configs
aetting 31e480b
lowthresh mcplusfull
aetting bcf3459
added rewrite checks
aa985bf
convert from config hashes updated
d187de5
Added swallow anneals
e66c626
lowthresh add context v1
aetting 0825df8
fix path
aetting 68c6fee
add 200B round3 config
aetting 0401ec8
more web paths
aetting 6ef5068
adjust math ratios and code path
aetting a9de822
adjust reasoning ratios
aetting 795e26d
adjust reasoning ratios
aetting 2cf4c9c
update name
aetting b12faae
16 nodes
aetting 59d42fc
add omr fullthoughts baseline
aetting fafdcc5
psgqa microanneal
aetting fa8d330
psgqa microanneal name
aetting da8433d
psgqa microanneal name
aetting 30d25b3
add no reasoning no instruct
aetting fbf6ff9
add no reasoning no instruct
aetting b50797f
add nodes
aetting a5d0d3b
fix dolmino ratio
aetting 65f7ebf
add sub8k llamanemotron
aetting cc1f690
Added check of swallowmatt stuff
d79785a
bumped nodes on fm4p
c71cd1a
Added megamatt test anneals
05249d8
changed names
78bb0e4
some more swallowmath diversity experiments
5193269
correct token counts
1cf8997
add round-4-decon macroanneal
aetting 1150623
remove outdated comments
aetting 30d554e
millianneal for swallowcode sgcr
9be0874
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
15aa880
updated math ratios
aetting bd0b76a
added scor config
93e3349
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
8de3f02
add round 5 wip config
aetting c4b7b19
updated midtrain eval script
fbfa3da
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
61850cb
updated paths and token counts
aetting d2bc774
updated fim token count
aetting 82fa294
full set of faeze reasoning paths
aetting 07b20b3
add pdf and web p* ratios
aetting fbf6e0f
remove old sections
aetting 2ecd073
tweak ratio formatting
aetting eb323a2
remove old instruction
aetting 1be7856
remove zero topics
aetting ff24ef3
all sponge paths
aetting 18cbf13
add empty set check
aetting 4cd3d60
fix fim path
aetting c4ea958
Merge pull request #151 from allenai/ae-debug-round5
aetting 058c175
glob in sponge path
aetting fa23fcb
added megamath anneals for rewrite saliency
72be05c
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
4b6a99e
mixing gen/mc validation configs
aetting 9abb64b
rename runs
aetting e9b938e
urgent priority
aetting 4342883
Added swallowmatt2 tests
a053dc0
downbumped priority
1a6d58e
add 3 vs 2.5 comparison microanneals
aetting 6956874
added restart
d7a4da1
swm2 restart | good name
60691df
Adds config for compression filtered code fim microanneal
undfined 9e6112b
Added swallowcode2 anneals
4ec7e0b
Merge branch 'olmo3-anneals' of github.com:allenai/olmo-cookbook into…
ea60c03
fixed name for lint Q4
5df6ec0
Added tinyMath POT mjicroanneals
6e171ab
Added tinyMath POT mjicroanneals
c32e56e
Added olmo2.5 mjicroanneals
dd7d0ff
Added MIND anneals
132b2e2
fixed gs2weka for better pyvenv stuff
e54fabb
TinyMATH3 PoT
7c8141b
Added tinymath3-pot
4108234
Added tinymath4 pot
717a2b4
Added Pot2 of tinymath4
f390610
Added tinymath4 MIND
9df2260
Added final tinyMATH runs
8a6eb27
Added final tinyMATH runs
6c4ea11
Added swallowcode sgcr multi stuff
de5e2ba
Added allholy decon
b122f1f
added scor-py
b1778f2
Added swallowCodeMulti SCOR
8097269
Added megamatt anneal
e93b696
Switched workspace to one that has nonpreempt slots
97a61a3
back to microanneal workspace
f5b67a6
Added cranecode micro
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,266 @@ | ||
| #!/usr/bin/env python3 | ||
| """ | ||
| Script to process YAML file and run olmo-cookbook command with latest checkpoint | ||
| """ | ||
| import argparse | ||
| import re | ||
| import subprocess | ||
| import sys | ||
| from pathlib import Path | ||
|
|
||
| import yaml | ||
|
|
||
|
|
||
| def run_command(cmd, shell=False, errs_okay=False): | ||
| """Run a shell command and return stdout""" | ||
| try: | ||
| if shell: | ||
| result = subprocess.run(cmd, shell=True, capture_output=True, text=True, check=True) | ||
| else: | ||
| result = subprocess.run(cmd, capture_output=True, text=True, check=True) | ||
| return result.stdout.strip() | ||
| except subprocess.CalledProcessError as e: | ||
| print(f"Error running command: {' '.join(cmd) if isinstance(cmd, list) else cmd}") | ||
| print(f"Error: {e.stderr}") | ||
| if not errs_okay: | ||
| sys.exit(1) | ||
| raise e | ||
|
|
||
|
|
||
| def get_yaml_name(yaml_file): | ||
| """Extract the 'name' attribute from YAML file""" | ||
| try: | ||
| with open(yaml_file, "r") as f: | ||
| data = yaml.safe_load(f) | ||
|
|
||
| if "name" not in data: | ||
| print(f"Error: 'name' attribute not found in {yaml_file}") | ||
| sys.exit(1) | ||
|
|
||
| return data["name"] | ||
| except Exception as e: | ||
| print(f"Error reading YAML file {yaml_file}: {e}") | ||
| sys.exit(1) | ||
|
|
||
|
|
||
| def get_beaker_name(): | ||
| """Get the NAME from 'beaker account whoami' output""" | ||
| output = run_command(["beaker", "account", "whoami"]) | ||
|
|
||
| # Parse the table output to extract NAME | ||
| lines = output.strip().split("\n") | ||
| if len(lines) < 2: | ||
| print("Error: Unexpected output from 'beaker account whoami'") | ||
| sys.exit(1) | ||
|
|
||
| # Look for the data row (skip header) | ||
| for line in lines[1:]: | ||
| parts = line.split() | ||
| if len(parts) >= 2: | ||
| return parts[1] # NAME is the second column | ||
|
|
||
| print("Error: Could not extract NAME from beaker account whoami output") | ||
| sys.exit(1) | ||
|
|
||
|
|
||
| def find_latest_checkpoint(beaker_name, yaml_name): | ||
| """Find the latest checkpoint directory in weka""" | ||
|
|
||
| weka_path = f"weka://oe-training-default/ai2-llm/checkpoints/{beaker_name}/{yaml_name}-*" | ||
|
|
||
| # Convert weka:// path to s3:// path for s5cmd | ||
| s3_path = weka_path.replace("weka://oe-training-default/", "s3://oe-training-default/") | ||
|
|
||
| # Add wildcard to check for any files in the directory | ||
| s3_path_wildcard = f"{s3_path}/*" | ||
|
|
||
| print(f"Checking if weka path exists: {weka_path}") | ||
| print(f"Using s5cmd to check: {s3_path_wildcard}") | ||
|
|
||
| cmd = [ | ||
| "s5cmd", | ||
| "--profile", | ||
| "WEKA", | ||
| "--endpoint-url", | ||
| "https://weka-aus.beaker.org:9000", | ||
| "ls", | ||
| s3_path_wildcard, | ||
| ] | ||
|
|
||
| try: | ||
| output = run_command(cmd, errs_okay=True) | ||
| if not output: | ||
| print(f"No checkpoints found with prefix: {prefix}") | ||
| sys.exit(1) | ||
|
|
||
| # Get all matching paths | ||
| paths = output.strip().split("\n") | ||
|
|
||
| # Sort paths to get the latest one (lexicographically) | ||
| paths = [_.split(" ")[-1].strip() for _ in paths] | ||
| ckpts = set() | ||
| for p in paths: | ||
| re_string = yaml_name + r"-[0-9a-f]{8}/step\d+/" | ||
| if re.match(re_string, p): | ||
| ckpts.add(re.match(re_string, p).group()) | ||
| assert ( | ||
| len(ckpts) > 0 | ||
| ), "No valid checkpoints found??? [this should assert should never fail if we got here to begin with]" | ||
| max_ckpt = max(ckpts) | ||
| print(max_ckpt) | ||
| return "weka://oe-training-default/ai2-llm/checkpoints/%s/%s" % (beaker_name, max_ckpt) | ||
|
|
||
| except subprocess.CalledProcessError as e: | ||
| print("No weka paths found!") | ||
| print( | ||
| f"Make sure you have access to weka://oe-training-deafult/ai2-llm/checkpoints/{beaker_name}/{yaml_name}-* directories" | ||
| ) | ||
| raise e | ||
| # sys.exit(1) | ||
| except Exception as e: | ||
| print("ERR CODE ", e) | ||
| raise e | ||
|
|
||
|
|
||
| def check_hf_path_exists(latest_ckpt): | ||
| """Check if the corresponding weka path already exists""" | ||
| # Convert gs:// path to weka:// path | ||
| hf_path = latest_ckpt.rstrip("/") + "-hf/*" | ||
|
|
||
| print(f"Checking if weka path exists: {hf_path}") | ||
| cmd = [ | ||
| "s5cmd", | ||
| "--profile", | ||
| "WEKA", | ||
| "--endpoint-url", | ||
| "https://weka-aus.beaker.org:9000", | ||
| "ls", | ||
| hf_path, | ||
| ] | ||
|
|
||
| # Convert weka:// path to s3:// path for s5cmd | ||
| hf_path = hf_path.replace("weka://oe-training-default/", "s3://oe-training-default/") | ||
|
|
||
| print(f"Checking if weka path exists: {hf_path}") | ||
| cmd = [ | ||
| "s5cmd", | ||
| "--profile", | ||
| "WEKA", | ||
| "--endpoint-url", | ||
| "https://weka-aus.beaker.org:9000", | ||
| "ls", | ||
| hf_path, | ||
| ] | ||
|
|
||
| try: | ||
| # Run the command - if it succeeds, the path exists | ||
| output = run_command(cmd, errs_okay=True) | ||
| print(f"✅ Weka path exists - found %s files:" % len(output.split("\n"))) | ||
| return True | ||
| except subprocess.CalledProcessError as e: | ||
| # If the command fails, the path doesn't exist | ||
| print(f"❌ Weka path does not exist (s5cmd failed as expected)") | ||
| return False | ||
|
|
||
|
|
||
| def run_olmo_cookbook(weka_path): | ||
| """Run the olmo-cookbook command with the GCS path""" | ||
| print("Converting %s" % weka_path) | ||
| weka_path = weka_path.replace("weka://", "/").rstrip("/") | ||
| cmd = [ | ||
| "olmo-cookbook-eval", | ||
| "convert", | ||
| weka_path, | ||
| "-t", | ||
| "olmo-core-v2", | ||
| "--use-beaker", | ||
| "--huggingface-transformers-git-url", | ||
| "https://github.com/2015aroras/transformers.git", | ||
| "--huggingface-transformers-commit-hash", | ||
| "ae3889ced6ed7362e5883671fc6dc4cb4fece5fa", | ||
| "--olmo-core-v2-commit-hash", | ||
| "57a04d0b69047d797c96eede056a211e75b5914a", | ||
| ] | ||
| print(f"Running: {' '.join(cmd)}") | ||
|
|
||
| try: | ||
| # Run the command and stream output in real-time | ||
| process = subprocess.Popen( | ||
| cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1, universal_newlines=True | ||
| ) | ||
|
|
||
| beaker_url = None | ||
| beaker_url_pattern = re.compile(r"https://beaker\.org/ex/[A-Z0-9]+") | ||
|
|
||
| for line in process.stdout: | ||
| print(line, end="") | ||
|
|
||
| # Look for the beaker URL in the output | ||
| match = beaker_url_pattern.search(line) | ||
| if match: | ||
| beaker_url = match.group(0) | ||
|
|
||
| process.wait() | ||
|
|
||
| if process.returncode != 0: | ||
| print(f"Error: olmo-cookbook command failed with return code {process.returncode}") | ||
| sys.exit(1) | ||
|
|
||
| # Print the extracted Beaker URL | ||
| if beaker_url: | ||
| print(f"\n" + "=" * 60) | ||
| print(f"🔗 Beaker Experiment URL: {beaker_url}") | ||
| print(f"=" * 60) | ||
| return beaker_url | ||
| else: | ||
| print("\nWarning: Could not extract Beaker experiment URL from output") | ||
|
|
||
| except Exception as e: | ||
| print(f"Error running olmo-cookbook: {e}") | ||
| sys.exit(1) | ||
|
|
||
|
|
||
| def main(): | ||
| parser = argparse.ArgumentParser(description="Process YAML file and run olmo-cookbook with latest checkpoint") | ||
| parser.add_argument("yaml_file", help="Path to the YAML file") | ||
| parser.add_argument("--beaker-name", required=False, default=None) | ||
| parser.add_argument("--overwrite", required=False, type=bool, default=False) | ||
| args = parser.parse_args() | ||
|
|
||
| # Validate input file exists | ||
| if not Path(args.yaml_file).exists(): | ||
| print(f"Error: YAML file {args.yaml_file} does not exist") | ||
| sys.exit(1) | ||
|
|
||
| print(f"Processing YAML file: {args.yaml_file}") | ||
|
|
||
| # Step 1: Get name from YAML | ||
| yaml_name = get_yaml_name(args.yaml_file) | ||
| print(f"YAML name: {yaml_name}") | ||
|
|
||
| # Step 2: Get beaker name | ||
| if args.beaker_name == None: | ||
| beaker_name = get_beaker_name() | ||
| else: | ||
| beaker_name = args.beaker_name | ||
| print(f"Beaker name: {beaker_name}") | ||
|
|
||
| # Step 3: Find latest checkpoint | ||
| print( | ||
| f"Searching for checkpoints with prefix: weka://oe-training-default/ai2-llm/checkpoints/{beaker_name}/{yaml_name}-" | ||
| ) | ||
| latest_checkpoint = find_latest_checkpoint(beaker_name, yaml_name) | ||
| print(f"Latest checkpoint: {latest_checkpoint}") | ||
|
|
||
| # Step 4: Check if weka path already exists | ||
| if check_hf_path_exists(latest_checkpoint) and not args.overwrite: | ||
| print(f"\n🚫 Converted checkpoint already exists in weka storage. Skipping cookbook command.") | ||
| print(f"The checkpoint has already been copied to weka://oe-training-default/") | ||
| return | ||
|
|
||
| # Step 5: Run olmo-cookbook command | ||
| run_olmo_cookbook(latest_checkpoint) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Undefined Variable in Error Message
The
find_latest_checkpointfunction references an undefinedprefixvariable in an error message. If no checkpoints are found, this will lead to aNameErrorat runtime.