Skip to content

Commit f0dd18b

Browse files
committed
Add benchmark plots; add autobatchsize update
1 parent 10b912e commit f0dd18b

14 files changed

+2039
-301
lines changed

CLAUDE.md

Lines changed: 26 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Always test your changes by running the appropriate script or CLI command. Never complete a task without testing your changes until the script or CLI command runs without issues for 3 minutes+ (at minimum). If you find an error unrelated to your task, at minimum quote the exact error back to me when you have completed your task and offer to investigate and fix it.
1+
Always test your changes by running the appropriate script or CLI command. Never complete a task without testing your changes until the script or CLI command runs without issues. If it's a long-running script let it run for at least a few iterations of the main loop. If you find an error unrelated to your task, at minimum quote the exact error back to me when you have completed your task and offer to investigate and fix it.
22

33
## Project Structure and Conventions
44

@@ -18,17 +18,39 @@ Put imports at the top of the file unless you have a good reason to do otherwise
1818

1919
# Development
2020

21+
Never use try/except blocks - fail fast, fail explicitly.
22+
23+
Never use "fallbacks".
24+
25+
Do not write lines longer than 88 characters.
26+
27+
Don't use ALL CAPS unless it's proper English (e.g. an acronym).
28+
29+
Don't keep default run path values inside low level code - if a module calls another module, the higher level module should always pass through inject a base path.
30+
31+
Don't save data to a directory that is not in the .gitignore - especially the data/ directory.
32+
33+
Don't remove large datasets from the HF cache without asking.
34+
2135
You can call CLI commands without prefixing `python -m`, like `bergson build`.
2236

2337
Use `pre-commit run --all-files` if you forget to install pre-commit and it doesn't run in the hook.
2438

2539
Run bash commands in the dedicated tmux pane named "claude" if it is available.
2640

27-
Don't keep default run path values inside low level code - if a module calls another module, the higher level module should always pass through inject a base path.
41+
Don't betray lineage. An example of betraying lineage is duplicating a file, making changes in the duplicate, then calling it "foo_fixed" rather than "foo". Instead, commit the file and modify it directly. Another example is adding a RoundButton to a module containing a Button but not updating the original Button to be called RectangleButton. This betrays that the rectangular button was written first.
2842

29-
Don't save data to a directory that is not in the gitignore - especially the data/ directory.
43+
If you think some data files (e.g. CSVs) have been invalidated but you're not 100% sure, you can add them to a .gitignore'd archive directory along with an equivalentally named markdown file explaining the context.
3044

31-
Don't remove large datasets from the HF cache without asking.
45+
File names always use snake case - in_memory, not inmemory.
46+
47+
When writing files to disk python scripts should choose their own filenames but be provided with their file paths.
48+
49+
### Documentation
50+
51+
Do not mark documentation for code that has been removed as deprecated - simply remove the documentation.
52+
53+
No context leakage: do not write code or comments that link features to the specific experiment for which the feature was developed, unless it's only useful for that particular experiment. Be as generic as is correctly possible and not more.
3254

3355
### Tests
3456

bergson/build.py

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,16 +3,21 @@
33
import shutil
44
from dataclasses import asdict
55
from datetime import timedelta
6+
from pathlib import Path
67

78
import torch
89
import torch.distributed as dist
910
from datasets import Dataset, IterableDataset
1011
from tqdm.auto import tqdm
1112

1213
from bergson.collection import collect_gradients
14+
from bergson.collector.gradient_collectors import GradientCollector
1315
from bergson.config import IndexConfig
1416
from bergson.data import allocate_batches
1517
from bergson.distributed import launch_distributed_run
18+
from bergson.utils.auto_batch_size import (
19+
determine_batch_size,
20+
)
1621
from bergson.utils.utils import assert_type, setup_reproducibility
1722
from bergson.utils.worker_utils import (
1823
create_processor,
@@ -63,6 +68,24 @@ def build_worker(
6368
model, target_modules = setup_model_and_peft(cfg)
6469
processor = create_processor(model, ds, cfg, target_modules)
6570

71+
# Auto batch size determination if enabled
72+
if cfg.autobatchsize:
73+
cfg.token_batch_size = determine_batch_size(
74+
root=Path(".cache"),
75+
cfg=cfg,
76+
model=model,
77+
collector=GradientCollector(
78+
model=model.base_model,
79+
cfg=cfg,
80+
processor=processor,
81+
target_modules=target_modules,
82+
data=ds,
83+
scorer=None,
84+
reduce_cfg=None,
85+
),
86+
starting_batch_size=cfg.token_batch_size,
87+
)
88+
6689
attention_cfgs = {module: cfg.attention for module in cfg.split_attention_modules}
6790

6891
kwargs = {

bergson/config.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,9 @@ class IndexConfig:
144144
token_batch_size: int = 2048
145145
"""Batch size in tokens for building the index."""
146146

147+
autobatchsize: bool = False
148+
"""Whether to automatically determine the optimal batch size."""
149+
147150
processor_path: str = ""
148151
"""Path to a precomputed processor."""
149152

0 commit comments

Comments
 (0)