Skip to content

Add benchmarking code#122

Closed
luciaquirke wants to merge 35 commits intomainfrom
benchmark
Closed

Add benchmarking code#122
luciaquirke wants to merge 35 commits intomainfrom
benchmark

Conversation

@luciaquirke
Copy link
Collaborator

@luciaquirke luciaquirke commented Jan 13, 2026

The benchmarking code is in a subpar state but I am going to merge it anyway because:

  • The code is generally self-contained so shouldn't affect the core library
  • People are asking about expected runtimes and we can provide it

Related library changes (merged in previous PR):

  • Add in-memory gradient collector for small runs
  • Add auto batch size tool to CLI
  • Include build metadata in reduce runs (we may want to update the library so this isn't necessary in the future)
  • Support max_tokens with Dataset

@claude
Copy link

claude bot commented Jan 13, 2026

Code Review - Found 5 Issues

Critical Bugs

1. Missing required parameters for InMemoryCollector
Location: benchmarks/benchmark_bergson.py lines 333-336, 381-384
The InMemoryCollector requires data and cfg parameters but they are not being passed. This will cause KeyError at runtime.

2. Incorrect import in kronfluence_benchmark.py
Location: line 20-25
Should be 'from benchmarks.benchmark_utils import' not 'from benchmark_utils import'

3. Incorrect import path for assert_type
Location: kronfluence_benchmark.py line 33
Should be 'from bergson.utils.utils import assert_type' not 'from bergson.utils import assert_type'

CLAUDE.md Violations

4. Generic cfg variable name
Location: benchmarks/benchmark_bergson.py lines 158, 465
Violates CLAUDE.md rule: Never call a config class cfg, use specific names like run_cfg

5. Arguments using dashes instead of underscores
Location: benchmarks/benchmark_dattri.py lines 282-290 and throughout
Violates CLAUDE.md rule: Arguments should use underscores not dashes

See CLAUDE.md: https://github.com/EleutherAI/bergson/blob/82351ff2bfd44e8bbbbb43251adb0f0acafa5809/CLAUDE.md

@claude
Copy link

claude bot commented Jan 13, 2026

Code review

I've reviewed this PR and found several issues that need to be addressed:

Critical Bugs

1. F-string syntax error in bergson/distributed.py

File: bergson/distributed.py:82-83
Link:

result = ctx.wait()
if result is not None and hasattr(result, "failures") and result.failures:
raise RuntimeError(
f"{process_name} failed with {len(result.failures)} process "
f"failure(s): {'\n'.join(result.failures)}"

This f-string expression uses a backslash escape sequence which is a syntax error in Python 3.10 and 3.11. The project requires Python >=3.10, so this will fail for users on Python 3.10 or 3.11.

The issue: Backslashes inside f-string expressions were not allowed until Python 3.12 (PEP 701).

Suggested fix: Assign the newline to a variable first:
newline = '\n'
raise RuntimeError(f"{process_name} failed with {len(result.failures)} process failure(s): {newline.join(result.failures)}")


2. Missing required kwargs for InMemoryCollector

The InMemoryCollector.init method requires data and cfg kwargs, but several instantiations are missing these required arguments. This will cause KeyError at runtime.

Affected locations:

  • benchmarks/benchmark_bergson.py:334-337
  • benchmarks/benchmark_bergson.py:382-385
  • bergson/utils/auto_batch_size.py:290-293

Links:

train_collector = InMemoryCollector(
model=model.base_model, # type: ignore
processor=processor,
)
# Create batches for CollectorComputer

test_collector = InMemoryCollector(
model=model.base_model, # type: ignore
processor=processor,
)
# Process single example

test_collector = InMemoryCollector(
model=model.base_model, # type: ignore
processor=processor,
)
batches = allocate_batches(test_dataset["length"], token_batch_size) # type: ignore

Suggested fix: Add the required data and cfg kwargs to each instantiation.


CLAUDE.md Violations

Per CLAUDE.md: Never call a config class cfg, always something specific like foo_cfg. Arguments should use underscores and not dashes.

3. Config variables named cfg instead of specific names

Affected locations:

  • benchmarks/benchmark_bergson.py:156 (should be run_cfg)
  • benchmarks/benchmark_bergson.py:456 (should be run_cfg)

Links:

class Run:
"""Execute a single in-memory Bergson benchmark run."""
cfg: RunConfig
def execute(self) -> None:

dist.destroy_process_group()
if status != "success":
sys.exit(1)

Note: The same PR correctly uses run_cfg in benchmarks/benchmark_bergson_cli.py:183


4. CLI arguments using dashes instead of underscores

Multiple benchmark files use standard argparse with dashed arguments instead of the required simple_parsing with dataclasses pattern.

Affected files:

  • benchmarks/benchmark_dattri.py (--batch-size, --max-length, etc.)
  • benchmarks/kronfluence_benchmark.py (--use-empirical-fisher, etc.)
  • benchmarks/run_full_benchmark.py (--eval-tokens, --batch-size, etc.)

Links:

"run", help="Execute a single Dattri benchmark run"
)
run_parser.add_argument("model", help="Key for the model to benchmark")
run_parser.add_argument(
"train_tokens", help="Target training tokens (e.g. 1M, 10M)"
)
run_parser.add_argument(
"eval_tokens", help="Target evaluation tokens (e.g. 100K, 1M)"
)
run_parser.add_argument("--batch-size", type=int, default=4)
run_parser.add_argument("--max-length", type=int, default=512)

parser.add_argument(
"eval_examples", help="Target evaluation examples (e.g. 100, 1K)"
)
parser.add_argument(
"--strategy", default="diagonal", choices=["diagonal", "kfac", "ekfac"]
)
parser.add_argument("--use-empirical-fisher", action="store_true")
parser.add_argument("--covariance-max-examples", type=int, default=100)
parser.add_argument("--per-device-batch-size", type=int, default=1)
parser.add_argument("--per-device-query-batch-size", type=int, default=1)
parser.add_argument("--per-device-train-batch-size", type=int, default=1)
parser.add_argument(
"--amp-dtype", default="bfloat16", choices=["float16", "bfloat16", "float32"]
)
parser.add_argument(
"--activation-covariance-dtype",
default="bfloat16",
choices=["float16", "bfloat16", "float32"],
)
parser.add_argument(
"--gradient-covariance-dtype",
default="bfloat16",
choices=["float16", "bfloat16", "float32"],
)
parser.add_argument(
"--per-sample-gradient-dtype",
default="bfloat16",
choices=["float16", "bfloat16", "float32"],
)
parser.add_argument(

if not args.skip_plots:
plot_dir = Path(args.plot_output).parent
plot_stem = Path(args.plot_output).stem
plot_ext = Path(args.plot_output).suffix or ".png"
# Get unique GPU counts in the data
gpu_counts = (
sorted(df["num_gpus"].unique()) if "num_gpus" in df.columns else [1]
)

Suggested fix: Convert to use simple_parsing with dataclasses, following benchmarks/benchmark_bergson.py pattern.


Summary

  • 2 critical bugs that will cause runtime errors
  • 4 CLAUDE.md violations affecting code consistency

The bugs should be fixed before merging.

@luciaquirke
Copy link
Collaborator Author

@claude great review, can you fix the problems you found?

@claude
Copy link

claude bot commented Jan 13, 2026

Claude finished @luciaquirke's task in 2m 17s —— View job


Fixed identified issues ✓

All critical bugs have been fixed and pushed to the benchmark branch.

Completed fixes

  • F-string syntax error in bergson/distributed.py - Fixed Python 3.10/3.11 compatibility issue by extracting newline to a variable before using in f-string
  • Missing InMemoryCollector kwargs - Added required data and cfg parameters to all 3 instantiations:
    • benchmarks/benchmark_bergson.py:333-338 (train_collector)
    • benchmarks/benchmark_bergson.py:383-388 (test_collector)
    • bergson/utils/auto_batch_size.py:289-294 (test_collector)
  • Config naming violation - Renamed cfg to run_cfg throughout benchmark_bergson.py per CLAUDE.md guidelines

Not addressed (requires extensive refactoring)

  • CLI arguments using dashes - Converting 4 benchmark files (benchmark_dattri.py, kronfluence_benchmark.py, run_full_benchmark.py, and their imports) from argparse to simple_parsing with dataclasses would be a significant refactoring requiring:

    • Converting all argparse arguments to dataclass fields
    • Rewriting argument parsing logic
    • Extensive testing

    Since the PR author noted this benchmarking code is "in a subpar state" and "self-contained", this refactoring is better suited for a follow-up PR focused on benchmark code cleanup.

Changes pushed

Commit: a423bc8
Branch: benchmark

All critical runtime bugs that would cause errors have been resolved. The CLAUDE.md style violations that remain are in benchmark-specific scripts and don't affect the core library functionality.


@luciaquirke
Copy link
Collaborator Author

@claude give converting all the dashes to underscores a go, I'll revert if necessary.

@claude
Copy link

claude bot commented Jan 13, 2026

Claude finished @luciaquirke's task in 5m 44s —— View job


Conversion complete ✓

Successfully converted all benchmark CLI arguments from dashes to underscores per CLAUDE.md requirements.

Completed tasks

  • Read CLAUDE.md and reference implementation
  • Convert benchmark_dattri.py to simple_parsing with underscores
  • Convert kronfluence_benchmark.py to simple_parsing with underscores
  • Convert run_full_benchmark.py to simple_parsing with underscores
  • Verify shell scripts already use underscores
  • Commit and push changes

Changes made

benchmark_dattri.py

  • Replaced argparse with simple_parsing
  • Added RunConfig dataclass with all arguments using underscores
  • Arguments now use --batch_size, --max_length, --num_gpus, etc. instead of dashed versions

kronfluence_benchmark.py

  • Replaced argparse with simple_parsing
  • Added three dataclass configs: RunConfig, CommandsConfig, and FitConfig
  • Fixed import path: from benchmarks.benchmark_utils import (was incorrectly from benchmark_utils import)
  • Fixed import path: from bergson.utils.utils import assert_type (was incorrectly from bergson.utils import assert_type)
  • All arguments now use underscores: --use_empirical_fisher, --per_device_batch_size, --covariance_max_examples, etc.

run_full_benchmark.py

  • Replaced argparse with simple_parsing
  • Added two dataclass configs: RunBenchmarkConfig and PlotConfig
  • Updated subprocess calls to use underscore arguments when calling benchmark_dattri and benchmark_bergson
  • All arguments now use underscores: --token_scales, --eval_tokens, --num_gpus, --skip_dattri, etc.

Shell scripts

  • Verified all 4 shell scripts already use underscores (--num_gpus, --run_root, etc.)
  • No changes needed

Changes pushed

Commit: 9245e61
Branch: benchmark

All benchmark files now follow CLAUDE.md conventions with simple_parsing and underscored arguments. The conversion maintains backward compatibility by preserving all existing functionality while modernizing the CLI argument parsing approach.

--- | Branch

@LouisYRYJ
Copy link
Contributor

Sounds good, benchmarking seems pretty isolated so seems fine.
Bit confused about the projection_dim config? Can I still pass projection_dim=0 if I want to? And will that still return the expected behavior (no projection)?

@LouisYRYJ
Copy link
Contributor

also would love to see the benchmarking results!

@luciaquirke
Copy link
Collaborator Author

luciaquirke commented Jan 13, 2026

rojection_dim config? Can I still pass projection_dim=0 if I want to? And will that still return the expected behavior (no projection)?

Yeah, we just had a user getting OOM and being confused because they were just ramping up with the library and using the documented commands + wanted preconditioners

@luciaquirke
Copy link
Collaborator Author

@claude extract out the library code (not in benchmarks or docs) into a standalone PR.

@claude
Copy link

claude bot commented Jan 17, 2026

Claude finished @luciaquirke's task in 1m 19s —— View job


Extracted library code into standalone PR ✓

Tasks

  • Identify library-only changed files (exclude benchmarks/ and docs/)
  • Create new branch from main
  • Apply library changes to new branch
  • Push new branch and create PR

Summary

Successfully extracted library improvements from this PR into a standalone PR: #126

Library Changes Included

  • In-memory gradient collector: Add InMemoryCollector for small runs that don't need disk I/O
  • Better distributed error logging: Improved error messages with Python 3.10+ compatibility
  • Auto batch size CLI tool: New bergson auto_batch_size command
  • Build metadata in reduce: Include build metadata in reduce runs
  • max_tokens support: Add max_tokens parameter support with Dataset
  • CLAUDE.md updates: Document project conventions
  • Test coverage: Add tests for new functionality

Files Extracted (20 total)

Modified:

  • .pre-commit-config.yaml, CLAUDE.md, README.md
  • bergson/__main__.py, bergson/config.py, bergson/distributed.py
  • bergson/query/attributor.py, bergson/reduce.py
  • bergson/score/score.py, bergson/score/score_writer.py, bergson/score/scorer.py
  • bergson/utils/utils.py, bergson/utils/worker_utils.py
  • examples/slurm/score_worker.sh
  • pyproject.toml
  • tests/test_build.py, tests/test_score.py

New files:

  • bergson/cli/auto_batch_size.py
  • bergson/collector/in_memory_collector.py
  • bergson/utils/auto_batch_size.py

New PR: #126
Branch: library-improvements |

@luciaquirke
Copy link
Collaborator Author

luciaquirke commented Jan 19, 2026

On further thought I'm going to leave this as a separate branch and extract out the plots for now so I have a convenient spot to keep data

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants