EleutherAI
diff --git a/‎CLAUDE.md‎
Lines changed: 25 additions & 1 deletion b/‎CLAUDE.md‎
Lines changed: 25 additions & 1 deletion
diff --git a/‎benchmarks/README.md‎
Lines changed: 149 additions & 0 deletions b/‎benchmarks/README.md‎
Lines changed: 149 additions & 0 deletions
diff --git a/‎benchmarks/__init__.py‎ b/‎benchmarks/__init__.py‎
@@ -1,3 +1,27 @@
-Mark tests requiring GPUs with `@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")`.
+Always test your changes by running the appropriate script or CLI command.
+
+## Project Structure and Conventions
+
+Keep __main__.py clean - it should primarily provide documentation and routing for the available CLI commands and their configs.
+
+Consider writing a new library file if you add a standalone, complex feature used in more than one place.
+
+When you write a script that launches a CLI command via a subprocess, print the CLI command so it can be easily reproduced.
+
+Use dataclasses for config, and use simple_parsing to parse the CLI configs dataclasses. Never call a config class `cfg`, always something specific like foo_cfg, e.g. run_cfg/RunConfig. Arguments should use underscores and not dashes like `--example_arg`.
+
+Never save logs, scripts, and other random development into the root of a project. Create an appropriate directory such as runs/ or scripts/ and add it to the .gitignore.
+
+# Development
+
+You can call CLI commands without prefixing `python -m`, like `bergson build`.
 
 Use `pre-commit run --all-files` if you forget to install pre-commit and it doesn't run in the hook.
+
+### Tests
+
+Mark tests requiring GPUs with `@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA not available")`.
+
+### Environment Setup
+
+If you use need to use a venv, create and/or activate it with `python3 -m venv .venv && source .venv/bin/activate && pip install pytest`.
@@ -0,0 +1,149 @@
+# Bergson Benchmarks
+
+This directory contains benchmarking scripts for measuring Bergson's performance across different models and configurations.
+
+## Benchmark Scripts
+
+### Core Benchmarks
+
+- **`benchmark_bergson.py`** - Programmatic benchmarks for Bergson
+  - `run` - In-memory benchmark using `InMemoryCollector` (fast, single GPU)
+  - `run-disk` - Disk-based benchmark using real `build()`, `reduce()`, `score_dataset()` (single GPU)
+
+- **`benchmark_bergson_cli.py`** - CLI-based benchmark using subprocess
+  - Tests the actual CLI commands (`bergson build`, `bergson reduce`, `bergson score`)
+  - Supports multi-GPU via `--num-gpus`
+
+### Comparison Benchmarks
+
+- **`benchmark_dattri.py`** - Dattri influence function benchmark
+- **`kronfluence_benchmark.py`** - Kronfluence influence function benchmark
+
+### Utilities
+
+- **`benchmark_utils.py`** - Shared utilities for all benchmarks
+  - Model specifications
+  - Token parsing
+  - Path generation
+  - Timestamp utilities
+  - `load_benchmark_dataset()` - Load on-disk tokenized dataset with filtering
+
+- **`save_to_disk.py`** - Utility for preprocessing and saving tokenized datasets to disk
+
+### Analysis
+
+- **`plot_cli_benchmark.py`** - Plot benchmark results
+  - Automatically separates plots by num_gpus and hardware
+  - Generates `cli_benchmark_1gpu.png`, `cli_benchmark_8gpu.png`, etc.
+  - Each PNG only contains results from the same GPU/hardware configuration
+- **`run_full_benchmark.py`** - Orchestrate full benchmark suite
+
+## Usage Examples
+
+### Loading the Benchmark Dataset
+
+All benchmarks should use the pre-tokenized on-disk dataset for consistency:
+
+```python
+from benchmarks.benchmark_utils import load_benchmark_dataset
+
+# Load and filter to sequences >= 1024 tokens
+ds = load_benchmark_dataset()
+```
+
+Or test it directly:
+```bash
+python -m benchmarks.test_load_dataset
+```
+
+This will:
+- Load the tokenized dataset from `data/EleutherAI/SmolLM2-135M-10B-tokenized`
+- Filter out sequences shorter than 1024 tokens (for even batching)
+- Print statistics about total tokens available
+
+### In-Memory Benchmark (fastest)
+```bash
+python -m benchmarks.benchmark_bergson run pythia-14m 1M 100K
+```
+
+### Disk-Based Benchmark (tests real code paths)
+```bash
+python -m benchmarks.benchmark_bergson run-disk pythia-14m 1M 100K
+```
+
+### CLI Benchmark (multi-GPU support)
+
+Single GPU (default):
+```bash
+python -m benchmarks.benchmark_bergson_cli pythia-70m 10M
+```
+
+Multi-GPU (8 GPUs):
+```bash
+python -m benchmarks.benchmark_bergson_cli pythia-70m 10M --num_gpus 8
+```
+
+### Running Full Benchmark Suites
+
+**Small models (1 GPU):**
+```bash
+./benchmarks/run_small_models_cli_benchmark.sh
+```
+
+**Small models (8 GPUs):**
+```bash
+./benchmarks/run_small_models_8gpu.sh
+```
+
+**Large models (1 GPU):**
+```bash
+./benchmarks/run_large_models_cli_benchmark.sh
+```
+
+**Large models (8 GPUs):**
+```bash
+./benchmarks/run_large_models_8gpu.sh
+```
+
+### Generating Plots
+
+The plotting script automatically separates results by GPU count and hardware:
+
+```bash
+python -m benchmarks.plot_cli_benchmark
+```
+
+This will:
+- Load all benchmark results from `runs/bergson_cli_benchmark/`
+- Group by (num_gpus, hardware) combination
+- Generate separate plots for each configuration:
+  - `figures/cli_benchmark_1gpu.png` - Single GPU results
+  - `figures/cli_benchmark_8gpu.png` - 8 GPU results
+  - `runs/benchmarks/cli_benchmark_1gpu.csv` - Single GPU data
+  - `runs/benchmarks/cli_benchmark_8gpu.csv` - 8 GPU data
+
+Each plot only contains results from the same GPU/hardware configuration, making comparisons fair and meaningful.
+
+## Benchmark Comparison
+
+| Benchmark | Method | Multi-GPU | Disk I/O | Use Case |
+|-----------|--------|-----------|----------|----------|
+| `run` | In-memory collector | No (FSDP only) | None | Quick memory scaling tests |
+| `run-disk` | Real build/reduce/score | No | Yes | Test production code paths |
+| CLI (1 GPU) | Subprocess CLI commands | No | Yes | Single GPU baseline |
+| CLI (8 GPU) | Subprocess CLI commands | Yes | Yes | Full multi-GPU distributed |
+
+## Benchmark Records
+
+All benchmarks now include:
+- **num_gpus**: Number of GPUs used for the run
+- **hardware**: Hardware information (node name + GPU type/count)
+
+This allows proper comparison between single-GPU and multi-GPU runs.
+
+## Adding New Benchmarks
+
+1. Add your benchmark script to this directory
+2. Import from `benchmarks.benchmark_utils` for shared functionality
+3. Follow the existing pattern for saving results (JSON records)
+4. Update this README with your benchmark's purpose and usage