|
| 1 | +# Bergson Benchmarks |
| 2 | + |
| 3 | +This directory contains benchmarking scripts for measuring Bergson's performance across different models and configurations. |
| 4 | + |
| 5 | +## Benchmark Scripts |
| 6 | + |
| 7 | +### Core Benchmarks |
| 8 | + |
| 9 | +- **`benchmark_bergson.py`** - Programmatic benchmarks for Bergson |
| 10 | + - `run` - In-memory benchmark using `InMemoryCollector` (fast, single GPU) |
| 11 | + - `run-disk` - Disk-based benchmark using real `build()`, `reduce()`, `score_dataset()` (single GPU) |
| 12 | + |
| 13 | +- **`benchmark_bergson_cli.py`** - CLI-based benchmark using subprocess |
| 14 | + - Tests the actual CLI commands (`bergson build`, `bergson reduce`, `bergson score`) |
| 15 | + - Supports multi-GPU via `--num-gpus` |
| 16 | + |
| 17 | +### Comparison Benchmarks |
| 18 | + |
| 19 | +- **`benchmark_dattri.py`** - Dattri influence function benchmark |
| 20 | +- **`kronfluence_benchmark.py`** - Kronfluence influence function benchmark |
| 21 | + |
| 22 | +### Utilities |
| 23 | + |
| 24 | +- **`benchmark_utils.py`** - Shared utilities for all benchmarks |
| 25 | + - Model specifications |
| 26 | + - Token parsing |
| 27 | + - Path generation |
| 28 | + - Timestamp utilities |
| 29 | + - `load_benchmark_dataset()` - Load on-disk tokenized dataset with filtering |
| 30 | + |
| 31 | +- **`save_to_disk.py`** - Utility for preprocessing and saving tokenized datasets to disk |
| 32 | + |
| 33 | +### Analysis |
| 34 | + |
| 35 | +- **`plot_cli_benchmark.py`** - Plot benchmark results |
| 36 | + - Automatically separates plots by num_gpus and hardware |
| 37 | + - Generates `cli_benchmark_1gpu.png`, `cli_benchmark_8gpu.png`, etc. |
| 38 | + - Each PNG only contains results from the same GPU/hardware configuration |
| 39 | +- **`run_full_benchmark.py`** - Orchestrate full benchmark suite |
| 40 | + |
| 41 | +## Usage Examples |
| 42 | + |
| 43 | +### Loading the Benchmark Dataset |
| 44 | + |
| 45 | +All benchmarks should use the pre-tokenized on-disk dataset for consistency: |
| 46 | + |
| 47 | +```python |
| 48 | +from benchmarks.benchmark_utils import load_benchmark_dataset |
| 49 | + |
| 50 | +# Load and filter to sequences >= 1024 tokens |
| 51 | +ds = load_benchmark_dataset() |
| 52 | +``` |
| 53 | + |
| 54 | +Or test it directly: |
| 55 | +```bash |
| 56 | +python -m benchmarks.test_load_dataset |
| 57 | +``` |
| 58 | + |
| 59 | +This will: |
| 60 | +- Load the tokenized dataset from `data/EleutherAI/SmolLM2-135M-10B-tokenized` |
| 61 | +- Filter out sequences shorter than 1024 tokens (for even batching) |
| 62 | +- Print statistics about total tokens available |
| 63 | + |
| 64 | +### In-Memory Benchmark (fastest) |
| 65 | +```bash |
| 66 | +python -m benchmarks.benchmark_bergson run pythia-14m 1M 100K |
| 67 | +``` |
| 68 | + |
| 69 | +### Disk-Based Benchmark (tests real code paths) |
| 70 | +```bash |
| 71 | +python -m benchmarks.benchmark_bergson run-disk pythia-14m 1M 100K |
| 72 | +``` |
| 73 | + |
| 74 | +### CLI Benchmark (multi-GPU support) |
| 75 | + |
| 76 | +Single GPU (default): |
| 77 | +```bash |
| 78 | +python -m benchmarks.benchmark_bergson_cli pythia-70m 10M |
| 79 | +``` |
| 80 | + |
| 81 | +Multi-GPU (8 GPUs): |
| 82 | +```bash |
| 83 | +python -m benchmarks.benchmark_bergson_cli pythia-70m 10M --num_gpus 8 |
| 84 | +``` |
| 85 | + |
| 86 | +### Running Full Benchmark Suites |
| 87 | + |
| 88 | +**Small models (1 GPU):** |
| 89 | +```bash |
| 90 | +./benchmarks/run_small_models_cli_benchmark.sh |
| 91 | +``` |
| 92 | + |
| 93 | +**Small models (8 GPUs):** |
| 94 | +```bash |
| 95 | +./benchmarks/run_small_models_8gpu.sh |
| 96 | +``` |
| 97 | + |
| 98 | +**Large models (1 GPU):** |
| 99 | +```bash |
| 100 | +./benchmarks/run_large_models_cli_benchmark.sh |
| 101 | +``` |
| 102 | + |
| 103 | +**Large models (8 GPUs):** |
| 104 | +```bash |
| 105 | +./benchmarks/run_large_models_8gpu.sh |
| 106 | +``` |
| 107 | + |
| 108 | +### Generating Plots |
| 109 | + |
| 110 | +The plotting script automatically separates results by GPU count and hardware: |
| 111 | + |
| 112 | +```bash |
| 113 | +python -m benchmarks.plot_cli_benchmark |
| 114 | +``` |
| 115 | + |
| 116 | +This will: |
| 117 | +- Load all benchmark results from `runs/bergson_cli_benchmark/` |
| 118 | +- Group by (num_gpus, hardware) combination |
| 119 | +- Generate separate plots for each configuration: |
| 120 | + - `figures/cli_benchmark_1gpu.png` - Single GPU results |
| 121 | + - `figures/cli_benchmark_8gpu.png` - 8 GPU results |
| 122 | + - `runs/benchmarks/cli_benchmark_1gpu.csv` - Single GPU data |
| 123 | + - `runs/benchmarks/cli_benchmark_8gpu.csv` - 8 GPU data |
| 124 | + |
| 125 | +Each plot only contains results from the same GPU/hardware configuration, making comparisons fair and meaningful. |
| 126 | + |
| 127 | +## Benchmark Comparison |
| 128 | + |
| 129 | +| Benchmark | Method | Multi-GPU | Disk I/O | Use Case | |
| 130 | +|-----------|--------|-----------|----------|----------| |
| 131 | +| `run` | In-memory collector | No (FSDP only) | None | Quick memory scaling tests | |
| 132 | +| `run-disk` | Real build/reduce/score | No | Yes | Test production code paths | |
| 133 | +| CLI (1 GPU) | Subprocess CLI commands | No | Yes | Single GPU baseline | |
| 134 | +| CLI (8 GPU) | Subprocess CLI commands | Yes | Yes | Full multi-GPU distributed | |
| 135 | + |
| 136 | +## Benchmark Records |
| 137 | + |
| 138 | +All benchmarks now include: |
| 139 | +- **num_gpus**: Number of GPUs used for the run |
| 140 | +- **hardware**: Hardware information (node name + GPU type/count) |
| 141 | + |
| 142 | +This allows proper comparison between single-GPU and multi-GPU runs. |
| 143 | + |
| 144 | +## Adding New Benchmarks |
| 145 | + |
| 146 | +1. Add your benchmark script to this directory |
| 147 | +2. Import from `benchmarks.benchmark_utils` for shared functionality |
| 148 | +3. Follow the existing pattern for saving results (JSON records) |
| 149 | +4. Update this README with your benchmark's purpose and usage |
0 commit comments