Skip to content

Commit d0fb577

Browse files
committed
Code release for "Scaling DoRA: High-Rank Adaptation via Factored Norm and Fused Kernels"
Repository contents: - code/bench_dora_comprehensive.py: comprehensive benchmark suite (10 sections: norm, compose, backward, e2e, memory, models, stability, precision, vram, micro) - code/bench_it6/: raw JSON results from 6 GPUs × 3 dtypes (microbenchmarks) and 3 GPUs × 6 models (model-level), 200 repeats each - code/convergence_runs/: TensorBoard events for eager/fused convergence validation (3 seeds × 2 modes, Qwen3.5-9B, 2000 steps) - code/peft_patched/: standalone copy of the modified PEFT source files - code/scripts/: reference upstream dora.py, dataset repacking, inference audit - code/requirements.txt: frozen pip environment from the reproducibility Docker image (alexazel/dorafactors-env:cu131-pt210-vllm-t52-base) - paper/generate_figures.py: regenerate all 13 paper figures from bench_it6 JSON - paper/generate_training_figure.py: convergence figure from TensorBoard events - hf.patch: reconstruction patch against upstream PEFT commit 20a9829 - vendor/dorafactors-peft: git submodule → sockeye44/dorafactors-peft branch v1 Vendor dependency validation: - The benchmark script validates vendor/dorafactors-peft (submodule, commit 9bb1084 reachable, valid PEFT package structure) and the reference dora.py (SHA-1 prefix 86def591d41, cascading search: code/scripts/ then vendor/) at startup, with interactive y/N acknowledgment for non-standard conditions.
0 parents  commit d0fb577

70 files changed

Lines changed: 4333855 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
# Vendor: upstream HF PEFT reference file fetched via wget (not committed)
2+
vendor/dora.reference_hf_peft.py
3+
__pycache__/

.gitmodules

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[submodule "vendor/dorafactors-peft"]
2+
path = vendor/dorafactors-peft
3+
url = https://github.com/sockeye44/dorafactors-peft
4+
branch = v1

README.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Official code for "Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels"
2+
3+
Memory-efficient DoRA (Weight-Decomposed Low-Rank Adaptation) for PEFT, featuring factored
4+
column norms, fused Triton kernels with custom autograd, and automatic dispatch across eager
5+
PyTorch and Triton backends.
6+
7+
From the paper: *Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels*
8+
(arXiv preprint arXiv:XXXX.XXXXX, 2026).
9+
10+
## Reproducing benchmark results
11+
12+
**Clone with submodules** (recommended — one command, zero setup):
13+
14+
```bash
15+
git clone --recurse-submodules https://github.com/sockeye44/dorafactors
16+
cd dorafactors
17+
```
18+
19+
If you already cloned without `--recurse-submodules`:
20+
21+
```bash
22+
git submodule update --init
23+
```
24+
25+
Install dependencies:
26+
27+
```bash
28+
pip install -r code/requirements.txt
29+
```
30+
31+
**Run the benchmark** (from the repo root):
32+
33+
```bash
34+
python code/bench_dora_comprehensive.py --suite all --verbose
35+
```
36+
37+
The script validates its vendor dependencies at startup and will guide you
38+
through any missing pieces. All paths are resolved relative to the script's
39+
location, so `python code/bench_dora_comprehensive.py` works from any working
40+
directory — but **always invoke it via the repo-root-relative path** shown above
41+
to keep commands unambiguous and copy-pasteable.
42+
43+
<details>
44+
<summary>What the script expects at startup</summary>
45+
46+
| Artifact | Source | How it gets there |
47+
|---|---|---|
48+
| `vendor/dorafactors-peft/` | Git submodule → [`sockeye44/dorafactors-peft`](https://github.com/sockeye44/dorafactors-peft) branch `v1` | `--recurse-submodules` or `git submodule update --init` |
49+
| Reference `dora.py` | Upstream HF PEFT @ [`20a9829`](https://github.com/huggingface/peft/blob/20a9829f76419149f5e447b856bc0abe865c28a7/src/peft/tuners/lora/dora.py) | Searched at `code/scripts/dora.reference_hf_peft.py` then `vendor/` (SHA-1 verified); fetch with `wget` if absent |
50+
51+
If either is missing or fails integrity checks, the script prints exact
52+
remediation commands. In interactive sessions it offers a `[y/N]` prompt to
53+
continue under non-standard conditions; in non-interactive sessions (CI, piped
54+
stdin) it exits with an error.
55+
56+
</details>
57+
58+
### Alternative: reconstruct the PEFT fork from patch
59+
60+
Instead of using the submodule, you can reconstruct the patched PEFT module
61+
from `hf.patch` against upstream PEFT commit
62+
[`20a9829`](https://github.com/huggingface/peft/commit/20a9829) (`v0.18.0.rc0`):
63+
64+
```bash
65+
git clone https://github.com/huggingface/peft /tmp/peft-fork
66+
cd /tmp/peft-fork && git checkout 20a9829
67+
git apply /path/to/dorafactors/hf.patch
68+
```
69+
70+
### Generating paper figures
71+
72+
The figure-generation scripts read pre-collected benchmark JSON from
73+
`code/bench_it6/` and TensorBoard events from `code/convergence_runs/`.
74+
Both resolve data paths relative to their own location, so they work from
75+
any working directory.
76+
77+
```bash
78+
# Microbenchmark + model-level figures (requires matplotlib, numpy)
79+
python paper/generate_figures.py # PDF only
80+
python paper/generate_figures.py --png # PDF + PNG
81+
82+
# Training convergence figure (requires matplotlib, numpy, tensorboard)
83+
python paper/generate_training_figure.py
84+
```
85+
86+
Outputs go to `paper/figures/`.
87+
88+
### Running the test suite
89+
90+
Unit tests for the factored norm, fused kernels, and DoRA math live in the
91+
PEFT fork submodule. One test (`test_reference_vs_optimized_forward_equivalence`)
92+
loads the upstream HF PEFT baseline from `docs/dora.reference_hf_peft.py`
93+
inside the fork tree — copy it from the parent repo before running:
94+
95+
```bash
96+
cp code/scripts/dora.reference_hf_peft.py vendor/dorafactors-peft/docs/
97+
cd vendor/dorafactors-peft
98+
pytest tests/test_lora_variants.py \
99+
tests/tuners/lora/test_dora_fused.py \
100+
tests/tuners/lora/test_dora_math.py -v
101+
```
102+
103+
Triton kernel tests require an SM 80+ GPU (Ampere or newer); validated on
104+
SM 80 (A100) through SM 120 (RTX 6000 PRO).
105+
106+
## Documentation
107+
108+
Full documentation, how-to guides, and API reference are available at:
109+
110+
- [**Home**](https://sockeye44.github.io/dorafactors-docs/) — overview and quick-start
111+
- [**Getting Started**](https://sockeye44.github.io/dorafactors-docs/getting-started/) — installation, setup, and first steps
112+
- [**Configuration**](https://sockeye44.github.io/dorafactors-docs/config/) — all configuration options and environment variables
113+
114+
## Modules
115+
116+
| Module | Description | Reference |
117+
|--------|-------------|-----------|
118+
| `peft.tuners.lora.dora` | Layer classes (`DoraLinearLayer`, `DoraEmbeddingLayer`, conv variants), configuration functions, FSDP/ZeRO-3 integration, and eager composition helpers | [Layer Classes](layers.md), [Configuration](config.md) |
119+
| `peft.tuners.lora.dora_fused` | Fused Triton kernels for DoRA compose, norm assembly, and forward+inner products; custom autograd function; PyTorch fallbacks; autotune configs | [Fused Kernels](fused.md) |
120+
121+
## Environment Variables
122+
123+
| Variable | Default | Description |
124+
|----------|---------|-------------|
125+
| `PEFT_DORA_FUSED` | unset (auto) | Enable fused Triton kernels: `"1"`, `"0"`, or unset (auto: use if Triton available) |
126+
| `PEFT_DORA_FUSED_BACKWARD` | unset (on) | Fused backward pass: `"1"` (force on, bypass shape heuristic), `"0"` (off), or unset (on, with shape-based filtering for linear layers) |
127+
| `PEFT_DORA_NORM_CHUNK_MB` | `256` | Column-norm chunking threshold in MB; matrices exceeding this are chunked (min 16) |
128+
| `PEFT_DORA_FWD_CHUNK_MB` | `256` | Forward-pass chunking threshold in MB (min 16) |
129+
| `DORA_AUTOTUNE_COMPREHENSIVE` | `"0"` | Enable comprehensive Triton autotuning (`"1"` for full search) |
130+
| `PEFT_DORA_ALLOW_PARTIAL_GATHER` | `"0"` | Allow partial parameter gathering under ZeRO-3 (`"1"` to enable) |
131+
| `PEFT_FORCE_GATHER` | unset (auto) | Force full parameter gathering: `"1"`, `"0"`, or unset (auto-detect) |
132+
133+
## Module Relationships
134+
135+
`dora.py` is the primary module: it defines all layer classes and configuration functions.
136+
It lazy-imports `dora_fused.py` on first use (guarded by `_get_dora_fused()`) so that Triton
137+
is not required at import time.
138+
139+
## Citation
140+
141+
```bibtex
142+
@article{zelenin2026dorafactors,
143+
title = {Scaling DoRA: High-Rank Adaptation via Factored Norms and Fused Kernels},
144+
author = {Zelenin, Alexandra and Zhuravlyova, Alexandra},
145+
journal = {arXiv preprint arXiv:XXXX.XXXXX},
146+
eprint = {XXXX.XXXXX},
147+
archivePrefix = {arXiv},
148+
year = {2026}
149+
}
150+
```

code/RESULTS.md

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
# Data Provenance & Figure Mapping
2+
3+
This document maps every benchmark data artifact to its origin, purpose, and usage in the paper.
4+
5+
## Software Stack
6+
7+
**Benchmarks + convergence (single pinned stack):**
8+
- **PyTorch**: 2.10.0+cu130 (built against CUDA 13.0)
9+
- **Triton**: 3.6.0
10+
- **Transformers**: 5.2.0
11+
- **CUDA toolkit**: 13.1 (ptxas V13.1.115)
12+
- **NVIDIA Driver**: 580.126.09
13+
- **Python**: 3.12.12
14+
- **OS**: Linux 6.8.0 (Ubuntu 22.04, glibc 2.35)
15+
16+
**Convergence runs (additional dependencies):**
17+
- **ms-swift**: commit `a807cb9`
18+
- **DeepSpeed**: 0.18.6
19+
- **Flash-Attention**: 2.8.3
20+
21+
---
22+
23+
## Source Code Files
24+
25+
| File | Description |
26+
|------|-------------|
27+
| `dora.py` | Our DoRA layer implementation: factored norm, fused dispatch, `DoraLinearLayer` / `DoraEmbeddingLayer` / `_DoraConvNdLayer`. |
28+
| `dora_fused.py` | Fused Triton kernels: compose forward/backward, norm assembly, `FusedDoRAComposeFunction` autograd wrapper. |
29+
| `dora_diagnostics.py` | Diagnostic instrumentation gated by `PEFT_DORA_DIAGNOSE=1`; zero-cost no-op when disabled. |
30+
| `dora_ci.py` | Modal CI entrypoint: orchestrates pytest + benchmarks on remote GPUs. |
31+
| `bench_dora_comprehensive.py` | Comprehensive benchmark suite: norm, compose, backward, e2e, memory, stability, models. Produces JSON consumed by `generate_figures.py`. |
32+
| `test_dora_fused.py` | Regression and performance tests for fused kernels (749 tests total with `test_dora_math.py`). |
33+
| `test_dora_math.py` | Mathematical correctness tests: factored norm equivalence, numerical stability, edge cases. |
34+
| `hf.patch` | Git diff against upstream PEFT `20a9829` (`v0.18.0.rc0`). Symlinked at repo root. |
35+
| `peft_patched/` | Patched PEFT source tree (apply `hf.patch` to upstream for equivalent). |
36+
| `scripts/check_compose_parity.py` | Quick compose parity check at production-scale dimensions. |
37+
| `scripts/dora.reference_hf_peft.py` | Unmodified HF PEFT `DoraLinearLayer` snapshot for baseline comparisons. |
38+
| `scripts/dora_inference_audit.py` | Mechanistic 6-phase DoRA inference audit (norm, compose, backward, decode, dispatch). |
39+
| `scripts/repack_mmfinereason_qr.py` | Dataset preprocessing: field renames + tok_len filtering for MMFineReason. |
40+
| `scripts/run_revision_benchmarks.sh` | High-rank + loss_tokens sensitivity benchmark runner. |
41+
| `kernelagent_sols/` | KernelAgent (Meta) optimized kernel artifacts for compose and backward. |
42+
43+
---
44+
45+
## Benchmark Data (bench_it6)
46+
47+
**Directory**: `code/bench_it6/`
48+
49+
All paper figures, tables, and claims derive from this single data collection.
50+
51+
### Microbenchmarks (6 GPUs, 200 repeats, extended shapes)
52+
53+
| GPU | SM | Memory | Files |
54+
|-----|----|--------|-------|
55+
| L40S | SM89 (Ada) | 48 GB GDDR6 | `sm89_l40s_comprehensive_extended_{bf16,fp16,fp32}.json` |
56+
| A100 | SM80 (Ampere) | 80 GB HBM2e | `sm80_a100_comprehensive_extended_{bf16,fp16,fp32}.json` |
57+
| RTX 6000 PRO | SM120 (Blackwell) | 96 GB GDDR7 | `sm120_rtx6000_comprehensive_extended_{bf16,fp16,fp32}.json` |
58+
| H200 | SM90 (Hopper) | 141 GB HBM3e | `sm90_h200_comprehensive_extended_{bf16,fp16,fp32}.json` |
59+
| B200 | SM100 (Blackwell) | 192 GB HBM3e | `sm100_b200_comprehensive_extended_{bf16,fp16,fp32}.json` |
60+
| B300 | SM103 (Blackwell) | 268 GB HBM3e | `sm103_b300_comprehensive_extended_{bf16,fp16,fp32}.json` |
61+
62+
### Model-level (3 GPUs, r=384, bs=1, seq=4096, ga=8, loss_tokens=1024, 20 repeats)
63+
64+
| GPU | File |
65+
|-----|------|
66+
| RTX 6000 PRO | `rtx6000_seq4096_bs1_gas8_seq4k_loss1k_n20w2_*.json` |
67+
| H200 | `h200_seq4096_bs1_gas8_seq4k_loss1k_n20w2_*.json` |
68+
| B200 | `b200_seq4096_bs1_gas8_seq4k_loss1k_n20w2_*.json` |
69+
70+
### High-rank (H200, r=512, loss_tokens=1024, 20 repeats)
71+
72+
| File | Models |
73+
|------|--------|
74+
| `h200_r512_loss1k_n20w2_*.json` | Qwen3.5-27B, Qwen3-VL-32B |
75+
76+
6 models total: Qwen2.5-VL-32B, Qwen3-VL-32B, Qwen3.5-27B, Gemma3-27B, Mistral-Small-24B, Qwen3-VL-8B.
77+
78+
---
79+
80+
## Convergence Data
81+
82+
**Directory**: `code/convergence_runs/`
83+
84+
### Primary (Qwen3.5-9B-Base, r=384, AdamW, 3 seeds × eager/fused = 6 runs)
85+
86+
| Seed | Mode | File |
87+
|------|------|------|
88+
| 1 | eager | `events.out.tfevents.*eager_seed1sft*` |
89+
| 1 | fused | `events.out.tfevents.*fused_seed1sft*` |
90+
| 2 | eager | `events.out.tfevents.*eager_seed2sft*` |
91+
| 2 | fused | `events.out.tfevents.*fused_seed2sft*` |
92+
| 3 | eager | `events.out.tfevents.*eager_seed3sft*` |
93+
| 3 | fused | `events.out.tfevents.*fused_seed3sft*` |
94+
95+
- Dataset: eyes-ml/MMFineReason-SFT-123K-Qwen3-VL-235B-Thinking-QR-max4096
96+
- Hardware: 1×RTX 6000 PRO (96 GB GDDR7)
97+
- Grand mean |Δloss|: 7.1e-4, worst max: 1.1e-2 (seed 1), mean final eval |Δ|: 8.9e-5
98+
- Wall-clock: 330 min fused vs 360 min eager (8.3% reduction)
99+
100+
### Cross-model + cross-optimizer (Qwen3-VL-8B, r=256, Muon+AdamW, seed 4)
101+
102+
| Mode | File |
103+
|------|------|
104+
| eager | `events.out.tfevents.*q3vl_8b_muon_eager_seed4sft*` |
105+
| fused | `events.out.tfevents.*q3vl_8b_muon_fused_seed4sft*` |
106+
107+
- Mean |Δloss|: 7.7e-4, max: 5.7e-3, final eval |Δ|: 3.9e-5
108+
- Wall-clock: 325 min fused vs 354 min eager (8.2% reduction)
109+
110+
---
111+
112+
## Figure Generation
113+
114+
| Script | Generates | Data Source |
115+
|--------|-----------|-------------|
116+
| `paper/generate_figures.py` | 13 benchmark figures | `code/bench_it6/` JSON files |
117+
| `paper/generate_training_figure.py` | 1 convergence figure | `code/convergence_runs/` TensorBoard events |
118+
119+
---
120+
121+
## Historical Data
122+
123+
Benchmark iterations 1–5 (`bench_it1/` through `bench_it5/`), the old convergence
124+
JSONL data (`tboard/`), and the old proprietary-dataset SFT runs (`sft_runs/`) have
125+
been removed from the repository. They are preserved in the git history for
126+
provenance. All paper claims derive exclusively from `bench_it6/` and
127+
`convergence_runs/`.

0 commit comments

Comments
 (0)