VAC: Variable Allocation Compression

Compress a LLM with minimal capability loss.

VAC is a structured compression toolkit that uses Fisher-informed low-rank factorization with evolutionary strategy search to find the optimal compression for each weight matrix in a transformer model. Unlike uniform quantization, VAC allocates compression budgets per-matrix using a multiple-choice knapsack solver, achieving dramatically better quality at the same storage cost.

Note: VAC produces factorized models that run via HuggingFace Transformers with trust_remote_code=True. GGUF/llama.cpp/Ollama support is not yet available — the factorized format requires a custom inference path. See Limitations for details.

Key Results

Method	Pre-KD PPL	Compression	Notes
Naive SVD (uniform 2x)	9,739	2.0x	Model destroyed
Sequential Fisher (v1)	144	2.0x	67x better than naive
VAC evolved (v2)	90.54	1.8x	39% better than v1
After full recovery	~27	1.8x	Within 6 PPL of teacher

The evolved strategy discovered three key insights:

Middle-out compression order (+21% over front-to-back): compress the easy middle layers first so hard layers get accurate Fisher on a barely-distorted model
Cube-root Fisher scaling (+18% over sqrt): gentler weighting avoids over-trusting the diagonal Fisher approximation
Attention-heavy allocation: attention absorbs 4x compression with minimal damage; MLP is the sensitive component

Installation

pip install vac-compress

Or from source:

git clone https://github.com/asystemoffields/v-a-c.git
cd v-a-c
pip install -e .

Quick Start

import torch
from vac import compress_model

# Compress any HuggingFace model
model, metadata = compress_model(
    "allenai/OLMo-3-7B-Think",
    target_ratio=2.0,
    device="cuda",
)
# model is now a factorized transformer with ~50% fewer stored parameters
# and ~2x faster inference (fewer FLOPs per layer)

Loading a Compressed Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load the VAC-compressed model (requires trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "asystemoffields/OLMo-3-7B-Think-VAC",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-3-7B-Think")

# For smaller GPUs (8+ GB), use INT8 quantization on load:
# model = AutoModelForCausalLM.from_pretrained(
#     "asystemoffields/OLMo-3-7B-Think-VAC",
#     trust_remote_code=True,
#     load_in_8bit=True,
# )

messages = [{"role": "user", "content": "What is the capital of France?"}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
output = model.generate(inputs.to(model.device), max_new_tokens=512, temperature=0.6, top_p=0.95, do_sample=True)
print(tokenizer.decode(output[0], skip_special_tokens=True))

How It Works

Profile each weight matrix with diagonal Fisher information (sensitivity analysis)
Allocate per-matrix rank budgets via MCKP knapsack (spend bits where they matter)
Compress sequentially with Fisher recomputed per layer (accounts for error propagation). The evolved strategy uses middle-out order and cube-root Fisher scaling.
Recover via knowledge distillation on the model's original training data

The sequential compression is the key breakthrough: each layer is optimized for the actual distorted activations it will see at inference, not the pristine activations from the original model. This single change gives 67x better perplexity than naive SVD at the same compression ratio.

Comparison

Format	Download	VRAM	Quality	Inference Speed
Original (bf16)	14.6 GB	14.6 GB	Baseline (PPL 21)	1.0x
Original GPTQ Q4	4.1 GB	~5 GB	Good (PPL ~23)	~1.0x
VAC 1.8x (bf16)	8.9 GB	8.9 GB	PPL 27	~1.8x faster
VAC 1.8x (INT8)	8.9 GB	~4.5 GB	PPL 27.3	~1.8x faster

VAC provides both smaller storage AND faster inference. Quantization-only methods (GPTQ, AWQ) reduce storage but perform the same number of FLOPs. VAC's factorized layers (x @ B.T @ A.T instead of x @ W.T) have physically fewer multiply-accumulate operations.

Limitations

No GGUF support. VAC models cannot currently be run in llama.cpp, Ollama, or LM Studio. The factorized layer format (two smaller matrices per layer) is not supported by these inference engines. A plan exists for integration via llama.cpp's LoRA mechanism, but it requires ecosystem work.
Requires trust_remote_code=True for loading via HuggingFace Transformers.
Loading requires ~16 GB system RAM (the model loads to CPU first, then moves to GPU). GPU VRAM requirement is only 8.9 GB (bf16) or ~4.5 GB (INT8).
Not lossless. There is a ~6 PPL gap from the teacher model. For most interactive use this is imperceptible, but on precise benchmarks (math, code) there may be measurable differences.

Pipeline Overview

Profile (Fisher)  -->  Allocate (MCKP)  -->  Compress (Sequential Fisher SVD)
     30 min               5 min                    1 hour

  -->  KD Recovery (DOLMA)  -->  Post-training (SFT/DPO)  -->  Package
         4-6 hours                  4-6 hours                  10 min

Total: ~12-20 H100-hours for a complete 7B compression run.

Advanced Usage

Evolutionary Strategy Search

Find the optimal compression strategy for your model:

from vac.evolve import run_evolution

best_genome, fitness = run_evolution(
    model_name="your-model",
    target_ratio=2.0,
    population_size=30,
    n_generations=50,
)

Custom Rank Allocation

from vac.allocate import solve_allocation
from vac.profile import profile_model

# Profile all matrices
results = profile_model("your-model", n_calib=64, seq_len=4096)

# Solve optimal allocation
allocation = solve_allocation(results, target_ratio=2.0, n_layers=32)

Knowledge Distillation

from vac.kd import train_kd

train_kd(
    student=compressed_model,
    teacher_name="original-model",
    dataset="allenai/dolma",
    n_steps=5000,
    seq_len=4096,
)

Mathematical Foundation

VAC uses a separable Fisher scaling approximation for weighted SVD:

Compute diagonal Fisher: F_ij = E[(dL/dW_ij)^2]
Extract row/column marginals: s_row = sqrt(mean(F, dim=1)), s_col = sqrt(mean(F, dim=0))
Scale the weight: W_scaled = diag(s_row) @ W @ diag(s_col)
Standard SVD on the scaled matrix: W_scaled = U S V^T
Truncate to rank r and un-scale back to original coordinates

This makes the SVD preferentially discard directions that are functionally unimportant (low Fisher), preserving the model's behavior even at aggressive compression ratios.

See docs/math.md for the complete mathematical specification.

Prior Art & References

VAC builds on and extends prior work in neural network compression:

GPTQ (Frantar et al., 2022) — Layer-wise sequential quantization with Hessian information. VAC's sequential compression approach (processing layers front-to-back with updated statistics) draws from this insight.
ASVD (Activation-aware SVD) — Using activation/gradient information to weight the SVD decomposition. VAC's Fisher-scaled SVD is in this family.
Optimal Brain Damage (LeCun et al., 1990) / Optimal Brain Surgeon (Hasler & Stork, 1993) — Using second-order (Fisher/Hessian) information to identify removable parameters. The foundational insight behind Fisher-guided compression.
Knowledge Distillation (Hinton et al., 2015) — Training a student to match a teacher's soft predictions. Used in VAC's recovery phase.
QuaRot / SpinQuant (Ashkboos et al., 2024) — Hadamard rotations to improve quantization. Related to VAC's exploration of rotation-before-compression.
PMRA (asystemoffields) — Per-matrix rate allocation for quantization. VAC extends this framework to structural (rank) allocation.

Citation

If you use VAC in your research, please cite:

@software{vac2026,
  title={VAC: Variable Allocation Compression},
  author={asystemoffields},
  year={2026},
  url={https://github.com/asystemoffields/v-a-c},
}

Acknowledgments

Allen AI for publishing OLMo with full training data, post-training datasets, and evaluation infrastructure. Their radical openness made this research possible.
Built on PyTorch and HuggingFace Transformers.

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
docs		docs
examples		examples
scripts		scripts
vac		vac
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VAC: Variable Allocation Compression

Key Results

Installation

Quick Start

Loading a Compressed Model

How It Works

Comparison

Limitations

Pipeline Overview

Advanced Usage

Evolutionary Strategy Search

Custom Rank Allocation

Knowledge Distillation

Mathematical Foundation

Prior Art & References

Citation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VAC: Variable Allocation Compression

Key Results

Installation

Quick Start

Loading a Compressed Model

How It Works

Comparison

Limitations

Pipeline Overview

Advanced Usage

Evolutionary Strategy Search

Custom Rank Allocation

Knowledge Distillation

Mathematical Foundation

Prior Art & References

Citation

Acknowledgments

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages