Production Mixed-Rate Allocation

Production Mixed-Rate Allocation (PMRA) is a method for mixing existing GGUF quantizations of the same model into one standard GGUF.

To mix a model with PMRA, start with a small production GGUF, then promote selected tensor payloads to stronger production GGUF formats where calibration shows the extra bytes buy quality. The mixer writes the selected payloads into a single GGUF that loads in normal GGUF runtimes.

PMRA is not a new quantizer and does not need a custom runtime. It is a selection-and-materialization method over production GGUF payloads. Each mix records its source allocation in pmra.* metadata and in the artifact report.

Released model mixes, metrics, and upstream attribution are tracked separately from this method overview. The Hugging Face collection for public PMRA GGUF mixes is PMRA.

Vocabulary

mix, verb: run PMRA selection and build the resulting mixed GGUF.
mix, noun: the resulting GGUF artifact with tensor payloads from more than one source quantization.
source: one existing GGUF quantization used as a tensor-payload donor.
target/control: the uniform GGUF budget or quality point the mix is compared against.
selector: the allocation strategy, usually calibration knapsack or a recorded comparison selector.
payload budget: the tensor-payload byte ceiling the mix must fit under.

How PMRA Works

PMRA treats mixed quantization as a byte-budgeted allocation problem.

Pick a low-bit source GGUF, such as IQ2_M.
Pick a target/control budget, such as IQ3_XS.
Load stronger GGUF sources from the same checkpoint.
For each tensor group, temporarily promote it to each stronger source.
Measure calibration NLL improvement and added payload bytes.
Select the best set of promotions under the byte budget.
Mix the selected tensor payloads into one standard GGUF.
Evaluate the full mix against uniform controls and random same-budget mixes.

Uniform quantization spends one format choice broadly across the model, even though tensors are not equally sensitive. PMRA spends stronger formats only where calibration shows the bytes matter.

Why Knapsack

The default selector is a multiple-choice knapsack:

maximize total calibration NLL improvement
subject to total extra bytes <= payload budget
and at most one source choice per tensor group

Each candidate promotion has a value, which is measured calibration improvement, and a cost, which is added tensor payload bytes. Knapsack is a better fit than a pure greedy ratio because the whole byte budget matters: several modest tensor promotions can beat one large promotion even if the large one has a tempting single-tensor score.

When the byte state space is compact, PMRA uses an exact scaled dynamic program. When it is too large, it keeps a Pareto-pruned frontier so the run stays practical. The selected mix still has to pass held-out evaluation because tensor interactions are real and calibration is the selection objective, not the final claim.

The repo also supports search refinements around that default: seeded genetic search, direct genetic search, seeded simulated annealing, and direct simulated annealing. These are tested as candidate finders, then judged by the same held-out controls as knapsack.

Prior Art And Positioning

PMRA is new in this repo as a GGUF-native mixing workflow, but it sits inside a longer line of mixed-precision and sensitivity-aware compression work.

Relevant predecessors include:

HAQ, which used hardware feedback to choose mixed-precision quantization policies.
HAWQ-V2, which used Hessian-aware sensitivity analysis for mixed-precision quantization.
LLM.int8(), which used mixed-precision decomposition to preserve transformer outlier dimensions.
GPTQ, which made post-training LLM weight quantization practical at large scale.
AWQ, which protected salient weights using activation-aware calibration.
SpQR and SqueezeLLM, which combined dense low-bit quantization with special handling for sensitive or outlier weights.
OmniQuant, which used calibration to optimize quantization parameters across LLM settings.

Applications

PMRA is useful anywhere a deployable GGUF has to balance quality, file size, memory budget, and runtime compatibility without introducing a custom inference path.

PMRA is useful when you want to:

hit a local memory budget more precisely than one uniform preset allows
recover quality at the same size by protecting sensitive tensors
publish one normal GGUF instead of a custom runtime path
reuse public quantization ladders rather than recomputing every quant
expose exactly where the bytes went through pmra.* metadata and reports
create practical local-model deployment points for laptops, small GPUs, and apps

How To Read This Repo

For a quick pass through the repo, use this order:

This README - explanation of what PMRA is, how mixing works, and why knapsack is the default selector. 2.The Guide - more detailed explanation of what PMRA does, why uniform quantization wastes bits, how probing and Fisher-guided profiling work, and what the output looks like.
PMRA Hugging Face Collection - public GGUF mixes and model cards.
Artifact Index - released mixes, reports, and metrics.
Method - implementation-level method notes.
Reproduce - local, Colab, and optional Modal paths for rebuilding selector results and GGUF artifacts.
Evidence Docs - full research trail, including failed and superseded gates.

Repository Layout

docs/       Human-facing release notes, method notes, evidence, reproduction.
scripts/    PMRA selector, GGUF mixer, public evaluators, helpers.
modal/      Modal A100 harness used for the public-calibrated runs.
results/    Result cards and JSON reports for released and validation mixes.
artifacts/  Earlier artifact reports, not the GGUF files themselves.
tools/      Hugging Face upload and verification helpers.

Install

pip install -r requirements.txt

Full selector runs expect access to the base model weights and matching GGUF source files. They can run locally, in Colab, on rented GPU machines, or through the optional Modal harness. Local CPU runs are useful for small checks, but the larger selector runs are GPU-heavy.

Minimal Reproduction Shape

Run public-calibrated PMRA mix selection on Wikitext-2 raw train/validation.
Evaluate the frozen selection on held-out public text.
Mix selected tensor payloads into one GGUF.
Load and smoke-test the GGUF with llama.cpp.

Exact commands are in docs/REPRODUCE.md.

Attribution

PMRA code and docs in this repo are released under Apache-2.0. Individual mixes inherit the licensing and attribution requirements of their base models and GGUF source quantizations; release-specific attribution is tracked in the model release docs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Production Mixed-Rate Allocation

Vocabulary

How PMRA Works

Why Knapsack

Prior Art And Positioning

Applications

How To Read This Repo

Repository Layout

Install

Minimal Reproduction Shape

Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
artifacts		artifacts
docs		docs
modal		modal
results		results
scripts		scripts
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Production Mixed-Rate Allocation

Vocabulary

How PMRA Works

Why Knapsack

Prior Art And Positioning

Applications

How To Read This Repo

Repository Layout

Install

Minimal Reproduction Shape

Attribution

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages