Production Mixed-Rate Allocation (PMRA) is a method for mixing existing GGUF quantizations of the same model into one standard GGUF.
To mix a model with PMRA, start with a small production GGUF, then promote selected tensor payloads to stronger production GGUF formats where calibration shows the extra bytes buy quality. The mixer writes the selected payloads into a single GGUF that loads in normal GGUF runtimes.
PMRA is not a new quantizer and does not need a custom runtime. It is a
selection-and-materialization method over production GGUF payloads. Each mix
records its source allocation in pmra.* metadata and in the artifact report.
Released model mixes, metrics, and upstream attribution are tracked separately from this method overview. The Hugging Face collection for public PMRA GGUF mixes is PMRA.
- mix, verb: run PMRA selection and build the resulting mixed GGUF.
- mix, noun: the resulting GGUF artifact with tensor payloads from more than one source quantization.
- source: one existing GGUF quantization used as a tensor-payload donor.
- target/control: the uniform GGUF budget or quality point the mix is compared against.
- selector: the allocation strategy, usually calibration knapsack or a recorded comparison selector.
- payload budget: the tensor-payload byte ceiling the mix must fit under.
PMRA treats mixed quantization as a byte-budgeted allocation problem.
- Pick a low-bit source GGUF, such as
IQ2_M. - Pick a target/control budget, such as
IQ3_XS. - Load stronger GGUF sources from the same checkpoint.
- For each tensor group, temporarily promote it to each stronger source.
- Measure calibration NLL improvement and added payload bytes.
- Select the best set of promotions under the byte budget.
- Mix the selected tensor payloads into one standard GGUF.
- Evaluate the full mix against uniform controls and random same-budget mixes.
Uniform quantization spends one format choice broadly across the model, even though tensors are not equally sensitive. PMRA spends stronger formats only where calibration shows the bytes matter.
The default selector is a multiple-choice knapsack:
maximize total calibration NLL improvement
subject to total extra bytes <= payload budget
and at most one source choice per tensor group
Each candidate promotion has a value, which is measured calibration improvement, and a cost, which is added tensor payload bytes. Knapsack is a better fit than a pure greedy ratio because the whole byte budget matters: several modest tensor promotions can beat one large promotion even if the large one has a tempting single-tensor score.
When the byte state space is compact, PMRA uses an exact scaled dynamic program. When it is too large, it keeps a Pareto-pruned frontier so the run stays practical. The selected mix still has to pass held-out evaluation because tensor interactions are real and calibration is the selection objective, not the final claim.
The repo also supports search refinements around that default: seeded genetic search, direct genetic search, seeded simulated annealing, and direct simulated annealing. These are tested as candidate finders, then judged by the same held-out controls as knapsack.
PMRA is new in this repo as a GGUF-native mixing workflow, but it sits inside a longer line of mixed-precision and sensitivity-aware compression work.
Relevant predecessors include:
- HAQ, which used hardware feedback to choose mixed-precision quantization policies.
- HAWQ-V2, which used Hessian-aware sensitivity analysis for mixed-precision quantization.
- LLM.int8(), which used mixed-precision decomposition to preserve transformer outlier dimensions.
- GPTQ, which made post-training LLM weight quantization practical at large scale.
- AWQ, which protected salient weights using activation-aware calibration.
- SpQR and SqueezeLLM, which combined dense low-bit quantization with special handling for sensitive or outlier weights.
- OmniQuant, which used calibration to optimize quantization parameters across LLM settings.
PMRA is useful anywhere a deployable GGUF has to balance quality, file size, memory budget, and runtime compatibility without introducing a custom inference path.
PMRA is useful when you want to:
- hit a local memory budget more precisely than one uniform preset allows
- recover quality at the same size by protecting sensitive tensors
- publish one normal GGUF instead of a custom runtime path
- reuse public quantization ladders rather than recomputing every quant
- expose exactly where the bytes went through
pmra.*metadata and reports - create practical local-model deployment points for laptops, small GPUs, and apps
For a quick pass through the repo, use this order:
- This README - explanation of what PMRA is, how mixing works, and why knapsack is the default selector. 2.The Guide - more detailed explanation of what PMRA does, why uniform quantization wastes bits, how probing and Fisher-guided profiling work, and what the output looks like.
- PMRA Hugging Face Collection - public GGUF mixes and model cards.
- Artifact Index - released mixes, reports, and metrics.
- Method - implementation-level method notes.
- Reproduce - local, Colab, and optional Modal paths for rebuilding selector results and GGUF artifacts.
- Evidence Docs - full research trail, including failed and superseded gates.
docs/ Human-facing release notes, method notes, evidence, reproduction.
scripts/ PMRA selector, GGUF mixer, public evaluators, helpers.
modal/ Modal A100 harness used for the public-calibrated runs.
results/ Result cards and JSON reports for released and validation mixes.
artifacts/ Earlier artifact reports, not the GGUF files themselves.
tools/ Hugging Face upload and verification helpers.
pip install -r requirements.txtFull selector runs expect access to the base model weights and matching GGUF source files. They can run locally, in Colab, on rented GPU machines, or through the optional Modal harness. Local CPU runs are useful for small checks, but the larger selector runs are GPU-heavy.
- Run public-calibrated PMRA mix selection on Wikitext-2 raw train/validation.
- Evaluate the frozen selection on held-out public text.
- Mix selected tensor payloads into one GGUF.
- Load and smoke-test the GGUF with llama.cpp.
Exact commands are in docs/REPRODUCE.md.
PMRA code and docs in this repo are released under Apache-2.0. Individual mixes inherit the licensing and attribution requirements of their base models and GGUF source quantizations; release-specific attribution is tracked in the model release docs.