Skip to content

Commit 881409f

Browse files
chore(gguf): support prebuilt llama.cpp binaries + add Q3_K_M for 8 GB VRAM
scripts/build_gguf.sh now resolves the convert script and the llama-quantize binary independently: - LLAMACPP_REPO points at a cloned llama.cpp checkout (provides convert_hf_to_gguf.py). - LLAMACPP_BIN points at the directory holding the quantize binary and probes both the bare and the .exe filename. - LLAMACPP (legacy) still works as an alias for LLAMACPP_REPO so the prior single-env-var docs do not break. This lets the build run against either a from-source build (Linux MI300X) or the official ggml-org Windows prebuilt zip without forking the script. QUANTS now also produces Q3_K_M (~6.5 GB) ahead of Q4_K_M / Q5_K_M / Q6_K / Q8_0. Q3_K_M is the new headline 8 GB-class consumer target (RTX 4070 Laptop, RTX 3060 Ti); Q4_K_M still ships for 12-16 GB cards. README and model-card quantization tables updated to reflect the new hardware tier breakdown.
1 parent ad3e8dd commit 881409f

3 files changed

Lines changed: 58 additions & 24 deletions

File tree

README.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -178,8 +178,12 @@ generation work is never wasted.
178178
- **Training (one-shot).** Single AMD Instinct MI300X (192 GB HBM3) on ROCm 7.0.
179179
Full-parameter SFT of a 14B Qwen1 model at sequence length 8192 does not fit on
180180
80 GB-class hardware; the MI300X is not optional for the training path.
181-
- **Consumer inference.** The Q4_K_M GGUF (≈9.45 GB) targets a single
182-
RTX 4060 Ti 16 GB via llama.cpp. Pass-1 per-section context fits at 4-6K tokens.
181+
- **Consumer inference.** The **Q3_K_M GGUF (≈6.5 GB)** is the
182+
recommended 8 GB-class target (RTX 4070 Laptop, RTX 3060 Ti, etc.) and
183+
fits fully in VRAM at Pass-1 4-6K context. The Q4_K_M GGUF (≈9.45 GB)
184+
is shipped for 12-16 GB cards (RTX 4060 Ti 16 GB, RTX 3080); on 8 GB
185+
cards Q4_K_M still runs via partial GPU offload (`llama-cli -ngl 25`).
186+
Larger Q5_K_M / Q6_K / Q8_0 quants ship for prosumer / dual-GPU rigs.
183187
- **Demo / research inference.** Any ROCm or CUDA host with the BF16 checkpoint;
184188
the `MemoCriticAgent` is pure orchestration and adds zero VRAM cost beyond the
185189
base model.

docs/model-card.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -206,12 +206,13 @@ checkpoint by `scripts/build_gguf.sh`, which calls llama.cpp's
206206
target quant. See the script's prereq header for the required
207207
llama.cpp checkout and disk-budget notes.
208208

209-
| Quant | Approx. size | Intended hardware | Target throughput (tok/s) |
210-
|-----------|--------------|----------------------------------|---------------------------|
211-
| Q4_K_M | ~9.45 GB | 16 GB consumer GPU (RTX 4060 Ti) | ≥ 18 |
212-
| Q5_K_M | ~10.5 GB | 16-24 GB consumer GPU | TBD |
213-
| Q6_K | ~12.1 GB | 24 GB+ consumer or prosumer | TBD |
214-
| Q8_0 | ~15.7 GB | 24 GB+ prosumer / dual-GPU CPU offload | TBD |
209+
| Quant | Approx. size | Intended hardware | Target throughput (tok/s) |
210+
|-----------|--------------|---------------------------------------------------------|---------------------------|
211+
| Q3_K_M | ~6.5 GB | 8 GB consumer GPU (RTX 4070 Laptop, RTX 3060 Ti) | TBD |
212+
| Q4_K_M | ~9.45 GB | 12-16 GB consumer GPU (RTX 4060 Ti 16 GB, RTX 3080) | TBD |
213+
| Q5_K_M | ~10.5 GB | 16-24 GB consumer GPU | TBD |
214+
| Q6_K | ~12.1 GB | 24 GB+ consumer or prosumer | TBD |
215+
| Q8_0 | ~15.7 GB | 24 GB+ prosumer / dual-GPU CPU offload | TBD |
215216

216217
Pass-1 per-section context of 4-6K tokens is the supported consumer
217218
operating point; longer contexts require the BF16 checkpoint served via

scripts/build_gguf.sh

Lines changed: 45 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -2,20 +2,33 @@
22
#
33
# Build the YuhoLens-14B GGUF release set from a HuggingFace checkpoint.
44
#
5-
# Required tools (operator must install before running):
6-
# - python (matching the llama.cpp checkout's environment)
7-
# - llama.cpp cloned and built with the quantize binary:
8-
# git clone https://github.com/ggerganov/llama.cpp ../llama.cpp
9-
# cd ../llama.cpp && cmake -B build && cmake --build build --target llama-quantize
10-
# pip install -r ../llama.cpp/requirements.txt
11-
# - At least 80 GB free on the target disk (f16 + four quants for a 14B model).
5+
# Two-directory layout:
6+
# - LLAMACPP_REPO: the cloned llama.cpp repo (provides convert_hf_to_gguf.py).
7+
# - LLAMACPP_BIN: the directory holding the llama-quantize binary. May be
8+
# the repo's build/bin (when llama.cpp is built from source)
9+
# OR a flat directory of prebuilt Windows binaries (which
10+
# is what ggml-org publishes on the GitHub releases page).
11+
#
12+
# When LLAMACPP_BIN is unset it auto-derives to "$LLAMACPP_REPO/build/bin".
13+
# Both the bare ("llama-quantize") and the .exe ("llama-quantize.exe") name
14+
# are probed, so the same script works on Linux source builds and on
15+
# Windows prebuilt-binary checkouts.
16+
#
17+
# Required tools the operator must install BEFORE running:
18+
# - python with `gguf` and `safetensors` packages (pip install gguf
19+
# safetensors).
20+
# - llama.cpp cloned somewhere readable (default ../llama.cpp).
21+
# - A llama-quantize binary, either built from source or unzipped from
22+
# the official prebuilt Windows release.
23+
# - At least 80 GB free disk for a 14B model (f16 intermediate + 5 quants).
1224
#
1325
# Usage:
1426
# scripts/build_gguf.sh <checkpoint_dir> [output_dir]
1527
#
1628
# Defaults:
17-
# - LLAMACPP env var overrides the llama.cpp path (default: ../llama.cpp).
1829
# - output_dir defaults to <checkpoint_dir> when omitted.
30+
# - LLAMACPP_REPO defaults to ../llama.cpp.
31+
# - LLAMACPP_BIN defaults to $LLAMACPP_REPO/build/bin.
1932
#
2033
# This script does not run automatically. Operator runs it after the HF
2134
# checkpoint is downloaded locally; the resulting GGUFs are uploaded to
@@ -38,31 +51,47 @@ fi
3851

3952
mkdir -p "$OUT_DIR"
4053

41-
LLAMACPP="${LLAMACPP:-../llama.cpp}"
42-
CONVERT_SCRIPT="$LLAMACPP/convert_hf_to_gguf.py"
43-
QUANT_BIN="$LLAMACPP/build/bin/llama-quantize"
54+
# Back-compat: legacy LLAMACPP env var maps to LLAMACPP_REPO.
55+
LLAMACPP_REPO="${LLAMACPP_REPO:-${LLAMACPP:-../llama.cpp}}"
56+
LLAMACPP_BIN="${LLAMACPP_BIN:-$LLAMACPP_REPO/build/bin}"
57+
CONVERT_SCRIPT="$LLAMACPP_REPO/convert_hf_to_gguf.py"
58+
59+
resolve_quant_bin() {
60+
for candidate in "$LLAMACPP_BIN/llama-quantize" "$LLAMACPP_BIN/llama-quantize.exe"; do
61+
if [[ -f "$candidate" ]]; then
62+
echo "$candidate"
63+
return 0
64+
fi
65+
done
66+
return 1
67+
}
4468

4569
if [[ ! -f "$CONVERT_SCRIPT" ]]; then
4670
echo "error: convert script not found at $CONVERT_SCRIPT" >&2
47-
echo " set LLAMACPP=/path/to/llama.cpp or clone llama.cpp at ../llama.cpp" >&2
71+
echo " set LLAMACPP_REPO=/path/to/llama.cpp clone" >&2
4872
exit 66
4973
fi
5074

51-
if [[ ! -x "$QUANT_BIN" ]]; then
52-
echo "error: llama-quantize binary not found at $QUANT_BIN" >&2
53-
echo " build llama.cpp first: cmake -B build && cmake --build build --target llama-quantize" >&2
75+
if ! QUANT_BIN="$(resolve_quant_bin)"; then
76+
echo "error: llama-quantize binary not found in $LLAMACPP_BIN" >&2
77+
echo " set LLAMACPP_BIN to the directory containing llama-quantize(.exe)" >&2
78+
echo " (build/bin/ for a source build, or the unzip dir for prebuilts)" >&2
5479
exit 66
5580
fi
5681

5782
OUT_F16="${OUT_DIR%/}/yuholens-14b-f16.gguf"
5883

84+
echo "[gguf] using convert script: $CONVERT_SCRIPT"
85+
echo "[gguf] using quantize bin: $QUANT_BIN"
5986
echo "[gguf] converting $CKPT -> $OUT_F16"
6087
python "$CONVERT_SCRIPT" \
6188
--outfile "$OUT_F16" \
6289
--outtype f16 \
6390
"$CKPT"
6491

65-
QUANTS=("Q4_K_M" "Q5_K_M" "Q6_K" "Q8_0")
92+
# Q3_K_M is the 8 GB consumer headline quant; everything from Q4_K_M up
93+
# wants 10 GB+ VRAM or partial CPU offload at runtime.
94+
QUANTS=("Q3_K_M" "Q4_K_M" "Q5_K_M" "Q6_K" "Q8_0")
6695
for quant in "${QUANTS[@]}"; do
6796
out="${OUT_DIR%/}/yuholens-14b-${quant}.gguf"
6897
echo "[gguf] quantising $quant -> $out"

0 commit comments

Comments
 (0)