Skip to content

Commit 7bb2d88

Browse files
authored
Update llama.cpp submodule to dbe9c0c (+ embed real web UI) (#983)
* Updated skill * Update llama.cpp submodule to dbe9c0c * llama.cpp: embed web UI assets via Hugging Face bucket * llama.cpp: widen worker-thread sigmask to avoid EINTR-from-cv.wait_for crash * Updated patches README * llamafile: drop dead SERVER_ASSETS references in BUILD.mk * tinyblas: add tinyblasSgemmStridedBatched for cuBLAS compat * llama.cpp: add GGML_CALL to Vulkan *_tensor_2d backend callbacks * Fix to CUDA version detection in cuda*.bat files * llama.cpp UI: pick newest published tag <= build number * Fix GPU DSO extraction failing on Windows ("Permission denied")
1 parent a0f4f28 commit 7bb2d88

40 files changed

Lines changed: 787 additions & 503 deletions

.llamafile_plugin/.claude-plugin/marketplace.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
{
1010
"name": "llamafile",
1111
"description": "Build guidance and commands for the llamafile project",
12-
"version": "0.1.1",
12+
"version": "0.1.2",
1313
"author": {
1414
"name": "Mozilla AI",
1515
"email": "davide@mozilla.ai"
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
{
22
"name": "llamafile",
3-
"version": "0.1.1",
3+
"version": "0.1.2",
44
"description": "Build guidance and commands for the llamafile project"
55
}

docs/skills/llamafile/SKILL.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
---
22
name: llamafile
33
description: This skill should be used when the user asks to "build llamafile", "rebuild llamafile", "run llamafile", "run llamafile tests", "debug llamafile", "set up llamafile", "update patches", "fix patch conflict", "update llama.cpp", "pull latest llama.cpp", "sync upstream llama.cpp", "reset submodules", "write a test for llamafile", "how does llamafile work", "llamafile architecture", or needs guidance on the llamafile build system, patch workflow, submodule integration, cosmocc toolchain, or development practices.
4-
version: 0.1.2
4+
version: 0.1.3
55
---
66

77
# Llamafile Development Guide

docs/skills/llamafile/update_llamacpp.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,18 @@ llamafile relies on llama.cpp for many of its functionalities. Keeping it up-to-
44
with the latest version upstream is generally a good practice, as it brings both
55
bugfixes and support for recent models and features.
66

7-
This document describes the steps to keep llamafile updated with upstream.
7+
This document describes the steps to keep llamafile updated with upstream. At a high-level,
8+
a llama.cpp consists in the following tasks
9+
10+
1. Updating the submodule
11+
2. Verifying and updating the current patches (important: never create patches manually,
12+
but run `../tools/generate-patches.sh --output-dir ../llama.cpp.patches` from the
13+
llama.cpp directory)
14+
3. Updating build dependencies, to make sure new/changed deps are taken into account
15+
4. Updating llamafile code: llamafile is built on top of llama.cpp and you want to
16+
make sure its code still works with the new, updated submodule. NOTE that this also
17+
includes GPU acceleration libs (cuda, rocm, vulkan .sh and .bat scripts in the
18+
`llamafile/` directory)
819

920
## Step 1: Update the submodule
1021

@@ -78,6 +89,8 @@ target.
7889
- Check if the llamafile code that calls llama.cpp server/main needs updates
7990
- Review `llamafile/` for any API changes in llama.cpp that need to be reflected
8091
- Pay attention to changes in `llama.cpp/include/` for API modifications
92+
- Also check GPU acceleration libraries code (cuda, rocm, vulkan .sh and .bat
93+
scripts in the `llamafile/` directory)
8194

8295
At the end of this step, you should be able to build all targets in this repo,
8396
i.e. the following verification step should return a successful result

llama.cpp

llama.cpp.patches/README.md

Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ The `GGML_CALL` macro (defined as `__attribute__((__ms_abi__))` when `GGML_MULTI
5252
| `ggml_src_ggml-cuda_ggml-cuda.cu.patch` | Adds `GGML_CALL` to all CUDA backend callback implementations (60+ functions); also adds `free_struct` and TinyBLAS BF16 guard (see below) |
5353
| `ggml_src_ggml-metal_ggml-metal.cpp.patch` | Adds `GGML_CALL` to all Metal backend callback implementations (62 functions); also adds `free_struct` (see below) |
5454
| `ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch` | Adds `GGML_CALL` to all Vulkan backend callback implementations; also adds `free_struct` and a heap memory underflow fix (see below) |
55+
| `ggml_src_ggml-backend-meta.cpp.patch` | Adds `GGML_CALL` to all meta-device, meta-buffer-type, meta-buffer, and meta-backend callback implementations (the meta backend aggregates several simple backends behind one interface, so its callbacks are reached through the same function-pointer structs) |
5556

5657
### Cross-Module Memory Management
5758

@@ -97,6 +98,28 @@ Llamafile uses TinyBLAS as a lightweight replacement for cuBLAS, enabling GPU su
9798
| `ggml_src_ggml-cuda_common.cuh.patch` | Disables BF16 MMA when using TinyBLAS (TinyBLAS would incorrectly interpret BF16 as FP16) |
9899
| `ggml_src_ggml-cuda_ggml-cuda.cu.patch` | Disables BF16 in `ggml_cuda_op_mul_mat_cublas` when using TinyBLAS |
99100

101+
### Optional IQ-Quant Exclusion (CUDA)
102+
103+
The IQ ("importance") quantization formats (`IQ1_S`, `IQ2_XXS`/`XS`/`S`, `IQ3_S`/`XXS`, `IQ4_NL`/`XS`) pull in a large amount of CUDA template instantiation that inflates compile time and binary size. These patches gate every IQ code path behind `#ifndef GGML_CUDA_NO_IQ_QUANTS`, so a build can compile them out by defining `GGML_CUDA_NO_IQ_QUANTS`. When the macro is undefined (the default), behavior is unchanged.
104+
105+
| Patch | Description |
106+
|-------|-------------|
107+
| `ggml_src_ggml-cuda_convert.cu.patch` | Guards IQ dequantization cases in `ggml_get_to_fp16_cuda` and `ggml_get_to_fp32_cuda` |
108+
| `ggml_src_ggml-cuda_cpy.cu.patch` | Guards the `f32 → IQ4_NL` copy helper and its dispatch case |
109+
| `ggml_src_ggml-cuda_mmq.cu.patch` | Guards IQ cases in `ggml_cuda_mul_mat_q_switch_type` and in the `ggml_cuda_should_use_mmq` support/heuristic switches |
110+
| `ggml_src_ggml-cuda_mmq.cuh.patch` | Guards the `extern DECL_MMQ_CASE(...)` declarations for IQ types |
111+
| `ggml_src_ggml-cuda_mmvq.cu.patch` | Guards IQ cases in `get_vec_dot_q_cuda` and `get_vdr_mmvq` |
112+
113+
### CPU Performance Optimizations (llamafile #975)
114+
115+
These patches restore llamafile's optimized CPU kernels (TinyBLAS matmul, AVX-512 flash-attention helpers) on top of upstream's CPU backend, and tune CPU-only defaults. The hooks call into symbols exported from `llamafile/sgemm.cpp` and are compiled only when `GGML_USE_LLAMAFILE` is defined.
116+
117+
| Patch | Description |
118+
|-------|-------------|
119+
| `ggml_src_ggml-cpu_ggml-cpu.c.patch` | Routes MoE matmul (`ggml_compute_forward_mul_mat_id`) through `llamafile_mixmul` / `llamafile_mixmul_iqk`, mirroring the dense-matmul `llamafile_sgemm` hook; reserves work-buffer space for the MoE kernel in `ggml_graph_plan` via `llamafile_mixmul_needs` |
120+
| `ggml_src_ggml-cpu_ops.cpp.patch` | Routes flash-attention inner loops through llamafile's AVX-512 helpers (`llamafile_fa_vec_dot_f16`, `llamafile_fa_fp16_to_fp32_row`, `llamafile_fa_simd_gemm`) in both the one-chunk and tiled FA paths; also accumulates VKQ in f32 on CPUs lacking native f16 FMA (avoiding costly f16↔f32 round-trips per KV step) |
121+
| `src_llama-context.cpp.patch` | Defaults `-fa auto` to **off** on CPU-only setups (no GPU devices), since the CPU flash-attention path is slower than the non-FA path on x86; users can still force `-fa on` for memory savings on long contexts |
122+
100123
### Llamafile File Handling
101124

102125
These patches integrate llamafile's file handling APIs for loading models from bundled zip archives and `.llamafile` containers.
@@ -111,14 +134,30 @@ These patches integrate llamafile's file handling APIs for loading models from b
111134

112135
| Patch | Description |
113136
|-------|-------------|
114-
| `tools_server_server.cpp.patch` | Renames `main()` to `server_main()` with `on_ready`/`on_shutdown_available` callbacks for combined TUI+server mode; adds Metal/GPU backend trigger before `common_init()`; adds Cosmopolitan-specific standalone `main()` with `cosmo_args`, verbose flag handling, and GPU pre-initialization; handles `LLAMAFILE_TUI` exit to avoid Metal cleanup crashes |
137+
| `tools_server_server.cpp.patch` | Renames upstream's `llama_server()` to `server_main()` and adds `on_ready`/`on_shutdown_available` callbacks for combined TUI+server mode; adds Metal/GPU backend trigger before `common_init()`; adds Cosmopolitan-specific standalone `main()` with `cosmo_args`, verbose flag handling, and GPU pre-initialization; handles `LLAMAFILE_TUI` exit to avoid Metal cleanup crashes |
138+
139+
The web UI moved upstream from prebuilt `tools/server/public/*` assets to
140+
a Svelte project under `tools/ui/`, embedded at CMake time via
141+
`tools/ui/embed.cpp`. cosmocc has no JS toolchain, so `apply-patches.sh`
142+
(run by `make setup`) downloads the prebuilt Svelte bundle from the
143+
`ggml-org/llama-ui` Hugging Face bucket into `llama.cpp/tools/ui/dist/`.
144+
At build time, `llama.cpp/BUILD.mk` compiles `tools/ui/embed.cpp` with
145+
`cosmoc++` (its APE output runs on the host) and runs it against the
146+
downloaded assets to emit `o/$(MODE)/llama.cpp/tools/ui/ui.{cpp,h}`,
147+
which is then compiled like any other C++ source and linked into
148+
`llama-server` and `llamafile`. If the download fails (offline, version
149+
not yet on HF) `apply-patches.sh` warns and continues with `dist/`
150+
empty — `embed.cpp` then emits a no-op `llama_ui_find_asset`, and
151+
`server-http.cpp` skips UI route registration via its
152+
`LLAMA_UI_HAS_ASSETS` guard so the API still works.
115153

116154
### Bug Fixes
117155

118156
| Patch | Description |
119157
|-------|-------------|
120158
| `ggml_src_ggml-backend-reg.cpp.patch` | Suppresses debug log noise for non-existent backend search paths (irrelevant for llamafile's DSO loading approach) |
121159
| `ggml_src_ggml-vulkan_ggml-vulkan.cpp.patch` | Fixes unsigned integer underflow in `ggml_backend_vk_get_device_memory` where Vulkan's `heapUsage` can exceed `heapBudget` (clamps to zero instead of wrapping) |
160+
| `src_models_t5.cpp.patch` | Forward-declares the `graph<false>`/`graph<true>` explicit specializations before `build_arch_graph` so clang's `-std=gnu++23` doesn't reject them as specializations after implicit instantiation |
122161

123162
## Creating New Patches
124163

llama.cpp.patches/apply-patches.sh

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,9 @@ for patch_file in "$PATCHES_DIR"/*.patch; do
3737
fi
3838
done
3939

40+
# Fetch the prebuilt web UI assets (see fetch-ui-assets.sh for details).
41+
"$SCRIPT_DIR/fetch-ui-assets.sh"
42+
4043
echo ""
4144
echo "Patches applied successfully!"
4245
echo "Note: These changes are not committed to the submodule."
Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
#!/bin/bash
2+
# Fetch the prebuilt llama.cpp web UI assets from Hugging Face.
3+
#
4+
# Upstream's tools/ui/scripts/ui-assets.cmake pulls the Svelte build outputs
5+
# from the ggml-org/llama-ui HF bucket. We do the same here so the cosmocc
6+
# build never has to run a JS toolchain. If the fetch fails (no network,
7+
# version not yet published, HF down) we leave tools/ui/dist empty; BUILD.mk's
8+
# embed step then generates a no-asset ui.cpp and the server still works, just
9+
# without the web UI.
10+
#
11+
# Run by apply-patches.sh (i.e. `make setup`); also safe to run standalone to
12+
# re-fetch the UI without re-applying patches.
13+
14+
set -e
15+
16+
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
17+
LLAMA_DIR="$SCRIPT_DIR/../llama.cpp"
18+
19+
HF_BUCKET="${LLAMAFILE_UI_HF_BUCKET:-llama-ui}"
20+
HF_BASE="https://huggingface.co/buckets/ggml-org/${HF_BUCKET}/resolve"
21+
HF_TREE_API="https://huggingface.co/api/buckets/ggml-org/${HF_BUCKET}/tree"
22+
UI_DIST="$LLAMA_DIR/tools/ui/dist"
23+
UI_ASSETS=(bundle.css bundle.js index.html loading.html)
24+
25+
# Echo the highest bNNNN tag in the bucket whose build number is <= $1, or
26+
# nothing. The tree API returns directories in ascending order, 100 per page,
27+
# with a `Link: <...>; rel="next"` header for the following page.
28+
pick_ui_tag() {
29+
local cur="$1"
30+
local url="${HF_TREE_API}?limit=100&recursive=false"
31+
local hdrs best="" n saw_newer guard=0 body
32+
hdrs="$(mktemp)"
33+
while [ -n "$url" ] && [ "$guard" -lt 100 ]; do
34+
guard=$((guard + 1))
35+
body="$(curl -fsSL --max-time 30 -D "$hdrs" "$url" 2>/dev/null)" || break
36+
saw_newer=0
37+
for n in $(printf '%s' "$body" | grep -oE '"path":"b[0-9]+"' \
38+
| grep -oE '[0-9]+'); do
39+
if [ "$n" -le "$cur" ]; then
40+
if [ -z "$best" ] || [ "$n" -gt "$best" ]; then
41+
best="$n"
42+
fi
43+
else
44+
saw_newer=1
45+
fi
46+
done
47+
# Once a page holds a tag newer than us, all later pages are newer too,
48+
# so there is nothing better to find.
49+
if [ "$saw_newer" -eq 1 ]; then
50+
break
51+
fi
52+
url="$(grep -i '^link:' "$hdrs" \
53+
| grep -oE '<[^>]+>; *rel="next"' | grep -oE 'https?://[^>]+' | head -1)"
54+
done
55+
rm -f "$hdrs"
56+
[ -n "$best" ] && printf 'b%s' "$best"
57+
}
58+
59+
echo ""
60+
echo "Fetching prebuilt web UI assets from Hugging Face..."
61+
62+
# Pick the version to download. Upstream's CMake just tries the exact build
63+
# number and then "latest", but for llamafile that's fragile: the exact tag for
64+
# our pinned commit is often not published yet, and "latest" can be built
65+
# against a newer backend than the llama.cpp we've pinned. Instead we enumerate
66+
# the tags actually present in the bucket and pick the newest one that is <= our
67+
# build number. "latest" is kept only as a last-resort fallback (enumeration
68+
# failed, e.g. offline / API change, or our commit predates every published
69+
# tag) so the build can still get *some* UI rather than none.
70+
UI_CUR_BUILD="$(cd "$LLAMA_DIR" && git describe --tags --always 2>/dev/null \
71+
| grep -oE '^b[0-9]+' | grep -oE '[0-9]+' || true)"
72+
73+
UI_CANDIDATES=()
74+
if [ -n "$UI_CUR_BUILD" ]; then
75+
echo " resolving newest UI tag <= b$UI_CUR_BUILD ..."
76+
UI_BEST_TAG="$(pick_ui_tag "$UI_CUR_BUILD")"
77+
if [ -n "$UI_BEST_TAG" ]; then
78+
echo " selected UI tag $UI_BEST_TAG"
79+
UI_CANDIDATES+=("$UI_BEST_TAG")
80+
else
81+
echo " no UI tag <= b$UI_CUR_BUILD found in bucket; will try 'latest'"
82+
fi
83+
fi
84+
UI_CANDIDATES+=("latest")
85+
86+
mkdir -p "$UI_DIST"
87+
ui_ok=false
88+
for v in "${UI_CANDIDATES[@]}"; do
89+
echo " trying $HF_BASE/$v ..."
90+
fail=false
91+
for asset in "${UI_ASSETS[@]}" checksums.txt; do
92+
if ! curl -fsSL --max-time 60 -o "$UI_DIST/$asset" \
93+
"$HF_BASE/$v/$asset?download=true"; then
94+
fail=true
95+
break
96+
fi
97+
done
98+
if $fail; then
99+
continue
100+
fi
101+
102+
# Best-effort sha256 verification against checksums.txt (one "<hash> <name>"
103+
# line per asset). Skip if shasum/sha256sum isn't around.
104+
if command -v shasum >/dev/null 2>&1; then
105+
sha_cmd="shasum -a 256"
106+
elif command -v sha256sum >/dev/null 2>&1; then
107+
sha_cmd="sha256sum"
108+
else
109+
sha_cmd=""
110+
fi
111+
if [ -n "$sha_cmd" ] && [ -f "$UI_DIST/checksums.txt" ]; then
112+
bad=false
113+
for asset in "${UI_ASSETS[@]}"; do
114+
want=$(awk -v a="$asset" '$2 == a { print $1 }' "$UI_DIST/checksums.txt")
115+
got=$($sha_cmd "$UI_DIST/$asset" | awk '{print $1}')
116+
if [ -z "$want" ] || [ "$want" != "$got" ]; then
117+
echo " checksum mismatch for $asset (want=$want got=$got)"
118+
bad=true
119+
break
120+
fi
121+
done
122+
if $bad; then
123+
continue
124+
fi
125+
fi
126+
127+
echo " fetched UI assets from $v"
128+
ui_ok=true
129+
break
130+
done
131+
132+
if ! $ui_ok; then
133+
echo " warning: could not download UI assets; server will build without the web UI"
134+
rm -f "$UI_DIST"/*
135+
fi

0 commit comments

Comments
 (0)